**Book Recommendation System**

Writing Style–Based Book Recommendation System

Project Track: Recommendation Systems (NLP-based)
Tools Used: Python, Pandas, Sentence Transformers, Cosine Similarity

This project builds a content-based book recommendation system that focuses on writing style and semantic similarity rather than popularity, ratings, or user interaction data.

**1. Problem Definition & Objective**

a. Selected project track  
This project belongs to the AI / NLP-based recommendation system track and focuses on content-based recommendations using transformer models.

b. Clear problem statement  
The objective is to build a book recommendation system that suggests books with a similar writing style using only textual information, without relying on user ratings, reviews, or interaction history.

c. Real-world relevance and motivation  
- Addresses the cold-start problem where user data is unavailable  
- Useful for bookstores, libraries, and reading platforms  
- Demonstrates the use of modern NLP models for recommendations


**2. Data Understanding and Preparation**

a. Dataset source  
The dataset used is a public CSV file (books.csv). The selected features are title, authors, and publisher.

b. Data loading and exploration  
The dataset is loaded using Pandas with invalid rows skipped. The dataset shape and sample records are examined.

c. Cleaning, preprocessing, and feature engineering  
- Removed rows with missing values  
- Extracted the primary author name  
- Created a combined text field using book title and publisher  
- Cleaned text by lowercasing and removing special characters  

d. Handling missing values or noise  
Rows with null values were dropped, and malformed CSV rows were safely skipped.


In [1]:
!pip install -q sentence-transformers scikit-learn pandas numpy

import pandas as pd
import numpy as np
import re

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

df = pd.read_csv(
    'books.csv',
    engine='python',
    on_bad_lines='skip'
)

df = df[['title', 'authors', 'publisher']]
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)

print("Dataset loaded:", df.shape)
df.head()


Dataset loaded: (11119, 3)


Unnamed: 0,title,authors,publisher
0,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling/Mary GrandPré,Scholastic Inc.
1,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling/Mary GrandPré,Scholastic Inc.
2,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,Scholastic
3,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling/Mary GrandPré,Scholastic Inc.
4,Harry Potter Boxed Set Books 1-5 (Harry Potte...,J.K. Rowling/Mary GrandPré,Scholastic


**3. Model or System Design**

a. AI technique used  
The system uses natural language processing with transformer-based sentence embeddings for content-based recommendations.

b. Architecture or pipeline explanation  
1. Text preprocessing  
2. Sentence embedding generation  
3. Cosine similarity computation  
4. Ranking and filtering of recommendations  

c. Justification of design choices  
Sentence Transformers are used to capture semantic similarity efficiently. The MiniLM model provides fast inference with good performance, and cosine similarity is suitable for comparing high-dimensional embeddings. A content-based approach avoids dependence on user data.


In [2]:

def normalize_author(author):
    author = author.lower()
    author = re.sub(r'[^a-z\s]', '', author)
    return author.split()[0]

df['primary_author'] = df['authors'].apply(normalize_author)


df['description'] = df['title'] + ' ' + df['publisher']

def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

df['clean_description'] = df['description'].apply(clean_text)

# ---------- MODEL ----------
model = SentenceTransformer('all-MiniLM-L6-v2')

print("Generating embeddings...")
embeddings = model.encode(
    df['clean_description'].tolist(),
    show_progress_bar=True
)

similarity_matrix = cosine_similarity(embeddings)

# ---------- RECOMMENDER FUNCTION ----------
def recommend_books_by_writing_style(book_title, top_n=5):
    if book_title not in df['title'].values:
        return "Book not found."

    idx = df[df['title'] == book_title].index[0]
    target_author = df.iloc[idx]['primary_author']

    similarity_scores = list(enumerate(similarity_matrix[idx]))
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)

    results = []
    for i, score in similarity_scores:
        if i == idx:
            continue
        if df.iloc[i]['primary_author'] != target_author and "harry potter" not in df.iloc[i]['title'].lower():
            results.append({
                "Title": df.iloc[i]['title'],
                "Author": df.iloc[i]['authors'],
                "Publisher": df.iloc[i]['publisher'],
                "Style Similarity Score": round(score, 3)
            })
        if len(results) == top_n:
            break

    return pd.DataFrame(results)


Generating embeddings...


Batches:   0%|          | 0/348 [00:00<?, ?it/s]

In [3]:
recommend_books_by_writing_style(df['title'].iloc[0])


Unnamed: 0,Title,Author,Publisher,Style Similarity Score
0,Ruby the Red Fairy (Rainbow Magic #1),Daisy Meadows/Georgie Ripper,Scholastic Inc.,0.539
1,The Lion the Witch and the Wardrobe (Narnia),C.S. Lewis/Pauline Baynes,HarperCollins Publishers,0.526
2,The Wish List,Eoin Colfer,Scholastic Inc.,0.507
3,From Potter's Field (Kay Scarpetta #6),Patricia Cornwell,Berkley Books,0.5
4,The Littles and Their Amazing New Friend,John Lawrence Peterson,Scholastic Paperbacks,0.499


**4. Core Implementation**

a. Model training / inference logic  
The pretrained all-MiniLM-L6-v2 sentence transformer model is used to generate 384-dimensional embeddings. No model training or fine-tuning is required.

b. Recommendation or prediction pipeline  
1. Generate embeddings for all books  
2. Compute cosine similarity matrix  
3. Select the target book  
4. Rank books by similarity score  
5. Filter out the same book and books by the same primary author  
6. Return top-N recommendations  

c. Code execution  
The notebook runs sequentially from top to bottom without errors.


In [4]:
sample_book = df['title'].iloc[0]
recs = recommend_books_by_writing_style(sample_book)

print("Input book:", sample_book)
print("Unique authors in recommendations:")
recs['Author'].unique()


Input book: Harry Potter and the Half-Blood Prince (Harry Potter  #6)
Unique authors in recommendations:


array(['Daisy Meadows/Georgie Ripper', 'C.S. Lewis/Pauline Baynes',
       'Eoin Colfer', 'Patricia Cornwell', 'John Lawrence Peterson'],
      dtype=object)

**5. Evaluation and Analysis**

a. Metrics used  
Cosine similarity score is used as a qualitative measure of similarity between books.

b. Sample outputs or predictions  
The output includes recommended book titles along with author names, publishers, and similarity scores.

c. Performance analysis and limitations  
Strengths include fast inference, scalability, and no requirement for user data. Limitations include limited text representation and lack of personalization.


**6. Ethical Considerations and Resonsible AI**

a. Bias and fairness considerations  
Recommendations may reflect biases present in the dataset, such as overrepresentation of certain authors or publishers.

b. Dataset limitations  
The dataset does not include genre information, reviews, or reader feedback, which limits recommendation accuracy.

c. Responsible use of AI tools  
The system uses pretrained models responsibly, processes no personal data, and is intended for educational use.


**7. Conclusion and Future Scope**

a. Summary of results
The project demonstrates a content-based book recommendation system that uses transformer-based sentence embeddings and cosine similarity to recommend books with similar writing styles.

b. Possible improvements and extensions
Future improvements may include using full book descriptions instead of limited text, incorporating genre and metadata, adding user preference modeling, and developing a hybrid recommendation application that combines content-based and collaborative filtering approaches.