![bookstore](bookstore.jpg)


Identifying popular products is incredibly important for e-commerce companies! Popular products generate more revenue and, therefore, play a key role in stock control.

You've been asked to support an online bookstore by building a model to predict whether a book will be popular or not. They've supplied you with an extensive dataset containing information about all books they've sold, including:

* `price`
* `popularity` (target variable)
* `review/summary`
* `review/text`
* `review/helpfulness`
* `authors`
* `categories`

You'll need to build a model that predicts whether a book will be rated as popular or not.

They have high expectations of you, so have set a target of at least 70% accuracy! You are free to use as many features as you like, and will need to engineer new features to achieve this level of performance.

In [5]:
# Import some required packages
import pandas as pd

# Read in the dataset
books = pd.read_csv("data/books.csv")

# Preview the first five rows
books.head()

Unnamed: 0,title,price,review/helpfulness,review/summary,review/text,description,authors,categories,popularity
0,We Band of Angels: The Untold Story of America...,10.88,2/3,A Great Book about women in WWII,I have alway been a fan of fiction books set i...,"In the fall of 1941, the Philippines was a gar...",'Elizabeth Norman','History',Unpopular
1,Prayer That Brings Revival: Interceding for Go...,9.35,0/0,Very helpful book for church prayer groups and...,Very helpful book to give you a better prayer ...,"In Prayer That Brings Revival, best-selling au...",'Yong-gi Cho','Religion',Unpopular
2,The Mystical Journey from Jesus to Christ,24.95,17/19,Universal Spiritual Awakening Guide With Some ...,The message of this book is to find yourself a...,THE MYSTICAL JOURNEY FROM JESUS TO CHRIST Disc...,'Muata Ashby',"'Body, Mind & Spirit'",Unpopular
3,Death Row,7.99,0/1,Ben Kincaid tries to stop an execution.,The hero of William Bernhardt's Ben Kincaid no...,"Upon receiving his execution date, one of the ...",'Lynden Harris','Social Science',Unpopular
4,Sound and Form in Modern Poetry: Second Editio...,32.5,18/20,good introduction to modern prosody,There's a lot in this book which the reader wi...,An updated and expanded version of a classic a...,"'Harvey Seymour Gross', 'Robert McDowell'",'Poetry',Unpopular


In [6]:
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, accuracy_score
from sklearn.feature_selection import SelectFromModel

# Handle missing values
books.fillna("", inplace=True)

# Encode target variable
le = LabelEncoder()
books['popularity'] = le.fit_transform(books['popularity'])  # 'Popular' -> 1, 'Unpopular' -> 0

# Convert review/helpfulness from fraction to float
def convert_fraction(x):
    try:
        num, denom = map(int, x.split('/'))
        return num / denom if denom != 0 else 0
    except:
        return 0

books['review/helpfulness'] = books['review/helpfulness'].apply(convert_fraction)

# Feature Engineering
# Process text features using TF-IDF
text_features = ['review/summary', 'review/text']
tfidf = TfidfVectorizer(stop_words='english', max_features=5000)

# Convert categorical features using one-hot encoding
books = pd.get_dummies(books, columns=['authors', 'categories'])

# Extract text-based features
X_text = books[text_features].apply(lambda x: ' '.join(x), axis=1)
X_text_tfidf = tfidf.fit_transform(X_text)

# Extract numerical features
X_numeric = books[['price', 'review/helpfulness']].values
scaler = StandardScaler()
X_numeric_scaled = scaler.fit_transform(X_numeric)

# Combine all features
X = np.hstack((X_numeric_scaled, X_text_tfidf.toarray()))
y = books['popularity']

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Feature selection to reduce dimensionality
selector = SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=42))
X_train_reduced = selector.fit_transform(X_train, y_train)
X_test_reduced = selector.transform(X_test)

# Model training with hyperparameter tuning
param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [None, 10],
    'min_samples_split': [2, 5]
}

rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(rf, param_grid, cv=3, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train_reduced, y_train)

# Best model evaluation
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test_reduced)
print(classification_report(y_test, y_pred))

model_accuracy = accuracy_score(y_test, y_pred)
print(model_accuracy)

              precision    recall  f1-score   support

           0       0.78      0.50      0.61      1046
           1       0.79      0.93      0.85      2098

    accuracy                           0.79      3144
   macro avg       0.79      0.72      0.73      3144
weighted avg       0.79      0.79      0.77      3144

0.7872137404580153
