# Amazon Top 50 Bestselling Books 2009 - 2019 | Rating prediction using several methods

First we will take a general look at the data, then we will generate more useful variables based on the original ones and then we will explore several methods to see which one predicts user ratings better.

## 0. Setting up the analysis. Importing the libraries and the datafile.

In [None]:
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import re

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

from sklearn.linear_model import SGDRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.neural_network import MLPRegressor

import warnings
warnings.filterwarnings("ignore")

datapath="/kaggle/input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv"


## 1. First look at the data

### 1.1 Visualizing the original data

In [None]:
rawdata=pd.read_csv(datapath)
rawdata

The data shows the user rating, the total number of reviews and the price at 2019 and the genre (fic/non fic) each book, which occupy a row for each year as a bestseller. Now let's see the pairplot of the numeric variables.

### 1.2 Looking for direct correlations.

In [None]:
sns.pairplot(rawdata)

At first glance we don't see a direct correlation between the user ratings and the other numeric variables with the pairplot. Let's try to visualize the user rating distribution for fiction/not fiction books and the distribution of the bestsellers of each year.

In [None]:
genredata=rawdata.groupby("Name").mean()
genredata["Genre"]=genredata.index.map(lambda name: rawdata[rawdata["Name"]==name]["Genre"].values[0])

bins_=np.arange(3.15,5.05,0.1)

sns.displot(genredata, x="User Rating",hue="Genre", element="step", bins=bins_)
sns.displot(rawdata, x="User Rating",hue="Year", element="poly", bins=bins_,fill=False)

The plots show that fiction books are slightly better rated than the not fictional ones and the books that have been bestsellers in the last few years have better ratings than the ones they have not. In both cases the correlation is not very strong.

## 2. Exploring the data

### 2.1 Creating new variables from the original ones

We will compute the number of years as a bestseller (YearsBs) and the number of years elapsed since the first time a book was a bestseller (YearsEl).

In [None]:
cleandata=rawdata
yearsBSdata=rawdata.groupby("Name").count().Year
yearsELdata=rawdata.groupby("Name").min().Year.apply(lambda x: 2020-x)
cleandata["YearsBs"]=cleandata["Name"].apply(lambda name: yearsBSdata[name])
cleandata["YearsEl"]=cleandata["Name"].apply(lambda name: yearsELdata[name])
cleandata=cleandata.rename(columns={"User Rating":"Rating"})
cleandata

### 2.2 Looking for new correlations

In [None]:
yearsdata=cleandata.groupby("Name").mean()
sns.scatterplot(x=yearsdata.YearsBs,y=yearsdata.Rating)
sns.displot(yearsdata, x="Rating",hue="YearsBs", element="poly", stat="probability",common_norm=False,bins=bins_,fill=False)

In [None]:
sns.scatterplot(x=yearsdata.YearsEl,y=yearsdata.Rating)
sns.displot(yearsdata, x="Rating",hue="YearsEl", element="poly", stat="probability",common_norm=False,bins=bins_,fill=False)

We see that the more years as a bestseller a book has been, the more likely is to have a good rating. Also the newer bestsellershave slightly better ratings that the older ones.

## 3. Preparing the data to be machine learnt

### 3.1 Data cleaning

There is some data that wasn't scrapped as intended. Some books have different versions.

There are books with slightly altered names for each version:

In [None]:
def is_name(string_searched,string_evaluated):
    match=re.search(string_searched,string_evaluated)
    if match!=None:
       return True
    else:
       return False
    
dirtydata1=cleandata[cleandata.Name.apply(lambda name: is_name(r"The 5 Love Languages: The Secret to Love [Tt]hat Lasts", name))]
dirtydata1

However there are some books with 2 versions with the same name variable.

In [None]:
dirtydata2=cleandata[(cleandata.Rating.round(2)!=cleandata.Name.apply(lambda name: yearsdata.Rating[name].round(2))) |
                   (cleandata.Price!=cleandata.Name.apply(lambda name: yearsdata.Price[name]))]
dirtydata2["MeanRating"]=dirtydata2.Name.apply(lambda name: yearsdata.Rating[name].round(2))
dirtydata2["MeanPrice"]=dirtydata2.Name.apply(lambda name: yearsdata.Price[name].round(2))
dirtydata2

Now let's evaluate if we can remove the anomalous data or if it's worth it to keep it and try to correct it.

In [None]:
anomalous_ratio=(dirtydata1.shape[0]+dirtydata2.shape[0])/cleandata.shape[0]
print("The anomalous data ratio is {:.2f} %".format(anomalous_ratio*100.))

The anomalous data is less than 10% of total data. We can remove it.

In [None]:
cleanerdata=cleandata
cleanerdata=cleanerdata.drop(index=dirtydata1.index)
cleanerdata=cleanerdata.drop(index=dirtydata2.index)

### 3.2 Outlier cleaning

Now we will clean the outliers, using the quantile criteria.

\begin{equation}
Q_1- 1.5 \ Q_{13}<{(X,y)} \leq Q_3+1.5 \ Q_{13}
\end{equation}

where

\begin{equation}
Q_{13}=Q_3-Q_1
\end{equation}

In [None]:
def outlier_remover(columnseries):
    Q1=columnseries.describe()["25%"]
    Q3=columnseries.describe()["75%"]
    Q13=Q3-Q1
    lowerbound=Q1-1.5*Q13
    upperbound=Q3+1.5*Q13
    newcolumnseries=columnseries[columnseries.between(lowerbound,upperbound)]
    return newcolumnseries

for columnname in cleanerdata.columns:
    if columnname not in ["Name","Author","Genre"]: cleanerdata[columnname]=outlier_remover(cleanerdata[columnname])
cleanerdata=cleanerdata.dropna()
cleanerdata.describe()

Before continuing we may repeat the rating distribution figures to see how much the anomalous data affected the results.

In [None]:
newgenredata=cleanerdata.groupby("Name").mean()
newgenredata["Genre"]=newgenredata.index.map(lambda name: cleanerdata[cleanerdata["Name"]==name]["Genre"].values[0])

bins_=np.arange(3.15,5.05,0.1)

sns.displot(newgenredata, x="Rating",hue="Genre", element="step", bins=bins_)
sns.displot(cleanerdata, x="Rating",hue="Year", element="poly", bins=bins_,fill=False)

In [None]:
groupeddata=cleanerdata.groupby("Name").mean()
sns.scatterplot(x=groupeddata.YearsBs,y=groupeddata.Rating)
sns.displot(groupeddata, x="Rating",hue="YearsBs", element="poly", stat="probability",common_norm=False,bins=bins_,fill=False)

In [None]:
sns.scatterplot(x=groupeddata.YearsEl,y=groupeddata.Rating)
sns.displot(groupeddata, x="Rating",hue="YearsEl", element="poly", stat="probability",common_norm=False,bins=bins_,fill=False)

Fortunately our initial assumptions haven't changed so much.

### 3.3 Dummy variables generation

Now we generate the dummy variables of "author", which indicate the author of the book, and "year", which indicate in which years the book was a bestseller. First we generate them and assign them to "Name", but in multiple year bestsellers each row will only activate the dummy variable corresponding to that year, so we have to group by "Name" and take the maximum value of the dummy variables so each year as a bestseller will be activated. Then we assign the new dummy variables to the cleaned dataset. In the end, multirow books will have each row with the author, genre and corresponding year dummy variables activated and they will have the same other variables except "Year".

In [None]:
author_dummies=pd.get_dummies(cleanerdata["Author"], prefix="author")
year_dummies=pd.get_dummies(cleanerdata["Year"], prefix="year")
genre_dummies=pd.get_dummies(cleanerdata["Genre"], prefix="genre")
dummydata=pd.concat([cleanerdata["Name"],author_dummies,year_dummies,genre_dummies] ,axis=1)
dummydata_summed=dummydata.groupby("Name").max()
for column in dummydata_summed.columns:
    cleanerdata[column]=cleanerdata["Name"].apply(lambda name: dummydata_summed.loc[name][column]) 

Finally we group the clean data by "Name" with the mean values to drop the non numeric columns ("Author", "Genre"). Then we drop "Year" so now each book occupies a single row with all the correct numeric and dummy variables.

In [None]:
learningdata=cleanerdata.groupby("Name").mean().drop(columns="Year")

### 3.4 Splitting and scaling the data

Now that we have the data properly structured, we can prepare it to be machine learned. First we will split it into training and testing data and then we will rescale it so it will be easier to learn. We will use the StandardScaler tool, which will make our data have mean=0 and variance=1.

#### 3.4.1 Splitting

In [None]:
learningdata_=learningdata
learningdata_y=learningdata_.pop("Rating").values
learningdata_X=learningdata_.values
X_train, X_test, y_train, y_test = train_test_split(learningdata_X,learningdata_y,test_size=0.15, random_state=29)

#### 3.4.2 Scaling

In [None]:
Xscaler=StandardScaler().fit(X_train)
yscaler=StandardScaler().fit(y_train.reshape(-1,1))
X_train=Xscaler.transform(X_train)
X_test=Xscaler.transform(X_test)
y_train=np.ravel(yscaler.transform(y_train.reshape(-1,1)))
y_test=np.ravel(yscaler.transform(y_test.reshape(-1,1)))

## 4. Machine learning the data

### 4.1 Model evaluation

To evaluate each algorythm we use cross-validation and parameter searching to avoid overfitting issues and to select a decent parameter combination.

In [None]:
predratings_train=pd.DataFrame()
predratings_test=pd.DataFrame()

predratings_train["Reality"]=np.ravel(yscaler.inverse_transform(y_train.reshape(-1,1)))
predratings_test["Reality"]=np.ravel(yscaler.inverse_transform(y_test.reshape(-1,1)))

modelRMSE_train={}
modelRMSE_test={}

def tuned_model(model, param_grid, results=False):
    tuner=GridSearchCV(model,param_grid)
    tuner.fit(X_train,y_train)
    best_model=tuner.best_estimator_
    best_params=tuner.best_params_
    print(f"Best model: {best_model}")
    if results==True:

        print(pd.DataFrame(tuner.cv_results_).sort_values(by=['rank_test_score']).dropna())
    return best_model

def execute_model(model,modelname):
    model.fit(X_train,y_train)
    y_train_pred=model.predict(X_train)
    y_test_pred=model.predict(X_test)
    predratings_train[modelname]=np.ravel(yscaler.inverse_transform(y_train_pred.reshape(-1,1)))
    predratings_test[modelname]=np.ravel(yscaler.inverse_transform(y_test_pred.reshape(-1,1)))


    modelRMSE_train[modelname]=mean_squared_error(y_train,y_train_pred,squared=False)
    modelRMSE_test[modelname]=mean_squared_error(y_test,y_test_pred,squared=False)

#### 4.1.1 Stochastic Gradient Descent Regressor

In [None]:
SGDR_param_grid={"alpha":[1,0.5,0.1,0.05,0.01],"loss":["squared_loss", "huber", "epsilon_insensitive", "squared_epsilon_insensitive"],"penalty":["l2", "l1", "elasticnet"]}
SGDR=tuned_model(SGDRegressor(),SGDR_param_grid,results=True)
execute_model(SGDR,"Stochastic Gradient Descent")

#### 4.1.2 Support Vector Regression

In [None]:
SVectoR_param_grid={"C":[0.1,0.5,1,5,10],"kernel":["linear", "poly", "rbf", "sigmoid"]}
SVectoR=tuned_model(SVR(),SVectoR_param_grid,results=True)
execute_model(SVectoR,"Support Vector")

#### 4.1.3 Decision Tree Regressor

In [None]:
DTreeR_param_grid={"ccp_alpha":[0.8,0.5,0.01,0.005,0.0001],"max_depth":[None, 10, 20, 40],"criterion":["mse", "friedman_mse", "mae", "poisson"]}
DTreeR=tuned_model(DecisionTreeRegressor(),DTreeR_param_grid,results=True)
execute_model(DTreeR,"Decision Tree")

#### 4.1.4 Multi-Layer Perceptron Regressor

In [None]:
MLPR_param_grid={"alpha":[0.1,0.01,0.001,0.0001],
                 "max_iter":[10,20,40,80], 
                 "hidden_layer_sizes":[10,20,40,60],
                 "activation":["identity","tanh","logistic","relu"]}
MLPR=tuned_model(MLPRegressor(),MLPR_param_grid,results=True)
execute_model(MLPR,"Multi-Layer Perceptron")

### 4.2 Algorythm comparation

#### 4.2.1 Graphical comparation.

In [None]:
fig,ax=plt.subplots(figsize=(20,10),ncols=4, nrows=2)

fig.suptitle("Model predictions vs actual ratings (Top: Training data, Bottom: Testing data)",fontsize=16)
figrows=ax.shape[0]
figcols=ax.shape[1]
for i in range(0,figrows):
    for j in range(0,figcols):
        ax[i][j].set_xlim(4,5)
        ax[i][j].set_ylim(4,5)

sns.scatterplot(x=predratings_train["Reality"],y=predratings_train["Stochastic Gradient Descent"],ax=ax[0][0])
sns.scatterplot(x=predratings_train["Reality"],y=predratings_train["Support Vector"],ax=ax[0][1])
sns.scatterplot(x=predratings_train["Reality"],y=predratings_train["Decision Tree"],ax=ax[0][2])
sns.scatterplot(x=predratings_train["Reality"],y=predratings_train["Multi-Layer Perceptron"],ax=ax[0][3])

sns.scatterplot(x=predratings_test["Reality"],y=predratings_test["Stochastic Gradient Descent"],ax=ax[1][0])
sns.scatterplot(x=predratings_test["Reality"],y=predratings_test["Support Vector"],ax=ax[1][1])
sns.scatterplot(x=predratings_test["Reality"],y=predratings_test["Decision Tree"],ax=ax[1][2])
sns.scatterplot(x=predratings_test["Reality"],y=predratings_test["Multi-Layer Perceptron"],ax=ax[1][3])


All 4 algorythms struggle to learn the data as shown in the width of the training plot "lines", that limits the quality of the testing data rating predictions.

#### 4.2.2 RMSE comparation

In [None]:
print("Root Mean Square Error (Training data)")
for key in modelRMSE_train:
    print("{}: {}".format(key,modelRMSE_train[key]))
print("\n")    
print("Root Mean Square Error (Test data)")
for key in modelRMSE_test:
    print("{}: {}".format(key,modelRMSE_test[key]))

#### 4.2.3 Data comparation

TRAINING DATASET

In [None]:
predratings_train.round(2)

TESTING DATASET

In [None]:
predratings_test.round(2)

## 5. Conclusions

1. The first data analysis shows that:
   - The newer bestsellers tend to have slightly better ratings than the older ones.
   - The more years a book has been a bestseller, the more likely is to have a good rating.
   - Fiction books are better rated than non fiction ones.
2. About the model selection and the machine learning algorythms used:
   - The lack of samples after the data cleaning have made the data difficult to learn properly.
   - The Decision Tree algorythm is notably the less effective algorythm used.
   - The Stochastic Gradient Descent is the one which gives better results, closely followed by the Support Vector
     and the Multi-Layer Perceptron.
   