## EDA and User Rating Prediction of Amazon's Bestselling Books

- Analyse the data and retrieve some meaningful insights and Predict user rating of a book.
- Data obtained from Kaggle : https://www.kaggle.com/sootersaalu/amazon-top-50-bestselling-books-2009-2019 
- Python(Jupyter Notebook) is used for analysis.

### Importing relavent libraries...

In [None]:
import numpy as np
import pandas as pd

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')

%matplotlib inline

In [None]:
import plotly.graph_objs as go
import plotly.offline as iplot
import plotly.express as px

In [None]:
import warnings
from warnings import filterwarnings
filterwarnings('ignore')

### Importing data...

In [None]:
df = pd.read_csv("../input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv")

In [None]:
df.head()

### Checking of data and Data Cleaning... 

In [None]:
df.info()

In [None]:
df.shape

- There are 550 rows and 7 columns in DataFrame...!!!

#### Checking for null value...

In [None]:
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

plt.show()

- No null value is in DataFrame

#### Summary of DataFrame

In [None]:
df.describe()

### Exploratory Data Analysis

In [None]:
df.columns

In [None]:
sns.pairplot(df)

plt.show()

#### Books genre and their quantities

In [None]:
df['Genre'].unique()

- There are two genre of books:
- 1- Fiction  
- 2- Non Fiction

In [None]:
print(df['Genre'].value_counts())
print('\n')

px.pie(data_frame=df,names =df['Genre'].value_counts().index, values= df['Genre'].value_counts(),hole = 0.41,
      title = 'Quantity per Genre')

In [None]:
print('Number of unique books : ',df['Name'].nunique())
print('Number of unique authors : ', df['Author'].nunique())

#### Year wise minimum, maximum and mean 'User Rating' of books

In [None]:
user_rating = df.groupby('Year')['User Rating'].agg(['min','max','mean']).reset_index()

In [None]:
user_rating

In [None]:
user_rating.columns = ['year','min_rating','max_rating','mean_rating']

In [None]:
px.line(data_frame=user_rating,x = 'year',y = ['min_rating','max_rating','mean_rating'],
       title = 'Min_Max_Average User Rating per Year')

- From above plot we can observe the minimum,maximum and average 'User Rating' year wise...!!!

#### Year wise minimum, maximum and mean 'Price' of books

In [None]:
price = df.groupby('Year')['Price'].agg(['min','max','mean']).reset_index()

In [None]:
price

In [None]:
price.columns = ['year','min_price','max_price','mean_price']

In [None]:
px.line(data_frame=price,x = 'year',y = ['min_price','max_price','mean_price'],
       title= 'Min_Max_Average Price per Year')

- From above plot we can observe the minimum,maximum and average 'Price' trend year wise...!!!

In [None]:
plt.figure(figsize = (13,7))

sns.boxplot(x = 'Year', y = 'Price', data = df, hue = 'Genre')

plt.title('Price Distibution of Books by Genre per Year',fontsize = 19)
plt.xlabel('Year',fontsize = 13)
plt.ylabel('Price',fontsize = 13)

plt.show()

- Above plot shows the Price distibution of books by genre per year...!!!

#### Year wise minimum, maximum and mean 'Reviews' of books

In [None]:
reviews = df.groupby('Year')['Reviews'].agg(['min','max','mean']).reset_index()

In [None]:
reviews.columns = ['year','min_reviews','max_reviews','mean_reviews']
reviews

In [None]:
px.line(data_frame=reviews,x = 'year',y = ['min_reviews','max_reviews','mean_reviews'],
       title = 'Min_Max_Average Reviews per Year')

- Minimum, Maximum and Average number of Reviews per year shown by above plot...!!!

#### Author and their minimum, maximum and mean 'User Rating'

In [None]:
author = df.groupby('Author')['User Rating'].agg(['min','max','mean']).reset_index()

In [None]:
author.head()

In [None]:
author.shape

#### Authors and their book count

In [None]:
author2 = df.groupby('Author')['Name'].count().reset_index().sort_values(by = 'Name',ascending = False)

In [None]:
author2.columns = ['Author','No of Books']

In [None]:
author2.head()

In [None]:
author2.shape

#### Numer of books per 'Genre' per 'Year'

In [None]:
df.groupby('Year')['Genre'].value_counts()

In [None]:
plt.figure(figsize=(11,7))

sns.countplot(x = 'Year' , data = df,hue = 'Genre')
plt.title('Number of Books by Genre per Year',fontsize = 19)
plt.xlabel('Year',fontsize = 13)
plt.ylabel('Number of Books',fontsize = 13)

plt.show()

#### Checking of categorical and numerical columns...

In [None]:
cat_col = [col for col in df.columns if df[col].dtype == 'object']

In [None]:
print('Categorical columns: \n',cat_col)

In [None]:
num_col = [col for col in df.columns if df[col].dtype != 'object']

In [None]:
print('Numerical columns: \n',num_col)

#### Converting Categorical Features...
- We'll need to convert categorical features by 'LabelEncoding' Otherwise our machine learning algorithm won't be able to directly take in those features as inputs...!!!

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
encoder = LabelEncoder()

In [None]:
df['Author'] = encoder.fit_transform(df['Author'])

In [None]:
df['Genre'] = encoder.fit_transform(df['Genre'])

In [None]:
df.head()

#### Checking for outlires...

In [None]:
def plot(df,col):
    fig,(ax1,ax2) = plt.subplots(2,1)
    sns.distplot(df[col],ax = ax1)
    sns.boxplot(df[col],ax = ax2)

In [None]:
plot(df,'Price')

- There are some outliers in Price of books
- Data points that are greater than 40 are outliers
- We will replace those points with median else they may affect our ML model

#### Handling outliers...

In [None]:
df['Price'] = np.where(df['Price']>40, df['Price'].median(),df['Price'])

In [None]:
plot(df,'Price')

- There are less outliers than previous so that's totally okay.

In [None]:
plot(df,'Reviews')

- There are some outliers in Reviews of books
- Data points that are greater than 40000 are outliers
- We will replace those points with median else they may affect our ML model

In [None]:
df['Reviews'] = np.where(df['Reviews'] >40000,df['Reviews'].median(),df['Reviews'] )

In [None]:
plot(df,'Reviews')

- There are less outliers than previous so that's totally okay.

In [None]:
sns.heatmap(df.corr(), annot= True, linewidths=1,linecolor='white')

plt.show()

- Our data is almost ready...!!!

### Training a Linear Regression Model
#### Training and Testing Data
#### Selecting Dependent and Independent Variables 


In [None]:
# Independent Variable
X = df.drop(['User Rating','Name'], axis = 1)

In [None]:
# Dependent Variable
y = df['User Rating']

#### Train Test Split
- Split the data into training and testing set

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101)

#### Creating and Training model

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lm = LinearRegression()

In [None]:
lm.fit(X_train,y_train)

#### Prediction from our Model

In [None]:
predictions = lm.predict(X_test)

In [None]:
# Comparing actual Vs predicted
act_pred = pd.DataFrame({'actual':y_test,'predicted':predictions})
act_pred.head()

In [None]:
sns.lmplot(data = act_pred,x = 'actual',y = 'predicted')

plt.show()

#### Residual Plot

In [None]:
sns.distplot(y_test-predictions,bins=50)

plt.show()

#### Model evaluation...

In [None]:
print(lm.intercept_)

In [None]:
print('Coefficient: \n',lm.coef_)

In [None]:
coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient'])
coeff_df

In [None]:
from sklearn import metrics

In [None]:
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
print('r2 score:',metrics.r2_score(y_test,predictions))

In [None]:
print('Training Score: \n',lm.score(X_train,y_train))

In [None]:
print('Testing Score: \n',lm.score(X_test,y_test))