In this Notebook I am going to explore the dataset and predict a target feature by blending differrent models.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from scipy.special import boxcox1p, inv_boxcox1p
from scipy.stats import skew, boxcox_normmax

In [None]:
data = pd.read_csv('../input/best-trilogies-book-series-ever/BestTrilogies_BookSeriesEver.csv')

# Data Exploration

In [None]:
data.info()

**Feature information**

1)bookTitle: Contains the name of the books

2)authorName: Contains name of the Author

3)avgRating: Contains the average rating of the book

4)totalRating: Contains the number of people who have rated the book

5)scoreValue: Contains a score which is calculated using avgRating and peopleVote

6)peopleVoted: Number of people who upvoted the book

In [None]:
data.head()

We see that avgRating, totalRating and scoreValue are stored as Object type instead of numerical type's. This is because they contain some strings in between, we will get rid of them and store them as numerical values.

In [None]:
data.describe()

In [None]:
data.describe(include="O")

# Data Cleaning

In [None]:
data['peopleVoted'].value_counts()

We replace the negative value with 0 because it was causing problems when we were transforming our data.

In [None]:
data['peopleVoted'] = data['peopleVoted'].replace(-1,0)

Now we will convert avgRating, totalRating and scoreValue to numeric datatype.

In [None]:
data[['totalRating', 'scoreValue']] = data[['totalRating', 'scoreValue']].replace(',','', regex=True)

In [None]:
data['avgRating'] = data['avgRating'].replace('[^\d.]','', regex=True)

In [None]:
data['totalRating'] = data['totalRating'].replace('[^\d]','', regex=True)

In [None]:
data[['totalRating', 'scoreValue']] = data[['totalRating', 'scoreValue']].astype(int)

In [None]:
data['avgRating'] = data['avgRating'].astype(float)

In [None]:
data.head()

We can see the description of our features and understand the data.

In [None]:
data.describe()

In [None]:
"""
if you want to extract information in paranthesis for some reason.
book = pd.DataFrame()
book['Name'] = data['bookTitle'].replace(r'[^(]*\(|\)[^)]*','',regex=True)
"""

We drop these feature's because I dont think they will be of much use in the prediction of our target variable. We could use the name of Author as a feature because the rating and popularity of a book change's depending on the popularity of its Author. Maybe we will try to use it in a future version.

In [None]:
data = data.drop(['bookTitle', 'authorName'], axis=1)

# Visualization

We can see that except avgRating, other columns are highly skewed and we have to do something to resolve it.

In [None]:
fig = plt.figure(figsize=(16,14))
for index,col in enumerate(data.columns):
    plt.subplot(2,2,index+1)
    sns.distplot(data.loc[:,col], kde=False)
fig.tight_layout(pad=1.0)

There are extreme outliers in every feature except avgRating.
Note that totalRating is displayed with the help of complex number.

In [None]:
fig = plt.figure(figsize=(10,10))
for index,col in enumerate(data.columns):
    plt.subplot(2,2,index+1)
    sns.boxplot(y=col, data=data)
fig.tight_layout(pad=1.0)

Scorevalue and peopleVoted have a correlation value of 1, this means that scorevalue is almost totally based on peopleVoted.

In [None]:
sns.heatmap(data.corr(), annot=True)

# Preprocessing

Removing the skewness will help in dealing with outliers and also reduces the correlation between features.

In [None]:
data.skew()

In [None]:
highly_skewed = data.drop('avgRating', axis=1).columns

In [None]:
data_unskewed = pd.DataFrame()
for col in highly_skewed:
    data_unskewed[col] = boxcox1p(data[col], boxcox_normmax(data[col]+1))
data_unskewed.skew()

Reduced the skewness in our data

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.svm import SVR
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.model_selection import cross_val_score

In [None]:
X = data_unskewed.drop('totalRating', axis=1).join(data['avgRating'])
y = data_unskewed['totalRating']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [None]:
scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

function to revert the boxcox transformation of our target variable.

In [None]:
def inv_pred(y_pred):
    return inv_boxcox1p(y_pred, boxcox_normmax(data['totalRating']+1))

In [None]:
lr = LinearRegression().fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
y_pred_inv = inv_pred(y_pred_lr)
mean_absolute_error(y_pred_inv, y_test)

In [None]:
svr = SVR(C=20, epsilon=0.01, gamma=0.0001, tol=0.0001)
svr.fit(X_train, y_train)
y_pred = svr.predict(X_test)
y_pred_inv = inv_pred(y_pred)
mean_absolute_error(y_pred_inv, y_test)

In [None]:
xgb = XGBRegressor(learning_rate=0.001, n_estimators=3000,
    max_depth=4, min_child_weight=0,
    subsample=0.8, colsample_bytree=0.4,
    nthread=-1, scale_pos_weight=2,
    seed=42)
xgb.fit(X_train, y_train)
y_pred = xgb.predict(X_test)
y_pred_inv = inv_pred(y_pred)
mean_absolute_error(y_pred_inv, y_test)

In [None]:
lgbm = LGBMRegressor(objective='regression', n_estimators=3000,
                         num_leaves=5, learning_rate=0.0001,
                         max_bin=150, bagging_fraction=0.3,
                         n_jobs=-1, bagging_freq=7,
                         feature_fraction=0.1, min_data_in_leaf=8)
lgbm.fit(X_train, y_train)
y_pred = lgbm.predict(X_test)
y_pred_inv = inv_pred(y_pred)
mean_absolute_error(y_pred_inv, y_test)

In [None]:
def blend_model(X, a, b, c):
    return((a*lgbm.predict(X)) + (b*xgb.predict(X)) + (c*svr.predict(X)))

In [None]:
y_pred = blend_model(X_test, 0.45, 0.25, 0.30)
y_pred_inv = inv_pred(y_pred)
mean_absolute_error(y_pred_inv, y_test)

We get a mean absolute error of 11519.34 after blending our models, Although this is greater than what LGBM achieved on its own, we expect it to generalize better to new data.

Thanks for reading my Notebook, kindly upvote it will help a lot, feedbacks and suggestions are appreciated.