# “The so-called paradoxes of an author, to which a reader takes exception, often exist not in the author’s book at all, but rather in the reader’s head.” – Friedrich Nietzsche. Books are open doors to the unimagined worlds which is unique to every person. It is more than just a hobby for many. There are many among us who prefer to spend more time with books than anything else. Here we explore a big database of books. Books of different genres, from thousands of authors. In this Project, We are required to use the dataset to build a Machine Learning model to predict the price of books based on a given set of features.

# Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

In [234]:
df=pd.read_excel('F:\\Data_Train.xlsx')
df=df.drop('Author',1)
df.head()

Unnamed: 0,Title,Edition,Reviews,Ratings,Synopsis,Genre,BookCategory,Price
0,The Prisoner's Gold (The Hunters 3),"Paperback,– 10 Mar 2016",4.0 out of 5 stars,8 customer reviews,THE HUNTERS return in their third brilliant no...,Action & Adventure (Books),Action & Adventure,220.0
1,Guru Dutt: A Tragedy in Three Acts,"Paperback,– 7 Nov 2012",3.9 out of 5 stars,14 customer reviews,A layered portrait of a troubled genius for wh...,Cinema & Broadcast (Books),"Biographies, Diaries & True Accounts",202.93
2,Leviathan (Penguin Classics),"Paperback,– 25 Feb 1982",4.8 out of 5 stars,6 customer reviews,"""During the time men live without a common Pow...",International Relations,Humour,299.0
3,A Pocket Full of Rye (Miss Marple),"Paperback,– 5 Oct 2017",4.1 out of 5 stars,13 customer reviews,A handful of grain is found in the pocket of a...,Contemporary Fiction (Books),"Crime, Thriller & Mystery",180.0
4,LIFE 70 Years of Extraordinary Photography,"Hardcover,– 10 Oct 2006",5.0 out of 5 stars,1 customer review,"For seven decades, ""Life"" has been thrilling t...",Photography Textbooks,"Arts, Film & Photography",965.62


In [235]:
df['Ratings'].nunique()

342

In [236]:
df['BookCategory'].value_counts()

Action & Adventure                      818
Crime, Thriller & Mystery               723
Biographies, Diaries & True Accounts    596
Language, Linguistics & Writing         594
Comics & Mangas                         583
Romance                                 560
Humour                                  540
Arts, Film & Photography                517
Computing, Internet & Digital Media     510
Sports                                  471
Politics                                325
Name: BookCategory, dtype: int64

In [237]:
df['Edition']=df['Edition'].str.split(',').str[-1]
df['year']=df['Edition'].str.split().str[-1]

In [238]:
df['year'].mode() 

0    2018
dtype: object

In [239]:
df['year']=df['year'].str.replace('Import','2018')
df['year']=df['year'].str.replace('Print','2018')
df['year']=df['year'].str.replace('Audiobook','2018')
df['year']=df['year'].str.replace('Facsimile','2018')
df['year']=df['year'].str.replace('Edition','2018')
df['year']=df['year'].str.replace('Unabridged','2018')
df['year']=df['year'].str.replace('NTSC','2018')
df['year']=df['year'].str.replace('set','2018')

In [240]:
df['year'].value_counts()
df['year']=df['year'].astype('int')

In [241]:
df=df.drop('Edition',1)

In [242]:
df['Reviews']=df['Reviews'].str.split().str[0]
df['year']=df['year'].astype('float')

In [243]:
df.head()

Unnamed: 0,Title,Reviews,Ratings,Synopsis,Genre,BookCategory,Price,year
0,The Prisoner's Gold (The Hunters 3),4.0,8 customer reviews,THE HUNTERS return in their third brilliant no...,Action & Adventure (Books),Action & Adventure,220.0,2016.0
1,Guru Dutt: A Tragedy in Three Acts,3.9,14 customer reviews,A layered portrait of a troubled genius for wh...,Cinema & Broadcast (Books),"Biographies, Diaries & True Accounts",202.93,2012.0
2,Leviathan (Penguin Classics),4.8,6 customer reviews,"""During the time men live without a common Pow...",International Relations,Humour,299.0,1982.0
3,A Pocket Full of Rye (Miss Marple),4.1,13 customer reviews,A handful of grain is found in the pocket of a...,Contemporary Fiction (Books),"Crime, Thriller & Mystery",180.0,2017.0
4,LIFE 70 Years of Extraordinary Photography,5.0,1 customer review,"For seven decades, ""Life"" has been thrilling t...",Photography Textbooks,"Arts, Film & Photography",965.62,2006.0


In [244]:
df['Ratings']=df['Ratings'].str.split().str[0]
df['Ratings']=df['Ratings'].str.replace(',','')
df['Ratings']=df['Ratings'].astype('float')

In [245]:
test=pd.read_csv('F:\\test.csv')
test['Title']=test['Title'].fillna('not available')

In [246]:
df['Title']=df['Title'].str.replace('[^a-zA-Z]',' ')
#df['Title']=df['Title'].apply(lambda x:' '.join([word for word in x.split() if len(word)>3]))
df['Title']=df['Title'].apply(lambda x:x.lower())

In [247]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf=TfidfVectorizer(stop_words='english',max_features=300,min_df=2)
tf.fit(df['Title'])
df_tf=pd.DataFrame(tf.transform(df['Title']).toarray(),columns=tf.get_feature_names()).add_prefix('tfidf_')
df=df.join(df_tf)
df=df.drop('Title',1)

In [248]:
test.isnull().sum()

Title                                   0
Reviews                                 0
Ratings                                 0
Synopsis                                0
Genre                                   0
year                                    0
Action & Adventure                      0
Arts, Film & Photography                0
Biographies, Diaries & True Accounts    0
Comics & Mangas                         0
Computing, Internet & Digital Media     0
Crime, Thriller & Mystery               0
Humour                                  0
Language, Linguistics & Writing         0
Politics                                0
Romance                                 0
Sports                                  0
dtype: int64

In [249]:
test_tf=pd.DataFrame(tf.transform(test['Title']).toarray(),columns=tf.get_feature_names()).add_prefix('tfidf_')
test=test.join(test_tf)
test=test.drop('Title',1)

In [250]:
cat=pd.get_dummies(df['BookCategory'])
df=df.join(cat)
df=df.drop('BookCategory',1)

In [251]:
df['Synopsis']=df['Synopsis'].str.replace('[^a-zA-Z]',' ')
#df['Synopsis']=df['Synopsis'].apply(lambda x:' '.join([word for word in x.split() if len(word)>3]))
df['Synopsis']=df['Synopsis'].apply(lambda x:x.lower())

In [252]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf=TfidfVectorizer(stop_words='english',max_features=700,min_df=2)
tf.fit(df['Synopsis'])
df_tf=pd.DataFrame(tf.transform(df['Synopsis']).toarray(),columns=tf.get_feature_names()).add_prefix('tfidf1_')
df=df.join(df_tf)
df=df.drop('Synopsis',1)

In [253]:
test_tf=pd.DataFrame(tf.transform(test['Synopsis']).toarray(),columns=tf.get_feature_names()).add_prefix('tfidf1_')
test=test.join(test_tf)
test=test.drop('Synopsis',1)

In [254]:
df.head()

Unnamed: 0,Reviews,Ratings,Genre,Price,year,tfidf_advanced,tfidf_adventure,tfidf_adventures,tfidf_age,tfidf_album,...,tfidf1_world,tfidf1_write,tfidf1_writer,tfidf1_writers,tfidf1_writing,tfidf1_written,tfidf1_year,tfidf1_years,tfidf1_york,tfidf1_young
0,4.0,8.0,Action & Adventure (Books),220.0,2016.0,0.0,0.0,0.0,0.0,0.0,...,0.095226,0.0,0.0,0.0,0.167754,0.0,0.0,0.0,0.0,0.0
1,3.9,14.0,Cinema & Broadcast (Books),202.93,2012.0,0.0,0.0,0.0,0.0,0.0,...,0.07103,0.0,0.0,0.0,0.0,0.110767,0.0,0.0,0.0,0.0
2,4.8,6.0,International Relations,299.0,1982.0,0.0,0.0,0.0,0.0,0.0,...,0.112304,0.0,0.0,0.0,0.0,0.087566,0.0,0.073238,0.0,0.0
3,4.1,13.0,Contemporary Fiction (Books),180.0,2017.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5.0,1.0,Photography Textbooks,965.62,2006.0,0.0,0.0,0.0,0.0,0.0,...,0.113579,0.0,0.0,0.0,0.0,0.0,0.0,0.14814,0.0,0.0


In [255]:
df['Genre']=df['Genre'].str.replace('[^a-zA-Z]',' ')
#df['Genre']=df['Genre'].apply(lambda x:' '.join([word for word in x.split() if len(word)>3]))
df['Genre']=df['Genre'].apply(lambda x:x.lower())

In [256]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf=TfidfVectorizer(stop_words='english',max_features=200,min_df=2)
tf.fit(df['Genre'])
df_tf=pd.DataFrame(tf.transform(df['Genre']).toarray(),columns=tf.get_feature_names()).add_prefix('tfidf2_')
df=df.join(df_tf)
df=df.drop('Genre',1)

In [257]:
test_tf=pd.DataFrame(tf.transform(test['Genre']).toarray(),columns=tf.get_feature_names()).add_prefix('tfidf2_')
test=test.join(test_tf)
test=test.drop('Genre',1)

In [258]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

In [259]:
x=df.drop('Price',1)
y=df['Price']

In [260]:
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=42)

In [261]:
from sklearn.ensemble import BaggingRegressor
bag=BaggingRegressor(n_estimators=300)
bag.fit(x_train,y_train)
y_bag=bag.predict(x_test)
np.sqrt(mean_squared_error(y_bag,y_test))

562.3607803217104

In [262]:
from sklearn.ensemble import RandomForestRegressor
rf=RandomForestRegressor(n_estimators=400,n_jobs=-1)
rf.fit(x_train,y_train)
y_rf=rf.predict(x_test)
np.sqrt(mean_squared_error(y_rf,y_test))

610.1073758570097

In [263]:
dt=DecisionTreeRegressor()
dt.fit(x_train,y_train)
y_dt=dt.predict(x_test)
np.sqrt(mean_squared_error(y_dt,y_test))

755.7940534754465

In [264]:
y_test=bag.predict(test)

In [265]:
y_test

array([ 304.25653333, 1339.0561    ,  356.24744111, ...,  476.40346667,
        794.17903333,  620.68406667])

In [266]:
sub=pd.read_excel('F:\\Sample_Submission.xlsx')

In [267]:
sub['Price']=y_test

In [268]:
sub.to_excel('F:\\answers.xlsx',index=False)

In [269]:
sub

Unnamed: 0,Price
0,304.256533
1,1339.056100
2,356.247441
3,665.738833
4,469.897722
5,1258.968567
6,689.227667
7,344.889556
8,288.526467
9,451.568189
