# The Complete Solution To MachineHack's Predict The Book Price Hackathon
---

Published on Analytics India Magazine on October 7 2019:

[Step-By-Step Guide To Cracking MachineHack’s Predict The Book Price Hackathon](https://analyticsindiamag.com/cracking-machinehacks-predict-the-book-price-hackathon/)


### Predict The Book Price Hackathon

“The so-called paradoxes of an author, to which a reader takes exception, often exist not in the author’s book at all, but rather in the reader’s head.” – Friedrich Nietzsche

Books are open doors to the unimagined worlds which is unique to every person. It is more than just a hobby for many. There are many among us who prefer to spend more time with books than anything else.

Here we explore a big database of books. Books of different genres, from thousands of authors. In this challenge, participants are required to use the dataset to build a Machine Learning model to predict the price of books based on a given set of features.

Size of training set: 6237 records
Size of test set: 1560 records

FEATURES:

* Title: The title of the book
* Author: The author(s) of the book.
* Edition: The edition of the book eg (Paperback,– Import, 26 Apr 2018)
* Reviews: The customer reviews about the book
* Ratings: The customer ratings of the book
* Synopsis: The synopsis of the book
* Genre: The genre the book belongs to
* BookCategory: The department the book is usually available at.
* Price: The price of the book (Target variable)


Click [here](https://www.machinehack.com/course/predict-the-price-of-books/) to participate in the hackathon.

---

This python notebook contains the complete step by step guide to work on the above mentioned hackathon.Use this notebook to learn and adapt this work to better your score.

### Approach

1. Exploring The Data Sets
2. Cleaning, Processing and Generating New Features
1. Building A Regressor 
2. Optimizing The Hyperparameters Using Bayesian Optimization

The above steps are explained in detail as follows.

## 1. Exploring The Data Sets


---


In this step, we will import the datasets and will do a simple analysis that will help us process the data before predictive modeling.

This block involves:

* Importing the data
* Understanding the features and their characterstics 
* Noting key observations from the data.






In [0]:
import pandas as pd

In [0]:
train = pd.read_excel("Data/Data_Train.xlsx")
test = pd.read_excel("Data/Data_Test.xlsx")

In [0]:
train.head(50)

Unnamed: 0,Title,Author,Edition,Reviews,Ratings,Synopsis,Genre,BookCategory,Price
0,The Prisoner's Gold (The Hunters 3),Chris Kuzneski,"Paperback,– 10 Mar 2016",4.0 out of 5 stars,8 customer reviews,THE HUNTERS return in their third brilliant no...,Action & Adventure (Books),Action & Adventure,220.0
1,Guru Dutt: A Tragedy in Three Acts,Arun Khopkar,"Paperback,– 7 Nov 2012",3.9 out of 5 stars,14 customer reviews,A layered portrait of a troubled genius for wh...,Cinema & Broadcast (Books),"Biographies, Diaries & True Accounts",202.93
2,Leviathan (Penguin Classics),Thomas Hobbes,"Paperback,– 25 Feb 1982",4.8 out of 5 stars,6 customer reviews,"""During the time men live without a common Pow...",International Relations,Humour,299.0
3,A Pocket Full of Rye (Miss Marple),Agatha Christie,"Paperback,– 5 Oct 2017",4.1 out of 5 stars,13 customer reviews,A handful of grain is found in the pocket of a...,Contemporary Fiction (Books),"Crime, Thriller & Mystery",180.0
4,LIFE 70 Years of Extraordinary Photography,Editors of Life,"Hardcover,– 10 Oct 2006",5.0 out of 5 stars,1 customer review,"For seven decades, ""Life"" has been thrilling t...",Photography Textbooks,"Arts, Film & Photography",965.62
5,ChiRunning: A Revolutionary Approach to Effort...,Danny Dreyer,"Paperback,– 5 May 2009",4.5 out of 5 stars,8 customer reviews,The revised edition of the bestselling ChiRunn...,Healthy Living & Wellness (Books),Sports,900.0
6,Death on the Nile (Poirot),Agatha Christie,"Paperback,– 5 Oct 2017",4.4 out of 5 stars,72 customer reviews,Agatha Christie’s most exotic murder mystery\n...,"Crime, Thriller & Mystery (Books)","Crime, Thriller & Mystery",224.0
7,Yoga Your Home Practice Companion: A Complete ...,Sivananda Yoga Vedanta Centre,"Hardcover,– Import, 1 Mar 2018",4.7 out of 5 stars,16 customer reviews,"Achieve a healthy body, mental alertness, and ...",Sports Training & Coaching (Books),Sports,836.0
8,Karmayogi: A Biography of E. Sreedharan,M S Ashokan,"Paperback,– 15 Dec 2015",4.2 out of 5 stars,111 customer reviews,Karmayogi is the dramatic and inspiring story ...,Biographies & Autobiographies (Books),"Biographies, Diaries & True Accounts",130.0
9,"The Iron King (The Accursed Kings, Book 1)",Maurice Druon,"Paperback,– 26 Mar 2013",4.0 out of 5 stars,1 customer review,‘This is the original game of thrones’ George ...,Action & Adventure (Books),Action & Adventure,695.0


In [0]:
print(train.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6237 entries, 0 to 6236
Data columns (total 9 columns):
Title           6237 non-null object
Author          6237 non-null object
Edition         6237 non-null object
Reviews         6237 non-null object
Ratings         6237 non-null object
Synopsis        6237 non-null object
Genre           6237 non-null object
BookCategory    6237 non-null object
Price           6237 non-null float64
dtypes: float64(1), object(8)
memory usage: 438.6+ KB
None


In [0]:
train.describe(include = 'all').head(2)

Unnamed: 0,Title,Author,Edition,Reviews,Ratings,Synopsis,Genre,BookCategory,Price
count,6237,6237,6237,6237,6237,6237,6237,6237,6237.0
unique,5568,3679,3370,36,342,5549,345,11,


In [0]:
print(train.columns)

Index(['Title', 'Author', 'Edition', 'Reviews', 'Ratings', 'Synopsis', 'Genre',
       'BookCategory', 'Price'],
      dtype='object')


In [0]:
## Removing Synopsis since here we are not going to use this feature

train = train[['Title', 'Author', 'Edition', 'Reviews', 'Ratings','Genre',
               'BookCategory', 'Price']]

test = test[['Title', 'Author', 'Edition', 'Reviews', 'Ratings','Genre',
               'BookCategory']]

In [0]:
train.head()

Unnamed: 0,Title,Author,Edition,Reviews,Ratings,Genre,BookCategory,Price
0,The Prisoner's Gold (The Hunters 3),Chris Kuzneski,"Paperback,– 10 Mar 2016",4.0 out of 5 stars,8 customer reviews,Action & Adventure (Books),Action & Adventure,220.0
1,Guru Dutt: A Tragedy in Three Acts,Arun Khopkar,"Paperback,– 7 Nov 2012",3.9 out of 5 stars,14 customer reviews,Cinema & Broadcast (Books),"Biographies, Diaries & True Accounts",202.93
2,Leviathan (Penguin Classics),Thomas Hobbes,"Paperback,– 25 Feb 1982",4.8 out of 5 stars,6 customer reviews,International Relations,Humour,299.0
3,A Pocket Full of Rye (Miss Marple),Agatha Christie,"Paperback,– 5 Oct 2017",4.1 out of 5 stars,13 customer reviews,Contemporary Fiction (Books),"Crime, Thriller & Mystery",180.0
4,LIFE 70 Years of Extraordinary Photography,Editors of Life,"Hardcover,– 10 Oct 2006",5.0 out of 5 stars,1 customer review,Photography Textbooks,"Arts, Film & Photography",965.62


In [0]:
train.isnull().sum()

Title           0
Author          0
Edition         0
Reviews         0
Ratings         0
Genre           0
BookCategory    0
Price           0
dtype: int64

#### KEY OBSERVATIONS

* No null values in the dataset to treat.

* Some books have multiple authors in the Author column which needs to be processed amd seperated.

* Edition Column can be split in to 3 different features. (Type, Month and Year)

* The Reviews and Ratings columms are misslabelled.

* Reviews and Ratings, both needs to cleaned to represent integer and float values respectively.

* Like authors , a book may belong to multiple categories and genres. Thus we will need to split both the Genre and Category columns.



## Processing The Data
---

In this stage we will process the data by cleaning and making it ready for modeling.

This stage involves:

* Cleaning the data and generating new features
* Encoding all categorical variables
* Scaling the data




### Cleaning And Generating New Features

#### Splitting Edition Column

---

We will clean the column Edition and will create 3 new features from it which are Type, Month and Year.


In [0]:
#A method to clean and restructure the Edition column

def split_edition(data):  
  
  edition  = list(data)
  
  ed_type = [i.split(",– ")[0].strip().upper() for i in edition]
  
  edit_date = [i.split(",– ")[1].strip() for i in edition]
  
  m_y = [i.split()[-2:] for i in edit_date]
  
  
  for i in range(len(m_y)):
    if len(m_y[i]) == 1:
      m_y[i].insert(0,'NA')
      
  # Based on the given dataset below is the list of possible values for Months
  
  months =  ['Apr','Aug','Dec','Feb', 'Jan', 'Jul','Jun','Mar','May','NA','Nov','Oct','Sep']
  
  ed_month = [m_y[i][0].upper() if m_y[i][0] in months else 'NA' for i in range(len(m_y))]
  ed_year = [int(m_y[i][1].strip()) if m_y[i][1].isdigit() else 0 for i in range(len(m_y))]
  
  return ed_type, ed_month, ed_year

#### Splitting Author Columns


In order to split a colum in to multiple features we must first determine or identify that how many features can an existing Column account for. Hence for splitting the Author column in to multiple authors, we must know the maximum number of authors for a single book in the given datasets.We will combine the test and training set to do so.

We will also store the names of each and every author which we will later neeed for label encoding.

We will apply the same principles for the Genre as well as the BookCategory columns.

In [0]:
#Identifying the maximum number of authors for a single book from the given datasets
authors_1 = list(train['Author'])
authors_2 = list(test['Author'])

authors_1.extend(authors_2)

authorslis = [i.split(",") for i in authors_1]

max = 1
for i in authorslis:
  if len(i) >= max:
    max = len(i)
print("Max. number of authors for a single boook = ",max)

for i in range(len(authorslis)):
  if len(authorslis[i]) == max:
    print(i)    
    
all_authors = [author.strip().upper() for listin in authorslis for author in listin]
    

Max. number of authors for a single boook =  7
7008


In [0]:
# A method to split the Author column in to 7 new columns
def split_authors(data):
  
  authors = list(data)
  
  A1 = []
  A2 = []
  A3 = []
  A4 = []
  A5 = []
  A6 = []
  A7 = []
  for i in authors:
    
    try :
      A1.append(i.split(',')[0].strip().upper())
    except :
      A1.append('NONE')
      
    try :
      A2.append(i.split(',')[1].strip().upper())
    except :
      A2.append('NONE')
        
    try :
      A3.append(i.split(',')[2].strip().upper())
    except :
      A3.append('NONE')
        
    try :
      A4.append(i.split(',')[3].strip().upper())
    except :
      A4.append('NONE')
        
    try :
      A5.append(i.split(',')[4].strip().upper())
    except :
      A5.append('NONE')
      
    try :
      A6.append(i.split(',')[5].strip().upper())
    except :
      A6.append('NONE')
     
    try :
      A7.append(i.split(',')[6].strip().upper())
    except :
      A7.append('NONE')

      
  return A1,A2,A3,A4,A5,A6,A7
  
all_authors.append('NONE')

#### Splitting Genre Columns


In [0]:
#Identifying the maximum number of Genres for a single book from the given datasets

genre_1 = list(train['Genre'])
genre_2 = list(test['Genre'])

genre_1.extend(genre_2)

genre_lis = [i.split(",") for i in genre_1]


max = 1
for i in genre_lis:
  if len(i) >= max:
    max = len(i)
print("Max. number of genres for a single boook = ",max)
      
all_genres = [genre.strip().upper() for listin in genre_lis for genre in listin]
    


Max. number of genres for a single boook =  2


In [0]:
# A method to split the Genre column in to 7 new columns

def split_genres(data):
  
  genres = list(data)
  
  G1 = []
  G2 = []
  
  for i in genres:
    
    try :
      G1.append(i.split(',')[0].strip().upper())
      
    except :
      G1.append('NONE')
      
    try :
      G2.append(i.split(',')[1].strip().upper())
    except :
      G2.append('NONE')


      
  return G1,G2
  
all_genres.append('NONE')

#### Splitting BookCategory Column


In [0]:
#Identifying the maximum number of Categories for a single book from the given datasets

cat_1 = list(train['BookCategory'])
cat_2 = list(test['BookCategory'])

cat_1.extend(cat_2)

cat_lis = [i.split(",") for i in cat_1]


max = 1
for i in cat_lis:
  if len(i) >= max:
    max = len(i)
print("Max. number of Categories for a single boook = ",max)

all_categories = [cat.strip().upper() for listin in cat_lis for cat in listin]
    

Max. number of Categories for a single boook =  2


In [0]:
# A method to split the Category column in to 7 new columns

def split_categories(data):
  
  cat = list(data)
  
  C1 = []
  C2 = []

  for i in cat:
    
    try :
      C1.append(i.split(',')[0].strip().upper())
    except :
      C1.append('NONE')
      
    try :
      C2.append(i.split(',')[1].strip().upper())
    except :
      C2.append('NONE')


      
  return C1,C2
  
all_categories.append('NONE')


#### Cleaning & Restructuring The Datasets

In [0]:
# A method to clean and restructure the datasets

import re

def restructure(data):
  
  #Cleaning Title Column
  titles = list(data['Title'])
  titles = [title.strip().upper() for title in titles]
  
  #Cleaning & Restructuring Author Column
  a1,a2,a3,a4,a5,a6,a7 = split_authors(data['Author']) 
  
  #Cleaning & Restructuring Edition Column
  ed_type, ed_month, ed_year = split_edition(data['Edition'])
  
  #Cleaning Ratings Column
  ratings = list(data['Reviews'])
  ratings = [float(re.sub(" out of 5 stars", "", i).strip()) for i in ratings]
  
  #Cleaning Reviews Column
  reviews = list(data['Ratings'])
  plu = ' customer reviews'
  reviews = [re.sub(" customer reviews", "", i) if plu in i else re.sub(" customer review", "", i) for i in reviews  ]
  reviews = [int(re.sub(",", "", i).strip()) for i in reviews ]
  

  #Cleaning & Restructuring Genre Column
  g1, g2 = split_genres(data['Genre'])
  
  #Cleaning & Restructuring BookCategory Column
  c1,c2 = split_categories(data['BookCategory'])

  # Forming the Structured dataset
  structured_data = pd.DataFrame({'Title': titles,
                                  'Author1': a1,
                                  'Author2': a2,
                                  'Author3': a3,
                                  'Author4': a4,
                                  'Author5': a5,
                                  'Author6': a6,
                                  'Author7': a7,
                                  'Edition_Type': ed_type,
                                  'Edition_Month': ed_month,
                                  'Edition_Year': ed_year,
                                  'Ratings': ratings,
                                  'Reviews': reviews,
                                  'Genre1': g1,
                                  'Genre2': g2,
                                  'Category1': c1,
                                  'Category2': c2
                                  
                               })
  
  return structured_data

 

In [0]:
restructure(train).head(3)

Unnamed: 0,Title,Author1,Author2,Author3,Author4,Author5,Author6,Author7,Edition_Type,Edition_Month,Edition_Year,Ratings,Reviews,Genre1,Genre2,Category1,Category2
0,THE PRISONER'S GOLD (THE HUNTERS 3),CHRIS KUZNESKI,NONE,NONE,NONE,NONE,NONE,NONE,PAPERBACK,MAR,2016,4.0,8,ACTION & ADVENTURE (BOOKS),NONE,ACTION & ADVENTURE,NONE
1,GURU DUTT: A TRAGEDY IN THREE ACTS,ARUN KHOPKAR,NONE,NONE,NONE,NONE,NONE,NONE,PAPERBACK,NOV,2012,3.9,14,CINEMA & BROADCAST (BOOKS),NONE,BIOGRAPHIES,DIARIES & TRUE ACCOUNTS
2,LEVIATHAN (PENGUIN CLASSICS),THOMAS HOBBES,NONE,NONE,NONE,NONE,NONE,NONE,PAPERBACK,FEB,1982,4.8,6,INTERNATIONAL RELATIONS,NONE,HUMOUR,NONE


In [0]:

X_train = restructure(train)

Y_train = train.iloc[:, -1].values

X_test = restructure(test)


In [0]:
X_train.describe(include = 'all')

Unnamed: 0,Title,Author1,Author2,Author3,Author4,Author5,Author6,Author7,Edition_Type,Edition_Month,Edition_Year,Ratings,Reviews,Genre1,Genre2,Category1,Category2
count,6237,6237,6237,6237,6237,6237,6237,6237,6237,6237,6237.0,6237.0,6237.0,6237,6237,6237,6237
unique,5564,3633,264,73,21,5,1,1,19,13,,,,345,27,11,6
top,CASINO ROYALE: JAMES BOND 007 (VINTAGE),AGATHA CHRISTIE,NONE,NONE,NONE,NONE,NONE,NONE,PAPERBACK,OCT,,,,ACTION & ADVENTURE (BOOKS),NONE,ACTION & ADVENTURE,NONE
freq,4,69,5929,6159,6214,6233,6237,6237,5193,639,,,,947,5594,818,3297
mean,,,,,,,,,,,2005.101972,4.293202,35.984287,,,,
std,,,,,,,,,,,116.82151,0.662501,149.995031,,,,
min,,,,,,,,,,,0.0,1.0,1.0,,,,
25%,,,,,,,,,,,2010.0,4.0,2.0,,,,
50%,,,,,,,,,,,2014.0,4.4,7.0,,,,
75%,,,,,,,,,,,2017.0,4.8,22.0,,,,


In [0]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6237 entries, 0 to 6236
Data columns (total 17 columns):
Title            6237 non-null object
Author1          6237 non-null object
Author2          6237 non-null object
Author3          6237 non-null object
Author4          6237 non-null object
Author5          6237 non-null object
Author6          6237 non-null object
Author7          6237 non-null object
Edition_Type     6237 non-null object
Edition_Month    6237 non-null object
Edition_Year     6237 non-null int64
Ratings          6237 non-null float64
Reviews          6237 non-null int64
Genre1           6237 non-null object
Genre2           6237 non-null object
Category1        6237 non-null object
Category2        6237 non-null object
dtypes: float64(1), int64(2), object(14)
memory usage: 828.4+ KB


In [0]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1560 entries, 0 to 1559
Data columns (total 17 columns):
Title            1560 non-null object
Author1          1560 non-null object
Author2          1560 non-null object
Author3          1560 non-null object
Author4          1560 non-null object
Author5          1560 non-null object
Author6          1560 non-null object
Author7          1560 non-null object
Edition_Type     1560 non-null object
Edition_Month    1560 non-null object
Edition_Year     1560 non-null int64
Ratings          1560 non-null float64
Reviews          1560 non-null int64
Genre1           1560 non-null object
Genre2           1560 non-null object
Category1        1560 non-null object
Category2        1560 non-null object
dtypes: float64(1), int64(2), object(14)
memory usage: 207.3+ KB


### Encoding Categorical Features

In [0]:
# A method for Finding Unique items for all columns
def unique_items(list1, list2):
  a = list1
  b = list2
  a.extend(b)
  return list(set(a))  

In [0]:
from sklearn.preprocessing import LabelEncoder

le_Title = LabelEncoder()
all_titles = unique_items(list(X_train.Title),list(X_test.Title))
le_Title.fit(all_titles)

le_Edition_Type = LabelEncoder()
all_etypes = unique_items(list(X_train.Edition_Type),list(X_test.Edition_Type))
le_Edition_Type.fit(all_etypes)


le_Edition_Month = LabelEncoder()
all_em = unique_items(list(X_train.Edition_Month),list(X_test.Edition_Month))
le_Edition_Month.fit(all_em)

le_Author = LabelEncoder()
all_Authors = list(set(all_authors))
le_Author.fit(all_Authors)

le_Genre = LabelEncoder()
all_Genres = list(set(all_genres))
le_Genre.fit(all_Genres)

le_Category = LabelEncoder()
all_Categories = list(set(all_categories))
le_Category.fit(all_Categories)


LabelEncoder()

In [0]:

X_train['Title'] = le_Title.transform(X_train['Title'])

X_train['Edition_Type'] = le_Edition_Type.transform(X_train['Edition_Type'])



X_train['Edition_Month'] = le_Edition_Month.transform(X_train['Edition_Month'])

X_train['Author1'] = le_Author.transform(X_train['Author1'])
X_train['Author2'] = le_Author.transform(X_train['Author2'])
X_train['Author3'] = le_Author.transform(X_train['Author3'])
X_train['Author4'] = le_Author.transform(X_train['Author4'])
X_train['Author5'] = le_Author.transform(X_train['Author5'])
X_train['Author6'] = le_Author.transform(X_train['Author6'])
X_train['Author7'] = le_Author.transform(X_train['Author7'])


X_train['Genre1'] = le_Genre.transform(X_train['Genre1'])
X_train['Genre2'] = le_Genre.transform(X_train['Genre2'])


X_train['Category1'] = le_Category.transform(X_train['Category1'])
X_train['Category2'] = le_Category.transform(X_train['Category2'])



In [0]:
X_train.head()

Unnamed: 0,Title,Author1,Author2,Author3,Author4,Author5,Author6,Author7,Edition_Type,Edition_Month,Edition_Year,Ratings,Reviews,Genre1,Genre2,Category1,Category2
0,5802,797,3073,3073,3073,3073,3073,3073,13,7,2016,4.0,8,0,267,0,12
1,2120,391,3073,3073,3073,3073,3073,3073,13,10,2012,3.9,14,80,267,2,6
2,2984,4353,3073,3073,3073,3073,3073,3073,13,3,1982,4.8,6,211,267,8,12
3,189,78,3073,3073,3073,3073,3073,3073,13,11,2017,4.1,13,98,267,5,16
4,2987,1221,3073,3073,3073,3073,3073,3073,8,11,2006,5.0,1,284,267,1,7


In [0]:

X_test['Title'] = le_Title.transform(X_test['Title'])

X_test['Edition_Type'] = le_Edition_Type.transform(X_test['Edition_Type'])



X_test['Edition_Month'] = le_Edition_Month.transform(X_test['Edition_Month'])

X_test['Author1'] = le_Author.transform(X_test['Author1'])
X_test['Author2'] = le_Author.transform(X_test['Author2'])
X_test['Author3'] = le_Author.transform(X_test['Author3'])
X_test['Author4'] = le_Author.transform(X_test['Author4'])
X_test['Author5'] = le_Author.transform(X_test['Author5'])
X_test['Author6'] = le_Author.transform(X_test['Author6'])
X_test['Author7'] = le_Author.transform(X_test['Author7'])


X_test['Genre1'] = le_Genre.transform(X_test['Genre1'])
X_test['Genre2'] = le_Genre.transform(X_test['Genre2'])


X_test['Category1'] = le_Category.transform(X_test['Category1'])
X_test['Category2'] = le_Category.transform(X_test['Category2'])

In [0]:
X_test.head()

Unnamed: 0,Title,Author1,Author2,Author3,Author4,Author5,Author6,Author7,Edition_Type,Edition_Month,Edition_Year,Ratings,Reviews,Genre1,Genre2,Category1,Category2
0,5082,4058,3073,3073,3073,3073,3073,3073,12,11,1986,4.4,960,324,267,5,16
1,2906,1401,3073,3073,3073,3073,3073,3073,13,0,2018,5.0,1,273,267,4,9
2,751,949,3073,3073,3073,3073,3073,3073,13,7,2011,5.0,4,314,267,14,12
3,6232,169,3073,3073,3073,3073,3073,3073,13,9,2016,4.1,11,295,267,4,9
4,3790,3505,3073,3073,3073,3073,3073,3073,13,2,2011,4.4,9,235,267,10,11


### Sclaing The Features

In [0]:
# Feature Scaling

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

X_train = sc.fit_transform(X_train)

X_test = sc.transform(X_test)

#Reshaping ti fit the scaler
Y_train = Y_train.reshape((len(Y_train), 1)) 

Y_train = sc.fit_transform(Y_train)

#Restoring the original shape after scaling
Y_train = Y_train.ravel()

In [0]:
X_train.shape #SC

(6237, 17)

In [0]:
Y_train.shape #SC

(6237,)

In [0]:
X_train

array([[ 1.22489462, -1.11014305,  0.10361038, ..., -0.04054244,
        -1.21387465,  0.3167796 ],
       [-0.64983923, -1.40876249,  0.10361038, ..., -0.04054244,
        -0.82060991, -1.88135243],
       [-0.20992341,  1.50535127,  0.10361038, ..., -0.04054244,
         0.35918432,  0.3167796 ],
       ...,
       [ 0.91379674, -0.11278358,  0.10361038, ..., -0.04054244,
         1.53897855,  0.3167796 ],
       [-0.75676322, -1.56395633,  0.10361038, ..., -0.04054244,
        -1.21387465,  0.3167796 ],
       [ 0.95096556, -0.2951915 ,  0.10361038, ..., -0.04054244,
        -1.21387465,  0.3167796 ]])

In [0]:
X_test

array([[ 0.8582981 ,  1.2883741 ,  0.10361038, ..., -0.04054244,
        -0.23071279,  1.78220095],
       [-0.24963804, -0.66589149,  0.10361038, ..., -0.04054244,
        -0.42734517, -0.78228642],
       [-1.34688178, -0.99834465,  0.10361038, ..., -0.04054244,
         1.53897855,  0.3167796 ],
       ...,
       [ 1.06145367,  0.00342793,  0.10361038, ..., -0.04054244,
         0.35918432,  0.3167796 ],
       [ 0.20911677, -0.49672284,  0.10361038, ..., -0.04054244,
        -0.82060991, -1.88135243],
       [-1.15085447, -1.35139225,  0.10361038, ..., -0.04054244,
         0.75244907, -0.04957574]])

## Building a Regression Model

---

At this stage, we are all ready with the data which can now be fed in to a regressor. We will build a simple XGBoost regressor and will fit the training data. We will then use the model to predict the prices of the Books in the test set.




* Building a simple XGBoost regressor

* Testing the regressor on validation set
* Predicting the prices for test set data
* Saving the predictioons into an excel file


### Creating Training & Valiation sets

In [0]:
from sklearn.model_selection import train_test_split

train_x, val_x, train_y, val_y = train_test_split(X_train, Y_train, test_size = 0.1, random_state = 123)

In [0]:
print(train_x.shape)
print(train_y.shape)
print(val_x.shape)
print(val_y.shape)

(5613, 17)
(5613,)
(624, 17)
(624,)


### XGBoost

### Validating The Model

---

We will fist train and validate the model using RMLSE(Root Mean Squared Logerithmic Error).Once the validation is done we will use both the train and validation samples to train the final model which will be used to predict for the test set.

In [0]:
from xgboost import XGBRegressor
import numpy as np

xgb=XGBRegressor( objective='reg:squarederror', max_depth=6, learning_rate=0.1, n_estimators=100, booster = 'gbtree', n_jobs = -1,random_state = 1)
xgb.fit(train_x,train_y)

y_pred = sc.inverse_transform(xgb.predict(val_x))
y_true = sc.inverse_transform(val_y)

error = np.square(np.log10(y_pred +1) - np.log10(y_true +1)).mean() ** 0.5
score = 1 - error

print("RMLSE Score = ", score)

RMLSE Score =  0.7163958922112829


In [0]:
# Fitting the complete training set (inclusing val_x and val_y)
xgb.fit(X_train,Y_train)


XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=6, min_child_weight=1, missing=None, n_estimators=100,
             n_jobs=-1, nthread=None, objective='reg:squarederror',
             random_state=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
             seed=None, silent=None, subsample=1, verbosity=1)

In [0]:
# Predicting for test set
y_pred_xgb = sc.inverse_transform(xgb.predict(X_test))

In [0]:
# Saving the predictions in excel file

solution = pd.DataFrame(y_pred_xgb, columns = ['Price'])
solution.to_excel('Predict_Book_Price_Soln.xlsx', index = False)


In [0]:
solution.head(10)

Unnamed: 0,Price
0,214.681122
1,1330.917114
2,624.322449
3,844.769043
4,425.502533
5,890.277893
6,956.208252
7,339.453979
8,558.580017
9,467.108673


##Bayesian Optimization on XGBoost 

---

In this step will will use Bayesian Optimization to optimize the hypermeters such as gamma, learning_rate, max_depth, n_estimators.


We will use a pre-built library called bayesian-optimization.

In [0]:
!pip install bayesian-optimization

Collecting bayesian-optimization
  Downloading https://files.pythonhosted.org/packages/72/0c/173ac467d0a53e33e41b521e4ceba74a8ac7c7873d7b857a8fbdca88302d/bayesian-optimization-1.0.1.tar.gz
Building wheels for collected packages: bayesian-optimization
  Building wheel for bayesian-optimization (setup.py) ... [?25l[?25hdone
  Created wheel for bayesian-optimization: filename=bayesian_optimization-1.0.1-cp36-none-any.whl size=10032 sha256=8fd26880064a093cff7a7ea3de5daeb04582531c5362de7b4848e56e5d73439b
  Stored in directory: /root/.cache/pip/wheels/1d/0d/3b/6b9d4477a34b3905f246ff4e7acf6aafd4cc9b77d473629b77
Successfully built bayesian-optimization
Installing collected packages: bayesian-optimization
Successfully installed bayesian-optimization-1.0.1


In [0]:
from bayes_opt import BayesianOptimization
import xgboost as xgb
#from sklearn.metrics import mean_squared_error,mean_squared_log_error

In [0]:
dtrain = xgb.DMatrix(X_train, label= Y_train)


In [0]:
def bo_tune_xgb(max_depth, gamma, n_estimators ,learning_rate):
    params = {'max_depth': int(max_depth),
              'gamma': gamma,      
              'n_estimators': int(n_estimators),
              'learning_rate':learning_rate,
              'subsample': 0.8,
              'eta': 0.1,
              'eval_metric': 'rmse'}
    
    #Cross validating with the specified parameters in 5 folds and 70 iterations
    cv_result = xgb.cv(params, dtrain, num_boost_round=100, nfold=10)    
    
    #Return the negative RMSE
    return -1.0 * cv_result['test-rmse-mean'].iloc[-1]

In [0]:

xgb_bo = BayesianOptimization(bo_tune_xgb, {'max_depth': (1, 300), 
                                             'gamma': (0, 1),
                                             'learning_rate':(0,1),
                                             'n_estimators':(1,1000)
                                            })


xgb_bo.maximize(n_iter=10, init_points=10, acq='ei')

|   iter    |  target   |   gamma   | learni... | max_depth | n_esti... |
-------------------------------------------------------------------------
| [0m 1       [0m | [0m-1.009   [0m | [0m 0.2697  [0m | [0m 0.5214  [0m | [0m 51.2    [0m | [0m 389.1   [0m |
| [95m 2       [0m | [95m-0.9282  [0m | [95m 0.06396 [0m | [95m 0.07131 [0m | [95m 240.2   [0m | [95m 947.1   [0m |
| [0m 3       [0m | [0m-0.9759  [0m | [0m 0.6246  [0m | [0m 0.3789  [0m | [0m 76.9    [0m | [0m 530.3   [0m |
| [95m 4       [0m | [95m-0.9172  [0m | [95m 0.8354  [0m | [95m 0.1186  [0m | [95m 195.2   [0m | [95m 685.8   [0m |
| [0m 5       [0m | [0m-0.9174  [0m | [0m 0.9739  [0m | [0m 0.1673  [0m | [0m 8.738   [0m | [0m 905.6   [0m |
| [0m 6       [0m | [0m-1.074   [0m | [0m 0.6809  [0m | [0m 0.7988  [0m | [0m 42.74   [0m | [0m 197.3   [0m |
| [0m 7       [0m | [0m-0.9909  [0m | [0m 0.1068  [0m | [0m 0.405   [0m | [0m 33.39   [0m | [0m 7

In [0]:
#Extracting the best parameters
params = xgb_bo.max['params']

print(params)

#Conversting the max_depth and n_estimator values from float to int
params['max_depth']= int(round(params['max_depth']))
params['n_estimators']= int(round(params['n_estimators']))

print(params)


{'gamma': 0.835366902965147, 'learning_rate': 0.11864947002102888, 'max_depth': 195.21748298318235, 'n_estimators': 685.7597094777982}
{'gamma': 0.835366902965147, 'learning_rate': 0.11864947002102888, 'max_depth': 195, 'n_estimators': 686}


In [0]:
#Initialize an XGB with the tuned parameters and fit the training data
from xgboost import XGBRegressor
reg = XGBRegressor(**params).fit(X_train,Y_train)

y_pred_reg = sc.inverse_transform(reg.predict(X_test))



In [0]:
solution_bo = pd.DataFrame(y_pred_reg, columns = ['Price'])

solution_bo.head(10)

Unnamed: 0,Price
0,241.728058
1,1828.147339
2,429.341614
3,858.137817
4,372.580841
5,503.80722
6,587.503967
7,476.95871
8,399.994812
9,335.585876


In [0]:
solution_bo.to_excel('Predict_Book_Prices_BO_Soln.xlsx', index = False)

Once you have your solution files, upload it to MachineHack to know your score.

Good Luck !!