----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# <u>**IMDB Movies Dataset Explorative Data Analysis, Modelling, Predictions and Parameter Tuning**</u>

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

**<u>Group Activity</u>** 
---

<u>AIDI-1002-01 AI Algorithms</u>
---

Abhishek (ID: 100856936)
---
Riya Xavier (ID: 100847513)
---
Bindya Biju (ID: 100886575)
---
Sanath Davis (ID: 100884693)
---


----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


### This is a python notebook which details the various steps and processes involved in exploring, analysing, training and testing the dataset of IMDB Movies. 
### A project report is attached along with this submission.

## Minimum Requirements

### You will definitely need to have the following installed for running this notebook: 
    a) Pandas
    b) Sklearn
    c) Numpy
    d) Seaborn
    e) Matplotlib
    f) Pandas Profiling
    g) XG Boost



## Load dataset into pandas

The dataset file is named 'imdb_top_1000.csv' and can be downloaded from https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows

In [217]:
import pandas as pd
dataset = pd.read_csv("imdb_top_1000.csv")
dataset

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,https://m.media-amazon.com/images/M/MV5BNGEwMT...,Breakfast at Tiffany's,1961,A,115 min,"Comedy, Drama, Romance",7.6,A young New York socialite becomes interested ...,76.0,Blake Edwards,Audrey Hepburn,George Peppard,Patricia Neal,Buddy Ebsen,166544,
996,https://m.media-amazon.com/images/M/MV5BODk3Yj...,Giant,1956,G,201 min,"Drama, Western",7.6,Sprawling epic covering the life of a Texas ca...,84.0,George Stevens,Elizabeth Taylor,Rock Hudson,James Dean,Carroll Baker,34075,
997,https://m.media-amazon.com/images/M/MV5BM2U3Yz...,From Here to Eternity,1953,Passed,118 min,"Drama, Romance, War",7.6,"In Hawaii in 1941, a private is cruelly punish...",85.0,Fred Zinnemann,Burt Lancaster,Montgomery Clift,Deborah Kerr,Donna Reed,43374,30500000
998,https://m.media-amazon.com/images/M/MV5BZTBmMj...,Lifeboat,1944,,97 min,"Drama, War",7.6,Several survivors of a torpedoed merchant ship...,78.0,Alfred Hitchcock,Tallulah Bankhead,John Hodiak,Walter Slezak,William Bendix,26471,




Top 1000 Movies by IMDB Rating data

    PosterLink: Link of the poster that imdb using
    SeriesTitle: Name of the movie
    ReleasedYear: Year at which that movie released
    Certificate: Certificate earned by that movie
    Runtime: Total runtime of the movie
    Genre: Genre of the movie
    IMDB Rating: Rating of the movie at IMDB site
    Overview: mini story/ summary
    Meta_score: Score earned by the movie
    Director: Name of the Director
    Star1,Star2,Star3,Star4: Name of the Stars
    No of votes: Total number of votes
    Gross: Money earned by that movie



## Pre Processing dataset and cleaning


### a) Remove unwanted columns

In [218]:
dataset.columns

Index(['Poster_Link', 'Series_Title', 'Released_Year', 'Certificate',
       'Runtime', 'Genre', 'IMDB_Rating', 'Overview', 'Meta_score', 'Director',
       'Star1', 'Star2', 'Star3', 'Star4', 'No_of_Votes', 'Gross'],
      dtype='object')

In [219]:
#axis = 1 means we are working with columns
dataset = dataset.drop([
    'Star1', 
    'Star2', 
    'Star3', 
    'Star4', 
    'Poster_Link', 
    'Overview',
    'Series_Title',
    'Series_Title',
    'Director'
    ], 
axis = 1)

### b) Remove rows with NAN

In [220]:
#example of rows with null or nan values in the gross column
dataset["Gross"].isnull().sum()

169

In [221]:
#droping nan
row_count_before_dropping_nan = len(dataset) 
dataset = dataset.dropna()
row_count_after_dropping_nan = len(dataset)
print("Following number of rows with NAN have been removed:")
print(row_count_before_dropping_nan - row_count_after_dropping_nan)

Following number of rows with NAN have been removed:
286


### c) Clean the Column Values

In [222]:


dataset['Released_Year'].unique()



array(['1994', '1972', '2008', '1974', '1957', '2003', '1993', '2010',
       '1999', '2001', '1966', '2002', '1990', '1980', '1975', '2019',
       '2014', '1998', '1997', '1995', '1991', '1977', '1954', '2011',
       '2006', '2000', '1988', '1985', '1968', '1960', '1942', '1936',
       '1931', '2018', '2016', '2017', '2012', '2009', '1981', '1979',
       '1964', '2004', '1992', '1987', '1986', '1984', '1983', '1976',
       '1973', '1971', '1965', '1962', '1959', '1958', '1952', '1944',
       '1941', '2013', '2007', '2005', '1989', '1963', '1950', '1948',
       '2015', '1996', '1982', '1978', '1967', '1951', '1949', '1940',
       '1939', '1934', '1970', '1969', '1961', '1946', '1930', '1938',
       '1933', 'PG', '1953'], dtype=object)

Remove the phrase 'PG' which is not an year

In [223]:
row_count_before = len(dataset) 
dataset = dataset.drop(dataset[dataset['Released_Year'] == 'PG'].index)
dataset['Released_Year'].unique()
row_count_after = len(dataset)
print("Following number of rows with incorrect Released_years have been removed:")
print(row_count_before - row_count_after)

Following number of rows with incorrect Released_years have been removed:
1


Check if the IMDB rating column is good

In [224]:
dataset['IMDB_Rating'].unique()

array([9.3, 9.2, 9. , 8.9, 8.8, 8.7, 8.6, 8.5, 8.4, 8.3, 8.2, 8.1, 8. ,
       7.9, 7.8, 7.7, 7.6])

Check if the Genre column is good

In [225]:
dataset['Genre'].unique()

array(['Drama', 'Crime, Drama', 'Action, Crime, Drama',
       'Action, Adventure, Drama', 'Biography, Drama, History',
       'Action, Adventure, Sci-Fi', 'Drama, Romance', 'Western',
       'Action, Sci-Fi', 'Biography, Crime, Drama',
       'Action, Adventure, Fantasy', 'Comedy, Drama, Thriller',
       'Adventure, Drama, Sci-Fi', 'Animation, Adventure, Family',
       'Drama, War', 'Crime, Drama, Fantasy', 'Comedy, Drama, Romance',
       'Crime, Drama, Mystery', 'Crime, Drama, Thriller', 'Drama, Music',
       'Biography, Comedy, Drama', 'Drama, Mystery, Sci-Fi',
       'Biography, Drama, Music', 'Crime, Mystery, Thriller',
       'Animation, Adventure, Drama', 'Adventure, Comedy, Sci-Fi',
       'Horror, Mystery, Thriller', 'Drama, Romance, War',
       'Comedy, Drama, Family', 'Animation, Drama, Fantasy',
       'Animation, Action, Adventure', 'Drama, Western',
       'Action, Adventure', 'Comedy, Drama', 'Drama, Mystery, Thriller',
       'Action, Drama, Mystery', 'Mystery, Thr

### 4) Check the data types

In [226]:
dataset.dtypes

Released_Year     object
Certificate       object
Runtime           object
Genre             object
IMDB_Rating      float64
Meta_score       float64
No_of_Votes        int64
Gross             object
dtype: object

Convert Released Year into int

In [227]:
dataset['Released_Year'] = dataset['Released_Year'].astype(int)

Remove the phrase 'minutes' from runtime
example - > Runtime = 150 min becomes 150

In [228]:
dataset = dataset.astype({"Runtime": str})
dataset['Runtime'] = dataset['Runtime'].str.replace(' min','') 
dataset = dataset.astype({"Runtime": int})

Make sure the Gross value is numeric

In [229]:
dataset = dataset.astype({"Gross": str})
dataset['Gross']=dataset['Gross'].str.replace(',','')
dataset["Gross"] = dataset["Gross"].apply(pd.to_numeric)

Check the data types again

In [230]:
dataset.dtypes

Released_Year      int32
Certificate       object
Runtime            int32
Genre             object
IMDB_Rating      float64
Meta_score       float64
No_of_Votes        int64
Gross              int64
dtype: object

### 5) Encode string values to numerical

In [231]:
print(dataset.shape)

(713, 8)


We encode categorical features as a one-hot numeric array using the OneHotEncoder of Sklearn. This encoding is needed for feeding categorical data to many scikit-learn estimators, notably linear models and SVMs with the standard kernels.

In [232]:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')



First we Encode Genre 

In [233]:
#unique categories inside our column
#we can convert all these to numbers
#they are expecting a table, so two square brackets

enc.fit(dataset[['Genre']])
enc.categories_

[array(['Action, Adventure', 'Action, Adventure, Comedy',
        'Action, Adventure, Drama', 'Action, Adventure, Family',
        'Action, Adventure, Fantasy', 'Action, Adventure, History',
        'Action, Adventure, Horror', 'Action, Adventure, Mystery',
        'Action, Adventure, Romance', 'Action, Adventure, Sci-Fi',
        'Action, Adventure, Thriller', 'Action, Adventure, Western',
        'Action, Biography, Crime', 'Action, Biography, Drama',
        'Action, Comedy, Crime', 'Action, Comedy, Fantasy',
        'Action, Comedy, Mystery', 'Action, Crime, Drama',
        'Action, Crime, Mystery', 'Action, Crime, Thriller',
        'Action, Drama', 'Action, Drama, History',
        'Action, Drama, Mystery', 'Action, Drama, Sci-Fi',
        'Action, Drama, Sport', 'Action, Drama, War',
        'Action, Drama, Western', 'Action, Mystery, Thriller',
        'Action, Sci-Fi', 'Action, Sci-Fi, Thriller', 'Action, Thriller',
        'Adventure, Biography, Crime', 'Adventure, Biography,

In [234]:
# let us see what the encoding of the category Drama looks like
enc.transform([['Drama']]).toarray()



array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

In [235]:
#checking the shape
dataset.Genre.shape

(713,)

In [236]:
#Encoding all the categories
encoded_category = enc.transform(dataset[['Genre']]).toarray()
encoded_category = pd.DataFrame(encoded_category, columns=enc.categories_)
print(encoded_category.shape)
encoded_category


(713, 172)


Unnamed: 0,"Action, Adventure","Action, Adventure, Comedy","Action, Adventure, Drama","Action, Adventure, Family","Action, Adventure, Fantasy","Action, Adventure, History","Action, Adventure, Horror","Action, Adventure, Mystery","Action, Adventure, Romance","Action, Adventure, Sci-Fi",...,"Film-Noir, Mystery, Thriller",Horror,"Horror, Mystery, Sci-Fi","Horror, Mystery, Thriller","Horror, Sci-Fi","Horror, Thriller","Mystery, Romance, Thriller","Mystery, Sci-Fi, Thriller","Mystery, Thriller",Western
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
708,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
709,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
710,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
711,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now the Certificate Column

In [237]:
enc3 = OneHotEncoder(handle_unknown='ignore')
enc3.fit(dataset[['Certificate']])
enc3.categories_
#Encoding all the categories
encoded_category3 = enc3.transform(dataset[['Certificate']]).toarray()
encoded_category3 = pd.DataFrame(encoded_category3, columns=enc3.categories_)
print(encoded_category3.shape)
encoded_category3

(713, 12)


Unnamed: 0,A,Approved,G,GP,PG,PG-13,Passed,R,TV-PG,U,U/A,UA
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
708,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
709,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
710,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
711,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [238]:
# dataset = pd.concat([dataset, encoded_category, encoded_category2, encoded_category3], axis=1)
# dataset = dataset.drop(['Director', 'Genre'], axis = 1) #already encoded
# dataset

In [239]:
# dataset.isna().sum()
# dataset.shape
# dataset = dataset.dropna()
# dataset.isna().sum()
# dataset.shape
# dataset

### 6) Another attempt at encoding

First Genre. Find out all unique Genres - SInce many movies have multiple genres - and make them boolean columns

In [240]:
genres = dataset.Genre.unique()
required_genre_columns = []
for genre in genres:
    multiple_genres = genre.split(", ")
    for single_genre in multiple_genres:
        required_genre_columns.append(single_genre)


#removing duplicates
required_genre_columns = list(dict.fromkeys(required_genre_columns))
print(required_genre_columns)



['Drama', 'Crime', 'Action', 'Adventure', 'Biography', 'History', 'Sci-Fi', 'Romance', 'Western', 'Fantasy', 'Comedy', 'Thriller', 'Animation', 'Family', 'War', 'Mystery', 'Music', 'Horror', 'Sport', 'Musical', 'Film-Noir']


Add each unique Genre as column to the dataset

In [241]:
def transform_row(r):
    for new_genre in required_genre_columns:
        r[new_genre] = 1 if new_genre in r.Genre else 0
    return r

dataset = dataset.apply(transform_row, axis=1)
dataset = dataset.drop(['Genre'], axis = 1)
dataset

Unnamed: 0,Released_Year,Certificate,Runtime,IMDB_Rating,Meta_score,No_of_Votes,Gross,Drama,Crime,Action,...,Thriller,Animation,Family,War,Mystery,Music,Horror,Sport,Musical,Film-Noir
0,1994,A,142,9.3,80.0,2343110,28341469,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1972,A,175,9.2,100.0,1620367,134966411,1,1,0,...,0,0,0,0,0,0,0,0,0,0
2,2008,UA,152,9.0,84.0,2303232,534858444,1,1,1,...,0,0,0,0,0,0,0,0,0,0
3,1974,A,202,9.0,90.0,1129952,57300000,1,1,0,...,0,0,0,0,0,0,0,0,0,0
4,1957,U,96,9.0,96.0,689845,4360000,1,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
990,1971,PG,157,7.6,77.0,30144,696690,1,0,0,...,0,0,0,1,0,0,0,0,0,0
991,1970,GP,144,7.6,50.0,45338,1378435,0,0,0,...,0,0,0,1,0,0,0,0,0,0
992,1967,U,78,7.6,65.0,166409,141843612,0,0,0,...,0,1,1,0,0,0,0,0,0,0
994,1964,U,87,7.6,96.0,40351,13780024,0,0,0,...,0,0,0,0,0,1,0,0,1,0


Now we look at certificates

In [242]:
certificates = dataset.Certificate.unique()
certificates

def transform_row(r):
    for certificate in certificates:
        r[certificate] = 1 if certificate == r.Certificate else 0
    return r

dataset = dataset.apply(transform_row, axis=1)
dataset = dataset.drop(['Certificate'], axis = 1)
dataset

Unnamed: 0,Released_Year,Runtime,IMDB_Rating,Meta_score,No_of_Votes,Gross,Drama,Crime,Action,Adventure,...,U,R,G,PG-13,PG,Passed,Approved,TV-PG,U/A,GP
0,1994,142,9.3,80.0,2343110,28341469,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1972,175,9.2,100.0,1620367,134966411,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2008,152,9.0,84.0,2303232,534858444,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
3,1974,202,9.0,90.0,1129952,57300000,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1957,96,9.0,96.0,689845,4360000,1,1,0,0,...,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
990,1971,157,7.6,77.0,30144,696690,1,0,0,0,...,0,0,0,0,1,0,0,0,0,0
991,1970,144,7.6,50.0,45338,1378435,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1
992,1967,78,7.6,65.0,166409,141843612,0,0,0,1,...,1,0,0,0,0,0,0,0,0,0
994,1964,87,7.6,96.0,40351,13780024,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


## Modeling to PREDICT IMDB RATING 


#### X and y

In [243]:
# X = dataset.drop(['Gross'], axis=1)
# y = dataset['Gross']

X = dataset.drop(['IMDB_Rating'], axis=1)
y = dataset['IMDB_Rating']

### 1) Linear Regression

Split into Test and Train

In [244]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=57,)
print(X_train.shape)
print(y_test.shape)

(427, 38)
(286,)


Fit and Train

In [245]:
import numpy as np
from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(X_train, y_train)

y_pred = reg.predict(X_test)

Check Error

In [246]:
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error
linear_reg_mean_square = mean_squared_error(y_test, y_pred)
linear_reg_mean_absolute_percentage = mean_absolute_percentage_error(y_test, y_pred)

print("mean Square: ", linear_reg_mean_square)
print("mean absolute percentage error: ", linear_reg_mean_absolute_percentage)

mse = mean_squared_error(y_test, y_pred)
print("RMSE: %.2f" % (mse**(1/2.0)))

mean Square:  0.04251641075502864
mean absolute percentage error:  0.02022678064581262
RMSE: 0.21


### 2) Random Forest Regressor

In [247]:
from sklearn.ensemble import RandomForestRegressor
regr = RandomForestRegressor(max_depth=2, random_state=0)
regr.fit(X_train, y_train)

y_pred = regr.predict(X_test)

random_forest_reg_mean_square = mean_squared_error(y_test, y_pred, squared=False)
random_forest_reg_mean_absolute_percentage = mean_absolute_percentage_error(y_test, y_pred)

print("mean Square: ", random_forest_reg_mean_square)
print("mean absolute percentage error: ", random_forest_reg_mean_absolute_percentage)

mse = mean_squared_error(y_test, y_pred)
print("RMSE: %.2f" % (mse**(1/2.0)))

mean Square:  0.22038027617212402
mean absolute percentage error:  0.022039565561386296
RMSE: 0.22


### 3) XG Boost Regressor

In [248]:


from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score

model = XGBRegressor()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

score = model.score(X_train, y_train)
print('Training Score:', score)

score = model.score(X_test, y_test)
print('Testing Score:', score)

output = pd.DataFrame({'Predicted':y_pred})

xg_reg_mean_square = mean_squared_error(y_test, y_pred, squared=False)
xg_reg_mean_absolute_percentage = mean_absolute_percentage_error(y_test, y_pred)

print("mean Square: ", xg_reg_mean_square)
print("mean absolute percentage error: ", xg_reg_mean_absolute_percentage)

mse = mean_squared_error(y_test, y_pred)
print("RMSE: %.2f" % (mse**(1/2.0)))



Training Score: 0.9999118275837339
Testing Score: 0.5844648960746007
mean Square:  0.1988609735299707
mean absolute percentage error:  0.019651313486915643
RMSE: 0.20


## Fine Tuning

#### Hyperparameter Grid Search with XGBoost

In [249]:
from sklearn.model_selection import GridSearchCV
import xgboost as xgb

params = { 'max_depth': [3,6,10],
           'learning_rate': [0.01, 0.05, 0.1],
           'n_estimators': [100, 500, 1000],
           'colsample_bytree': [0.3, 0.7]}
xgbr = xgb.XGBRegressor(seed = 20)
clf = GridSearchCV(estimator=xgbr, 
                   param_grid=params,
                   scoring='neg_mean_squared_error', 
                   verbose=1)
clf.fit(X, y)
print("Best parameters:", clf.best_params_)
print("Lowest RMSE: ", (-clf.best_score_)**(1/2.0))

Fitting 5 folds for each of 54 candidates, totalling 270 fits
Best parameters: {'colsample_bytree': 0.7, 'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 1000}
Lowest RMSE:  0.27002963773716726


Now trying XGboost again with the best parameters obtained from grid search

In [250]:
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score

model = XGBRegressor(colsample_bytree= 0.7, learning_rate= 0.1, max_depth= 3, n_estimators= 1000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

score = model.score(X_train, y_train)
print('Training Score:', score)

score = model.score(X_test, y_test)
print('Testing Score:', score)

output = pd.DataFrame({'Predicted':y_pred})

xg_reg_mean_square = mean_squared_error(y_test, y_pred, squared=False)
xg_reg_mean_absolute_percentage = mean_absolute_percentage_error(y_test, y_pred)

print("mean Square: ", xg_reg_mean_square)
print("mean absolute percentage error: ", xg_reg_mean_absolute_percentage)

mse = mean_squared_error(y_test, y_pred)
print("RMSE: %.2f" % (mse**(1/2.0)))

Training Score: 0.9971758159073006
Testing Score: 0.6143111375498542
mean Square:  0.19158621209844706
mean absolute percentage error:  0.01867618588507256
RMSE: 0.19


#### Hyperparameter RANDOM Search with XGBoost

In [251]:
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.model_selection import RandomizedSearchCV
params = { 'max_depth': [3, 5, 6, 10, 15, 20],
           'learning_rate': [0.01, 0.1, 0.2, 0.3],
           'subsample': np.arange(0.5, 1.0, 0.1),
           'colsample_bytree': np.arange(0.4, 1.0, 0.1),
           'colsample_bylevel': np.arange(0.4, 1.0, 0.1),
           'n_estimators': [100, 500, 1000]}
xgbr = xgb.XGBRegressor(seed = 20)
clf = RandomizedSearchCV(estimator=xgbr,
                         param_distributions=params,
                         scoring='neg_mean_squared_error',
                         n_iter=25,
                         verbose=1)
clf.fit(X, y)
print("Best parameters:", clf.best_params_)
print("Lowest RMSE: ", (-clf.best_score_)**(1/2.0))

Fitting 5 folds for each of 25 candidates, totalling 125 fits
Best parameters: {'subsample': 0.7999999999999999, 'n_estimators': 100, 'max_depth': 3, 'learning_rate': 0.2, 'colsample_bytree': 0.7, 'colsample_bylevel': 0.4}
Lowest RMSE:  0.27288934598995196


#### Comparing GRID SEARCH and Randomised Search on xgboost

| Search Type | Time Taken  | Lowest RMSE |
| --- | --- | --- |
| Grid Search | 1 minute 3 seconds | 0.27 |
| Randomised Search | 24 seconds |  0.27 |

## Comparing Models

| Regressor | Time Taken  | Mean Square Error | Mean Absolute Percentage Error | RMSE  |
| --- | --- | --- |  --- |  --- |
| Linear | .4 seconds | 0.0425 | 0.02 | 0.21 |
| Random Forest | .2 seconds | 0.220 | 0.0220 | 0.22 |
| XG Boost | 1.2 seconds | 0.198 | 0.0196 |0.20 |
| XG Boost after Tuning | .6 seconds | 0.191 | 0.018 |0.19 |