### Model Design stage
We have already did data preprocessing and saved the processed dataset.
##### Tasks to be done in this step
>* Splitting the dataset into training dataset and testing dataset.
>* Designing the model and trainging it
>* Evaluating the model performance

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder


In [2]:
dataset = pd.read_csv('processed_movie_dataset.csv')
dataset.head()

Unnamed: 0,title,Running_time,Budget,Box_office,Release_date,imdb,metascore,rotten_tomatoes,Main_actor,Second_actor,Director,Category,Main_actor_rate,Director_rate,Second_actor_rate
0,Pinocchio,88.0,2600000.0,164000000.0,1940-02-07,7.4,9.9,7.3,Cliff Edwards,Dickie Jones,Ben Sharpsteen,Blockbuster,10.0,7.5,10.0
1,Fantasia,126.0,2280000.0,83300000.0,1940-11-13,7.7,9.6,9.5,Leopold Stokowski,Deems Taylor,Samuel Armstrong,Blockbuster,10.0,10.0,10.0
2,Dumbo,64.0,950000.0,1300000.0,1941-10-23,7.2,9.6,9.8,Edward Brophy,Herman Bing,Ben Sharpsteen,Flop,5.0,7.5,5.0
3,Bambi,70.0,858000.0,267400000.0,1942-08-09,7.3,9.1,9.0,see below,missing,Supervising director,Blockbuster,10.0,10.0,8.958333
4,Make Mine Music,75.0,1350000.0,3275000.0,1946-04-20,6.3,6.0,7.0,Nelson Eddy,missing,Jack Kinney,Hit,7.5,6.25,8.958333


In [3]:
# Let's move the Category of movies to the last colomn
cat = dataset.pop('Category')
dataset.insert(14, 'Category', cat)
dataset.head()

Unnamed: 0,title,Running_time,Budget,Box_office,Release_date,imdb,metascore,rotten_tomatoes,Main_actor,Second_actor,Director,Main_actor_rate,Director_rate,Second_actor_rate,Category
0,Pinocchio,88.0,2600000.0,164000000.0,1940-02-07,7.4,9.9,7.3,Cliff Edwards,Dickie Jones,Ben Sharpsteen,10.0,7.5,10.0,Blockbuster
1,Fantasia,126.0,2280000.0,83300000.0,1940-11-13,7.7,9.6,9.5,Leopold Stokowski,Deems Taylor,Samuel Armstrong,10.0,10.0,10.0,Blockbuster
2,Dumbo,64.0,950000.0,1300000.0,1941-10-23,7.2,9.6,9.8,Edward Brophy,Herman Bing,Ben Sharpsteen,5.0,7.5,5.0,Flop
3,Bambi,70.0,858000.0,267400000.0,1942-08-09,7.3,9.1,9.0,see below,missing,Supervising director,10.0,10.0,8.958333,Blockbuster
4,Make Mine Music,75.0,1350000.0,3275000.0,1946-04-20,6.3,6.0,7.0,Nelson Eddy,missing,Jack Kinney,7.5,6.25,8.958333,Hit


### Transforming the Category column into numeric
Since Category column is string, i.e 'Blockbuster', 'Flop' and 'Hit' we need to represent this label in number form like 0, 1, 2. Because in model we can't pass the strings to model when training it.
We can achieve this by using LabelEncoder() function from scikit-learn.

In [4]:
encoder = LabelEncoder()
label = encoder.fit_transform(dataset['Category'])
dataset['cat_label'] = label
dataset.head()

Unnamed: 0,title,Running_time,Budget,Box_office,Release_date,imdb,metascore,rotten_tomatoes,Main_actor,Second_actor,Director,Main_actor_rate,Director_rate,Second_actor_rate,Category,cat_label
0,Pinocchio,88.0,2600000.0,164000000.0,1940-02-07,7.4,9.9,7.3,Cliff Edwards,Dickie Jones,Ben Sharpsteen,10.0,7.5,10.0,Blockbuster,0
1,Fantasia,126.0,2280000.0,83300000.0,1940-11-13,7.7,9.6,9.5,Leopold Stokowski,Deems Taylor,Samuel Armstrong,10.0,10.0,10.0,Blockbuster,0
2,Dumbo,64.0,950000.0,1300000.0,1941-10-23,7.2,9.6,9.8,Edward Brophy,Herman Bing,Ben Sharpsteen,5.0,7.5,5.0,Flop,1
3,Bambi,70.0,858000.0,267400000.0,1942-08-09,7.3,9.1,9.0,see below,missing,Supervising director,10.0,10.0,8.958333,Blockbuster,0
4,Make Mine Music,75.0,1350000.0,3275000.0,1946-04-20,6.3,6.0,7.0,Nelson Eddy,missing,Jack Kinney,7.5,6.25,8.958333,Hit,2


As we can see from cat_label column added to the dataset, the category representation is as follows
>* Blockbuster => 0
>* Flop => 1
>* Hit => 2

#### Features and Label

In [5]:
# features
x = dataset[['imdb', 'metascore', 'rotten_tomatoes', 'Main_actor_rate', 'Second_actor_rate', 'Director_rate']]
x.head()


Unnamed: 0,imdb,metascore,rotten_tomatoes,Main_actor_rate,Second_actor_rate,Director_rate
0,7.4,9.9,7.3,10.0,10.0,7.5
1,7.7,9.6,9.5,10.0,10.0,10.0
2,7.2,9.6,9.8,5.0,5.0,7.5
3,7.3,9.1,9.0,10.0,8.958333,10.0
4,6.3,6.0,7.0,7.5,8.958333,6.25


In [6]:
# Label
y=dataset['cat_label']
y.head()

0    0
1    0
2    1
3    0
4    2
Name: cat_label, dtype: int32

##### Split the features and label into training and test dataset

In [7]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

print("X_train Shape: ", x_train.shape)
print("X_test Shape: ", x_test.shape)
print("Y_train Shape: ", y_train.shape)
print("Y_test Shape: ", y_test.shape)

X_train Shape:  (178, 6)
X_test Shape:  (45, 6)
Y_train Shape:  (178,)
Y_test Shape:  (45,)


#### Model training

In [8]:
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier



adb = AdaBoostClassifier(DecisionTreeClassifier(min_samples_split=10,max_depth=4),n_estimators=10,learning_rate=0.01)

# dt_classifier = DecisionTreeClassifier(criterion='entropy', max_leaf_nodes=12, random_state=0)

adb.fit(x_train, y_train)

AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=4,
                                                         min_samples_split=10),
                   learning_rate=0.01, n_estimators=10)

#### Prediction

In [9]:
y_prediction = adb.predict(x_test)
y_prediction

array([0, 1, 1, 0, 0, 1, 0, 0, 2, 1, 2, 0, 0, 2, 2, 0, 2, 0, 0, 0, 2, 1,
       0, 1, 1, 0, 1, 2, 1, 0, 1, 0, 0, 2, 0, 0, 1, 0, 0, 0, 1, 2, 2, 2,
       1])

#### Model Evaluation

In [10]:
from sklearn import metrics

In [11]:
print('Accuracy Score: {:.2f}%'.format(metrics.accuracy_score(y_test, y_prediction) * 100))  
print('')
print('Precision Score: {:.2f}%'.format(metrics.precision_score(y_test, y_prediction, average='macro') * 100))
print('')
print('Recall Score: {:.2f}%'.format(metrics.recall_score(y_test, y_prediction, average='macro') * 100))
print('')
print('F1 Score: {:.2f}%'.format(metrics.f1_score(y_test, y_prediction, average='macro') * 100))


Accuracy Score: 97.78%

Precision Score: 96.97%

Recall Score: 97.62%

F1 Score: 97.18%
