# Machine Learning

## Problem:
To find out if the machine learning models can classify the keywords of movies and predict the genre of the movie in a test set correctly.

If machine learning can predict the genre correctly it means there is relevance between the keywords and the genre
This means there is a way to predict the genre just with keywords, solving our objective.


In [1]:
import pandas as pd
import numpy as np
import json
import nltk
import csv
import matplotlib.pyplot as plt 
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

%matplotlib inline

In [2]:
df= pd.read_excel('alpha_occur_cleaned_data.xlsx')
df.head()

Unnamed: 0,title,id,genres,keywords
0,Blondie,3924,['Comedy'],['blondi']
1,Four Rooms,5,"['Crime', 'Comedy']","['hotel', ""new year's ev"", 'witch', 'bet', 'ho..."
2,Judgment Night,6,"['Action', 'Thriller', 'Crime']","['chicago, illinoi', 'drug deal', 'escap', 'on..."
3,Star Wars,11,"['Adventure', 'Action', 'Science Fiction']","['android', 'galaxi', 'hermit', 'death star', ..."
4,Finding Nemo,12,"['Animation', 'Family']","['parent child relationship', 'sydney, austral..."


The rows with empty keywords have been deleted with Microsoft Excel

In [3]:
df.shape

(19293, 4)

In [4]:
processing_data = df
processing_data.head()

Unnamed: 0,title,id,genres,keywords
0,Blondie,3924,['Comedy'],['blondi']
1,Four Rooms,5,"['Crime', 'Comedy']","['hotel', ""new year's ev"", 'witch', 'bet', 'ho..."
2,Judgment Night,6,"['Action', 'Thriller', 'Crime']","['chicago, illinoi', 'drug deal', 'escap', 'on..."
3,Star Wars,11,"['Adventure', 'Action', 'Science Fiction']","['android', 'galaxi', 'hermit', 'death star', ..."
4,Finding Nemo,12,"['Animation', 'Family']","['parent child relationship', 'sydney, austral..."


## String processing. 'genres' in csv/xlsx was stored as string instead of a list

Using ast to convert genres from string to list.

In [5]:
import ast

In [6]:
# find how many rows in data
processing_data.shape

(19293, 4)

In [7]:
# For loop on all the strings
for i in range(processing_data.shape[0]):
    processing_data['genres'][i] = ast.literal_eval(processing_data['genres'][i])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  processing_data['genres'][i] = ast.literal_eval(processing_data['genres'][i])


In [8]:
processing_data.head()

Unnamed: 0,title,id,genres,keywords
0,Blondie,3924,[Comedy],['blondi']
1,Four Rooms,5,"[Crime, Comedy]","['hotel', ""new year's ev"", 'witch', 'bet', 'ho..."
2,Judgment Night,6,"[Action, Thriller, Crime]","['chicago, illinoi', 'drug deal', 'escap', 'on..."
3,Star Wars,11,"[Adventure, Action, Science Fiction]","['android', 'galaxi', 'hermit', 'death star', ..."
4,Finding Nemo,12,"[Animation, Family]","['parent child relationship', 'sydney, austral..."


In [9]:
# Sanity check
processing_data['genres'][2][1]

'Thriller'

In [10]:
processing_data['genres'][1900][1]

'Action'

## String processing on 'keywords'. Remove whitespaces

In [11]:
# For loop on all the strings
for i in range(processing_data.shape[0]):
    processing_data['keywords'][i] = processing_data['keywords'][i].replace(" ", "")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  processing_data['keywords'][i] = processing_data['keywords'][i].replace(" ", "")


In [12]:
processing_data.head(100)

Unnamed: 0,title,id,genres,keywords
0,Blondie,3924,[Comedy],['blondi']
1,Four Rooms,5,"[Crime, Comedy]","['hotel',""newyear'sev"",'witch','bet','hotelroo..."
2,Judgment Night,6,"[Action, Thriller, Crime]","['chicago,illinoi','drugdeal','escap','onenigh..."
3,Star Wars,11,"[Adventure, Action, Science Fiction]","['android','galaxi','hermit','deathstar','jedi..."
4,Finding Nemo,12,"[Animation, Family]","['parentchildrelationship','sydney,australia',..."
...,...,...,...,...
95,Star Trek V: The Final Frontier,172,"[Science Fiction, Action, Adventure, Thriller]","['feder','lossoflovedon','selfsacrific','hosta..."
96,"20,000 Leagues Under the Sea",173,"[Adventure, Drama, Family, Fantasy, Science Fi...","['dive','ocean','submarin','julesvern','captai..."
97,Star Trek VI: The Undiscovered Country,174,"[Science Fiction, Action, Adventure, Thriller]","['farewel','feder','courtcas','plan','spaceope..."
98,Saw,176,"[Horror, Mystery, Crime]","['detect','shotgun','flashback','hospit','doct..."


## sanity check on the processed data

In [13]:
processing_data['genres'][10000][0]

'Comedy'

In [14]:
processing_data['keywords'][2]

"['chicago,illinoi','drugdeal','escap','onenight','box']"

In [15]:
processing_data['keywords'][10000]

"['drive-intheat']"

# Multi-Label Classification
## Transformation into binary classification problems

Target variables 'genres' will be one-hot encoded. Using MultiLabelBinarizer module.
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html

In [16]:
from sklearn.preprocessing import MultiLabelBinarizer

multilabel_binarizer = MultiLabelBinarizer()
multilabel_binarizer.fit(processing_data['genres'])

# transform target variable
y = multilabel_binarizer.transform(processing_data['genres'])

In [17]:
y

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 1, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [18]:
y.shape

(19293, 19)

19 Genres encoded

# Tf-idf vectorizer
https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction 

In [19]:
tfidf_vectorizer = TfidfVectorizer()

In [20]:
# split dataset into training and test set
X_train, X_test, y_train, y_test = train_test_split(processing_data['keywords'], y, test_size=0.2, random_state=9)

In [21]:
X_train

6576         ['terrorist','undercov','lasvega','showgirl']
8502           ['journalist','rockstar','music','concert']
16532                                    ['privatedetect']
3481     ['supercomput','computerprogram','destini','ti...
8215                    ['robberi','bank','love','murder']
                               ...                        
4532     ['judg','juror','deathpenalti','revel','righta...
4673     ['virgin','colleg','pregnanc','yoga','bikini',...
5014                                 ['worldwarii','tank']
9979     ['neonaz','prison','gallow','coffe','auschwitz...
501      ['suicid','paranoia','blackmarket','hallucin',...
Name: keywords, Length: 15434, dtype: object

In [22]:
y_train

array([[1, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [23]:
# create TF-IDF features
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

In [24]:
X_train_tfidf

<15434x5235 sparse matrix of type '<class 'numpy.float64'>'
	with 82783 stored elements in Compressed Sparse Row format>

# Logistic Regrssion

In [25]:
from sklearn.linear_model import LogisticRegression

from sklearn.multiclass import OneVsRestClassifier

# Performance metric
from sklearn.metrics import f1_score

In [26]:
lr = LogisticRegression()
LR_clf = OneVsRestClassifier(lr)

In [27]:
# fit model on train data
LR_clf.fit(X_train_tfidf, y_train)

OneVsRestClassifier(estimator=LogisticRegression())

In [28]:
# make predictions for test set
y_pred = LR_clf.predict(X_test_tfidf)

In [29]:
# Sanity check
y_pred[100]

array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [30]:
multilabel_binarizer.inverse_transform(y_pred)[0]

('Drama',)

In [31]:
# evaluate performance
LR_f1_score = f1_score(y_test, y_pred, average="micro")

In [32]:
LR_f1_score

0.45602090095796055

# Decision Tree classifier

In [33]:
 # Import Decision Tree Classifier model from Scikit-Learn
# X_train, X_test, y_train, y_test
from sklearn.tree import DecisionTreeClassifier

# Create a Decision Tree Classifier object
dectree = DecisionTreeClassifier(max_depth = 2)

In [34]:
DT_clf = OneVsRestClassifier(dectree)

In [35]:
# fit model on train data
DT_clf.fit(X_train_tfidf, y_train)

OneVsRestClassifier(estimator=DecisionTreeClassifier(max_depth=2))

In [36]:
# make predictions for test set
y_pred = DT_clf.predict(X_test_tfidf)

In [37]:
# Sanity check
y_pred[100]

array([0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [38]:
multilabel_binarizer.inverse_transform(y_pred)[100]

('Crime',)

In [39]:
# evaluate performance
DT_f1_score = f1_score(y_test, y_pred, average="micro")

In [40]:
DT_f1_score

0.2097902097902098

# Gradient Boosting Classifier

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html 


In [41]:
# GRboost
# X_train, X_test, y_train, y_test
from sklearn.ensemble import GradientBoostingClassifier
GRboost = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,max_depth=1, random_state=0)

In [42]:
GR_clf = OneVsRestClassifier(GRboost)

In [43]:
# fit model on train data
GR_clf.fit(X_train_tfidf, y_train)

OneVsRestClassifier(estimator=GradientBoostingClassifier(learning_rate=1.0,
                                                         max_depth=1,
                                                         random_state=0))

In [44]:
# make predictions for test set
y_pred = GR_clf.predict(X_test_tfidf)

In [45]:
# evaluate performance
GR_f1_score = f1_score(y_test, y_pred, average="micro")

In [46]:
GR_f1_score

0.40432244833569964

In [47]:
for i in range(20):
    print(multilabel_binarizer.inverse_transform(y_pred)[i])

('Drama',)
('Drama', 'Romance')
('Crime', 'Drama', 'Thriller')
('Horror',)
('Drama',)
('Drama',)
('Drama',)
('Drama',)
('Thriller',)
()
('Drama', 'Thriller')
('Drama',)
('Action', 'Science Fiction')
('Comedy',)
()
()
('Action', 'Comedy', 'Thriller')
()
()
()


## Predict a movie example on test set

In [48]:
def infer_tags(q):
    q_vec = tfidf_vectorizer.transform([q])
    LR_q_pred = LR_clf.predict(q_vec)
    DT_q_pred = DT_clf.predict(q_vec)
    GR_q_pred = GR_clf.predict(q_vec)
    return multilabel_binarizer.inverse_transform(LR_q_pred), multilabel_binarizer.inverse_transform(DT_q_pred), multilabel_binarizer.inverse_transform(GR_q_pred)

In [49]:
X_test.sample(1).index[0]

6442

In [170]:
for i in range(5):
    k = X_test.sample(1).index[0]
    LRpredict, DTpredict, GRpredict = infer_tags(X_test[k])
    print("Movie: ", processing_data['title'][k])
    print("\nKeywords: ", processing_data['keywords'][k] )
    print("\nPredicted genre using: ")
    print("Logistic Regression : ",  LRpredict )
    print("Decision Tree       : ",  DTpredict )
    print("Gradient Boosting   : ",  GRpredict )
    print("Actual genre: ", processing_data['genres'][k], "\n")
    print("//////////////////////////////////////////////////////////")

Movie:  Attack

Keywords:  ['worldwarii','u.s.armi','nationalguard','basedonplayormus','u.s.militari','moraldilemma']

Predicted genre using: 
Logistic Regression :  [('Drama', 'War')]
Decision Tree       :  [('War',)]
Gradient Boosting   :  [('Drama',)]
Actual genre:  ['Action', 'Drama', 'War'] 

//////////////////////////////////////////////////////////
Movie:  Seven Days in May

Keywords:  ['basedonnovelorbook','thewhitehous','gener','kidnap','coldwar','pentagon','u.s.airforc','conspiraci','desert','secretservic','u.s.marin','nuclearweapon','politicalthril']

Predicted genre using: 
Logistic Regression :  [('Drama', 'Thriller')]
Decision Tree       :  [('Drama',)]
Gradient Boosting   :  [('Action', 'Drama', 'Thriller')]
Actual genre:  ['Drama', 'Thriller'] 

//////////////////////////////////////////////////////////
Movie:  Midnight Crossing

Keywords:  ['sea','cuba','boat','island','cruis','blind']

Predicted genre using: 
Logistic Regression :  [('Drama',)]
Decision Tree       :  

In [51]:
    print("\nf1 scores : ")
    print("Logistic Regression : ",  LR_f1_score )
    print("Decision Tree       : ",  DT_f1_score )
    print("Gradient Boosting   : ",  GR_f1_score )


f1 scores : 
Logistic Regression :  0.45602090095796055
Decision Tree       :  0.2097902097902098
Gradient Boosting   :  0.40432244833569964


## Observation

1) Comparing the f1 scores of 3 classifiers, Logistic Regression had the highest f1 score followed by Gradient Boosting, with Decision Tree being the last. This shows that the order of accuracy is Logistic Regression, Gradient Boosting, and Decision Tree in descending order. 

2) For prediction example using the test set, there are instances where the there is no prediction. This seems to occur when there are too few keywords for the classifiers to predict.