# NLP for Data Movies

#### The purpose of this analysis is to found a correlation between movie overviews and revenue. In order to understand a text such as "Overview" we need to decode it making use of NLP tecniques.
#### Next step will be use those vectorized overviews and fit a model to make predictions.

# Getting Data
### Source: Kaggle from IMDB

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import os

In [2]:
movies = pd.read_csv(os.path.join('RawData', 'tmdb_5000_movies.csv'))

In [3]:
df_movies = movies[['id','budget','revenue','popularity','original_title','overview']]

In [4]:
df_movies.head()

Unnamed: 0,id,budget,revenue,popularity,original_title,overview
0,19995,237000000,2787965087,150.437577,Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,285,300000000,961000000,139.082615,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."
2,206647,245000000,880674609,107.376788,Spectre,A cryptic message from Bond’s past sends him o...
3,49026,250000000,1084939099,112.31295,The Dark Knight Rises,Following the death of District Attorney Harve...
4,49529,260000000,284139100,43.926995,John Carter,"John Carter is a war-weary, former military ca..."


In [5]:
df_movies.to_csv('Output/NLP_revenue/df_movies_original_dataframe.csv')

# Cleaning Data

In [6]:
#Cleaning rows with no revenue 
df_movies = df_movies.drop(df_movies[df_movies.revenue == 0].index)

In [7]:
df_movies.count()

id                3376
budget            3376
revenue           3376
popularity        3376
original_title    3376
overview          3376
dtype: int64

In [8]:
#Check if there is any movie without overview
df_movies[pd.isnull(df_movies["overview"])]

Unnamed: 0,id,budget,revenue,popularity,original_title,overview


# Feature Engineering NLP

In [9]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [10]:
#Split data into train and test
x_train,x_test, y_train,y_test = train_test_split(df_movies,df_movies["revenue"], test_size=0.2)

x_train_df=x_train.reset_index()
y_train_df=y_train.reset_index()

x_test_df=x_test.reset_index()
y_test_df=y_test.reset_index()

## Remove stop words

In [11]:
#Download library for stopwords
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to C:\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [12]:
#Cleaning stop words and some regular expresiions such as: punctuation marks, numbers.
import re

REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]"')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def text_prepare(text):
    """
        text: a string
        
        return: modified initial string
    """
    
    text = text.lower()
    text = re.sub(REPLACE_BY_SPACE_RE," ",text)
    text = re.sub(BAD_SYMBOLS_RE,"",text)
    text = ' '.join([word for word in text.split() if word not in STOPWORDS])
    
    return text

In [13]:
#Testing function
text_prepare("Captain Barbossa, long believed to be dead, has come back to life and is headed to the edge of the Earth with Will Turner and Elizabeth Swann. But nothing is quite as it seems")

'captain barbossa long believed dead come back life headed edge earth turner elizabeth swann nothing quite seems'

In [14]:
#Applying function to all overviews in data set
ls_train = [text_prepare(x) for x in x_train_df['overview']]
ls_test = [text_prepare(x) for x in x_test_df['overview']]

In [15]:
#Join dataframe with a new column: Overview without stop words
x_train_df['overview_sw']=ls_train
x_test_df['overview_sw']=ls_test

In [16]:
x_train_df.head()
x_test_df.head()

Unnamed: 0,index,id,budget,revenue,popularity,original_title,overview,overview_sw
0,1381,39486,35000000,60251371,9.525037,Secretariat,Housewife and mother Penny Chenery agrees to t...,housewife mother penny chenery agrees take ail...
1,1791,22894,26000000,67918658,30.981604,Legion,"When God loses faith in humankind, he sends hi...",god loses faith humankind sends legion angels ...
2,262,120,93000000,871368364,138.049577,The Lord of the Rings: The Fellowship of the Ring,"Young hobbit Frodo Baggins, after inheriting a...",young hobbit frodo baggins inheriting mysterio...
3,1103,5503,44000000,368875760,54.884297,The Fugitive,"Wrongfully accused of murdering his wife, Rich...",wrongfully accused murdering wife richard kimb...
4,676,12160,63000000,25052000,13.859307,Wyatt Earp,Covering the life and times of one of the West...,covering life times one wests iconic heroes wy...


## Tokenize, Vectorize and TF-IDF

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

vec_train = TfidfVectorizer(max_features=5000)
vec_train.fit(x_train_df["overview_sw"])
transformed_train = vec_train.transform(x_train_df["overview_sw"])
text_features_train = pd.DataFrame(transformed_train.todense())
text_features_train.columns = vec_train.get_feature_names()

In [18]:
# vec_test = TfidfVectorizer(max_features=1000)
# vec_test.fit(x_test_df["overview_sw"])
transformed_test = vec_train.transform(x_test_df["overview_sw"])
text_features_test = pd.DataFrame(transformed_test.todense())
text_features_test.columns = vec_train.get_feature_names()

In [19]:
transformed_train.shape

(2700, 5000)

In [20]:
transformed_test.shape

(676, 5000)

In [21]:
#Check results train data
text_features_train.head()

Unnamed: 0,007,10,10th,10yearold,11,11th,11yearold,12,12yearold,13,...,zebra,zero,zeus,zinos,zion,zoe,zombie,zombies,zone,zoo
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [22]:
#Check results test data
text_features_test.head()

Unnamed: 0,007,10,10th,10yearold,11,11th,11yearold,12,12yearold,13,...,zebra,zero,zeus,zinos,zion,zoe,zombie,zombies,zone,zoo
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [23]:
#Final dataframe with vectorized words
final_train_df=pd.concat([x_train_df, text_features_train], axis=1)
final_test_df=pd.concat([x_test_df, text_features_test], axis=1)

In [24]:
final_train_df.to_csv('Output/NLP_revenue/NLP_train_results.csv')
final_test_df.to_csv('Output/NLP_revenue/NLP_test_results.csv')

# Run the model: Random Forest Regressor

In [25]:
from sklearn.ensemble import RandomForestRegressor

clf = RandomForestRegressor()
clf.fit(text_features_train, y_train_df["revenue"])

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [26]:
#Model results: Words more correlated to revenue
pd.Series(clf.feature_importances_, index=text_features_train.columns).sort_values(ascending=False)

hogwarts       0.036346
gandalf        0.023786
brock          0.020797
decepticons    0.018761
shrek          0.017592
10th           0.017191
formidable     0.016132
villainous     0.015266
stark          0.014719
katniss        0.012730
park           0.012671
skywalker      0.012184
brink          0.011525
spiderman      0.011032
accompanied    0.010973
skull          0.009276
alien          0.009220
manny          0.009126
iron           0.009017
ring           0.008954
powerful       0.008913
po             0.008438
imf            0.008227
knights        0.007572
emerges        0.007532
earth          0.007509
lion           0.007393
peter          0.006879
fury           0.006831
nefarious      0.006479
                 ...   
dorian         0.000000
officials      0.000000
doug           0.000000
offered        0.000000
offer          0.000000
oceans         0.000000
outbreak       0.000000
outcast        0.000000
diabolical     0.000000
disturbing     0.000000
peoples        0

In [27]:
#Export results in csv format
pd.Series(clf.feature_importances_, index=text_features_train.columns).sort_values(ascending=False).to_csv('Output/NLP_revenue/most_correlated_words.csv')

# Make predictions: Test data

In [28]:
predictions = clf.predict(text_features_test.values).tolist()

In [29]:
df_predictions = pd.DataFrame({'predictions_revenue':predictions})

In [30]:
final_test_predictions_df=pd.concat([x_test_df, df_predictions], axis=1)

In [31]:
#Check results of predictions
final_test_predictions_df.head()

Unnamed: 0,index,id,budget,revenue,popularity,original_title,overview,overview_sw,predictions_revenue
0,1381,39486,35000000,60251371,9.525037,Secretariat,Housewife and mother Penny Chenery agrees to t...,housewife mother penny chenery agrees take ail...,31416328.3
1,1791,22894,26000000,67918658,30.981604,Legion,"When God loses faith in humankind, he sends hi...",god loses faith humankind sends legion angels ...,185753836.2
2,262,120,93000000,871368364,138.049577,The Lord of the Rings: The Fellowship of the Ring,"Young hobbit Frodo Baggins, after inheriting a...",young hobbit frodo baggins inheriting mysterio...,425311275.8
3,1103,5503,44000000,368875760,54.884297,The Fugitive,"Wrongfully accused of murdering his wife, Rich...",wrongfully accused murdering wife richard kimb...,61365146.0
4,676,12160,63000000,25052000,13.859307,Wyatt Earp,Covering the life and times of one of the West...,covering life times one wests iconic heroes wy...,64839228.7


In [32]:
final_test_predictions_df["variance"]=((final_test_predictions_df['predictions_revenue']-final_test_predictions_df['revenue']).div(final_test_predictions_df.revenue, axis=0))*100

final_test_predictions_df.to_csv("Output/NLP_revenue/predictions_results_variance.csv")

# Score the model

In [33]:
from sklearn.metrics import r2_score

test_score = r2_score(y_test_df["revenue"], predictions)
test_score

-0.022174976752180831

# Download model

In [34]:
from sklearn.externals import joblib
joblib.dump(clf, 'NLP_model_revenue.pkl')

['NLP_model_revenue.pkl']

In [35]:
joblib.dump(vec_train,'NLP_vectorizer_revenue.pkl')

['NLP_vectorizer_revenue.pkl']