# NLP for Data Movies

#### The purpose of this analysis is to found a correlation between movie overviews and revenue. In order to understand a text such as "Overview" we need to decode it making use of NLP tecniques.
#### Next step will be use those vectorized overviews and fit a model to make predictions.¶

## Getting Data
### Source: Kaggle from IMDB

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import os

In [2]:
movies = pd.read_csv(os.path.join('RawData', 'tmdb_5000_movies.csv'))

In [3]:
df_movies = movies[['id','budget','revenue','popularity','original_title','overview']]

In [4]:
df_movies.head()

Unnamed: 0,id,budget,revenue,popularity,original_title,overview
0,19995,237000000,2787965087,150.437577,Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,285,300000000,961000000,139.082615,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."
2,206647,245000000,880674609,107.376788,Spectre,A cryptic message from Bond’s past sends him o...
3,49026,250000000,1084939099,112.31295,The Dark Knight Rises,Following the death of District Attorney Harve...
4,49529,260000000,284139100,43.926995,John Carter,"John Carter is a war-weary, former military ca..."


In [5]:
df_movies.to_csv('Output/NLP_popularity/df_movies_original_dataframe.csv')

# Cleaning Data

In [6]:
#Cleaning rows with no popularity 
df_movies = df_movies.drop(df_movies[df_movies.popularity == 0].index)

In [7]:
df_movies.count()

id                4802
budget            4802
revenue           4802
popularity        4802
original_title    4802
overview          4799
dtype: int64

In [8]:
#Check if there is any movie without overview
df_movies[pd.isnull(df_movies["overview"])]

Unnamed: 0,id,budget,revenue,popularity,original_title,overview
2656,370980,15000000,0,0.738646,Chiamatemi Francesco - Il Papa della gente,
4140,459488,2,0,0.050625,"To Be Frank, Sinatra at 100",
4431,292539,913000,0,0.795698,Food Chains,


In [9]:
df_movies=df_movies[pd.notnull(df_movies["overview"])]

# Feature Engineering NLP

In [10]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [11]:
#Split data into train and test
x_train,x_test, y_train,y_test = train_test_split(df_movies,df_movies["popularity"], test_size=0.2)

x_train_df=x_train.reset_index()
y_train_df=y_train.reset_index()

x_test_df=x_test.reset_index()
y_test_df=y_test.reset_index()

## Remove stop words

In [12]:
#Download library for stopwords
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to C:\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [13]:
#Cleaning stop words and some regular expresiions such as: punctuation marks, numbers.
import re

REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]"')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def text_prepare(text):
    """
        text: a string
        
        return: modified initial string
    """
    
    text = text.lower()
    text = re.sub(REPLACE_BY_SPACE_RE," ",text)
    text = re.sub(BAD_SYMBOLS_RE,"",text)
    text = ' '.join([word for word in text.split() if word not in STOPWORDS])
    
    return text

In [14]:
#Testing function
text_prepare("Captain Barbossa, long believed to be dead, has come back to life and is headed to the edge of the Earth with Will Turner and Elizabeth Swann. But nothing is quite as it seems")

'captain barbossa long believed dead come back life headed edge earth turner elizabeth swann nothing quite seems'

In [15]:
#Applying function to all overviews in data set
ls_train = [text_prepare(x) for x in x_train_df['overview']]
ls_test = [text_prepare(x) for x in x_test_df['overview']]

In [16]:
#Join dataframe with a new column: Overview without stop words
x_train_df['overview_sw']=ls_train
x_test_df['overview_sw']=ls_test

In [17]:
x_train_df.head()
x_test_df.head()

Unnamed: 0,index,id,budget,revenue,popularity,original_title,overview,overview_sw
0,2403,679,18500000,183316455,67.66094,Aliens,When Ripley's lifepod is found by a salvage cr...,ripleys lifepod found salvage crew 50 years la...
1,2450,2977,16500000,37311672,18.236284,Becoming Jane,A biographical portrait of a pre-fame Jane Aus...,biographical portrait prefame jane austen roma...
2,4211,1651,0,0,3.654018,La sirène du Mississipi,"Adapted from a story by William Irish, it's a ...",adapted story william irish noirish tale man o...
3,4179,260778,0,0,0.320387,வாலு,"Sharp (Simbu), a happy-go-lucky guy, loves Pri...",sharp simbu happygolucky guy loves priya hansi...
4,4526,36825,0,0,0.082978,The R.M.,Jared Phelps (Kirby Heyborne) has completed tw...,jared phelps kirby heyborne completed two year...


## Tokenize, Vectorize and TF-IDF

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer

vec_train = TfidfVectorizer(max_features=5000)
vec_train.fit(x_train_df["overview_sw"])
transformed_train = vec_train.transform(x_train_df["overview_sw"])
text_features_train = pd.DataFrame(transformed_train.todense())
text_features_train.columns = vec_train.get_feature_names()

In [19]:
# vec_test = TfidfVectorizer(max_features=1000)
# vec_test.fit(x_test_df["overview_sw"])
transformed_test = vec_train.transform(x_test_df["overview_sw"])
text_features_test = pd.DataFrame(transformed_test.todense())
text_features_test.columns = vec_train.get_feature_names()

In [20]:
transformed_train.shape

(3839, 5000)

In [21]:
transformed_test.shape

(960, 5000)

In [22]:
#Check results train data
text_features_train.head()

Unnamed: 0,007,10,100,10th,10yearold,11,11yearold,12,12yearold,13,...,youngest,youngsters,youth,zeus,zion,zoe,zombie,zombies,zone,zoo
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.130395,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [23]:
#Check results test data
text_features_test.head()

Unnamed: 0,007,10,100,10th,10yearold,11,11yearold,12,12yearold,13,...,youngest,youngsters,youth,zeus,zion,zoe,zombie,zombies,zone,zoo
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [24]:
#Final dataframe with vectorized words
final_train_df=pd.concat([x_train_df, text_features_train], axis=1)
final_test_df=pd.concat([x_test_df, text_features_test], axis=1)

In [25]:
final_train_df.to_csv('Output/NLP_popularity/NLP_train_results.csv')
final_test_df.to_csv('Output/NLP_popularity/NLP_test_results.csv')

# Run the model: Random Forest Regressor

In [26]:
from sklearn.ensemble import RandomForestRegressor

clf = RandomForestRegressor()
clf.fit(text_features_train, y_train_df["popularity"])

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [27]:
#Model results: Words more correlated to popularity
pd.Series(clf.feature_importances_, index=text_features_train.columns).sort_values(ascending=False)

alongside        0.056362
hatches          0.053648
discovering      0.021602
origin           0.014995
fully            0.014703
humor            0.012810
sparrow          0.011191
theme            0.011105
earth            0.010932
hammond          0.010833
hogwarts         0.010724
prime            0.010086
symbol           0.009519
abducted         0.009156
katniss          0.008505
humanity         0.007598
batman           0.007475
plague           0.007114
gotham           0.006994
epic             0.006739
aftermath        0.006520
governments      0.006481
powerful         0.005994
nearly           0.005960
dwarves          0.005862
maintain         0.005798
manhunt          0.005571
named            0.005290
hightech         0.005012
challenge        0.004970
                   ...   
unorthodox       0.000000
mccain           0.000000
sean             0.000000
education        0.000000
dysfunctional    0.000000
unfolds          0.000000
mass             0.000000
chagrin     

In [28]:
#Export results in csv format
pd.Series(clf.feature_importances_, index=text_features_train.columns).sort_values(ascending=False).to_csv('Output/NLP_popularity/most_correlated_words.csv')

# Make predictions: Test data

In [29]:
predictions = clf.predict(text_features_test.values).tolist()

In [30]:
df_predictions = pd.DataFrame({'predictions_popularity':predictions})

In [31]:
final_test_predictions_df=pd.concat([x_test_df, df_predictions], axis=1)

In [32]:
#Check results of predictions
final_test_predictions_df.head()

Unnamed: 0,index,id,budget,revenue,popularity,original_title,overview,overview_sw,predictions_popularity
0,2403,679,18500000,183316455,67.66094,Aliens,When Ripley's lifepod is found by a salvage cr...,ripleys lifepod found salvage crew 50 years la...,24.765056
1,2450,2977,16500000,37311672,18.236284,Becoming Jane,A biographical portrait of a pre-fame Jane Aus...,biographical portrait prefame jane austen roma...,7.425448
2,4211,1651,0,0,3.654018,La sirène du Mississipi,"Adapted from a story by William Irish, it's a ...",adapted story william irish noirish tale man o...,11.587493
3,4179,260778,0,0,0.320387,வாலு,"Sharp (Simbu), a happy-go-lucky guy, loves Pri...",sharp simbu happygolucky guy loves priya hansi...,15.659098
4,4526,36825,0,0,0.082978,The R.M.,Jared Phelps (Kirby Heyborne) has completed tw...,jared phelps kirby heyborne completed two year...,17.54635


In [33]:
final_test_predictions_df["variance"]=((final_test_predictions_df['predictions_popularity']-final_test_predictions_df['popularity']).div(final_test_predictions_df.popularity, axis=0))*100

final_test_predictions_df.to_csv("Output/NLP_popularity/predictions_results_variance.csv")

# Score the model

In [34]:
from sklearn.metrics import r2_score

test_score = r2_score(y_test_df["popularity"], predictions)
test_score

-0.073779705121055139

# Download model

In [35]:
from sklearn.externals import joblib
joblib.dump(clf, 'NLP_model_popularity.pkl')

['NLP_model_popularity.pkl']

In [36]:
joblib.dump(vec_train,'NLP_vectorizer_popularity.pkl')

['NLP_vectorizer_popularity.pkl']