# Data Preprocessing

## Data Cleansing 

In [1]:
import pandas as pd

In [2]:
credit = pd.read_csv('data/tmdb_5000_credits.csv')
movies = pd.read_csv('data/tmdb_5000_movies.csv')

We merge credit and movie file to create one dataset. We join the two dataset on the `id` column.

In [3]:
credit.columns = ['id', 'title', 'cast', 'crew']
movies = movies.merge(credit, on='id')
movies.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title_x,vote_average,vote_count,title_y,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...",...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]",...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


Check for `Nan` Columns.

In [4]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4803 entries, 0 to 4802
Data columns (total 23 columns):
budget                  4803 non-null int64
genres                  4803 non-null object
homepage                1712 non-null object
id                      4803 non-null int64
keywords                4803 non-null object
original_language       4803 non-null object
original_title          4803 non-null object
overview                4800 non-null object
popularity              4803 non-null float64
production_companies    4803 non-null object
production_countries    4803 non-null object
release_date            4802 non-null object
revenue                 4803 non-null int64
runtime                 4801 non-null float64
spoken_languages        4803 non-null object
status                  4803 non-null object
tagline                 3959 non-null object
title_x                 4803 non-null object
vote_average            4803 non-null float64
vote_count              4803 non-null 

Since most more than 50% of the `homepage` column has missing values and the column is irrelavent for making recommendations, we drop the column.

In [5]:
movies = movies.drop(columns=['homepage'])

Similarily we drop the three rows that have the overview column as `Nan`.

In [6]:
movies = movies.drop(movies[movies['overview'].isnull()].index.tolist())

In [7]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4800 entries, 0 to 4802
Data columns (total 22 columns):
budget                  4800 non-null int64
genres                  4800 non-null object
id                      4800 non-null int64
keywords                4800 non-null object
original_language       4800 non-null object
original_title          4800 non-null object
overview                4800 non-null object
popularity              4800 non-null float64
production_companies    4800 non-null object
production_countries    4800 non-null object
release_date            4799 non-null object
revenue                 4800 non-null int64
runtime                 4800 non-null float64
spoken_languages        4800 non-null object
status                  4800 non-null object
tagline                 3959 non-null object
title_x                 4800 non-null object
vote_average            4800 non-null float64
vote_count              4800 non-null int64
title_y                 4800 non-null o

We see that one value for `release_date` is a null value. We manually inputt the release date for the movie.

In [8]:
movies[movies['release_date'].isnull()].index.tolist()

[4553]

In [9]:
movies['release_date'].loc[movies['id'] == 380097] = "2015-03-01"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


In [10]:
# reset index
movies = movies.reset_index(drop=True)

We fill the null values of the `tagline` with empty strings.

In [11]:
movies['tagline'] = movies['tagline'].fillna('')

In [12]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4800 entries, 0 to 4799
Data columns (total 22 columns):
budget                  4800 non-null int64
genres                  4800 non-null object
id                      4800 non-null int64
keywords                4800 non-null object
original_language       4800 non-null object
original_title          4800 non-null object
overview                4800 non-null object
popularity              4800 non-null float64
production_companies    4800 non-null object
production_countries    4800 non-null object
release_date            4800 non-null object
revenue                 4800 non-null int64
runtime                 4800 non-null float64
spoken_languages        4800 non-null object
status                  4800 non-null object
tagline                 4800 non-null object
title_x                 4800 non-null object
vote_average            4800 non-null float64
vote_count              4800 non-null int64
title_y                 4800 non-null o

## Feature Engineering

We create a new column `rating` based on IMDB's weighted rating formula which is showns as $\text{Weighted Rating} = \frac{Rv + Cm}{v+m}$
where, 

- $R$ is the average rating of the movie
- $v$ is the numer of votes for the movie
- $m$ is the minimum votes required to be listed in top 250 (which is 3000)
- $C$ us the mean vote accross the whole report

In [13]:
def weighted_rating(x, m=3000, C=movies['vote_average'].mean()):
    v = x['vote_count']
    R = x['vote_average']
    return (R*v + C*m)/(v+m)

In [14]:
movies['score'] = movies.apply(weighted_rating, axis=1)

In [15]:
movies['genres'][0]

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

Some columns in the dataset are formatted as a json so we change them into an array.

In [16]:
from ast import literal_eval

In [17]:
features = ['genres', 'keywords', 'production_companies', 'production_countries',
           'spoken_languages', 'cast', 'crew']
for feature in features:
    movies[feature] = movies[feature].apply(literal_eval)

In [18]:
for feature in features:
    if feature != 'crew':
        movies[feature] = movies[feature].apply(lambda x: [i['name'] for i in x])

In [19]:
def get_director(val):
    for i in val:
        if i['job'] == 'Director':
            return i['name']
    return "None" 
            

In [20]:
movies['crew'] = movies['crew'].apply(get_director)

In [21]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4800 entries, 0 to 4799
Data columns (total 23 columns):
budget                  4800 non-null int64
genres                  4800 non-null object
id                      4800 non-null int64
keywords                4800 non-null object
original_language       4800 non-null object
original_title          4800 non-null object
overview                4800 non-null object
popularity              4800 non-null float64
production_companies    4800 non-null object
production_countries    4800 non-null object
release_date            4800 non-null object
revenue                 4800 non-null int64
runtime                 4800 non-null float64
spoken_languages        4800 non-null object
status                  4800 non-null object
tagline                 4800 non-null object
title_x                 4800 non-null object
vote_average            4800 non-null float64
vote_count              4800 non-null int64
title_y                 4800 non-null o

To utilize `TfidfVectorizer`, we transform the data to create arrays of each words. We create two arrays one with `columns=[cast, keywords, crew, genre]` and the other with `columns=[production_companies, production_countries, spoken_language]`.

In [22]:
def tokenization(x):
    if isinstance(x, list):
        return [str.lower(word.replace(" ", "")) for word in x]
    else:
        if isinstance(x, str) and x != "None":
            return str.lower(x.replace(" ", ""))
        else:
            return str.lower(x.replace("None", ""))

In [23]:
movie_info = ['cast', 'keywords', 'crew', 'genres']

for info in movie_info:
    movies[info] = movies[info].apply(tokenization)

In [24]:
movie_country = ["production_companies", "production_countries", "spoken_languages"]

for column in movie_country:
    movies[column] = movies[column].apply(tokenization)

In [25]:
def concatenate_info(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' '\
            + x['crew'] + ' ' + ' '.join(x['genres'])

In [26]:
def concatenate_countries(x):
    return ' '.join(x['production_companies']) + ' ' + ' '.join(x['production_countries']) + ' '\
            + ' '.join(x['spoken_languages'])

In [27]:
movies['info'] = movies.apply(concatenate_info, axis=1)
movies['country_info'] = movies.apply(concatenate_countries, axis=1)

Drop columns that were used to create `info` and `country_info` columns

In [28]:
movies = movies.drop(columns=['cast', 'keywords', 'crew', 'genres',
                             'production_companies', 'production_countries', 'spoken_languages'])

We also drop columns that are not useful in creating a recommendation system.

In [29]:
movies = movies.drop(columns=['title_x', 'title_y', 'status', 'release_date',
                            'original_language', 'budget'])

In [30]:
# change id to index and sort by id
movies = movies.set_index('id').sort_values(by='id')

In [31]:
movies.head()

Unnamed: 0_level_0,original_title,overview,popularity,revenue,runtime,tagline,vote_average,vote_count,score,info,country_info
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
5,Four Rooms,It's Ted the Bellhop's first night on the job....,22.87623,4300000,98.0,Twelve outrageous guests. Four scandalous requ...,6.5,530,6.154037,hotel newyear'seve witch bet hotelroom sperm l...,miramaxfilms abandapart unitedstatesofamerica ...
11,Star Wars,Princess Leia is captured and held hostage by ...,126.393695,775398007,121.0,"A long time ago in a galaxy far, far away...",8.1,6624,7.474351,android galaxy hermit deathstar lightsaber jed...,lucasfilm twentiethcenturyfoxfilmcorporation u...
12,Finding Nemo,"Nemo, an adventurous young clownfish, is unexp...",85.688789,940335536,100.0,"There are 3.7 trillion fish in the ocean, they...",7.6,6122,7.104358,fathersonrelationship harbor underwater fishta...,pixaranimationstudios unitedstatesofamerica en...
13,Forrest Gump,A man with a low IQ has accomplished great thi...,138.133331,677945399,142.0,"The world will never be the same, once you've ...",8.2,7927,7.621502,vietnamveteran hippie mentallydisabled running...,paramountpictures unitedstatesofamerica english
14,American Beauty,"Lester Burnham, a depressed suburban father in...",80.878605,356296601,122.0,Look closer.,7.9,3313,7.041256,malenudity femalenudity adultery midlifecrisis...,dreamworksskg jinks/cohencompany unitedstateso...


## Pipeline

In [32]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
import scipy

In [35]:
word_column = ['original_title', 'overview', 'tagline', 'info', 'country_info']
num_column = ['popularity', 'revenue', 'runtime', 'vote_average', 'vote_count', 'score']
assert len(movies.columns) == len(word_column) + len(num_column)

In [42]:
full_pipeline = ColumnTransformer([
    ("num", StandardScaler(), num_column),
    ("original_title", TfidfVectorizer(), "original_title"),
    ("overview", TfidfVectorizer(), "overview"),
    ("tagline", TfidfVectorizer(), "tagline"),
    ("info", TfidfVectorizer(), 'info'),
    ("country_info", TfidfVectorizer(), "country_info")
])

movie_data = full_pipeline.fit_transform(movies)

In [44]:
scipy.sparse.save_npz('data/movie.npz', movie_data)