## MOVIE RECOMMENDER SYSTEM - TRAIN TEST GENERATOR 

#### SREENATH S

**NOTE: It is assumed that all the required input files are present in the same folder where this notebook is copied to.**

This notebook is part of the project Movie Recommendation System.

Funactionalities provided as part of this notebook are:
    1. Method to retrieve the preprocessd movie meta data.
    2. Method to retrieve the preprocessd movie meta data indexed by imdbId.
    3. Method to retrieve the customer interaction (movie ratings dataset)
    4. Perform train and test split and retrieve the training and test ratings dataset
    5. Method to retrieve full, Train and test ratings dataset indexed by usertId

Import the required packages

In [1]:
import pandas as pd
import numpy as np
import import_ipynb
import scipy
from sklearn.model_selection import train_test_split

Importing the MovieRecommender_EDA_Preprocessing module. Please look at MovieRecommender_EDA_Preprocessing for EDA and various preprocessing performed on the datasets.

Please note that we are disabling the print option for this import, otherwise all the print statement from the imported notebook will be printed below. 

In [2]:
%%capture
import MovieRecommender_EDA_Preprocessing as EDAProcessor

Priniting the user interaction dataset and its shape to ensure it is imported properly

In [3]:
EDAProcessor.rating_dataset_final.head()

Unnamed: 0,userId,imdbId,rating
0,1,112792,2.5
1,7,112792,3.0
2,31,112792,4.0
3,32,112792,4.0
4,36,112792,3.0


In [4]:
EDAProcessor.rating_dataset_final.shape

(99752, 3)

Storing the user interaction datset on another dataframe, this is for easier access within this dataset, also we dont want to make any changes to the state of the dataset which is part of the imported module.

In [5]:
user_behaviour_full_df = EDAProcessor.rating_dataset_final

Let us print the dataset shape and initial rows to ensure data is imported corectly.

In [6]:
user_behaviour_full_df.shape

(99752, 3)

In [7]:
user_behaviour_full_df.head()

Unnamed: 0,userId,imdbId,rating
0,1,112792,2.5
1,7,112792,3.0
2,31,112792,4.0
3,32,112792,4.0
4,36,112792,3.0


In [8]:
EDAProcessor.content_based_dataset.shape

(8989, 9)

Similarly importing the movie_metadata dataset as well (Note that the metadata and keyword datasets are already merged). All these datasets are filtered and merged on imdbId.

In [9]:
EDAProcessor.content_based_dataset.head()

Unnamed: 0,original_language,original_title,title,overview,movie_genre,movie_production,movie_keywords,spoken_language,imdbId
0,en,Toy Story,Toy Story,"Led by Woody, Andy's toys live happily in his ...",Animation Comedy Family,Pixar Animation Studios,jealousy toy boy friendship friends rivalry bo...,en,114709
1,en,Jumanji,Jumanji,When siblings Judy and Peter discover an encha...,Adventure Fantasy Family,"TriStar Pictures,Teitler Film,Interscope Commu...",board game disappearance based on children's b...,en fr,113497
2,en,Grumpier Old Men,Grumpier Old Men,A family wedding reignites the ancient feud be...,Romance Comedy,"Warner Bros.,Lancaster Gate",fishing best friend duringcreditsstinger old men,en,113228
3,en,Waiting to Exhale,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",Comedy Drama Romance,Twentieth Century Fox Film Corporation,based on novel interracial relationship single...,en,114885
4,en,Father of the Bride Part II,Father of the Bride Part II,Just when George Banks has recovered from his ...,Comedy,"Sandollar Productions,Touchstone Pictures",baby midlife crisis confidence aging daughter ...,en,113041


Retrieving the movie metadata dataset. Also changing the index of metadata dataset to imdbId. This will be done for easier retrieval of movie data with imdbId

In [10]:
content_data_df = EDAProcessor.content_based_dataset

In [11]:
indexed_content_df = EDAProcessor.content_based_dataset.set_index('imdbId')

Checking the daatset for size and initial rows after changing index to 'imdbId'. Please note that 'imdb' id is common between the movie metadata and ratings dataset (we have already used links data set to merge on imdbId)

In [12]:
indexed_content_df.shape

(8989, 8)

In [13]:
indexed_content_df.head()

Unnamed: 0_level_0,original_language,original_title,title,overview,movie_genre,movie_production,movie_keywords,spoken_language
imdbId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
114709,en,Toy Story,Toy Story,"Led by Woody, Andy's toys live happily in his ...",Animation Comedy Family,Pixar Animation Studios,jealousy toy boy friendship friends rivalry bo...,en
113497,en,Jumanji,Jumanji,When siblings Judy and Peter discover an encha...,Adventure Fantasy Family,"TriStar Pictures,Teitler Film,Interscope Commu...",board game disappearance based on children's b...,en fr
113228,en,Grumpier Old Men,Grumpier Old Men,A family wedding reignites the ancient feud be...,Romance Comedy,"Warner Bros.,Lancaster Gate",fishing best friend duringcreditsstinger old men,en
114885,en,Waiting to Exhale,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",Comedy Drama Romance,Twentieth Century Fox Film Corporation,based on novel interracial relationship single...,en
113041,en,Father of the Bride Part II,Father of the Bride Part II,Just when George Banks has recovered from his ...,Comedy,"Sandollar Productions,Touchstone Pictures",baby midlife crisis confidence aging daughter ...,en


In [14]:
# METHOD NAME : get_movie_metadata
# INPUT PRAMS : none
# OUTPUT      : content_data_df
# DESCRIPTION : This method returns indexed movie metadata DF

In [15]:
def get_movie_metadata():
    return content_data_df

In [16]:
# METHOD NAME : get_indexed_movie_metadata
# INPUT PRAMS : none
# OUTPUT      : indexed_content_df
# DESCRIPTION : This method returns indexed movie metadata DF

In [17]:
def get_indexed_movie_metadata():
    return indexed_content_df

In [18]:
# METHOD NAME : get_users_content_data
# INPUT PRAMS : userId, user_ratings_df - Input dataframe containing userId, imdbId, ratings 
# OUTPUT      : series of user inetracted movies
# DESCRIPTION : This method returns all the interacted movies for the specified user from user_ratings_df

In [19]:
def get_users_interaction_data(userId, user_ratings_df):
    interacted_items = user_ratings_df.loc[userId]['imdbId'].unique()
    return set(interacted_items)

In [20]:
# METHOD NAME : train_test_user_behaviour
# INPUT PRAMS : none
# OUTPUT      : user_behaviour_train_df, user_behaviour_test_df
# DESCRIPTION : This method performs train test split on user interaction dataset and returns train set and test set

In [21]:
def train_test_user_behaviour():
    user_behaviour_train_df, user_behaviour_test_df = train_test_split(user_behaviour_full_df,test_size=0.15,random_state=22)
    return user_behaviour_train_df, user_behaviour_test_df

In [22]:
user_behaviour_train_df, user_behaviour_test_df = train_test_user_behaviour()
print('engagement in train set: %d' % len(user_behaviour_train_df))
print('engagement in on test set: %d' % len(user_behaviour_test_df))
print('Shape of user_behaviour_train_df: ' , user_behaviour_train_df.shape)
user_behaviour_train_df.head()

engagement in train set: 84789
engagement in on test set: 14963
Shape of user_behaviour_train_df:  (84789, 3)


Unnamed: 0,userId,imdbId,rating
2457,73,112864,3.5
49661,472,120906,5.0
48470,584,119116,3.5
33782,18,117913,4.0
80144,614,80549,2.0


In [23]:
print('Shape of user_behaviour_test_df: ' , user_behaviour_test_df.shape)
user_behaviour_test_df.head()

Shape of user_behaviour_test_df:  (14963, 3)


Unnamed: 0,userId,imdbId,rating
19754,537,73195,4.0
10293,621,86190,4.0
83591,452,95800,4.0
5875,177,106977,5.0
8529,536,96895,3.0


In [24]:
# METHOD NAME : get_user_behaviour_indexed
# INPUT PRAMS : none  
# OUTPUT      : user_behaviour_full_indexed_df,user_behaviour_train_indexed_df,user_behaviour_test_indexed_df
# DESCRIPTION : This method uses extracted user iteraction dataset, test and train sets. 
# Then set the userId as index for every dataset.

In [25]:
def get_user_behaviour_indexed():
    
    user_behaviour_full_indexed_df = user_behaviour_full_df.set_index('userId')
    user_behaviour_train_indexed_df = user_behaviour_train_df.set_index('userId')
    user_behaviour_test_indexed_df = user_behaviour_test_df.set_index('userId')
    return user_behaviour_full_indexed_df,user_behaviour_train_indexed_df,user_behaviour_test_indexed_df


In [26]:
user_beh_indexed_df,user_beh_train_indexed_df,user_beh_test_indexed_df = get_user_behaviour_indexed()
print('Userbahaviour indexed full DF, Shape = ', user_beh_indexed_df.shape)
print('Userbahaviour indexed train DF, Shape = ', user_beh_train_indexed_df.shape)
print('Userbahaviour indexed test DF, Shape = ', user_beh_test_indexed_df.shape)

Userbahaviour indexed full DF, Shape =  (99752, 2)
Userbahaviour indexed train DF, Shape =  (84789, 2)
Userbahaviour indexed test DF, Shape =  (14963, 2)


Analysisng unique number of movies and users present in train as well as test datasets and the overlaps between train and test dataset

In [27]:
user_behaviour_train_df.userId.nunique()

671

In [28]:
user_behaviour_test_df.userId.nunique()

668

As we can see that there is a good overlap of users between the train and test dataset, which is good. Lets see how many number of movies are common between these two datasets.

In [29]:
user_behaviour_train_df.imdbId.nunique()

8500

In [30]:
user_behaviour_test_df.imdbId.nunique()

4212

In [31]:
len(set(user_behaviour_train_df.imdbId.unique()).intersection(set(user_behaviour_test_df.imdbId.unique())))

3723