The objective of this challenge is to assess you ability to:

● perform basic data manipulation and data pre-processing

● demonstrate awareness of the computations involved

● perform feature engineering

● train and tune ML models

● asses performance of the ML models

● obtaining clear, useful, and business driven insights from data and models

In [1]:
#Imports 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sqlalchemy import create_engine
from sklearn.model_selection import train_test_split
import pymysql
import pandas_profiling
import statsmodels.api as sm


In [2]:
# Reading data

scores_df = pd.read_csv('../data/genome_scores.csv')
tags_df = pd.read_csv('../data/genome_tags.csv')
link_df = pd.read_csv('../data/link.csv')
movie_df = pd.read_csv('../data/movie.csv')
rating_df = pd.read_csv('../data/rating.csv')
tag_df = pd.read_csv('../data/tag.csv')

# Exploratory analysis

You might perform exploratory analysis on this data, but you are not required to present it
to us, we will focus mainly on the feature engineering section of this challenge.

In [3]:
scores_df.head()

Unnamed: 0,movieId,tagId,relevance
0,1,1,0.025
1,1,2,0.025
2,1,3,0.05775
3,1,4,0.09675
4,1,5,0.14675


In [5]:
tags_df.head(10)

Unnamed: 0,tagId,tag
0,1,007
1,2,007 (series)
2,3,18th century
3,4,1920s
4,5,1930s
5,6,1950s
6,7,1960s
7,8,1970s
8,9,1980s
9,10,19th century


In [7]:
link_df.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [9]:
movie_df.head(10)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


In [11]:
tag_df.head(10)

Unnamed: 0,userId,movieId,tag,timestamp
0,18,4141,Mark Waters,2009-04-24 18:19:40
1,65,208,dark hero,2013-05-10 01:41:18
2,65,353,dark hero,2013-05-10 01:41:19
3,65,521,noir thriller,2013-05-10 01:39:43
4,65,592,dark hero,2013-05-10 01:41:18
5,65,668,bollywood,2013-05-10 01:37:56
6,65,898,screwball comedy,2013-05-10 01:42:40
7,65,1248,noir thriller,2013-05-10 01:39:43
8,65,1391,mars,2013-05-10 01:40:55
9,65,1617,neo-noir,2013-05-10 01:43:37


In [16]:
rating_df.head(50)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,2005-04-02 23:53:47
1,1,29,3.5,2005-04-02 23:31:16
2,1,32,3.5,2005-04-02 23:33:39
3,1,47,3.5,2005-04-02 23:32:07
4,1,50,3.5,2005-04-02 23:29:40
5,1,112,3.5,2004-09-10 03:09:00
6,1,151,4.0,2004-09-10 03:08:54
7,1,223,4.0,2005-04-02 23:46:13
8,1,253,4.0,2005-04-02 23:35:40
9,1,260,4.0,2005-04-02 23:33:46


## Modeling structure

Create a dataframe where each instance (row) corresponds to a rating of some movie made by some user at a given point in time.

Note in particular that if a user has several ratings, then each of her ratings must appear on a different row.

Each column will correspond to a predictive variable (below we give instructions on the predictive variables). 

Then, create a column with the response variable for your model. This response variable is defined as:

● 1 in case the rating is >= 4 (flag for "high" rating)

● 0 in case the rating is < 4

In [14]:
# Creating base data frame

len(rating_df.movieId.unique())

26744

In [17]:
# Creating a copy of the users' rating dataframe 
df = rating_df.copy()

In [18]:
# Creating target variable: ratings equal or higher than 4 will be assigned 1, else will be 0.

def rating_encoder(rating):
    if rating >= 4: 
        return 1
    else: 
        return 0

df['high_rating'] = df.rating.apply(rating_encoder)

In [21]:
# Splitting dataframes based on date, to avoid data leakage

# Ordering dataframe by date

df = df.sort_values(by=['timestamp'])

df


Unnamed: 0,userId,movieId,rating,timestamp,high_rating
4182421,28507,1176,4.0,1995-01-09 11:46:44,1
18950979,131160,1079,3.0,1995-01-09 11:46:49,0
18950936,131160,47,5.0,1995-01-09 11:46:49,1
18950930,131160,21,3.0,1995-01-09 11:46:49,0
12341178,85252,45,3.0,1996-01-29 00:00:00,0
...,...,...,...,...,...
7819902,53930,118706,3.5,2015-03-31 06:00:51,0
2508834,16978,2093,3.5,2015-03-31 06:03:17,0
12898546,89081,55232,3.5,2015-03-31 06:11:26,0
12898527,89081,52458,4.0,2015-03-31 06:11:28,1


In [22]:
# Reseting index
df = df.reset_index()

In [26]:
# Checking 70% split
df[df.index== int(0.7*len(df))]

Unnamed: 0,index,userId,movieId,rating,timestamp,high_rating
14000184,6633261,45669,6016,2.5,2007-12-08 01:20:38,0


In [31]:
# Storing timestamp that will separate train and test information, data after this timestamp cannot be used for prediction

timestamp_limit = df[df.index== int(0.7*len(df))]['timestamp']


In [32]:
# Splitting dataframe on train and test set

df_train, df_test= np.split(df, [int(.7 *len(df))])

In [33]:
df_train

Unnamed: 0,index,userId,movieId,rating,timestamp,high_rating
0,4182421,28507,1176,4.0,1995-01-09 11:46:44,1
1,18950979,131160,1079,3.0,1995-01-09 11:46:49,0
2,18950936,131160,47,5.0,1995-01-09 11:46:49,1
3,18950930,131160,21,3.0,1995-01-09 11:46:49,0
4,12341178,85252,45,3.0,1996-01-29 00:00:00,0
...,...,...,...,...,...,...
14000179,6633146,45669,4017,3.5,2007-12-08 01:19:14,0
14000180,6632828,45669,1241,4.5,2007-12-08 01:19:28,1
14000181,6633185,45669,4641,3.5,2007-12-08 01:19:55,0
14000182,6633284,45669,6620,4.0,2007-12-08 01:20:05,1


In [34]:
df_test

Unnamed: 0,index,userId,movieId,rating,timestamp,high_rating
14000184,6633261,45669,6016,2.5,2007-12-08 01:20:38,0
14000185,6633348,45669,7942,3.0,2007-12-08 01:21:04,0
14000186,6633515,45669,46723,3.0,2007-12-08 01:21:26,0
14000187,6632788,45669,923,3.5,2007-12-08 01:21:47,0
14000188,6633042,45669,2918,5.0,2007-12-08 01:22:00,1
...,...,...,...,...,...,...
20000258,7819902,53930,118706,3.5,2015-03-31 06:00:51,0
20000259,2508834,16978,2093,3.5,2015-03-31 06:03:17,0
20000260,12898546,89081,55232,3.5,2015-03-31 06:11:26,0
20000261,12898527,89081,52458,4.0,2015-03-31 06:11:28,1


## Feature engineering

This is the part of the challenge where we will focus the most on our evaluation. Implement a
series of features that you think will have a high predictive power. Be creative, and explore
all the ideas you might have on what information could be useful to predict the rating of a
client.

Important Note: When creating the features that you propose, that predict the rating that a
user will give to some movie:

● assume that this model will be used to generate online predictions on a production
setting, and be aware of the implications of that, and

● put special attention for data leakage.

Your code organization and good practices will be taken into consideration, make sure that
your final submission is understandable, cleant, and the logic is easy to follow by other
people. Also, it is advisable to have considerations for code efficiency.

In [None]:
# Overall movie rating, number of ratings, tags rating, number of tags, genres rating, number of genres

# Other databases info (imdb)

## Model implementation

Implement a ML model which predicts your response variable using the predictive features
you created. 

Explain the process you followed to generate/choose the model. 

Do not invest too much time training/tuning your model. It will be enough for us if you choose an algorithm and a configuration of hyperparameters you have seen in the past to work well for this type of problem and dataset.

Please, explain and justify your selection of the algorithm and hyperparameters.

## Feature importance

Give an explanation of the importance of each feature, and show us which of the features you created had the highest impact on your model. 

Explain and justify your choice of the importance metric.

Important note: 

Even though your model predicts whether a client will rate as “high” a movie or not, we will not look into your skills building recommendation systems (like collaborative filtering). As we mentioned, we are interested in assessing your feature engineering and modeling skills, using the modeling structure defined above.

# Conclusions

Add some comments summarizing your work. Also, add comments on how you would improve it if further time was given to you.