The objective of this challenge is to assess you ability to:

● perform basic data manipulation and data pre-processing

● demonstrate awareness of the computations involved

● perform feature engineering

● train and tune ML models

● asses performance of the ML models

● obtaining clear, useful, and business driven insights from data and models

In [1]:
#Imports 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sqlalchemy import create_engine
from sklearn.model_selection import train_test_split
import pymysql
import pandas_profiling
import statsmodels.api as sm


In [2]:
# Reading data

scores_df = pd.read_csv('../data/genome_scores.csv')
tags_df = pd.read_csv('../data/genome_tags.csv')
link_df = pd.read_csv('../data/link.csv')
movie_df = pd.read_csv('../data/movie.csv')
rating_df = pd.read_csv('../data/rating.csv')
tag_df = pd.read_csv('../data/tag.csv')

# Exploratory analysis

You might perform exploratory analysis on this data, but you are not required to present it
to us, we will focus mainly on the feature engineering section of this challenge.

In [3]:
scores_df.head()

Unnamed: 0,movieId,tagId,relevance
0,1,1,0.025
1,1,2,0.025
2,1,3,0.05775
3,1,4,0.09675
4,1,5,0.14675


In [4]:
# Checking nulls

scores_df.isna().sum() / len(scores_df)

movieId      0.0
tagId        0.0
relevance    0.0
dtype: float64

In [5]:
tags_df.head(10)

Unnamed: 0,tagId,tag
0,1,007
1,2,007 (series)
2,3,18th century
3,4,1920s
4,5,1930s
5,6,1950s
6,7,1960s
7,8,1970s
8,9,1980s
9,10,19th century


In [6]:
# Checking nulls

tags_df.isna().sum() / len(tags_df)

tagId    0.0
tag      0.0
dtype: float64

In [7]:
link_df.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [8]:
# Checking nulls

link_df.isna().sum() / len(link_df)

movieId    0.000000
imdbId     0.000000
tmdbId     0.009238
dtype: float64

In [9]:
movie_df.head(10)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


In [10]:
# Checking nulls

movie_df.isna().sum() / len(movie_df)

movieId    0.0
title      0.0
genres     0.0
dtype: float64

In [11]:
tag_df.head(10)

Unnamed: 0,userId,movieId,tag,timestamp
0,18,4141,Mark Waters,2009-04-24 18:19:40
1,65,208,dark hero,2013-05-10 01:41:18
2,65,353,dark hero,2013-05-10 01:41:19
3,65,521,noir thriller,2013-05-10 01:39:43
4,65,592,dark hero,2013-05-10 01:41:18
5,65,668,bollywood,2013-05-10 01:37:56
6,65,898,screwball comedy,2013-05-10 01:42:40
7,65,1248,noir thriller,2013-05-10 01:39:43
8,65,1391,mars,2013-05-10 01:40:55
9,65,1617,neo-noir,2013-05-10 01:43:37


In [12]:
# Checking nulls

tag_df.isna().sum() / len(tag_df)

userId       0.000000
movieId      0.000000
tag          0.000034
timestamp    0.000000
dtype: float64

In [13]:
# Checking nulls

tag_df[tag_df.tag.isna()==True]

Unnamed: 0,userId,movieId,tag,timestamp
373276,116460,123,,2008-01-04 12:47:47
373277,116460,346,,2008-01-04 13:05:46
373281,116460,1184,,2008-01-04 13:11:01
373288,116460,1785,,2008-01-04 13:06:46
373289,116460,2194,,2008-01-04 12:44:37
373291,116460,2691,,2008-01-04 12:50:02
373299,116460,4103,,2008-01-04 13:05:20
373301,116460,4473,,2008-01-04 12:50:40
373303,116460,4616,,2008-01-04 13:14:01
373319,116460,7624,,2008-01-04 13:11:06


In [14]:
# Dropping nulls

tag_df = tag_df.dropna()

In [15]:
rating_df.head(50)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,2005-04-02 23:53:47
1,1,29,3.5,2005-04-02 23:31:16
2,1,32,3.5,2005-04-02 23:33:39
3,1,47,3.5,2005-04-02 23:32:07
4,1,50,3.5,2005-04-02 23:29:40
5,1,112,3.5,2004-09-10 03:09:00
6,1,151,4.0,2004-09-10 03:08:54
7,1,223,4.0,2005-04-02 23:46:13
8,1,253,4.0,2005-04-02 23:35:40
9,1,260,4.0,2005-04-02 23:33:46


In [16]:
# Checking nulls

rating_df.isna().sum() / len(rating_df)

userId       0.0
movieId      0.0
rating       0.0
timestamp    0.0
dtype: float64

# Modeling structure

Create a dataframe where each instance (row) corresponds to a rating of some movie made by some user at a given point in time.

Note in particular that if a user has several ratings, then each of her ratings must appear on a different row.

Each column will correspond to a predictive variable (below we give instructions on the predictive variables). 

Then, create a column with the response variable for your model. This response variable is defined as:

● 1 in case the rating is >= 4 (flag for "high" rating)

● 0 in case the rating is < 4

In [17]:
# Creating base data frame

len(rating_df.movieId.unique())

26744

In [18]:
# Creating a copy of the users' rating dataframe 
df = rating_df.copy()

In [19]:
# Creating target variable: ratings equal or higher than 4 will be assigned 1, else will be 0.

def rating_encoder(rating):
    if rating >= 4: 
        return 1
    else: 
        return 0

df['high_rating'] = df.rating.apply(rating_encoder)

In [20]:
# Splitting dataframes based on date, to avoid data leakage

# Ordering dataframe by date

df = df.sort_values(by=['timestamp'])

df


Unnamed: 0,userId,movieId,rating,timestamp,high_rating
4182421,28507,1176,4.0,1995-01-09 11:46:44,1
18950979,131160,1079,3.0,1995-01-09 11:46:49,0
18950936,131160,47,5.0,1995-01-09 11:46:49,1
18950930,131160,21,3.0,1995-01-09 11:46:49,0
12341178,85252,45,3.0,1996-01-29 00:00:00,0
...,...,...,...,...,...
7819902,53930,118706,3.5,2015-03-31 06:00:51,0
2508834,16978,2093,3.5,2015-03-31 06:03:17,0
12898546,89081,55232,3.5,2015-03-31 06:11:26,0
12898527,89081,52458,4.0,2015-03-31 06:11:28,1


In [21]:
# Reseting index
df = df.reset_index()

In [22]:
# Checking 70% split
df[df.index== int(0.7*len(df))]

Unnamed: 0,index,userId,movieId,rating,timestamp,high_rating
14000184,6633261,45669,6016,2.5,2007-12-08 01:20:38,0


In [23]:
# Storing timestamp that will separate train and test information

timestamp_limit = df[df.index== int(0.7*len(df))]['timestamp']

# To avoid data leakage, data after this timestamp cannot be used for prediction

In [24]:
# Splitting dataframe on train and test set

df_train, df_test= np.split(df, [int(.7 *len(df))])

In [25]:
df_train

Unnamed: 0,index,userId,movieId,rating,timestamp,high_rating
0,4182421,28507,1176,4.0,1995-01-09 11:46:44,1
1,18950979,131160,1079,3.0,1995-01-09 11:46:49,0
2,18950936,131160,47,5.0,1995-01-09 11:46:49,1
3,18950930,131160,21,3.0,1995-01-09 11:46:49,0
4,12341178,85252,45,3.0,1996-01-29 00:00:00,0
...,...,...,...,...,...,...
14000179,6633146,45669,4017,3.5,2007-12-08 01:19:14,0
14000180,6632828,45669,1241,4.5,2007-12-08 01:19:28,1
14000181,6633185,45669,4641,3.5,2007-12-08 01:19:55,0
14000182,6633284,45669,6620,4.0,2007-12-08 01:20:05,1


In [26]:
df_test

Unnamed: 0,index,userId,movieId,rating,timestamp,high_rating
14000184,6633261,45669,6016,2.5,2007-12-08 01:20:38,0
14000185,6633348,45669,7942,3.0,2007-12-08 01:21:04,0
14000186,6633515,45669,46723,3.0,2007-12-08 01:21:26,0
14000187,6632788,45669,923,3.5,2007-12-08 01:21:47,0
14000188,6633042,45669,2918,5.0,2007-12-08 01:22:00,1
...,...,...,...,...,...,...
20000258,7819902,53930,118706,3.5,2015-03-31 06:00:51,0
20000259,2508834,16978,2093,3.5,2015-03-31 06:03:17,0
20000260,12898546,89081,55232,3.5,2015-03-31 06:11:26,0
20000261,12898527,89081,52458,4.0,2015-03-31 06:11:28,1


# Feature engineering

This is the part of the challenge where we will focus the most on our evaluation. Implement a
series of features that you think will have a high predictive power. Be creative, and explore
all the ideas you might have on what information could be useful to predict the rating of a
client.

Important Note: When creating the features that you propose, that predict the rating that a
user will give to some movie:

● assume that this model will be used to generate online predictions on a production
setting, and be aware of the implications of that, and

● put special attention for data leakage.

Your code organization and good practices will be taken into consideration, make sure that
your final submission is understandable, cleant, and the logic is easy to follow by other
people. Also, it is advisable to have considerations for code efficiency.

**TO DO:**

- Movie's average rating
- Movie's ratings amount
- Movie's number of tags
- Genre's rating
- Movie's Number of genres
- Movie's Release year
- Difference between movie's release year and rating year
- Tag's average rating 
- Tag's relevance (?)

- Other databases info (imdb)

##  Movie's Average Rating


In [27]:
# Computing movie's average rating on training set
mean_ratings = df_train.groupby('movieId')['rating'].mean().reset_index().rename(columns={'rating': 'movie_avg_rating'})



In [28]:
# Merging with train and test datasets

df_train = df_train.merge(mean_ratings, how='left', on= 'movieId')
df_test = df_test.merge(mean_ratings, how='left', on= 'movieId')

In [29]:
# Checking nulls

df_train.isna().sum() / len(df_train)

index               0.0
userId              0.0
movieId             0.0
rating              0.0
timestamp           0.0
high_rating         0.0
movie_avg_rating    0.0
dtype: float64

In [30]:
# Checking nulls

df_test.isna().sum() / len(df_test)

index               0.000000
userId              0.000000
movieId             0.000000
rating              0.000000
timestamp           0.000000
high_rating         0.000000
movie_avg_rating    0.207419
dtype: float64

## Movie's ratings amount 

In [31]:
# Computing movie's ratings counts on training set

ratings_count = df_train.groupby('movieId')['rating'].count().reset_index().rename(columns={'rating': 'movie_ratings_count'})


In [32]:
ratings_count.movie_ratings_count.value_counts()

9        80
13       66
8        62
6        61
10       60
         ..
4705      1
2327      1
11186     1
3702      1
1076      1
Name: movie_ratings_count, Length: 2926, dtype: int64

In [33]:
# Merging with train and test datasets

df_train = df_train.merge(ratings_count, how='left', on= 'movieId')
df_test = df_test.merge(ratings_count, how='left', on= 'movieId')

In [34]:
# Checking nulls

df_train.isna().sum() / len(df_train)

index                  0.0
userId                 0.0
movieId                0.0
rating                 0.0
timestamp              0.0
high_rating            0.0
movie_avg_rating       0.0
movie_ratings_count    0.0
dtype: float64

In [35]:
# Checking nulls

df_test.isna().sum() / len(df_test)

index                  0.000000
userId                 0.000000
movieId                0.000000
rating                 0.000000
timestamp              0.000000
high_rating            0.000000
movie_avg_rating       0.207419
movie_ratings_count    0.207419
dtype: float64

## Genre's rating


In [36]:
movie_df.head(10)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


In [37]:
len(movie_df.genres.unique()), len(movie_df)

(1342, 27278)

In [38]:
# Adding movie information to train and test sets

df_train = df_train.merge(movie_df, how='left', on='movieId')
df_test = df_test.merge(movie_df, how='left', on='movieId')

In [39]:
# Computing genre's average rating on training set

genre_mean_ratings = df_train.groupby('genres')['rating'].mean().reset_index().rename(columns={'rating': 'genre_avg_rating'})


In [40]:
genre_mean_ratings

Unnamed: 0,genres,genre_avg_rating
0,Action,2.826775
1,Action|Adventure,3.743394
2,Action|Adventure|Animation|Children|Comedy,4.011406
3,Action|Adventure|Animation|Children|Comedy|Fan...,3.008523
4,Action|Adventure|Animation|Children|Comedy|Sci-Fi,3.037480
...,...,...
763,Thriller|War,3.580000
764,Thriller|Western,3.065217
765,War,3.681832
766,War|Western,3.300000


In [41]:
df_train

Unnamed: 0,index,userId,movieId,rating,timestamp,high_rating,movie_avg_rating,movie_ratings_count,title,genres
0,4182421,28507,1176,4.0,1995-01-09 11:46:44,1,3.864444,1350,"Double Life of Veronique, The (Double Vie de V...",Drama|Fantasy|Romance
1,18950979,131160,1079,3.0,1995-01-09 11:46:49,0,3.872628,16440,"Fish Called Wanda, A (1988)",Comedy|Crime
2,18950936,131160,47,5.0,1995-01-09 11:46:49,1,4.021075,31577,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
3,18950930,131160,21,3.0,1995-01-09 11:46:49,0,3.599484,22094,Get Shorty (1995),Comedy|Crime|Thriller
4,12341178,85252,45,3.0,1996-01-29 00:00:00,0,3.385295,8548,To Die For (1995),Comedy|Drama|Thriller
...,...,...,...,...,...,...,...,...,...,...
14000179,6633146,45669,4017,3.5,2007-12-08 01:19:14,0,3.612600,1754,Pollock (2000),Drama
14000180,6632828,45669,1241,4.5,2007-12-08 01:19:28,1,3.657690,1801,Dead Alive (Braindead) (1992),Comedy|Fantasy|Horror
14000181,6633185,45669,4641,3.5,2007-12-08 01:19:55,0,3.791696,4299,Ghost World (2001),Comedy|Drama
14000182,6633284,45669,6620,4.0,2007-12-08 01:20:05,1,3.865844,2553,American Splendor (2003),Comedy|Drama


In [42]:
df_test

Unnamed: 0,index,userId,movieId,rating,timestamp,high_rating,movie_avg_rating,movie_ratings_count,title,genres
0,6633261,45669,6016,2.5,2007-12-08 01:20:38,0,4.261925,4822.0,City of God (Cidade de Deus) (2002),Action|Adventure|Crime|Drama|Thriller
1,6633348,45669,7942,3.0,2007-12-08 01:21:04,0,3.655738,61.0,Summer with Monika (Sommaren med Monika) (1953),Drama|Romance
2,6633515,45669,46723,3.0,2007-12-08 01:21:26,0,3.712575,1336.0,Babel (2006),Drama|Thriller
3,6632788,45669,923,3.5,2007-12-08 01:21:47,0,4.197421,13803.0,Citizen Kane (1941),Drama|Mystery
4,6633042,45669,2918,5.0,2007-12-08 01:22:00,1,3.977402,16395.0,Ferris Bueller's Day Off (1986),Comedy
...,...,...,...,...,...,...,...,...,...,...
6000074,7819902,53930,118706,3.5,2015-03-31 06:00:51,0,,,Black Sea (2014),Adventure|Thriller
6000075,2508834,16978,2093,3.5,2015-03-31 06:03:17,0,2.908915,1806.0,Return to Oz (1985),Adventure|Children|Fantasy
6000076,12898546,89081,55232,3.5,2015-03-31 06:11:26,0,3.100917,109.0,Resident Evil: Extinction (2007),Action|Horror|Sci-Fi|Thriller
6000077,12898527,89081,52458,4.0,2015-03-31 06:11:28,1,3.254667,375.0,Disturbia (2007),Drama|Thriller


In [43]:
# Merging with train and test datasets

df_train = df_train.merge(genre_mean_ratings, how='left', on= 'genres')
df_test = df_test.merge(genre_mean_ratings, how='left', on= 'genres')

In [44]:
# Checking nulls

df_train.isna().sum() / len(df_train)

index                  0.0
userId                 0.0
movieId                0.0
rating                 0.0
timestamp              0.0
high_rating            0.0
movie_avg_rating       0.0
movie_ratings_count    0.0
title                  0.0
genres                 0.0
genre_avg_rating       0.0
dtype: float64

In [45]:
# Checking nulls

df_test.isna().sum() / len(df_test)

index                  0.000000
userId                 0.000000
movieId                0.000000
rating                 0.000000
timestamp              0.000000
high_rating            0.000000
movie_avg_rating       0.207419
movie_ratings_count    0.207419
title                  0.000000
genres                 0.000000
genre_avg_rating       0.038203
dtype: float64

## Movie's Number of genres

In [46]:
# Copying movie dataset for additional calculations

movies = movie_df.copy()

In [47]:
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
27273,131254,Kein Bund für's Leben (2007),Comedy
27274,131256,"Feuer, Eis & Dosenbier (2002)",Comedy
27275,131258,The Pirates (2014),Adventure
27276,131260,Rentun Ruusu (2001),(no genres listed)


In [48]:
def genre_count(genres):
    
    if genres == "(no genres listed)":
        return 0    
    else:
        n = genres.count('|')
        return n+1

In [49]:
movies['n_of_genres'] = movies.genres.apply(genre_count)

In [50]:
movies['n_of_genres'].value_counts()

1     10583
2      8809
3      5330
4      1724
5       477
0       246
6        83
7        20
8         5
10        1
Name: n_of_genres, dtype: int64

In [51]:
movies

Unnamed: 0,movieId,title,genres,n_of_genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5
1,2,Jumanji (1995),Adventure|Children|Fantasy,3
2,3,Grumpier Old Men (1995),Comedy|Romance,2
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,3
4,5,Father of the Bride Part II (1995),Comedy,1
...,...,...,...,...
27273,131254,Kein Bund für's Leben (2007),Comedy,1
27274,131256,"Feuer, Eis & Dosenbier (2002)",Comedy,1
27275,131258,The Pirates (2014),Adventure,1
27276,131260,Rentun Ruusu (2001),(no genres listed),0


## Movie's Release year


In [52]:
# Extracting the movies release year
movies['release_year'] = movies.title.str.extract("\((\d{4})\)", expand=True)

In [53]:
# Checking data types

movies.dtypes

movieId          int64
title           object
genres          object
n_of_genres      int64
release_year    object
dtype: object

In [54]:
# Checking nulls

movies.isna().sum() / len(movies)

movieId         0.000000
title           0.000000
genres          0.000000
n_of_genres     0.000000
release_year    0.000807
dtype: float64

In [55]:
# Dropping nulls
movies = movies.dropna()

In [56]:
#Converting release year from string to integer

movies.release_year = movies.release_year.apply(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [57]:
foo = movies.drop(columns=['title', 'genres'])

In [58]:
# Merging with train and test datasets

df_train = df_train.merge(foo, how='left', on= 'movieId')
df_test = df_test.merge(foo, how='left', on= 'movieId')

del foo

## Difference between movie's release year and rating year


In [59]:
# Computing rating year for train and test sets
df_train['rating_year'] = pd.to_datetime(df_train['timestamp']).dt.year
df_test['rating_year'] = pd.to_datetime(df_test['timestamp']).dt.year

In [60]:
df_train.dtypes

index                    int64
userId                   int64
movieId                  int64
rating                 float64
timestamp               object
high_rating              int64
movie_avg_rating       float64
movie_ratings_count      int64
title                   object
genres                  object
genre_avg_rating       float64
n_of_genres              int64
release_year             int64
rating_year              int64
dtype: object

In [61]:
# Computing rating year for train and test sets 

df_train['years_between_release_rating'] = df_train.rating_year - df_train.release_year  
df_test['years_between_release_rating'] = df_test.rating_year - df_test.release_year

## Movie's legacy 

In [62]:
# Computing product of rating and years between movie´s release and user's rating

df_train['rating_legacy'] = df_train['rating']*df_train['years_between_release_rating'] 

In [63]:
# Computing legacy's average on training set

movie_mean_legacy = df_train.groupby('movieId')['rating_legacy'].mean().reset_index().rename(columns={'rating_legacy': 'movie_avg_legacy'})


In [64]:
movie_mean_legacy

Unnamed: 0,movieId,movie_avg_legacy
0,1,20.870094
1,2,15.192719
2,3,12.314074
3,4,9.108321
4,5,10.349501
...,...,...
9691,56165,0.000000
9692,56167,0.000000
9693,56169,0.000000
9694,56171,0.000000


In [65]:
# Merging with train and test datasets

df_train = df_train.merge(movie_mean_legacy, how='left', on= 'movieId')
df_test = df_test.merge(movie_mean_legacy, how='left', on= 'movieId')

In [66]:
# Checking nulls

df_train.isna().sum() / len(df_train)

index                           0.0
userId                          0.0
movieId                         0.0
rating                          0.0
timestamp                       0.0
high_rating                     0.0
movie_avg_rating                0.0
movie_ratings_count             0.0
title                           0.0
genres                          0.0
genre_avg_rating                0.0
n_of_genres                     0.0
release_year                    0.0
rating_year                     0.0
years_between_release_rating    0.0
rating_legacy                   0.0
movie_avg_legacy                0.0
dtype: float64

In [67]:
# Checking nulls

df_test.isna().sum() / len(df_test)

index                           0.000000
userId                          0.000000
movieId                         0.000000
rating                          0.000000
timestamp                       0.000000
high_rating                     0.000000
movie_avg_rating                0.207419
movie_ratings_count             0.207419
title                           0.000000
genres                          0.000000
genre_avg_rating                0.038203
n_of_genres                     0.000067
release_year                    0.000067
rating_year                     0.000000
years_between_release_rating    0.000067
movie_avg_legacy                0.207419
dtype: float64

## Tag's average rating 


## Tag's relevance (?)

# Model implementation

Implement a ML model which predicts your response variable using the predictive features
you created. 

Explain the process you followed to generate/choose the model. 

Do not invest too much time training/tuning your model. It will be enough for us if you choose an algorithm and a configuration of hyperparameters you have seen in the past to work well for this type of problem and dataset.

Please, explain and justify your selection of the algorithm and hyperparameters.

# Feature importance

Give an explanation of the importance of each feature, and show us which of the features you created had the highest impact on your model. 

Explain and justify your choice of the importance metric.

Important note: 

Even though your model predicts whether a client will rate as “high” a movie or not, we will not look into your skills building recommendation systems (like collaborative filtering). As we mentioned, we are interested in assessing your feature engineering and modeling skills, using the modeling structure defined above.

# Conclusions

Add some comments summarizing your work. Also, add comments on how you would improve it if further time was given to you.