# Wide-and-Deep ML: Feature Engineering

In this notebook, we will engineer the features we will use to build the wide-and-deep collaborative filter recommender.

In [1]:
# import modules

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

## 1. Load the data

In [2]:
# get data
df0 = pd.read_csv('../data/user_movie_interaction.csv')
df0.head()

Unnamed: 0.1,Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,0,1,2,3.5,2005-04-02 23:53:47,Jumanji (1995),Adventure|Children|Fantasy
1,1,5,2,3.0,1996-12-25 15:26:09,Jumanji (1995),Adventure|Children|Fantasy
2,2,13,2,3.0,1996-11-27 08:19:02,Jumanji (1995),Adventure|Children|Fantasy
3,3,29,2,3.0,1996-06-23 20:36:14,Jumanji (1995),Adventure|Children|Fantasy
4,4,34,2,3.0,1996-10-28 13:29:44,Jumanji (1995),Adventure|Children|Fantasy


In [3]:
df0.shape

(71554, 7)

## 2. Data wrangling

First thing we notice is the unnecessary 'Unnamed: 0' column. This data might just be 'residue' from the saving and loading transformations as `.csv`, so we get rid of it. We then preview the dataset's datatypes as they might prove useful for subsequent feature engineering steps and creating the model. For example, we see that the 'timestamp', 'title', and 'genres' columns are of the `object` data type, but from experience, we can infer that they are not the same object. 'genre' and 'title' look like `pandas` Strings (technically a little different from python Strings), and the timestamp is a pandas Timestamp. These data types are subject to different types of transformations.

One idea that comes to mind is to use one-hot encoding to be able to use richer contextual data about the time for the collaboration filter such as the specific year the movie came out and how that relates to each of the users, the time of day they 'interacted' with the movies, the number of days since a user watched their first movie in the dataset, the number of days between *that* and every subsequent movies, what proportion of movies in the dataset were watched within those periods, and what other users were watching similar/different movies within the period. We can aggregate this data for all the users as an important feature engineering step to give us temporal relations for all the user-movie interactions.

However, we observed that for this dataset, the timestamps looked like the times the ratings were done. After one-hot encoding the column and grouping the 'time_of_day' into 15-minute periods that was supposed to represent when a user watched a movie down to the quarterly hour, we noticed that the users had interacted with multiple movies within the same 15-munite period and decided that this was not going to be very useful for the predictions that we wanted to make regarding time.

In [4]:
# remove unwanted column
del df0['Unnamed: 0']
df0.dtypes

userId         int64
movieId        int64
rating       float64
timestamp     object
title         object
genres        object
dtype: object

In [5]:
# sort table primarily by `userId` then by `timestamp`, both in ascending order
df1 = df0.sort_values(by=['userId', 'timestamp'])
display(df1)

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
2433,1,924,3.5,2004-09-10 03:06:38,2001: A Space Odyssey (1968),Adventure|Drama|Sci-Fi
2349,1,919,3.5,2004-09-10 03:07:01,"Wizard of Oz, The (1939)",Adventure|Children|Fantasy|Musical
6081,1,2683,3.5,2004-09-10 03:07:30,Austin Powers: The Spy Who Shagged Me (1999),Action|Adventure|Comedy
5253,1,1584,3.5,2004-09-10 03:07:36,Contact (1997),Drama|Sci-Fi
2634,1,1079,4.0,2004-09-10 03:07:45,"Fish Called Wanda, A (1988)",Comedy|Crime
...,...,...,...,...,...,...
3688,500,1200,4.0,2012-05-16 15:34:34,Aliens (1986),Action|Adventure|Horror|Sci-Fi
47936,500,162,4.0,2012-05-16 15:35:08,Crumb (1994),Documentary
63087,500,3095,4.0,2012-05-16 15:35:14,"Grapes of Wrath, The (1940)",Drama
4829,500,1291,4.0,2012-05-16 15:35:20,Indiana Jones and the Last Crusade (1989),Action|Adventure


In [6]:
# timestamp is not useful information
del df1['timestamp']

Now, the preview also shows us that we could scale the 'ratings' to allow for more efficient modeling. We simply scale down the ratings by dividing by 5 - the maximum rating a movie could get - but we recognize that we can also use other methods like *z-score* and *mean normalizarion*.

We proceed to find the average rating of each movie as we feel like this might be more representative of the overall user-movie interactions. Specifically, it can be used to discover patterns about how each user rates a specific movie compared to how everyone in the dataset who watched it felt about it. We must caution that this is not at all meant to be an inherently objective process, and a few other statistical measures could probably be used along with testing on new data to see which allows us to create better models with minimum *bias* and *variance*.

In [7]:
# scale rating column
df0['rating'] = df0['rating'].apply(lambda x: x/5.0)

In [8]:
# find every movie's average rating
# then rename the `rating` column
df2 = df1.groupby('movieId')['rating'].mean().reset_index()
df2 = df2.rename(columns={'rating': 'avg_movie_rating'})
df2.head()

Unnamed: 0,movieId,avg_movie_rating
0,1,3.996988
1,2,3.217949
2,3,3.277778
3,4,3.0
4,5,3.188889


An important feature that could prove powerful for the model is the genres. Each movie is associated with one or more genres. We can find the genres associated with each user and try to uncover patterns that show how this plays into the movies they watch as well as find associations between users. This process would allow us to create a powerful recommender model that uses deep learning on the text embeddings to determine the predictions, but more about this on [model preparation](./model_preparation).

To facilitate that step, we need to standardize the text data in the dataset. To do this, we remove the symbols and convert all the genre examples to lower case, and then we aggregate all the genres related with each user. After this step, each row has the user and movie primary features represented by ids, how the user rates the movie, the movie's average rating, its associated genres, and all the genres associated with the user according to all the movies they've interacted with. The target column is the movie's title.

In [9]:
# standardize `genres` column by removing vertical slash
# and making every letter lowercase
df1['genres'] = df1['genres'].apply(lambda x: x.replace('|', ' ').lower())
df1.head()

Unnamed: 0,userId,movieId,rating,title,genres
2433,1,924,3.5,2001: A Space Odyssey (1968),adventure drama sci-fi
2349,1,919,3.5,"Wizard of Oz, The (1939)",adventure children fantasy musical
6081,1,2683,3.5,Austin Powers: The Spy Who Shagged Me (1999),action adventure comedy
5253,1,1584,3.5,Contact (1997),drama sci-fi
2634,1,1079,4.0,"Fish Called Wanda, A (1988)",comedy crime


In [11]:
# find all the genres associated with each user
# then rename user genre
df3 = df1.groupby('userId')['genres'].agg(
    lambda x: ' '.join(
        list(set(' '.join(x).split()))
    )).reset_index()

df3 = df3.rename(columns={'genres': 'user_all_genres'})
df3.head()

Unnamed: 0,userId,user_all_genres
0,1,children war imax crime romance adventure horr...
1,2,children war imax crime romance adventure horr...
2,3,war documentary children crime romance adventu...
3,4,war children crime romance adventure sci-fi mu...
4,5,children war imax crime romance adventure horr...


In [12]:
# merge the dataframes
df4 = pd.merge(df1, df2, on='movieId', how='inner')
df4 = pd.merge(df4, df3, on='userId', how='inner')
df4.head()

Unnamed: 0,userId,movieId,rating,title,genres,avg_movie_rating,user_all_genres
0,1,924,0.7,2001: A Space Odyssey (1968),adventure drama sci-fi,0.782022,children war imax crime romance adventure horr...
1,1,919,0.7,"Wizard of Oz, The (1939)",adventure children fantasy musical,0.789286,children war imax crime romance adventure horr...
2,1,2683,0.7,Austin Powers: The Spy Who Shagged Me (1999),action adventure comedy,0.620482,children war imax crime romance adventure horr...
3,1,1584,0.7,Contact (1997),drama sci-fi,0.723944,children war imax crime romance adventure horr...
4,1,1079,0.8,"Fish Called Wanda, A (1988)",comedy crime,0.76,children war imax crime romance adventure horr...


A crucial step in reviewing the data is to make sure that there no missing values since they can impact the performance and predictive capacity of the model.

In [13]:
# find and drop any missing values
df4.isnull().sum()
# df4.dropna()

userId              0
movieId             0
rating              0
title               0
genres              0
avg_movie_rating    0
user_all_genres     0
dtype: int64

In [14]:
# get user profile data
user_profile = df4[['userId', 'rating', 'user_all_genres']]
user_profile = user_profile.drop_duplicates(subset=['userId'], keep='first')
user_profile

Unnamed: 0,userId,rating,user_all_genres
0,1,0.7,children war imax crime romance adventure horr...
175,2,1.0,children war imax crime romance adventure horr...
236,3,1.0,war documentary children crime romance adventu...
423,7,0.6,children war imax crime romance adventure horr...
699,14,0.7,children war imax crime romance adventure horr...
...,...,...,...
71430,210,0.8,crime romance film-noir horror adventure sci-f...
71454,293,0.6,children crime romance adventure horror sci-fi...
71477,123,0.6,children crime romance adventure sci-fi action...
71499,228,0.6,children war imax crime romance adventure musi...


In [15]:
# get similar profile about movies
movie_profile = df4.loc[:, ~df4.columns.isin(user_profile.columns)]
movie_profile = movie_profile.drop_duplicates(subset=['movieId'], keep='first')
del movie_profile['title']
movie_profile

Unnamed: 0,movieId,genres,avg_movie_rating
0,924,adventure drama sci-fi,0.782022
1,919,adventure children fantasy musical,0.789286
2,2683,action adventure comedy,0.620482
3,1584,drama sci-fi,0.723944
4,1079,comedy crime,0.760000
...,...,...,...
71181,71525,comedy,0.800000
71253,847,comedy drama,0.600000
71254,855,drama,0.600000
71387,26564,drama musical,0.800000


## 3. Save the data

The data is ready to be used for modeling. We split it into training, validation, and testing subsets, all of which will be critical in evaluating the performance of our model.

In [16]:
# save user and movie profiles
user_profile.to_csv('../data/user_profile.csv')
movie_profile.to_csv('../data/movie_profile.csv')

In [38]:
# split dataset into training and testing subsets
train, test = train_test_split(df4, test_size=.2)
train, val = train_test_split(train, test_size=.2)

# preview shape of datasets
print(f"{len(train)} train examples")
print(f"{len(val)} validation examples")
print(f"{len(test)} test examples")

# save train, test, validation samples to csv
train.to_csv('../data/user_movie_interaction_train.csv')
val.to_csv('../data/user_movie_interaction_val.csv')
test.to_csv('../data/user_movie_interaction_test.csv')

45794 train examples
11449 validation examples
14311 test examples
