# Movie lens data engineering

This notebook presents more sophisticated feature engineering for the movie lens data. 

Due to the nature of the data set being rich in movie attributes (with no user attributes as well as context of interaction), most feature engineering therefore will be carried out on movies.

## Dataset download

In [2]:
import os
from nvtabular.utils import download_file

MOVIE_LENS_DATA_PATH = './data/ml-25m/'

if not os.path.isdir(MOVIE_LENS_DATA_PATH):
    download_file(
        "http://files.grouplens.org/datasets/movielens/ml-25m.zip", os.path.join(MOVIE_LENS_DATA_PATH, "ml-25m.zip")
    )
!ls $MOVIE_LENS_DATA_PATH

README.txt	   ml-25m		movies_preprocessed_v4.csv  train.csv
genome-scores.csv  ml-25m.zip		ratings.csv
genome-tags.csv    movies.csv		tags.csv
links.csv	   movies_features.csv	test.csv


In [3]:
import cudf as pd
import numpy as np

In [4]:
movies = pd.read_csv(MOVIE_LENS_DATA_PATH+"movies.csv")
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
ratings = pd.read_csv(MOVIE_LENS_DATA_PATH+"ratings.csv")
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


In [6]:
ratings.dtypes

userId         int64
movieId        int64
rating       float64
timestamp      int64
dtype: object

In [7]:
links = pd.read_csv(MOVIE_LENS_DATA_PATH+"links.csv", dtype={'imdbId':str, 'tmdbId': str, 'movieId': int})
links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862
1,2,113497,8844
2,3,113228,15602
3,4,114885,31357
4,5,113041,11862


In [8]:
links.dtypes

movieId     int64
imdbId     object
tmdbId     object
dtype: object

In [9]:
numeric_features = []
cat_features = []

## Feature engineering for movies

### Movies rating statistics

We extract the average rating & numbers of ratings.

In [10]:
movies_num_ratings = pd.DataFrame(ratings.groupby('movieId').agg({'userId': ['count'],
                                                                  'rating': ['mean']}))

In [11]:
movies_num_ratings.columns = ["_".join(x) for x in movies_num_ratings.columns.ravel()]

In [12]:
movies_num_ratings.reset_index(inplace=True)

In [13]:
movies_num_ratings.head()

Unnamed: 0,movieId,userId_count,rating_mean
0,161012,8,1.125
1,63,2336,3.083904
2,7222,286,2.641608
3,92180,5,3.0
4,145474,5,2.3


In [14]:
movies = movies.merge(movies_num_ratings, on = 'movieId',  suffixes=('_', '_y'))

In [15]:
movies = movies.sort_values(by='movieId')
movies.head()

Unnamed: 0,movieId,title,genres,userId_count,rating_mean
44769,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,57309,3.893708
44770,2,Jumanji (1995),Adventure|Children|Fantasy,24228,3.251527
44771,3,Grumpier Old Men (1995),Comedy|Romance,11804,3.142028
44772,4,Waiting to Exhale (1995),Comedy|Drama|Romance,2523,2.853547
44773,5,Father of the Bride Part II (1995),Comedy,11714,3.058434


In [16]:
numeric_features += ['userId_count', 'rating_mean']

### Extract movie production year

In [17]:
movies['year'] = movies.title.str.extract("\((\d{4})\)", expand=True)
movies.year = pd.to_datetime(movies.year, format='%Y')
movies.year = movies.year.dt.year # As there are some NaN years, resulting type will be float (decimals)

movies['year_cat'] = movies['year']

In [18]:
movies.head()

Unnamed: 0,movieId,title,genres,userId_count,rating_mean,year,year_cat
44769,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,57309,3.893708,1995,1995
44770,2,Jumanji (1995),Adventure|Children|Fantasy,24228,3.251527,1995,1995
44771,3,Grumpier Old Men (1995),Comedy|Romance,11804,3.142028,1995,1995
44772,4,Waiting to Exhale (1995),Comedy|Drama|Romance,2523,2.853547,1995,1995
44773,5,Father of the Bride Part II (1995),Comedy,11714,3.058434,1995,1995


In [19]:
numeric_features += ['year']
cat_features += ['year_cat'] # encode year as both numeric and cat?

### Movies genres
We use one hot encoding for the movie genre.

In [20]:
genre_split = movies['genres'].str.split('|', n=10, expand=True)
genre_split.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
44769,Adventure,Animation,Children,Comedy,Fantasy,,,,,
44770,Adventure,Children,Fantasy,,,,,,,
44771,Comedy,Romance,,,,,,,,
44772,Comedy,Drama,Romance,,,,,,,
44773,Comedy,,,,,,,,,


In [21]:
genres = set()
for i in range(10):
    movies[i] = genre_split[i]
    genres = genres.union(set(genre_split[i].unique().to_pandas()))

In [22]:
genres.discard(None)
genres

{'(no genres listed)',
 'Action',
 'Adventure',
 'Animation',
 'Children',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Fantasy',
 'Film-Noir',
 'Horror',
 'IMAX',
 'Musical',
 'Mystery',
 'Romance',
 'Sci-Fi',
 'Thriller',
 'War',
 'Western'}

In [23]:
len(genres)

20

In [24]:
cat_features += ['genres']

In [25]:
for genre in genres:
    movies[genre] = 0
    for i in range(10):
        movies[genre] = movies[genre] + (movies[i]==genre)

In [26]:
for i in range(10):
    movies.drop(i, axis=1, inplace=True)

In [27]:
movies.head()

Unnamed: 0,movieId,title,genres,userId_count,rating_mean,year,year_cat,War,Western,Adventure,...,Documentary,Thriller,IMAX,Crime,Animation,(no genres listed),Romance,Children,Film-Noir,Drama
44769,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,57309,3.893708,1995,1995,0,0,1,...,0,0,0,0,1,0,0,1,0,0
44770,2,Jumanji (1995),Adventure|Children|Fantasy,24228,3.251527,1995,1995,0,0,1,...,0,0,0,0,0,0,0,1,0,0
44771,3,Grumpier Old Men (1995),Comedy|Romance,11804,3.142028,1995,1995,0,0,0,...,0,0,0,0,0,0,1,0,0,0
44772,4,Waiting to Exhale (1995),Comedy|Drama|Romance,2523,2.853547,1995,1995,0,0,0,...,0,0,0,0,0,0,1,0,0,1
44773,5,Father of the Bride Part II (1995),Comedy,11714,3.058434,1995,1995,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Content features

We use TF-IDF and SVD to turn text features (title, tag and genres) into numeric feature. 

In [28]:
tags = pd.read_csv(MOVIE_LENS_DATA_PATH+"tags.csv")
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,3,260,classic,1439472355
1,3,260,sci-fi,1439472256
2,4,1732,dark comedy,1573943598
3,4,1732,great dialogue,1573943604
4,4,7569,so bad it's good,1573943455


In [35]:
concated_tags = tags.groupby('movieId')['tag'].collect()
concated_tags

movieId
1         [Owned, imdb top 250, Pixar, Pixar, time trave...
2         [Robin Williams, time travel, fantasy, based o...
3         [funny, best friend, duringcreditsstinger, fis...
4         [based on novel or book, chick flick, divorce,...
5         [aging, baby, confidence, contraception, daugh...
                                ...                        
208813                                         [might like]
208933               [black and white, deal with the devil]
209035    [computer animation, Japan, mass behavior, mas...
209037    [chameleon, computer animation, gluttony, humo...
209063    [black, education, friends schools, independen...
Name: tag, Length: 45251, dtype: list

In [31]:
movies['genres'] = movies['genres'].str.split('|')

In [32]:
movies['genres']

44769    [Adventure, Animation, Children, Comedy, Fantasy]
44770                       [Adventure, Children, Fantasy]
44771                                    [Comedy, Romance]
44772                             [Comedy, Drama, Romance]
44773                                             [Comedy]
                               ...                        
58612                                              [Drama]
58613                                        [Documentary]
58614                                      [Comedy, Drama]
58615                                 [(no genres listed)]
58616                           [Action, Adventure, Drama]
Name: genres, Length: 59047, dtype: list

In [45]:
movies['title'] = movies['title'].str.split(' ')

In [46]:
movies_meta = pd.merge(movies, concated_tags.reset_index(), on='movieId', how='left')
movies_meta

Unnamed: 0,movieId,title,genres,userId_count,rating_mean,year,year_cat,War,Western,Adventure,...,Thriller,IMAX,Crime,Animation,(no genres listed),Romance,Children,Film-Noir,Drama,tag
0,181347,"[Alien, Evolution, (2001)]",[Documentary],2,2.500000,2001,2001,0,0,0,...,0,0,0,0,0,0,0,0,0,[making of]
1,181363,"[Ciacho, (2010)]",[Comedy],5,1.700000,2010,2010,0,0,0,...,0,0,0,0,0,0,0,0,0,"[car chase, gun, hospital, police, prison, pro..."
2,181365,"[Monty, Python:, Parrot, Sketch, Not, Included...",[Comedy],6,3.833333,1989,1989,0,0,0,...,0,0,0,0,0,0,0,0,0,[monty python]
3,181369,"[The, Mole, (2011)]","[Drama, Thriller]",2,3.000000,2011,2011,0,0,0,...,1,0,0,0,0,0,0,0,1,"[airplane, beach, car set on fire, clothes, di..."
4,181371,"[Uwikłanie, (2011)]","[Crime, Thriller]",1,2.500000,2011,2011,0,0,0,...,1,0,1,0,0,0,0,0,0,"[krakow poland, murder investigation, psychoth..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59042,194140,"[Sauvage, (2018)]",[Drama],13,3.500000,2018,2018,0,0,0,...,0,0,0,0,0,0,0,0,1,
59043,194142,"[Comment, tuer, sa, mère, (2018)]",[Comedy],3,2.666667,2018,2018,0,0,0,...,0,0,0,0,0,0,0,0,0,
59044,194146,"[The, Bachelor, (1997)]","[Drama, Romance]",1,3.000000,1997,1997,0,0,0,...,0,0,0,0,0,1,0,0,1,
59045,194148,"[Lethal, Admirer, (2018)]",[Thriller],1,0.500000,2018,2018,0,0,0,...,1,0,0,0,0,0,0,0,0,


In [111]:
movies_meta['tag'].to_pandas().fillna("ccx").apply(list).apply(lambda x: ' '.join(x))

TypeError: sequence item 266: expected str instance, NoneType found

In [99]:
 movies_meta['title'].to_pandas().apply(lambda x: ' '.join(x))+ \
                           movies_meta['genres'].to_pandas().apply(lambda x: ' '.join(x))+ \
                            movies_meta['tag'].to_pandas().fillna(" ").apply(list).apply(lambda x: ''.join(str(x)))

0           Alien Evolution (2001)Documentary['making of']
1        Ciacho (2010)Comedy['car chase', 'gun', 'hospi...
2        Monty Python: Parrot Sketch Not Included (1989...
3        The Mole (2011)Drama Thriller['airplane', 'bea...
4        Uwikłanie (2011)Crime Thriller['krakow poland'...
                               ...                        
59042                             Sauvage (2018)Drama[' ']
59043               Comment tuer sa mère (2018)Comedy[' ']
59044                The Bachelor (1997)Drama Romance[' ']
59045                   Lethal Admirer (2018)Thriller[' ']
59046                      Sleep No More (2018)Horror[' ']
Length: 59047, dtype: object

In [51]:
movies_meta['genres'].to_pandas().apply(lambda x: ' '.join(x))

0           Documentary
1                Comedy
2                Comedy
3        Drama Thriller
4        Crime Thriller
              ...      
59042             Drama
59043            Comedy
59044     Drama Romance
59045          Thriller
59046            Horror
Name: genres, Length: 59047, dtype: object

#### TF-IDF vectorizer
Next step is to turn the text metadata into a matrix.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(movies_meta['metadata'])
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), index=movies_meta['movieId'].tolist())
print(tfidf_df.shape)

In [None]:
tfidf_df.head()

In [None]:
len(tfidf.vocabulary_)

In [None]:
!mkdir ./assets

We save the TF-IDF object for subsequent transform of new movies.

In [None]:
import pickle

with open('./assets/movie-lens-tf-idf.pkl', 'wb') as f:
    pickle.dump({"tfidf": tfidf}, f)

#### TF-IDF Transform new movie
Note: for a new test movie, its TF-IDF representation can be computed as below:

In [None]:
new_movie_feat = tfidf.transform(['cartoon'])

In [None]:
new_movie_feat

#### SVD on TF-IDF matrix
The TF-IDF matrix is of numeric form, however, it is very high-dimensional and sparse. We use SVD to get an approximation of this matrix using only the most significant eigen vectors.

In [None]:
NUM_LATENT_FACTOR = 128

# Compress with SVD
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=NUM_LATENT_FACTOR)
latent_matrix = svd.fit_transform(tfidf_df)
 
# plot var expalined to see what latent dimensions to use
explained = svd.explained_variance_ratio_.cumsum()


In [None]:
import matplotlib.pyplot as plt
plt.plot(explained, '.-', ms = 16, color='red')
plt.xlabel('Singular value components', fontsize= 12)
plt.ylabel('Cumulative percent of variance', fontsize=12)        
plt.show()

In [None]:
latent_matrix_1_df = pd.DataFrame(latent_matrix,
                                  index=movies_meta['movieId'].tolist())

In [None]:
latent_matrix_1_df.head()

In [None]:
latent_matrix_1_df.set_axis(['SVD_latent_%d'%i for i in range(NUM_LATENT_FACTOR)], axis='columns', inplace=True)
latent_matrix_1_df.columns

In [None]:
latent_matrix_1_df['movieId'] = latent_matrix_1_df.index

In [None]:
latent_matrix_1_df.head()

In [None]:
movies = movies.merge(latent_matrix_1_df, on='movieId', how='left')

In [None]:
movies.head()

In [None]:
numeric_features += ['SVD_latent_%d'%i for i in range(NUM_LATENT_FACTOR)]

#### SVD transform new movie

Note: SVD transform of a new movie (with corresponding TF-IDF features) can be done as follows:

In [None]:
latent_matrix = svd.transform(new_movie_feat)

In [None]:
latent_matrix.shape

We save the SVD object for subsequent transform of new movies.

In [None]:
with open('./assets/svd.pkl', 'wb') as f:
    pickle.dump({"svd": svd}, f)

###  Popularity score

We compute the [popularity score](https://www.kaggle.com/rounakbanik/movie-recommender-systems) for each movie within the set of all movies as well as within each genre as follows:

```Weighted Rating (WR) = (v/(v+m).R)+(m/(v+m).C)```

where,

    v is the number of votes for the movie
    m is the minimum votes required to be listed in the chart
    R is the average rating of the movie
    C is the mean vote across the whole set


In [None]:
C = ratings['rating'].mean()
C

The next step is to determine an appropriate value for m, the minimum votes required to be listed in the chart. We will use 95th percentile as our cutoff. In other words, for a movie to feature in the charts, it must have more votes than at least 95% of the movies in the list.

In [None]:
m = movies_num_ratings['userId_count'].quantile(0.95)
m



Therefore, to qualify to be considered for the chart, a movie has to have at least 1503 votes on IMDB. We also see that the average rating for a movie on IMDB is 3.53 on a scale of 5. 2953 movies qualify to be on our chart.


In [None]:
qualified = movies[movies['userId_count'] >= m]

In [None]:
qualified.shape

In [None]:
def weighted_rating(x):
    v = x['userId_count']
    R = x['rating_mean']
    return (v/(v+m) * R) + (m/(m+v) * C)

In [None]:
qualified.loc[:, 'wr'] = qualified.apply(weighted_rating, axis=1)

In [None]:
qualified = qualified.sort_values('wr', ascending=False).head(250)

In [None]:
qualified.head(5)

In [None]:
cat_features += [g for g in genres]

#### Popularity score per genre

In [None]:
def build_chart(genre, percentile=0.85):
    df = movies[movies[genre] == 1] #filter out genre
    
    C = ratings['rating'].mean()
    m = movies_num_ratings['userId_count'].quantile(percentile)
    
    qualified = df[df['userId_count'] >= m]
    qualified['%s_wr'%genre] = qualified.apply(weighted_rating, axis=1)
    
    qualified = qualified.sort_values('%s_wr'%genre, ascending=False).head(250)
    qualified.reset_index(inplace=True, drop=True)
    qualified['%s_rank'%genre] = qualified.index
    
    return qualified

In [None]:
for genre in tqdm(genres):
     chart = build_chart(genre)
     movies = movies.merge(chart[["movieId", "%s_wr"%genre, "%s_rank"%genre]], on='movieId', how='left')    

In [None]:
movies.fillna(0, inplace=True)

In [None]:
movies.head()

In [None]:
movies.shape

In [None]:
numeric_features += ["%s_wr"%g for g in genres]
numeric_features += ["%s_rank"%g for g in genres]

In [None]:
with open('./assets/features.pkl', 'wb') as f:
    pickle.dump({"numeric_features": numeric_features, 'cat_features': cat_features }, f)


In [None]:
print(len(cat_features))
print(len(numeric_features))

### Merging with IMDB ID

In [None]:
movies = movies.merge(links, on='movieId', how='left')

In [None]:
movies.head()

In [None]:
movies.to_csv(MOVIE_LENS_DATA_PATH+"movies_features.csv", index=False)

In [None]:
movies.shape

# Divide into train and test

We use all examples for training, except the last-watched movies for testing.

In [None]:
ratings.tail()

## Forming binary dataset

To train a DLRM network, we will need binary labels. Herein we will employ a simple strategy. The ratings are on a scale between 1-5. We want to predict a binary target with 1 are all ratings >=4 and 0 are all ratings <=3.

In [None]:
ratings['label'] = 0
ratings.loc[ratings['rating']>3,'label'] = 1
ratings.loc[ratings['rating']<=3,'label'] = 0


In [None]:
MIN_RATINGS = 20
USER_COLUMN = 'userId'
ITEM_COLUMN = 'movieId'

grouped = ratings.groupby(USER_COLUMN)

In [None]:
df = ratings

# Need to sort before popping to get last item
df.sort_values(by='timestamp', inplace=True)

# clean up data
del df['rating'], df['timestamp']
df = df.drop_duplicates() # assuming it keeps order

# now we have filtered and sorted by time data, we can split test data out
grouped_sorted = df.groupby(USER_COLUMN, group_keys=False)
test_data = grouped_sorted.tail(1).sort_values(by=USER_COLUMN)
# need to pop for each group
train_data = grouped_sorted.apply(lambda x: x.iloc[:-1])

In [None]:
train_data.head()

In [None]:
test_data.head()