# Wide-and-Deep ML: Data Preparation

The purpose of this notebook is to prepare the dataset we will use to build the wide-and-deep recommendation model.

## Introduction

For our final semester project, we will be using a **wide-and-deep machine learning** to build what could be a puzzle piece that fits machine learning into a bigger, more complicated, recommendation engine. A wide-and-deep model combines the *memorization* capabilities of a linear model with the *generalization* capabilities of deep learning that can allow us to create recommendation systems that can predict a wider variety of choices.

In [3]:
# import modules

import numpy as np
import pandas as pd

## 1. Generate a dataset

We imported the *movie* and *rating* tables from the [MovieLens 20M Dataset](https://www.kaggle.com/datasets/grouplens/movielens-20m-dataset?select=movie.csv) on Kaggle since we felt like it is simple enough to allow us to focus on building the model and still leave enough room for the rather complicated and shifting data relations on the platform. As such, this notebook also aims to generate several additional features that could be useful for collaborative filtering.

### 1.1. Load the data

In [4]:
# preview movie table
movies_df = pd.read_csv('../data/movie.csv')
movies_df.head(10)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


In [5]:
# preview ratings table
ratings_df = pd.read_csv('../data/rating.csv')
ratings_df.head(10)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,2005-04-02 23:53:47
1,1,29,3.5,2005-04-02 23:31:16
2,1,32,3.5,2005-04-02 23:33:39
3,1,47,3.5,2005-04-02 23:32:07
4,1,50,3.5,2005-04-02 23:29:40
5,1,112,3.5,2004-09-10 03:09:00
6,1,151,4.0,2004-09-10 03:08:54
7,1,223,4.0,2005-04-02 23:46:13
8,1,253,4.0,2005-04-02 23:35:40
9,1,260,4.0,2005-04-02 23:33:46


In [6]:
# view shape of tables
print(f"ratings dataframe:\t {ratings_df.shape}")
print(f"movies dataframe:\t {movies_df.shape}")

ratings dataframe:	 (20000263, 4)
movies dataframe:	 (27278, 3)


Previewing the *movie* and *ratings* tables gives us a decent idea about what we want the complete dataset to look like. First, notice that the *ratings* table is considerably massive. This can be computationally taxing for our laptop architectures, and even overkill for a project of this scope, so we trim it to only have to work with a subset of the data. To do this, we trimmed the ratings dataset so we only had ratings from the first 500 unique users, then we filtered the movies dataset so that we only had the movies that were in the smaller ratings dataset. The result is two decently sized datasets that we could comfortably run on our machines.

In [7]:
# get index of the 500th userId to scale it down
kth_Id = ratings_df['userId'].drop_duplicates().index[500]

# select ratings for the first 1000 userIds
ratings_df1 = ratings_df.iloc[:kth_Id]

# get unique movies from trimmed ratings table
ratings_df1_movies = ratings_df1['movieId'].drop_duplicates()

# use table merge to filter movies table
movies_df1 = pd.merge(ratings_df1_movies, movies_df, on='movieId', how='inner')

# view new shape of tables
print(f"ratings dataframe:\t {ratings_df1.shape}")
print(f"movies dataframe:\t {movies_df1.shape}")

ratings dataframe:	 (71554, 4)
movies dataframe:	 (7411, 3)


We merge the two dataframes and review it before we can save it for feature engineering.

In [8]:
# merge the tables to get the user-item interactions dataset.
export_df = pd.merge(ratings_df1, movies_df1, on='movieId', how='inner')
export_df.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,2,3.5,2005-04-02 23:53:47,Jumanji (1995),Adventure|Children|Fantasy
1,5,2,3.0,1996-12-25 15:26:09,Jumanji (1995),Adventure|Children|Fantasy
2,13,2,3.0,1996-11-27 08:19:02,Jumanji (1995),Adventure|Children|Fantasy
3,29,2,3.0,1996-06-23 20:36:14,Jumanji (1995),Adventure|Children|Fantasy
4,34,2,3.0,1996-10-28 13:29:44,Jumanji (1995),Adventure|Children|Fantasy


## 2. Save the data

In [9]:
# view column data types
export_df.dtypes

userId         int64
movieId        int64
rating       float64
timestamp     object
title         object
genres        object
dtype: object

In [10]:
# save data as csv
export_df.to_csv('../data/user_movie_interaction.csv')