## Feature Engineering
we will try to extract and combile all the relevant features into one dataset to make it the main dataset on which we will train our model.

Summarizing our approach -
1. The years mentioned in title column values can be useful so we will extract that and create a new column of year.
2. We will perform one hot encoding for the genres column in df_movies.
3. We will try to aggregate user ratings in df_ratings to create a user summary.
4. Aggregating movie ratings in df_ratings to create a movie profile.
5. Handling the missing values.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
df_movies=pd.read_csv('ml-latest-small\movies.csv')
df_links=pd.read_csv('ml-latest-small\links.csv')
df_ratings=pd.read_csv('ml-latest-small/ratings.csv')
df_tags=pd.read_csv('ml-latest-small/tags.csv')

In [3]:
df_movies.head(1)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


In [4]:
df_ratings.head(1)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703


In [5]:
df_tags.head(1)

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994


In [6]:
import re

Step 1:

In [7]:
df_movies['year']=df_movies['title'].str.extract(r'\((\d{4})\)',expand=False)
df_movies['title']=df_movies['title'].str.replace(r'\((\d{4})\)','',regex=True).str.strip()
df_movies.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji,Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men,Comedy|Romance,1995
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II,Comedy,1995


Step 2:

In [8]:
## One hot encoding of genres
df_movies['genres']=df_movies['genres'].apply(lambda x:x.split('|'))
genres=df_movies['genres'].str.join('|').str.get_dummies()

In [9]:
genres.head(2)

Unnamed: 0,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0


In [10]:
## Combining the genres dataframe with the original movies_df
df_movies=pd.concat([df_movies,genres],axis=1)

## Dropping the 'genres' column
df_movies.drop('genres',inplace=True,axis=1)

df_movies.head()

Unnamed: 0,movieId,title,year,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story,1995,0,0,1,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji,1995,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men,1995,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale,1995,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
4,5,Father of the Bride Part II,1995,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


Step 3:

In [11]:
## Aggregating the user ratings
user_profiles=df_ratings.groupby('userId').agg(
    avg_rating=('rating','mean'),
    rating_count=('rating','count')
).reset_index()

In [12]:
user_profiles.head()

Unnamed: 0,userId,avg_rating,rating_count
0,1,4.366379,232
1,2,3.948276,29
2,3,2.435897,39
3,4,3.555556,216
4,5,3.636364,44


Step 4:

In [13]:
movie_profiles=df_ratings.groupby('movieId').agg(
    avg_movie_rating=('rating','mean'),
    num_of_ratings=('rating','count')
).reset_index()

## Merging the movie_profiles with df_movies
df_movies=df_movies.merge(movie_profiles, on='movieId',how='left')

In [14]:
df_movies.head(2)

Unnamed: 0,movieId,title,year,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,...,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,avg_movie_rating,num_of_ratings
0,1,Toy Story,1995,0,0,1,1,1,1,0,...,0,0,0,0,0,0,0,0,3.92093,215.0
1,2,Jumanji,1995,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,3.431818,110.0


In [15]:
type(df_movies['year'][0])

str

Note : After careful consideration, we are not using the df_tags for our model because the tags are applicable to only the 10% of total movies listed in the dataset. 90% of the movies are without tags so it would take much time to do feature engineering.

Step 5:

In [16]:
df_movies.isnull().sum()

movieId                0
title                  0
year                  13
(no genres listed)     0
Action                 0
Adventure              0
Animation              0
Children               0
Comedy                 0
Crime                  0
Documentary            0
Drama                  0
Fantasy                0
Film-Noir              0
Horror                 0
IMAX                   0
Musical                0
Mystery                0
Romance                0
Sci-Fi                 0
Thriller               0
War                    0
Western                0
avg_movie_rating      18
num_of_ratings        18
dtype: int64

In [17]:
df_ratings.isnull().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

We have missing values in df_movies.

In [18]:
year_mode=df_movies['year'].mode()[0]

In [19]:
year_mode

'2002'

In [20]:
df_movies['year'].fillna(year_mode,inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_movies['year'].fillna(year_mode,inplace=True)


In [21]:
df_movies.isnull().sum()

movieId                0
title                  0
year                   0
(no genres listed)     0
Action                 0
Adventure              0
Animation              0
Children               0
Comedy                 0
Crime                  0
Documentary            0
Drama                  0
Fantasy                0
Film-Noir              0
Horror                 0
IMAX                   0
Musical                0
Mystery                0
Romance                0
Sci-Fi                 0
Thriller               0
War                    0
Western                0
avg_movie_rating      18
num_of_ratings        18
dtype: int64

In [22]:
df_movies.dropna(subset=['avg_movie_rating'],inplace=True)

In [23]:
## Converting the year column values from str to int
df_movies['year']=df_movies['year'].astype(int)

In [24]:
df_movies.reset_index(drop=True, inplace=True)

In [25]:
df_ratings.reset_index(drop=True, inplace=True)

In [26]:
user_profiles.reset_index(drop=True, inplace=True)

In [27]:
df_movies.to_csv('dataset/movies.csv',index=False,header=True)

In [28]:
df_ratings.to_csv('dataset/ratings.csv',index=False,header=True)

In [29]:
user_profiles.to_csv('dataset/userprofile.csv',index=False,header=True)