# EDA and a Recommendation System Based on the MovieLens Dataset


### Contents
<br>
1. Libraries <br>
2. Loading and preprocessing of data <br>  
3. Exploratory Data Analysis <br>
    <li> 3.1. Rating Data </li>
    <li> 3.2. User Data </li>
    <li> 3.3. Movie Data </li>
4. Collaborative Filtering and Recommendation System <br>
    <li> 4.1. Building the recommendation system </li>
    <li> 4.2. User based collaborative filtering </li>
    <li> 4.3. Item based collaborative filtering </li>
5. Content Based Filtering <br>


## 1. Libraries  


In [6]:
#importing libraries for exploratory data analysis
import pandas as pd
import numpy as np
import random
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

#importing libraries for collaborative filtering
from sklearn.metrics import mean_squared_error, pairwise

## 2. Loading and Preprocessing Data

Data is taken from https://grouplens.org/datasets/movielens/. The dataset used is the MovieLens 100K Dataset (https://grouplens.org/datasets/movielens/100k/). This contains 3 data sources u.data, u.item and u.user which we have used for our project.

In [7]:
data_dir = '../data/'

In [8]:
# loading data

item = pd.read_csv("/content/u.item", sep="|",encoding="latin-1",
                      names=["movie_id", "movie_title", "release_date", "video_release_date",
                             "imbd_url", "unknown", "action", "adventure", "animation",
                             "childrens", "comedy", "crime", "documentary", "drama", "fantasy",
                             "film_noir", "horror", "musical", "mystery", "romance",
                             "sci-fi", "thriller", "war", "western"])

rating = pd.read_csv("/content/u.data", sep="\t", names=["user_id", "item_id", "rating", "timestamp"])

user = pd.read_csv("/content/u.user", sep="|", encoding="latin-1", names=["user_id", "age", "gender",
                                                                            "occupation", "zip_code"])

## 3. Exploratory Data Analysis



### 3.1 Rating data

In [9]:
# Age group dictionary retrieved from dataset description
age_group_dict = {
    1:  "Under 18",
    18:  "18-24",
    25:  "25-34",
    35:  "35-44",
    45:  "45-49",
    50:  "50-55",
    56:  "56+"
}

In [10]:
# Occupation dictionary retrieved from dataset description
occupation_dict = {
    0:  "other or not specified",
    1:  "academic/educator",
    2:  "artist",
    3:  "clerical/admin",
    4:  "college/grad student",
    5:  "customer service",
    6:  "doctor/health care",
    7:  "executive/managerial",
    8:  "farmer",
    9:  "homemaker",
    10:  "K-12 student",
    11:  "lawyer",
    12:  "programmer",
    13:  "retired",
    14:  "sales/marketing",
    15:  "scientist",
    16:  "self-employed",
    17:  "technician/engineer",
    18:  "tradesman/craftsman",
    19:  "unemployed",
    20:  "writer"
}

In [11]:
rating.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [12]:
rating.tail()

Unnamed: 0,user_id,item_id,rating,timestamp
99995,880,476,3,880175444
99996,716,204,5,879795543
99997,276,1090,1,874795795
99998,13,225,2,882399156
99999,12,203,3,879959583


In [13]:
rating.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype
---  ------     --------------   -----
 0   user_id    100000 non-null  int64
 1   item_id    100000 non-null  int64
 2   rating     100000 non-null  int64
 3   timestamp  100000 non-null  int64
dtypes: int64(4)
memory usage: 3.1 MB


From this we can see that there are 100,000 entries in user ratings. These contain 3 attributes - user id, item id and rating.

Further analyzing to find statistics about the dataset -

In [14]:
print('Number of unique users in the dataset :', rating.user_id.nunique())
print('Number of unique movies in the dataset : ', rating.item_id.nunique())


Number of unique users in the dataset : 943
Number of unique movies in the dataset :  1682


In [15]:
print('List of unique ratings:', rating.rating.unique())
print('Average rating given by users:', round(rating.rating.mean(), 2))


List of unique ratings: [3 1 2 4 5]
Average rating given by users: 3.53


In [16]:
print('Frequency of each rating value:')
print(rating['rating'].value_counts())

Frequency of each rating value:
rating
4    34174
3    27145
5    21201
2    11370
1     6110
Name: count, dtype: int64


From this we can see that there are 943 unique users that rated 1628 unique movies. The ratings range from 1 to 5. The average rating is 3.53 and the most frequent rating value is 4.

### 3.2 User data

In [17]:
user.head()

Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


In [18]:
user.tail()

Unnamed: 0,user_id,age,gender,occupation,zip_code
938,939,26,F,student,33319
939,940,32,M,administrator,2215
940,941,20,M,student,97229
941,942,48,F,librarian,78209
942,943,22,M,student,77841


In [19]:
user.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 943 entries, 0 to 942
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   user_id     943 non-null    int64 
 1   age         943 non-null    int64 
 2   gender      943 non-null    object
 3   occupation  943 non-null    object
 4   zip_code    943 non-null    object
dtypes: int64(2), object(3)
memory usage: 37.0+ KB


From this we can see that there are 943 entries in user id. These contain 5 attributes - user id, age, gender, occupation and zip code.

Further analysis to fins statistic about the dataset -

In [20]:
print('Number of unique occupations:', user.occupation.nunique())
print('Number of unique zipcodes:', user.zip_code.nunique())
print('Gender distribution of the given data:\n', user['gender'].value_counts())

Number of unique occupations: 21
Number of unique zipcodes: 795
Gender distribution of the given data:
 gender
M    670
F    273
Name: count, dtype: int64


From this we can see that there are 21 unique occupations, 795 unique zipcodes where the users live and the gender distribution of the data contains 670 male users and 273 female users.

### 3.3 Movie data

## 4. Collaborative Filtering and Recommendation System

In [21]:
# peak at dataframe
item.head()

# we only need the movie_id and movie_title
movies = item.loc[:, :"movie_title"].copy()
movies.head()

Unnamed: 0,movie_id,movie_title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


In [22]:
movies.movie_title

Unnamed: 0,movie_title
0,Toy Story (1995)
1,GoldenEye (1995)
2,Four Rooms (1995)
3,Get Shorty (1995)
4,Copycat (1995)
...,...
1677,Mat' i syn (1997)
1678,B. Monkey (1998)
1679,Sliding Doors (1998)
1680,You So Crazy (1994)


In [23]:
# peak at rating data
rating.head()

# dropping timestamp
rating.drop("timestamp", axis=1, inplace=True)
rating.head()

Unnamed: 0,user_id,item_id,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1


### 4.1. Building the Recommendation System

In [24]:
# creating n x m matrix where n is user_id and m is item_id
user_ratings = pd.pivot_table(rating, index="user_id", columns="item_id", values="rating").fillna(0)

# user and item counts
n_users = len(user_ratings.index)
n_items = len(user_ratings.columns)

print(f"Users: {n_users}\nItems: {n_items}")
user_ratings.head()

Users: 943
Items: 1682


item_id,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,4.0,3.0,3.0,5.0,4.0,1.0,5.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [25]:


def train_test_split(data: np.array, n_users: int, n_items:int):
    # create a empty array of shape n x m for test
    test = np.zeros((n_users, n_items))
    train = data.copy()

    # for each user, we generate a random sample of 5 from movies they've watched
    for user in range(n_users):
        random_sample = np.random.choice(data[user, :].nonzero()[0],
                                         size=5,
                                         replace=False)
        # set the train to zero to represent no rating and the test will be the original rating
        train[user, random_sample] = 0.
        test[user, random_sample] = data[user, random_sample]

    return train, test

train, test = train_test_split(data=user_ratings.to_numpy(), n_users=n_users, n_items=n_items)

In [26]:
# find similar users and items
user_similarity = pairwise.cosine_similarity(train + 1e-9)
item_similarity = pairwise.cosine_similarity(train.T + 1e-9)

print(user_similarity.shape, item_similarity.shape)

(943, 943) (1682, 1682)


### 4.2 User based Collaborative Filtering

In [27]:
# predict user ratings not included in data
user_preds = np.dot(user_similarity, train) / np.array([np.abs(user_similarity).sum(axis=1)]).T

# get the nonzero elements
nonzero_test = test[test.nonzero()]
nonzero_user_preds = user_preds[test.nonzero()]

user_rating_preds = mean_squared_error(nonzero_test, nonzero_user_preds)
print(f"UBCF Mean Squared Error: {user_rating_preds}")

UBCF Mean Squared Error: 8.19253209278542


### 4.3 Item Based Collaborative Filtering

In [28]:
# predict item ratings not included in data
item_preds = np.dot(train, item_similarity) / np.array([np.abs(item_similarity).sum(axis=1)])

# get the nonzero elements
nonzero_item_preds = item_preds[test.nonzero()]

item_rating_preds = mean_squared_error(nonzero_test, nonzero_item_preds)
print(f"IBCF Mean Squared Error: {item_rating_preds}")

IBCF Mean Squared Error: 11.300279983447261


## 5. Content Based Filtering

In [29]:
# merge data so we know the features of each movie
movies = pd.merge(item, rating, right_on="item_id", left_on="movie_id")
# create a pivot table
movies_pivot = pd.pivot_table(movies, index="user_id", columns="movie_title", values="rating")

# transpose only so it fit's in the screen
movies_pivot.T.head()

user_id,1,2,3,4,5,6,7,8,9,10,...,934,935,936,937,938,939,940,941,942,943
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'Til There Was You (1997),,,,,,,,,,,...,,,,,,,,,,
1-900 (1994),,,,,,,,,,,...,,,,,,,,,,
101 Dalmatians (1996),2.0,,,,2.0,,,,,,...,2.0,,,2.0,4.0,,,,,
12 Angry Men (1957),5.0,,,,,4.0,4.0,,,5.0,...,,,,,,,,,,
187 (1997),,,2.0,,,,,,,,...,,,,,,,,,,


In [30]:
# avg ratings and rating counts
avg_rating = movies.groupby("movie_title")["rating"].mean()
num_ratings = movies.groupby("movie_title")["rating"].count()

# getting counts and average ratings
ratings_counts = pd.DataFrame({"avg_rating": avg_rating,
                               "num_of_ratings": num_ratings})

# joining the new values to movie data
full_movie_data = pd.merge(movies, ratings_counts, left_on="movie_title", right_index=True)
full_movie_data.head()

Unnamed: 0,movie_id,movie_title,release_date,video_release_date,imbd_url,unknown,action,adventure,animation,childrens,...,romance,sci-fi,thriller,war,western,user_id,item_id,rating,avg_rating,num_of_ratings
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,308,1,4,3.878319,452
1,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,287,1,5,3.878319,452
2,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,148,1,4,3.878319,452
3,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,280,1,4,3.878319,452
4,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,66,1,3,3.878319,452


In [31]:
def get_similar_movies(full_movie_data: pd.DataFrame,
                       movie_matrix: pd.DataFrame,
                       movie_title: str,
                       min_num_of_ratings: int = 100,
                       n_recommendations: int = 5
                       ):
    """
    Get similar movies based on correlation with other movies
    """
    # get most correlated movies
    similar_movies = movie_matrix.corrwith(movie_matrix[movie_title])
    # converting to a dataframe and dropping NaN's
    similar_corr_df = pd.DataFrame({"correlation":similar_movies})
    similar_corr_df.dropna(inplace=True)

    # store the oringinal dataframe
    orig = full_movie_data.copy()

    # merge with correlated dataframe but only keep specified columns
    corr_with_movie = pd.merge(left=similar_corr_df,
                               right=orig,
                               on="movie_title")[
        ["movie_title", "correlation", "avg_rating", "num_of_ratings"]].drop_duplicates().reset_index(drop=True)

    # filter movies with less than min_num_of_ratings
    result = corr_with_movie[corr_with_movie['num_of_ratings'] > min_num_of_ratings].sort_values(
                                                                                     by='correlation',
                                                                                     ascending=False)
    return result.iloc[1:, :].head()

In [32]:
get_similar_movies(full_movie_data, movies_pivot, "Toy Story (1995)")

Unnamed: 0,movie_title,correlation,avg_rating,num_of_ratings
288,"Craft, The (1996)",0.5491,3.115385,104
356,Down Periscope (1996),0.457995,2.70297,101
825,Miracle on 34th Street (1994),0.456291,3.722772,101
479,G.I. Jane (1997),0.454756,3.36,175
51,Amistad (1997),0.449915,3.854839,124
