# MOVIELENS RECOMMENDATION SYSTEMS


### Collaborators
- 1.Ruth Kitasi
- 2.Agatha Nyambati
- 3.Joseline Apiyo
- 4.Cecilia Ngunjiri
- 5.John Mbego
- 6.Leonard Koyio

![movie-img.jpeg](movie-img.jpeg)

# 1.BUSINESS UNDERSTANDING

## 1.1 Overview

In today's world of massive data growth, recommendation systems have become essential tools for filtering information and enhancing user experiences. These systems help users find relevant content by analyzing their past behaviors, such as search queries or browsing histories.

Companies like YouTube and Spotify use recommendation algorithms to suggest the next video or curate personalized playlists based on user preferences.

In line with our project objective we aim to harness the power of data analysis to build a movie recommendation system that provides users with personalized movie suggestions.

By analyzing user ratings of other movies, we can generate tailored recommendations that align with individual preferences. The goal is to develop a model that delivers the top 5 movie recommendations for each user, optimizing their viewing experience based on their previous interactions.

## 1.2 Problem statement

With the vast amount of content available on streaming platforms, users often feel overwhelmed by choices, making it difficult to discover movies that align with their preferences. Traditional search methods fall short in addressing this challenge, resulting in a less satisfying user experience and decreased engagement.

MovieLens has tasked our team of data scientists with optimizing their recommendation system through data-driven approaches. By analyzing user behaviors and preferences, we aim to enhance the system's ability to deliver personalized movie recommendations.

## 1.3 Objectives

- Develop a model to provide personalized top 5 movie recommendations for users based on their ratings and preferences, utilizing collaborative filtering techniques.

- Determine the rating frequency of users based on various features, such as genre, director, and release year, to identify patterns in user preferences.

- Analyze key features that contribute to the popularity of trending movies to enhance the effectiveness of the recommendation system in suggesting relevant content.

- Implement collaborative filtering techniques, including both user-based and item-based methods, to segment users and items, improving the accuracy of personalized recommendations.

- Create a solution to address the cold start problem by recommending popular and trending movies to new users with no prior ratings, ensuring an engaging initial experience.


# 2. DATA UNDERSTANDING

## 2.1 Data Source

The dataset https://grouplens.org/datasets/movielens/, was obtained from the GroupLens website which  is a well-known resource for research in recommendation systems and data analysis.

The Movielens comprises of four files:

1.`Links`:  contains three features:-
  
- movieId is a unique identifier for movies used movielens
- imdbId is a unique identifier for  movie on IMDb
- tmdbId is a unique identifier for movies on the TMDb

2.`movies`:  contains three features:-

- movieId.
- Title contains titles of the movie
- Genre- contains genre of each movie title

3.`Ratings`:  contains three features:-

- userID - This is a unique identifier assigned to each user who has rated movies in the dataset
- movieID.
- Rating-This represents the user's rating for a particular movie.
- Timestamp-records the date and time when the rating was given.

4.`tags`:  contains four features:-

- userID 
- movieID.
- Tags-contains descriptive keywords or phrases that characterize the movie 
- Timestamp


  

## 2.2 Data loading


In [70]:
## Importing the libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [71]:
## reading the files
links = pd.read_csv('ml-latest-small\links.csv')
movies = pd.read_csv('ml-latest-small\movies.csv')
tags = pd.read_csv(r'ml-latest-small\tags.csv')
ratings = pd.read_csv(r'ml-latest-small\ratings.csv')

Viewing few columns of each file

In [72]:
# viewing the link file
links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [73]:
# viewing the movie file
movies.tail()

Unnamed: 0,movieId,title,genres
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation
9741,193609,Andrew Dice Clay: Dice Rules (1991),Comedy


In [74]:
# viewing the ratings file
ratings.sample(n=5)

Unnamed: 0,userId,movieId,rating,timestamp
36114,246,52885,5.0,1354134427
65596,421,593,5.0,1311494584
82526,524,377,5.0,851608745
71813,462,5932,4.0,1293373783
13927,89,118572,4.0,1520408985


In [75]:
# viewing the tags file
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


Getting a consise summary of each file using the info() method.

In [76]:
links.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  9742 non-null   int64  
 1   imdbId   9742 non-null   int64  
 2   tmdbId   9734 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 228.5 KB


In [77]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [78]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [79]:
tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3683 entries, 0 to 3682
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   userId     3683 non-null   int64 
 1   movieId    3683 non-null   int64 
 2   tag        3683 non-null   object
 3   timestamp  3683 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 115.2+ KB


In [80]:
links.describe()

Unnamed: 0,movieId,imdbId,tmdbId
count,9742.0,9742.0,9734.0
mean,42200.353623,677183.9,55162.123793
std,52160.494854,1107228.0,93653.481487
min,1.0,417.0,2.0
25%,3248.25,95180.75,9665.5
50%,7300.0,167260.5,16529.0
75%,76232.0,805568.5,44205.75
max,193609.0,8391976.0,525662.0


In [81]:
movies.describe()

Unnamed: 0,movieId
count,9742.0
mean,42200.353623
std,52160.494854
min,1.0
25%,3248.25
50%,7300.0
75%,76232.0
max,193609.0


In [82]:
ratings.describe()

Unnamed: 0,userId,movieId,rating,timestamp
count,100836.0,100836.0,100836.0,100836.0
mean,326.127564,19435.295718,3.501557,1205946000.0
std,182.618491,35530.987199,1.042529,216261000.0
min,1.0,1.0,0.5,828124600.0
25%,177.0,1199.0,3.0,1019124000.0
50%,325.0,2991.0,3.5,1186087000.0
75%,477.0,8122.0,4.0,1435994000.0
max,610.0,193609.0,5.0,1537799000.0


In [83]:
tags.describe()

Unnamed: 0,userId,movieId,timestamp
count,3683.0,3683.0,3683.0
mean,431.149335,27252.013576,1320032000.0
std,158.472553,43490.558803,172102500.0
min,2.0,1.0,1137179000.0
25%,424.0,1262.5,1137521000.0
50%,474.0,4454.0,1269833000.0
75%,477.0,39263.0,1498457000.0
max,610.0,193565.0,1537099000.0


Getting a summary of the number of rows and columns of each dataset

In [84]:
rows, colums =links.shape
print(f'The links dataset has {rows} rows and {colums} columns')

The links dataset has 9742 rows and 3 columns


In [85]:
rows, colums =movies.shape
print(f'The movies dataset has {rows} rows and {colums} columns')

The movies dataset has 9742 rows and 3 columns


In [86]:
rows, colums =ratings.shape
print(f'The ratings dataset has {rows} rows and {colums} columns')

The ratings dataset has 100836 rows and 4 columns


In [87]:
rows, colums =tags.shape
print(f'The tags dataset has {rows} rows and {colums} columns')

The tags dataset has 3683 rows and 4 columns


#### Observations made from data undertanding

- All the four files have a common feature which is the movieID column.
- The links and the movie datasets have equal number of rows of 9742.
- Each dataset presents a mixed type of data.(int64, object and float64)

## 2.3 Merging Files

Given that the four datasets share a common feature, the movie ID, we will use this column to perform a merge, consolidating the datasets into a single file. This approach ensures not only the integration of information from different sources but also enhances data completeness and facilitates more thorough analysis.

In [88]:
## Merging files on the common feature the MovieID

##Step 1: Merging the movies and the links datasets.
movies_links_merged =  pd.merge(movies, links, on='movieId', how='inner')
movies_links_merged.head()

Unnamed: 0,movieId,title,genres,imdbId,tmdbId
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0
1,2,Jumanji (1995),Adventure|Children|Fantasy,113497,8844.0
2,3,Grumpier Old Men (1995),Comedy|Romance,113228,15602.0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,114885,31357.0
4,5,Father of the Bride Part II (1995),Comedy,113041,11862.0


In [92]:
##Step 2: Merging the movies_links_merged and ratings datasets on movieId

movies_links_ratings_merged =pd.merge(ratings, movies_links_merged,on='movieId', how='inner')
movies_links_ratings_merged.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres,imdbId,tmdbId
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0


In [98]:
final_merge = pd.merge(movies_links_ratings_merged, tags, on='movieId', how='inner')
final_merge.head()

Unnamed: 0,userId_x,movieId,rating,timestamp_x,title,genres,imdbId,tmdbId,userId_y,tag,timestamp_y
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,336,pixar,1139045764
1,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,474,pixar,1137206825
2,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,567,fun,1525286013
3,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,336,pixar,1139045764
4,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,474,pixar,1137206825


In [47]:
##Step 3: Merging the results of movies_links_rating_merged with the tags dataset.
final_merge =  pd.merge(movies_links_ratings_merged, tags, on='movieId')
final_merge.head()

Unnamed: 0,movieId,title,genres,imdbId,tmdbId,userId_x,rating,timestamp_x,userId_y,tag,timestamp_y
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1,4.0,964982703,336,pixar,1139045764
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1,4.0,964982703,474,pixar,1137206825
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1,4.0,964982703,567,fun,1525286013
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,5,4.0,847434962,336,pixar,1139045764
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,5,4.0,847434962,474,pixar,1137206825


In [99]:
## Checking the number of rows and colums of our final merged dataset

rows, colums = final_merge.shape
print(f'The final merged dataset contains {rows} rows and {colums}colums')

The final merged dataset contains 233213 rows and 11colums


In [100]:
##Getting the conside summary of the final merged dataset

final_merge.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 233213 entries, 0 to 233212
Data columns (total 11 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   userId_x     233213 non-null  int64  
 1   movieId      233213 non-null  int64  
 2   rating       233213 non-null  float64
 3   timestamp_x  233213 non-null  int64  
 4   title        233213 non-null  object 
 5   genres       233213 non-null  object 
 6   imdbId       233213 non-null  int64  
 7   tmdbId       233213 non-null  float64
 8   userId_y     233213 non-null  int64  
 9   tag          233213 non-null  object 
 10  timestamp_y  233213 non-null  int64  
dtypes: float64(2), int64(6), object(3)
memory usage: 21.4+ MB


The output shows that the final_merged dataset contains 3 types as follows:
- 2 float64
- 6 int664
- 3 Object

The memory usage of the final merged dataset is approximately 21.4 MB. This suggests that the dataset has grown significantly after merging, especially due to the repetition of movie information across different user ratings and tag.

High Movie Engagement: The fact that the merged dataset contains 233,213 rows while the ratings dataset had 100,836 rows suggests that multiple users have rated, tagged, or interacted with the same movie, indicating high engagement for some movies.


In [101]:
final_merge.describe()

Unnamed: 0,userId_x,movieId,rating,timestamp_x,imdbId,tmdbId,userId_y,timestamp_y
count,233213.0,233213.0,233213.0,233213.0,233213.0,233213.0,233213.0,233213.0
mean,309.688191,12319.999443,3.966535,1213524000.0,261063.2,9378.277742,470.683564,1384774000.0
std,178.206387,28243.919401,0.968637,225044800.0,441441.1,36943.1398,153.329632,153462100.0
min,1.0,1.0,0.5,828124600.0,12349.0,11.0,2.0,1137179000.0
25%,156.0,296.0,3.5,1017365000.0,110357.0,278.0,424.0,1242494000.0
50%,309.0,1198.0,4.0,1217325000.0,110912.0,680.0,477.0,1457901000.0
75%,460.0,4638.0,5.0,1443201000.0,172495.0,1892.0,599.0,1498457000.0
max,610.0,193565.0,5.0,1537799000.0,5580390.0,503475.0,610.0,1537099000.0


userId_x and userId_y: The user IDs range from 1 to 610, which shows that there are 610 unique users in the dataset

Each column has 233,213 entries, meaning no missing values for the columns shown (movieId, imdbId, tmdbId, userId_x, rating, timestamp_x, userId_y, timestamp_y).

Movie IDs range from 1 to 193565, suggesting a large dataset covering a wide variety of movies.
The 50th percentile (50%, or median) movie ID is 1198, indicating that half the movies have an ID less than 1198.
The average (mean) movie ID is 12319.99, which is much higher than the median, indicating a right-skewed distribution of movie IDs (a few movies with very high IDs).

imdbId and tmdbId: Similarly, the IMDb IDs and TMDB IDs show a broad range from 12,349 to 5,580,390 (IMDb) and from 11 to 503,475 (TMDB), also suggesting a wide variety of movie records. The high standard deviation indicates significant variation in these IDs.

The mean rating is approximately 3.97, indicating that, on average, users gave higher ratings (closer to 4)

The mean timestamp (timestamp_x) is about 1.213 billion, which corresponds to around 2008 and maximum timestamp suggests ratings were given around 2018.

Right-skewed Distribution: Both movieId and rating columns exhibit right-skewed distributions. This could influence modeling choices, as the majority of movies or users may fall within a specific subset of the range.

In [103]:
#creating a copy of the final merge for to perform data cleaning
Movies_df = final_merge
Movies_df.head()

Unnamed: 0,userId_x,movieId,rating,timestamp_x,title,genres,imdbId,tmdbId,userId_y,tag,timestamp_y
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,336,pixar,1139045764
1,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,474,pixar,1137206825
2,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,567,fun,1525286013
3,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,336,pixar,1139045764
4,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,474,pixar,1137206825


## 3.DATA CLEANING

Now that we have merged our dataset, we will take the following steps to ensure it is clean and ready for analysis:

1. `Handling duplicates colums` to avoid redundancy.

2. `Removing unnecessary columns` that don't contribute to the analysis.

3. `Checking for missing` and address them appropriately.

4. `Handing Outliers` to ensure the dataset accurately represents the data.

5. `Ensure consistent data types` across all columns.

6. `Filter irrelevant rows` to keep only valid and useful information.

## 3.1.1Handling duplicates colums

In [55]:
#checking if there are duplicate columns
# .
#Compare userI_X and user1D_y to check id they have columns for identical values
Movies_df['userId_x'].equals(Movies_df['userId_y'])


False