# MOVIELENS RECOMMENDATION SYSTEMS


### Collaborators
- 1.Ruth Kitasi
- 2.Agatha Nyamabati
- 3.Joseline Apiyo
- 4.Cecilia Ngunjiri
- 5.John Mbego
- 6.Leornad Koyio

![movie-img.jpeg](movie-img.jpeg)

# 1.BUSINESS UNDERSTANDING

## 1.1 Overview

In today's world of massive data growth, recommendation systems have become essential tools for filtering information and enhancing user experiences. These systems help users find relevant content by analyzing their past behaviors, such as search queries or browsing histories.

Companies like YouTube and Spotify use recommendation algorithms to suggest the next video or curate personalized playlists based on user preferences.

In line with our project objective we aim to harness the power of data analysis to build a movie recommendation system that provides users with personalized movie suggestions.

By analyzing user ratings of other movies, we can generate tailored recommendations that align with individual preferences. The goal is to develop a model that delivers the top 5 movie recommendations for each user, optimizing their viewing experience based on their previous interactions.

## 1.2 Problem statement

With the vast amount of content available on streaming platforms, users often feel overwhelmed by choices, making it difficult to discover movies that align with their preferences. Traditional search methods fall short in addressing this challenge, resulting in a less satisfying user experience and decreased engagement.

MovieLens has tasked our team of data scientists with optimizing their recommendation system through data-driven approaches. By analyzing user behaviors and preferences, we aim to enhance the system's ability to deliver personalized movie recommendations.

## 1.3 Objectives

- Develop a model to provide personalized top 5 movie recommendations for users based on their ratings and preferences, utilizing collaborative filtering techniques.

- Determine the rating frequency of users based on various features, such as genre, director, and release year, to identify patterns in user preferences.

- Analyze key features that contribute to the popularity of trending movies to enhance the effectiveness of the recommendation system in suggesting relevant content.

- Implement collaborative filtering techniques, including both user-based and item-based methods, to segment users and items, improving the accuracy of personalized recommendations.

- Create a solution to address the cold start problem by recommending popular and trending movies to new users with no prior ratings, ensuring an engaging initial experience.


# 2. DATA UNDERSTANDING

## 2.1 Data Source

The dataset https://grouplens.org/datasets/movielens/, was obtained from the GroupLens website which  is a well-known resource for research in recommendation systems and data analysis.

The Movielens comprises of four files:

1.`Links`:  contains three features:-
  
- movieId is a unique identifier for movies used movielens
- imdbId is a unique identifier for  movie on IMDb
- tmdbId is a unique identifier for movies on the TMDb

2.`movies`:  contains three features:-

- movieId.
- Title contains titles of the movie
- Genre- contains genre of each movie title

3.`Ratings`:  contains three features:-

- userID - This is a unique identifier assigned to each user who has rated movies in the dataset
- movieID.
- Rating-This represents the user's rating for a particular movie.
- Timestamp-records the date and time when the rating was given.

4.`tags`:  contains four features:-

- userID 
- movieID.
- Tags-contains descriptive keywords or phrases that characterize the movie 
- Timestamp


  

## 2.2 Data loading


In [1]:
## Importing the libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
## reading the files
links = pd.read_csv('ml-latest-small\links.csv')
movies = pd.read_csv('ml-latest-small\movies.csv')
tags = pd.read_csv(r'ml-latest-small\tags.csv')
ratings = pd.read_csv(r'ml-latest-small\ratings.csv')

Viewing few columns of each file

In [3]:
# viewing the link file
links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [4]:
# viewing the movie file
movies.tail()

Unnamed: 0,movieId,title,genres
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation
9741,193609,Andrew Dice Clay: Dice Rules (1991),Comedy


In [5]:
# viewing the ratings file
ratings.sample(n=5)

Unnamed: 0,userId,movieId,rating,timestamp
7491,51,1186,4.5,1230929701
27485,186,3638,4.0,1031080064
54932,365,2858,1.0,1488332711
5831,41,81591,3.5,1458939555
55963,369,4878,4.0,1237083040


In [6]:
# viewing the tags file
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


Getting a consise summary of each file using the info() method.

In [7]:
links.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  9742 non-null   int64  
 1   imdbId   9742 non-null   int64  
 2   tmdbId   9734 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 228.5 KB


In [8]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [9]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [10]:
tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3683 entries, 0 to 3682
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   userId     3683 non-null   int64 
 1   movieId    3683 non-null   int64 
 2   tag        3683 non-null   object
 3   timestamp  3683 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 115.2+ KB


In [11]:
links.describe()

Unnamed: 0,movieId,imdbId,tmdbId
count,9742.0,9742.0,9734.0
mean,42200.353623,677183.9,55162.123793
std,52160.494854,1107228.0,93653.481487
min,1.0,417.0,2.0
25%,3248.25,95180.75,9665.5
50%,7300.0,167260.5,16529.0
75%,76232.0,805568.5,44205.75
max,193609.0,8391976.0,525662.0


In [12]:
movies.describe()

Unnamed: 0,movieId
count,9742.0
mean,42200.353623
std,52160.494854
min,1.0
25%,3248.25
50%,7300.0
75%,76232.0
max,193609.0


In [13]:
ratings.describe()

Unnamed: 0,userId,movieId,rating,timestamp
count,100836.0,100836.0,100836.0,100836.0
mean,326.127564,19435.295718,3.501557,1205946000.0
std,182.618491,35530.987199,1.042529,216261000.0
min,1.0,1.0,0.5,828124600.0
25%,177.0,1199.0,3.0,1019124000.0
50%,325.0,2991.0,3.5,1186087000.0
75%,477.0,8122.0,4.0,1435994000.0
max,610.0,193609.0,5.0,1537799000.0


In [14]:
tags.describe()

Unnamed: 0,userId,movieId,timestamp
count,3683.0,3683.0,3683.0
mean,431.149335,27252.013576,1320032000.0
std,158.472553,43490.558803,172102500.0
min,2.0,1.0,1137179000.0
25%,424.0,1262.5,1137521000.0
50%,474.0,4454.0,1269833000.0
75%,477.0,39263.0,1498457000.0
max,610.0,193565.0,1537099000.0


Getting a summary of the number of rows and columns of each dataset

In [15]:
rows, colums =links.shape
print(f'The links dataset has {rows} rows and {colums} columns')

The links dataset has 9742 rows and 3 columns


In [16]:
rows, colums =movies.shape
print(f'The movies dataset has {rows} rows and {colums} columns')

The movies dataset has 9742 rows and 3 columns


In [17]:
rows, colums =ratings.shape
print(f'The ratings dataset has {rows} rows and {colums} columns')

The ratings dataset has 100836 rows and 4 columns


In [18]:
rows, colums =tags.shape
print(f'The tags dataset has {rows} rows and {colums} columns')

The tags dataset has 3683 rows and 4 columns


#### Observations made from data undertanding

- All the four files have a common feature which is the movieID column.
- The links and the movie datasets have equal number of rows of 9742.
- Each dataset presents a mixed type of data.(int64, object and float64)

In [None]:
# merging files