## Business Understanding

A recommender system is an algorithm designed to suggest items to users based on various factors such as preferences, past behaviour, or similarities with other users. Recommender systems are utilized in a variety of areas including movies, music, news, social tags, and products in general. It produces a list of recommendations and there are few ways in which it can be done. Two of the most popular ways are – through collaborative filtering or through content-based filtering.

## Problem statement

The primary problem addressed by this MovieLens recommender system is the information overload faced by movie enthusiasts. With the vast number of movies available, it's challenging for users to discover new content that aligns with their preferences. This can lead to wasted time browsing through irrelevant titles and a diminished overall viewing experience.

### Stakeholders:

* Movie enthusiasts: The primary stakeholders are individuals who enjoy watching movies and are seeking personalized recommendations to enhance their viewing experience.

* Streaming platforms: These platforms can benefit from a recommender system by increasing user engagement and improving customer satisfaction.

* Film industry: By understanding viewer preferences through data-driven recommendations, the film industry can gain valuable insights into market trends and tailor future productions accordingly.

## Objectives

* Develop a movie recommendation system that accurately predicts user preferences by identify the types of movies that a user is likely to enjoy based on their past ratings

* Provide personalized recommendations to users based on their past movie ratings by tailoring recommendations to each individual user based on their specific preferences by using collaborative filtering techniques

## Data Understanding

We are using  MovieLens dataset which is a widely used benchmark dataset for collaborative filtering algorithms. It consists of four primary files:

* links.csv: Contains information about movie IDs, IMDb IDs, and TMDb IDs.

* movies.csv: Provides movie titles, genres, and release years.

* tags.csv: Includes user-provided tags or keywords associated with movies.

* ratings.csv: Contains user ratings for movies, including user IDs, movie IDs, ratings, and timestamps.

This dataset is well-suited for the movie recommender system project due to the following reasons:

* Rich Feature Set: The dataset includes essential features like user ratings, movie genres, and user-provided tags, which are crucial for building effective recommendation models.

* Large Sample Size: The dataset contains a substantial number of ratings and movies, providing a sufficient basis for training and evaluating recommendation algorithms.

* Diverse Content: The dataset covers a variety of movie genres and content, ensuring that the recommender system can cater to diverse user preferences.

* Real-World Data: The data is collected from real users and reflects actual viewing behavior, making it a realistic representation of the problem domain.

### Data Limitations

While the MovieLens dataset is valuable for building a movie recommender system, it has some limitations:

* Limited Temporal Coverage: The dataset spans a specific time period, which may not capture the most recent trends or preferences.

* Cold-Start Problem: The system may struggle to provide recommendations for new users or movies with limited ratings or tags.

* Bias in Ratings: User ratings can be influenced by various factors, such as popularity bias or groupthink, which may affect the accuracy of recommendations.

## Data loading

In [4]:
## Importing the libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [8]:
## reading the files
links = pd.read_csv("../ml-latest-small/links.csv")
movies = pd.read_csv("../ml-latest-small/movies.csv")
ratings = pd.read_csv("../ml-latest-small/ratings.csv")
tags = pd.read_csv("../ml-latest-small/tags.csv")

In [12]:
# display the first few rows

links.head(2)

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0


In [13]:
# display the first few rows

movies.head(2)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy


In [14]:
# display the first few rows

ratings.head(2)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247


In [15]:
# display the first few rows

tags.head(2)


Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996


In [18]:
# check for missing values

links.isnull().sum()

movieId    0
imdbId     0
tmdbId     8
dtype: int64

We will drop the missing values because they are few.

In [22]:
# dropping the missing values

links.dropna(inplace=True)

In [23]:
#counter checking for missing values in links

links.isnull().sum()

movieId    0
imdbId     0
tmdbId     0
dtype: int64

In [19]:
# check for missing values

movies.isnull().sum()

movieId    0
title      0
genres     0
dtype: int64

In [20]:
# check for missing values

ratings.isnull().sum()


userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

In [21]:
# check for missing values

tags.isnull().sum()

userId       0
movieId      0
tag          0
timestamp    0
dtype: int64

In [25]:
# Merge the DataFrames on common columns
df = pd.merge(ratings, movies, on='movieId')
df = pd.merge(df, tags, on=['userId', 'movieId'])
df = pd.merge(df, links, on='movieId')
df.head()


Unnamed: 0,userId,movieId,rating,timestamp_x,title,genres,tag,timestamp_y,imdbId,tmdbId
0,336,1,4.0,1122227329,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,pixar,1139045764,114709,862.0
1,474,1,4.0,978575760,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,pixar,1137206825,114709,862.0
2,567,1,3.5,1525286001,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,fun,1525286013,114709,862.0
3,289,3,2.5,1143424657,Grumpier Old Men (1995),Comedy|Romance,moldy,1143424860,113228,15602.0
4,289,3,2.5,1143424657,Grumpier Old Men (1995),Comedy|Romance,old,1143424860,113228,15602.0


In [27]:
# Drop the 'timestamp_x' and 'timestamp_y' columns
df.drop(columns=['timestamp_x', 'timestamp_y'], inplace=True)

df.head()

Unnamed: 0,userId,movieId,rating,title,genres,tag,imdbId,tmdbId
0,336,1,4.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,pixar,114709,862.0
1,474,1,4.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,pixar,114709,862.0
2,567,1,3.5,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,fun,114709,862.0
3,289,3,2.5,Grumpier Old Men (1995),Comedy|Romance,moldy,113228,15602.0
4,289,3,2.5,Grumpier Old Men (1995),Comedy|Romance,old,113228,15602.0
