<h1> Content Based Movie Recommender System </h1>

---


**Problem Statement**: To recommend similar movies in the data based on user selected movie from dataset.

**Data Link**: [ml-latest-small.zip](https://grouplens.org/datasets/movielens/latest/)





&nbsp; 

&nbsp;


Solution by     : **Aditya Karanth**.

GitHub Profile  : https://github.com/Aditya-Karanth

Kaggle Profile  : https://www.kaggle.com/adityakaranth

LinkedIn Profile: https://www.linkedin.com/in/u-aditya-karanth-2206/

<h3>Project Planning :</h3> 

  - **Exploring Data -**
    - Understand the nature of the data *.info()*
    - Getting unique users, titles, and genres.
    - Obtaining insights on the Number of Ratings, Average Rating, and their relation via various graphs
    
  - **Building a System -**
    - Creating a pivot table
    - Making suggestions using a single title. 

  - **Recommendation System -**
    - Creating a function '*similar_to*' for fetching likable titles to user-specified movies. 
    - Keyword search privilege for a user to select a specific movie title from the dataset.
    - Suggesting movies along with match percentage and number of ratings similar to the user-selected title. 

# Imports

In [None]:
import numpy as np
import pandas as pd
import sys

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('whitegrid')
sns.set_context('notebook')
matplotlib.rcParams['figure.figsize'] = (15,8) 

In [None]:
# # Extracting files from Movielens Zip file
# from zipfile import ZipFile

# with ZipFile('ml-latest-small.zip', 'r') as zip:
#     # printing all the contents of the zip file
#     zip.printdir()
#     # extracting all the files
#     print('Extracting.......')
#     zip.extractall()
#     print('successful!')

# Exploring Data

In [None]:
# Only movies.csv and rating.csv datasets are used


# df_link = pd.read_csv('../input/movie-lens-latest-small-dataset-100k/ml-latest-small/links.csv')
# df_tag = pd.read_csv('../input/movie-lens-latest-small-dataset-100k/ml-latest-small/tags.csv')
df_movie = pd.read_csv('../input/movie-lens-latest-small-dataset-100k/ml-latest-small/movies.csv')
df_rating = pd.read_csv('../input/movie-lens-latest-small-dataset-100k/ml-latest-small/ratings.csv')

In [None]:
# Movie data

display(df_movie.info())
print('\n\n')
df_movie.head()

In [None]:
# Ratings Data

display(df_rating.info())
print('\n\n')
df_rating.head()

In [None]:
# Combining both datasets in to one.
# We have total of 100836 records in our dataset

df = pd.merge(df_rating, df_movie, on='movieId')

display(df.info())
print('\n\n')
df.head()

In [None]:
print('Total number of users are {}'.format(df['userId'].nunique()))

In [None]:
print('Total number of movies are {}'.format(df['title'].nunique()))

In [None]:
print('Total number of unique genres are {}\n'.format(df['genres'].nunique()))
print('Top genres are \n\n',df['genres'].value_counts().head())

In [None]:
# Titles with most and least ratings.

rat_avg = df.groupby('title').mean()['rating'].sort_values(ascending=False)
rat_avg

In [None]:
# Titles with most and least number of ratings.

rat_count = df.groupby('title').count()['rating']
rat_count.sort_values(ascending=False)

In [None]:
# Creating a DataFrame using above findings, i.e Average ratings and number of users rated for the movie.

df_rate = pd.DataFrame(rat_avg)
df_rate['no_of_ratings'] = pd.DataFrame(rat_count)
df_rate

## Number of Ratings

In [None]:
# Plotting number of ratings(x) vs count of titles with that exact number of ratings(y)

print(df_rate['no_of_ratings'].sort_values(ascending=False))
print('\n')

plt.figure(figsize=(15,10))
df_rate['no_of_ratings'].sort_values(ascending=False).hist(bins=60)

# Here all records are considered (9719) and 
# due to lot of titles with less number of ratings(ex:1) graph is not clear

Hence plotting two graphs
-  Titles with 10+ ratings and more
-  Titles with less than 10 ratings

In [None]:
# Titles with 10+ ratings and more
# Number of ratings for a title is sorted in descending order
# Top 2121 rows contains records with 10+ ratings upto 329 ratings

print(df_rate['no_of_ratings'].sort_values(ascending=False)[:2100])
print('\n')

plt.figure(figsize=(15,10))
df_rate['no_of_ratings'].sort_values(ascending=False)[:2121].hist(bins=80)

In [None]:
# Titles with less than or equal to 10 ratings
# Number of ratings for a title is sorted in descending order
# Exculding Top 2121 rows from 9719 total sorted titles; 
# 7598 records with less than or equal to 10 no. of ratings.

print(df_rate['no_of_ratings'].sort_values(ascending=False)[2121:])
print('\n')

plt.figure(figsize=(15,10))
df_rate['no_of_ratings'].sort_values(ascending=False)[2121:].hist(bins=80)

## Average Rating

In [None]:
# Average rating of titles (x) vs count(y)

df_rate['rating'].hist(bins=80)

## Number of Ratings vs Average Ratings

In [None]:
# Relation between average ratings(x) vs no. of ratings(y)

sns.jointplot(df_rate['rating'], df_rate['no_of_ratings'],alpha=0.6, kind='scatter')

# Movies with less number of ratings tends to have lower ratings. 
# Similarly more the number of ratings, better the ratings are.

# Building a System 

In [None]:
# Creating a pivot table with 'user_Id' as index and 'titles' as columns
# Each row contains a unique user and movies the user rated

matrix = df.pivot_table(values='rating', index='userId',columns='title')
matrix

<h3> Suggesting for a Single Movie </h3>

In [None]:
# Taking 'Shawshank Redemption, The (1994)' as an example
# Extracting ratings of the movie by all users

shawshank_ratings = matrix['Shawshank Redemption, The (1994)']
shawshank_ratings

In [None]:
# Using correlation between selected movie and all movies based on user ratings; 
# Getting similar movies with selected movie

shawshank_alike = matrix.corrwith(shawshank_ratings)
print(shawshank_alike.dropna().sort_values(ascending=False))

In [None]:
# Many of the movies have less number of ratings,
# Excluding suggestions with a threshold number of ratings.

# Creating DF for above cell and adding no. of ratings column
shawshank_corr = pd.DataFrame(shawshank_alike,columns=['corr']).dropna()
shawshank_corr = shawshank_corr.join(df_rate['no_of_ratings'])
shawshank_corr

In [None]:
# Threshold is set as 100 (tunable accordingly)
# So similar movies should have atleast 100 no. of ratings
# Excluding first row as the movie Shawshank Redemption is perfectly correlated with itself.

print(shawshank_corr[shawshank_corr['no_of_ratings']>100].sort_values(by='corr',ascending=False)[1:].head(25))

# **Recommendation System**

## Function to Fetch Similar Movies

**Function Description**

Line:
1. Takes input as movie name.
2. Stores correlation of selected movie and all movies in *p* 
3. Temporary DF is created with *p* and excluding null values
4. Temporary DF has two columns *corr* and *no_of_ratings*
5. Select a threshold for minimum no. of ratings(*Thresh_rat*) for a suggested movie [Higher for Famous movies]
6. Select a threshold for % match(*Thresh_corr*)
7. Selecting similar movies based on *Thresh_rat*
8. Display all movies above *'Thresh_corr'* % Match; if suggestions are less than threshold level display five movies.
9. Change the display format of *Corr* column
10. Change the display format of *no_of_ratings* column
11. Return Similar Movies

In [None]:
def similar_to(movie_name):
  p = matrix.corrwith(matrix[movie_name]) 
  temp_df = pd.DataFrame(p,columns=['corr']).dropna()
  temp_df =temp_df.join(df_rate['no_of_ratings'])
  Thresh_rat = 80 # Adjusting this may give different results
  Thresh_corr = 0.50 # Percentage of match to show
  q = temp_df[temp_df['no_of_ratings']>Thresh_rat].sort_values(by='corr',ascending=False)[1:]
  q = q.head(5) if len(q[q['corr']>Thresh_corr])<5 else q[q['corr']>Thresh_corr]
  q['corr'] = q['corr'].apply(lambda x: "{}{}".format(round(x*100,1), '% Match'))
  q['no_of_ratings'] = q['no_of_ratings'].apply(lambda x: "{}{}".format(x, ' Ratings'))
  print('Users who watched "{}" also watched these\n'.format(movie_name))
  return q

# Exclude all Warnings
import warnings
warnings.filterwarnings("ignore")

## Selecting a Movie from Dataset

1. In *search* enter any key words/movie name as string and run the cell.
2. From the list displayed select a movie .
2. Copy a desired movie name

In [None]:
def search(x):
  print("\n".join(s for s in df['title'].unique() if x.lower() in s.lower()))
  pass

# Enter keyword here as search('__') and run the cell
search('figh') 

## Suggested Movies


Paste the selected/copied movie from above in 
`similar_to('____')` and run.


In [None]:
similar_to ('X-Men (2000)')

In [None]:
similar_to('Jumanji: Welcome to the Jungle (2017)')

In [None]:
similar_to ('Fight Club (1999)')