<a href="https://colab.research.google.com/github/quantumhome/DataAnalysisCaseStudy/blob/master/10thMay_RecommendationEngine_Dev.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Collaborative Filtering**

**Link to dataset: https://drive.google.com/drive/folders/1lgXm5Lyqjdkisn1G9c-MqaeLT4boNIzg?usp=sharing**

## **Mounting the google drive**

In [None]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
data_location = "/content/drive/MyDrive/Datasets/Netflix/Data for Files/combinedNetflixData.txt"

## **Step 1 - Importing Libraries**

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

## **Step 2 - Data Loading**

In [None]:
df = pd.read_csv(data_location, names = ["CustID", "Ratings"], usecols = [0, 1], header = None)

In [None]:
df.head()

Unnamed: 0,CustID,Ratings
0,1:,
1,1488844,3.0
2,822109,5.0
3,885013,4.0
4,30878,4.0


### **Agenda**
**Differentiate the movie Id with the user ID column and we have to create a new column for them**

* **`x:` this is the movie id, we have to place this `x` into new column so that we can represent better which user has rated which movie**

## **Fetch Relevant Information**

#### **How many total movies we are dealing with?**

In [None]:
total_movies_in_df = df.isnull().sum()["Ratings"]

In [None]:
print(total_movies_in_df)

4499


#### **How many unique viewers we are having?**

In [None]:
total_users_in_df = df["CustID"].nunique() - total_movies_in_df

In [None]:
print(total_users_in_df)

470758


#### **How many total reviews we have got?**

In [None]:
total_reviews_in_df = df["CustID"].count() - total_movies_in_df

In [None]:
print(total_reviews_in_df)

24053764


**There are around `4,499 movies` that has been watched by `4,70,758 unique users` and total reviews for all of them combined is `2,40,53,764`**

## **Step 3 - Segregation of Data**

* **Step 1 - Create a copy of the original data**

In [None]:
df_copy = df.copy()

* **Step 2 - Building a logic to seperate the `:` from the movie id**

In [None]:
# Create a temp variable that will keep the track of the current movie on which we are working
curr_movie = None

# Create a list that will hold the data for all the users that have rated the movie in curr_movie
movie_ids = []

* **Step 3 - Implementing the logic**

In [None]:
# Loop for iterating over CustID column, to replace ":" with numbers
for cust_id in df_copy["CustID"]:
  # We will put the condition for checking the "x:"
  if ":" in cust_id:
    # We will try to replace : with nothing so that we can get a particular integer values
    curr_movie = int(cust_id.replace(":", ""))
  # We will take this movie and map with the data
  movie_ids.append(curr_movie)

# This lines will help us mapping data to ratings
df_copy["MovieID"] = movie_ids

# Removal of the unneccsary data
df_copy = df_copy[df_copy["Ratings"].notna()]

In [None]:
df_copy.iloc[540:600]

Unnamed: 0,CustID,Ratings,MovieID
541,548064,5.0,1
542,946102,5.0,1
543,1790158,4.0,1
544,1403184,3.0,1
545,1535440,4.0,1
546,1426604,4.0,1
547,1815755,5.0,1
549,2059652,4.0,2
550,1666394,3.0,2
551,1759415,4.0,2


## **Step 4 - Preparations for collaborative Filtering**

**Here, we need to set some benchmarks / thresholds in the dataset, based on which you will be providing recommendations**
* **Viewers who are not frequently giving rating (non active viewers or fake users or dummy users). These user who watch movie and don't provide rating / provide very less, we remove them.**

* **Those movies that has less rating are possibly those which are not very famous, so we won't keep these these movies in our list.**

### **Benchmark / Threshold 1 - Movies**

In [None]:
df = df_copy

In [None]:
df.head()

Unnamed: 0,CustID,Ratings,MovieID
1,1488844,3.0,1
2,822109,5.0,1
3,885013,4.0,1
4,30878,4.0,1
5,823519,3.0,1


#### **Total counts movie reviews i.e how many times they were rated**

In [None]:
movie_list = df.groupby("MovieID")["Ratings"].agg(["count"])

In [None]:
movie_list

Unnamed: 0_level_0,count
MovieID,Unnamed: 1_level_1
1,547
2,145
3,2012
4,142
5,1140
...,...
4495,614
4496,9519
4497,714
4498,269


In [None]:
movie_list["count"].quantile(0.7)

np.float64(1798.6)

**Whichever movie will have rating greater than or equal to 1798, will be considered as good movie**

In [None]:
# Popular (no_of_rating >= 1799)
benchmark_Movie =round(movie_list["count"].quantile(0.7),0)

In [None]:
benchmark_Movie

np.float64(1799.0)

In [None]:
updated_movie_list = movie_list[movie_list["count"] < benchmark_Movie].index

In [None]:
updated_movie_list

Index([   1,    2,    4,    5,    6,    7,    9,   10,   11,   12,
       ...
       4484, 4486, 4487, 4489, 4491, 4494, 4495, 4497, 4498, 4499],
      dtype='int64', name='MovieID', length=3149)

### **Benchmark / Threshold 2 - Users**

#### **Total counts of reviews of a particular users**

In [None]:
cust_list = df.groupby("CustID")["Ratings"].agg(["count"])

In [None]:
cust_list

Unnamed: 0_level_0,count
CustID,Unnamed: 1_level_1
10,49
1000004,1
1000027,30
1000033,101
1000035,20
...,...
999964,48
999972,35
999977,14
999984,38


In [None]:
cust_list["count"].quantile(0.7)

np.float64(52.0)

**Whichever movie will have rating greater than or equal to 1798, will be considered as good movie**

In [None]:
# Popular (no_of_rating >= 1799)
benchmark_cust =round(cust_list["count"].quantile(0.7),0)

In [None]:
benchmark_cust

np.float64(52.0)

In [None]:
updated_cust_list = cust_list[cust_list["count"] < benchmark_cust].index

In [None]:
updated_cust_list

Index(['10', '1000004', '1000027', '1000035', '1000038', '1000051', '1000057',
       '100006', '100007', '1000072',
       ...
       '999932', '999935', '99994', '999945', '999949', '999964', '999972',
       '999977', '999984', '999988'],
      dtype='object', name='CustID', length=327300)

#### **Removal of Unpopular Movies and Inactive or Least Active Users**

In [None]:
# This removes the unpopular movies
df = df[~df['MovieID'].isin(updated_movie_list)]
# This removes the inactive users
df = df[~df['CustID'].isin(updated_cust_list)]

In [None]:
print(f"New Data\n---------------------------\nRows: {df.shape[0]} and Columns: {df.shape[1]}")

New Data
---------------------------
Rows: 17337458 and Columns: 3
