## Introduction
---
- We have two datasets: one contains information about movies, and the other contains ratings from people for those movies.
- The goal is to recommend movies to users based on their past ratings.

- The movies.csv dataset contains:
    - movieId: A unique ID assigned to each movie
    - title: The title of the movie
    - genres: The genres of the movie, separated by '|'

<hr style='width : 40%;' align='left'>

- The ratings.csv dataset contains:
    - userId: A unique ID assigned to each user
    - movieId: The ID of the rated movie
    - rating: The user's rating, which is between 1 and 5
    - timestamp: The UNIX timestamp of the rating

## Initial libraries and functions
---

In [1]:
import numpy as np
import pandas as pd

## Exploratory data analysis (EDA)
---
- Dataset looks clean, doesn't require cleaning that much.

In [2]:
df_movies = pd.read_csv('movies.csv')
df_ratings = pd.read_csv('ratings.csv')

In [3]:
df_movies.info()
df_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34208 entries, 0 to 34207
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  34208 non-null  int64 
 1   title    34208 non-null  object
 2   genres   34208 non-null  object
dtypes: int64(1), object(2)
memory usage: 801.9+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3899999 entries, 0 to 3899998
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     int64  
 1   movieId    int64  
 2   rating     float64
 3   timestamp  int64  
dtypes: float64(1), int64(3)
memory usage: 119.0 MB


In [4]:
df_movies.isna().sum(), df_movies.duplicated().sum()

(movieId    0
 title      0
 genres     0
 dtype: int64,
 0)

In [5]:
df_ratings.isna().sum(), df_ratings.duplicated().sum()

(userId       0
 movieId      0
 rating       0
 timestamp    0
 dtype: int64,
 0)

In [6]:
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [7]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,169,2.5,1204927694
1,1,2471,3.0,1204927438
2,1,48516,5.0,1204927435
3,2,2571,3.5,1436165433
4,2,109487,4.0,1436165496


## Collaborative recommendation system (user-based)
---
- For a collaborative user-based recommendation system, we need to:
    1. Create a dummy user who has watched some of our movies
    2. Find users who have watched the same movies as the dummy user
    3. Calculate the Pearson correlation between the dummy user's ratings and the ratings of similar users
    4. Group all movies and calculate the sum of the weighted ratings and the similarity index for all movies
    5. Divide the sum of the weighted ratings by the sum of the similarities to obtain the recommendation score
    6. Create a recommendation DataFrame and recommend movies with higher recommendation scores

### Preprocessing

In [8]:
df_movies = df_movies[['movieId', 'title']]
df_movies['title'] = df_movies['title'].apply(lambda title: title[:-7]) # removing year from title

df_ratings.drop('timestamp', axis=1, inplace=True)

df_movies.head()

Unnamed: 0,movieId,title
0,1,Toy Story
1,2,Jumanji
2,3,Grumpier Old Men
3,4,Waiting to Exhale
4,5,Father of the Bride Part II


### Choosing dummy user

In [9]:
dummyuser_id = 3

dummyuser_ratings = df_ratings[df_ratings['userId'] == dummyuser_id].reset_index(drop=True)
df_ratings = df_ratings[df_ratings.userId != dummyuser_id]

dummyuser_ratings

Unnamed: 0,userId,movieId,rating
0,3,356,4.0
1,3,2394,4.0
2,3,2431,5.0
3,3,2445,4.0


### Finding similar users & calculating the Pearson correlation between the dummy user and them

In [10]:
similar_users = df_ratings[df_ratings['movieId'].isin(dummyuser_ratings['movieId'])].groupby('userId')
similar_users.head()

Unnamed: 0,userId,movieId,rating
21,4,356,2.0
355,11,2431,2.5
577,13,2394,3.0
692,14,356,5.0
786,15,356,4.5
...,...,...,...
3898246,42107,356,4.0
3898857,42116,356,4.0
3899010,42117,356,4.0
3899309,42126,356,4.0


In [11]:
similar_users = sorted(similar_users,  key=lambda group: len(group[1]), reverse=True)[:100]
similar_users[:3]

[(774,
         userId  movieId  rating
  69484     774      356     5.0
  69772     774     2394     4.0
  69777     774     2431     3.0
  69779     774     2445     1.0),
 (1250,
          userId  movieId  rating
  115209    1250      356     4.0
  115446    1250     2394     3.0
  115455    1250     2431     5.0
  115457    1250     2445     3.0),
 (1643,
          userId  movieId  rating
  149046    1643      356     3.0
  149280    1643     2394     4.5
  149290    1643     2431     0.5
  149291    1643     2445     3.5)]

In [20]:
# This dictionary will contain each userId as the key and its similarity as the value.
similarity = dict()

for user_id, group in similar_users:
    # To avoid dividing by zero, since you can't track mutual change when one vector doesn't change.
    if group['rating'].std() == 0.:
        similarity[user_id] = 0
        continue
    
    df_temp = pd.merge(
        dummyuser_ratings,
        group,
        on='movieId',
        suffixes=(f'_{dummyuser_id}', f'_{user_id}',)
    )

    # This is also for the same reason mentioned above.
    if df_temp[f'rating_{dummyuser_id}'].std() == 0.:
        similarity[user_id] = 0
        continue
    
    similarity[user_id] = df_temp[f'rating_{dummyuser_id}'].corr(df_temp[f'rating_{user_id}'], method='pearson')

similarity

{774: -0.09759000729485331,
 1250: 0.8703882797784891,
 1643: -0.9304340030613444,
 1658: -0.7745966692414834,
 4208: -0.5555555555555555,
 4415: 0.0,
 7734: -0.9428090415820632,
 9797: -0.7745966692414834,
 14572: -0.3779644730092272,
 16484: -0.4714045207910316,
 18506: 0.0,
 19817: -0.17407765595569782,
 22883: -0.3758230140014144,
 23216: 0.6622661785325219,
 26125: -0.13245323570650439,
 26706: -0.3333333333333333,
 31302: 0.5222329678670935,
 32417: 0.5222329678670935,
 33121: -0.13245323570650439,
 36946: 0.050443327230531826,
 37356: 0.0,
 38318: -0.2581988897471611,
 38754: 0.19245008972987526,
 17: -0.5,
 114: 0,
 558: -0.9707253433941511,
 707: 0.0,
 815: -0.5,
 1040: -0.18898223650461363,
 1361: -0.6933752452815365,
 1414: 0.5,
 1573: -0.5,
 1950: -0.5,
 2174: -0.4999999999999999,
 2204: 0.5,
 2372: -0.18898223650461363,
 2397: -1.0,
 2514: 0.3273268353539885,
 2711: 0.8660254037844387,
 2964: -0.4999999999999999,
 3220: -1.0,
 3353: -0.4999999999999999,
 3388: -0.277350098

In [13]:
df_similarity = pd.DataFrame()

df_similarity['userId'] = similarity.keys()
df_similarity['similarity'] = similarity.values()

df_similarity = df_similarity.sort_values(by='similarity', ascending=False)[:11]

df_similarity.head()

Unnamed: 0,userId,similarity
80,10243,1.0
1,1250,0.870388
59,6303,0.866025
38,2711,0.866025
72,8793,0.755929


### Building the recommender

In [14]:
df_similarity = pd.merge(df_similarity, df_ratings, on='userId')
df_similarity

Unnamed: 0,userId,similarity,movieId,rating
0,10243,1.0,1,5.0
1,10243,1.0,2,3.0
2,10243,1.0,5,3.0
3,10243,1.0,6,4.0
4,10243,1.0,10,4.0
...,...,...,...,...
9340,1414,0.5,61160,1.5
9341,1414,0.5,64993,4.0
9342,1414,0.5,67197,2.5
9343,1414,0.5,68319,4.0


In [15]:
df_similarity['weightedRating'] = df_similarity['similarity'] * df_similarity['rating']
df_similarity.head()

Unnamed: 0,userId,similarity,movieId,rating,wightedRating
0,10243,1.0,1,5.0,5.0
1,10243,1.0,2,3.0,3.0
2,10243,1.0,5,3.0,3.0
3,10243,1.0,6,4.0,4.0
4,10243,1.0,10,4.0,4.0


In [16]:
tempsum_df = df_similarity.groupby('movieId').sum()[['similarity', 'weightedRating']]
tempsum_df.head()

Unnamed: 0_level_0,similarity,wightedRating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,6.955004,26.914734
2,6.428408,19.426614
3,2.900116,11.97843
4,1.278162,3.222825
5,4.196357,14.31973


In [17]:
df_recommendation = pd.DataFrame()

df_recommendation['recommendationScore'] = tempsum_df['weightedRating'] / tempsum_df['similarity']
df_recommendation.reset_index(inplace=True)

df_recommendation.head()

Unnamed: 0,movieId,recommendationScore
0,1,3.869838
1,2,3.021995
2,3,4.130327
3,4,2.521453
4,5,3.412419


In [18]:
df_recommendation = df_recommendation.sort_values(by='recommendationScore', ascending=False).head(20)
df_recommendation

Unnamed: 0,movieId,recommendationScore
1808,3204,5.0
1316,2375,5.0
2246,4066,5.0
1050,1950,5.0
1047,1945,5.0
1044,1939,5.0
1043,1937,5.0
1926,3447,5.0
3342,30848,5.0
3324,27800,5.0


In [19]:
df_movies[df_movies['movieId'].isin(df_recommendation['movieId'])]

Unnamed: 0,movieId,title
464,468,Englishman Who Went Up a Hill But Came Down a ...
494,498,Mr. Jones
500,504,No Escape
1736,1814,"Price Above Rubies, A"
1854,1937,Going My Way
1856,1939,"Best Years of Our Lives, The"
1862,1945,On the Waterfront
1867,1950,In the Heat of the Night
2272,2356,Celebrity
2291,2375,"Money Pit, The"


---
<center>
    <h3>
        <i>
            This concludes the notebook. Feel free to reach out with any questions or suggestions!
        </i>
    </h3>
</center>