## Introduction
---
- We have two datasets: one contains information about movies, and the other contains ratings from people for those movies.
- The goal is to recommend movies to users based on their past ratings.

- The movies.csv dataset contains:
    - movieId: A unique ID assigned to each movie
    - title: The title of the movie
    - genres: The genres of the movie, separated by '|'

<hr style='width : 40%;' align='left'>

- The ratings.csv dataset contains:
    - userId: A unique ID assigned to each user
    - movieId: The ID of the rated movie
    - rating: The user's rating, which is between 1 and 5
    - timestamp: The UNIX timestamp of the rating

## Initial libraries and functions
---

In [1]:
import numpy as np
import pandas as pd

## Exploratory data analysis (EDA)
---
- Dataset looks clean, doesn't require cleaning that much.

In [2]:
df_movies = pd.read_csv('movies.csv')
df_ratings = pd.read_csv('ratings.csv')

In [3]:
df_movies.info()
df_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34208 entries, 0 to 34207
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  34208 non-null  int64 
 1   title    34208 non-null  object
 2   genres   34208 non-null  object
dtypes: int64(1), object(2)
memory usage: 801.9+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3899999 entries, 0 to 3899998
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     int64  
 1   movieId    int64  
 2   rating     float64
 3   timestamp  int64  
dtypes: float64(1), int64(3)
memory usage: 119.0 MB


In [4]:
df_movies.isna().sum(), df_movies.duplicated().sum()

(movieId    0
 title      0
 genres     0
 dtype: int64,
 0)

In [5]:
df_ratings.isna().sum(), df_ratings.duplicated().sum()

(userId       0
 movieId      0
 rating       0
 timestamp    0
 dtype: int64,
 0)

In [6]:
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [7]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,169,2.5,1204927694
1,1,2471,3.0,1204927438
2,1,48516,5.0,1204927435
3,2,2571,3.5,1436165433
4,2,109487,4.0,1436165496


## Content-based recommendation system
---
- For a content-based recommendation system, we need to::
    1. One-hot encode the genres of each movie
    2. Create a dummy user who has watched some of our movies
    3. Get the dummy user's favorite genre's weights
    4. Calculate the dummy user's scores for all movies
    5. Recommend the movies with the highest scores

### Preprocessing

In [8]:
df_ratings.drop('timestamp', axis=1, inplace=True)

df_movies['title'] = df_movies['title'].apply(lambda title: title[:-7]) # removing year from title

# one-hot encoding each genre took 3 lines, since each movie had a list of genres instead of just one.
df_movies['genres'] = df_movies['genres'].str.split('|')
df_geners = df_movies['genres'].apply(lambda genre: pd.Series(1, index=genre)).fillna(0)
df_movies = pd.concat([df_movies.drop('genres', axis=1), df_geners], axis=1)

df_movies.head()

Unnamed: 0,movieId,title,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,...,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
0,1,Toy Story,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Choosing dummy user

In [9]:
user_ratings = df_ratings[df_ratings['userId'] == 3].reset_index(drop=True)
user_ratings

Unnamed: 0,userId,movieId,rating
0,3,356,4.0
1,3,2394,4.0
2,3,2431,5.0
3,3,2445,4.0


In [10]:
user_movies = df_movies[df_movies['movieId'].isin(user_ratings['movieId'])].copy()
print(user_movies.title)

user_movies.drop(['movieId', 'title'], axis=1, inplace=True)

user_movies.T

352             Forrest Gump
2310    Prince of Egypt, The
2347             Patch Adams
2361          At First Sight
Name: title, dtype: object


Unnamed: 0,352,2310,2347,2361
Adventure,0.0,0.0,0.0,0.0
Animation,0.0,1.0,0.0,0.0
Children,0.0,0.0,0.0,0.0
Comedy,1.0,0.0,1.0,0.0
Fantasy,0.0,0.0,0.0,0.0
Romance,1.0,0.0,0.0,0.0
Drama,1.0,0.0,1.0,1.0
Action,0.0,0.0,0.0,0.0
Crime,0.0,0.0,0.0,0.0
Thriller,0.0,0.0,0.0,0.0


### Getting dummy user's favorite genres' weights

In [11]:
weights = user_movies.T.dot(user_ratings['rating'].values)
weights

Adventure              0.0
Animation              4.0
Children               0.0
Comedy                 9.0
Fantasy                0.0
Romance                4.0
Drama                 13.0
Action                 0.0
Crime                  0.0
Thriller               0.0
Horror                 0.0
Mystery                0.0
Sci-Fi                 0.0
IMAX                   0.0
Documentary            0.0
War                    4.0
Musical                4.0
Western                0.0
Film-Noir              0.0
(no genres listed)     0.0
dtype: float64

### Building the recommender

In [12]:
df_contentonly = df_movies.drop(['movieId', 'title'], axis=1)

In [13]:
recommendation_scores = (df_contentonly * weights).sum(axis=1) / (weights.sum())
recommendation_scores = recommendation_scores.sort_values(ascending=False)

top_recommends = pd.DataFrame({ 'score' : recommendation_scores[:10]}).reset_index()
top_recommends

Unnamed: 0,index,score
0,10549,0.894737
1,8605,0.894737
2,1824,0.894737
3,2918,0.789474
4,20343,0.789474
5,26285,0.789474
6,5843,0.789474
7,20494,0.789474
8,8056,0.789474
9,5829,0.789474


In [14]:
df_recommendation = pd.concat(
    [
        df_movies.loc[top_recommends['index'].values].reset_index()[['movieId', 'title']],
        top_recommends['score']
    ],
    axis=1
)

df_recommendation

Unnamed: 0,movieId,title,score
0,39768,Life is a Miracle (Zivot je cudo),0.894737
1,26093,"Wonderful World of the Brothers Grimm, The",0.894737
2,1907,Mulan,0.894737
3,3003,Train of Life (Train de vie),0.789474
4,99731,Dabangg 2,0.789474
5,122240,Mister Magoo's Christmas Carol,0.789474
6,5941,Drumline,0.789474
7,100208,"Music Man, The",0.789474
8,8738,"Woman Is a Woman, A (femme est une femme, Une)",0.789474
9,5927,"Best Little Whorehouse in Texas, The",0.789474


---
<center>
    <h3>
        <i>
            This concludes the notebook. Feel free to reach out with any questions or suggestions!
        </i>
    </h3>
</center>