# Recommendation System - Movie Recommendation
## This notebook outlines the concepts involved in building a Complete Recommendation System for recommending Movies to users
## Movie Recommender System - A very very very simple Clone of Netflix

**MovieLens dataset** and build a model to **recommend movies** to the end users. This data has been collected by the GroupLens Research Project at the University of Minnesota.

This dataset consists of:
- **100,000 ratings** (1-5) from **943 users** on **1682 movies**
- Demographic information of the users (age, gender, occupation, etc.)

Dataset:

### Import the libraries

In [117]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import os

### Download the dataset

In [118]:
!wget https://raw.githubusercontent.com/subashgandyer/datasets/main/ml-100k/ml-100k.zip

--2024-03-27 20:01:50--  https://raw.githubusercontent.com/subashgandyer/datasets/main/ml-100k/ml-100k.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4953825 (4.7M) [application/zip]
Saving to: ‘ml-100k.zip.2’


2024-03-27 20:01:50 (17.5 MB/s) - ‘ml-100k.zip.2’ saved [4953825/4953825]



In [119]:
#!unzip ml-100k.zip

Archive:  ml-100k.zip
replace allbut.pl? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

### Load the dataset
### Reading users file
- u.user

- Please check the column names from the readme file
- Pass in column names for each CSV as the column name is not given in the file and read them using pandas
- Use these following columns
    - 'user_id', 'age', 'sex', 'occupation', 'zip_code'

In [120]:
# Define column names for users file
user_columns = ['user_id', 'age', 'sex', 'occupation', 'zip_code']

# Read the users file
users_df = pd.read_csv('u.user', sep='|', names=user_columns)



### Display the user data

In [121]:
# Display the first few rows of the users dataframe
print(users_df.head())

   user_id  age sex  occupation zip_code
0        1   24   M  technician    85711
1        2   53   F       other    94043
2        3   23   M      writer    32067
3        4   24   M  technician    43537
4        5   33   F       other    15213


### Reading ratings file
- u.data
- Use the following columns
    - 'user_id', 'movie_id', 'rating', 'unix_timestamp'

In [122]:
# Define column names for ratings file
ratings_columns = ['user_id', 'movie_id', 'rating', 'unix_timestamp']

# Read the ratings file
ratings_df = pd.read_csv('u.data', sep='\t', names=ratings_columns)



### Display the Ratings data

In [123]:
# Display the first few rows of the ratings dataframe
print(ratings_df.head())

   user_id  movie_id  rating  unix_timestamp
0      196       242       3       881250949
1      186       302       3       891717742
2       22       377       1       878887116
3      244        51       2       880606923
4      166       346       1       886397596


### Reading items file
- u.item
- Use the following columns
    - 'movie id', 'movie title' ,'release date','video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure',
    - 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
    - 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western'

In [124]:
# Define column names for items file
item_columns = ['movie_id', 'movie_title', 'release_date', 'video_release_date', 'IMDb_URL', 'unknown', 'Action',
                'Adventure', 'Animation', "Children's", 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
                'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']

# Read the items file
items_df = pd.read_csv('u.item', sep='|', names=item_columns, encoding='latin-1')



### Display the Items data

In [125]:
# Display the first few rows of the items dataframe
print(items_df.head())

   movie_id        movie_title release_date  video_release_date  \
0         1   Toy Story (1995)  01-Jan-1995                 NaN   
1         2   GoldenEye (1995)  01-Jan-1995                 NaN   
2         3  Four Rooms (1995)  01-Jan-1995                 NaN   
3         4  Get Shorty (1995)  01-Jan-1995                 NaN   
4         5     Copycat (1995)  01-Jan-1995                 NaN   

                                            IMDb_URL  unknown  Action  \
0  http://us.imdb.com/M/title-exact?Toy%20Story%2...        0       0   
1  http://us.imdb.com/M/title-exact?GoldenEye%20(...        0       1   
2  http://us.imdb.com/M/title-exact?Four%20Rooms%...        0       0   
3  http://us.imdb.com/M/title-exact?Get%20Shorty%...        0       1   
4  http://us.imdb.com/M/title-exact?Copycat%20(1995)        0       0   

   Adventure  Animation  Children's  ...  Fantasy  Film-Noir  Horror  Musical  \
0          0          1           1  ...        0          0       0        0

### Reading Training and Testing Ratings data
- Training
    - ua.base
- Testing
    - ua.test
- Use the following columns
    - 'user_id', 'movie_id', 'rating', 'unix_timestamp'

In [126]:
# Define column names for training and testing ratings files
ratings_columns = ['user_id', 'movie_id', 'rating', 'unix_timestamp']

# Read the training ratings file
train_ratings_df = pd.read_csv('ua.base', sep='\t', names=ratings_columns)

# Read the testing ratings file
test_ratings_df = pd.read_csv('ua.test', sep='\t', names=ratings_columns)



### Display the Training and Testing Ratings data

In [127]:
# Display the first few rows of the training ratings dataframe
print("Training Ratings:")
print(train_ratings_df.head())

# Display the first few rows of the testing ratings dataframe
print("\nTesting Ratings:")
print(test_ratings_df.head())

Training Ratings:
   user_id  movie_id  rating  unix_timestamp
0        1         1       5       874965758
1        1         2       3       876893171
2        1         3       4       878542960
3        1         4       3       876893119
4        1         5       3       889751712

Testing Ratings:
   user_id  movie_id  rating  unix_timestamp
0        1        20       4       887431883
1        1        33       4       878542699
2        1        61       4       878542420
3        1       117       3       874965739
4        1       155       2       878542201


### How many unique users?

In [128]:
# Calculate the number of unique users
unique_users = ratings_df['user_id'].nunique()

print("Number of unique users:", unique_users)

Number of unique users: 943


### How many unique items / movies?

In [129]:
# Calculate the number of unique items/movies
unique_movies = items_df['movie_id'].nunique()

print("Number of unique movies:", unique_movies)

Number of unique movies: 1682


### Create a User-Item Matrix
- pivot table

In [130]:
# Create the User-Item matrix using pivot table
user_item_matrix = ratings_df.pivot_table(index='movie_id', columns='user_id', values='rating', fill_value=0)

# Display the User-Item matrix
print("User-Item Matrix:")
print(user_item_matrix)

User-Item Matrix:
user_id   1    2    3    4    5    6    7    8    9    10   ...  934  935  \
movie_id                                                    ...             
1           5    4    0    0    4    4    0    0    0    4  ...    2    3   
2           3    0    0    0    3    0    0    0    0    0  ...    4    0   
3           4    0    0    0    0    0    0    0    0    0  ...    0    0   
4           3    0    0    0    0    0    5    0    0    4  ...    5    0   
5           3    0    0    0    0    0    0    0    0    0  ...    0    0   
...       ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...   
1678        0    0    0    0    0    0    0    0    0    0  ...    0    0   
1679        0    0    0    0    0    0    0    0    0    0  ...    0    0   
1680        0    0    0    0    0    0    0    0    0    0  ...    0    0   
1681        0    0    0    0    0    0    0    0    0    0  ...    0    0   
1682        0    0    0    0    0    0    0    0    0    0

In [131]:
# user_movies_data

# 1. Content Filtering

### Data Preparation
Prepare data for Metadata to have a TF-IDF Vectorizer for Content Filtering

In [132]:
# Combine movie title and genres to create metadata
metadata_df = items_df[['movie_id', 'movie_title', 'unknown', 'Action', 'Adventure', 'Animation', "Children's",
                        'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical',
                        'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']]

# Convert genres to string format
metadata_df['genres'] = metadata_df.iloc[:, 2:].apply(lambda x: '|'.join(x.astype(str)), axis=1)

# Drop the individual genre columns
metadata_df.drop(metadata_df.iloc[:, 2:], axis=1, inplace=True)

# Display the metadata dataframe
print("Metadata for TF-IDF Vectorizer:")
print(metadata_df.head())

Metadata for TF-IDF Vectorizer:
   movie_id        movie_title
0         1   Toy Story (1995)
1         2   GoldenEye (1995)
2         3  Four Rooms (1995)
3         4  Get Shorty (1995)
4         5     Copycat (1995)




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [133]:
# Define the items dataframe
items = pd.read_csv('u.item', sep='|', names=item_columns, encoding='latin-1')

# Create a metadata column by combining movie_title and genres
items['metadata'] = items['movie_title'] + ' ' + items['unknown'].astype(str) + ' ' + items['Action'].astype(str) + ' ' + items['Adventure'].astype(str) + ' ' + items['Animation'].astype(str) + ' ' + items["Children's"].astype(str) + ' ' + items['Comedy'].astype(str) + ' ' + items['Crime'].astype(str) + ' ' + items['Documentary'].astype(str) + ' ' + items['Drama'].astype(str) + ' ' + items['Fantasy'].astype(str) + ' ' + items['Film-Noir'].astype(str) + ' ' + items['Horror'].astype(str) + ' ' + items['Musical'].astype(str) + ' ' + items['Mystery'].astype(str) + ' ' + items['Romance'].astype(str) + ' ' + items['Sci-Fi'].astype(str) + ' ' + items['Thriller'].astype(str) + ' ' + items['War'].astype(str) + ' ' + items['Western'].astype(str)

# Access the 'Action' column
print(items['Action'])

# Display the updated items dataframe with the metadata column
print(items)

0       0
1       1
2       0
3       1
4       0
       ..
1677    0
1678    0
1679    0
1680    0
1681    0
Name: Action, Length: 1682, dtype: int64
      movie_id                                movie_title release_date  \
0            1                           Toy Story (1995)  01-Jan-1995   
1            2                           GoldenEye (1995)  01-Jan-1995   
2            3                          Four Rooms (1995)  01-Jan-1995   
3            4                          Get Shorty (1995)  01-Jan-1995   
4            5                             Copycat (1995)  01-Jan-1995   
...        ...                                        ...          ...   
1677      1678                          Mat' i syn (1997)  06-Feb-1998   
1678      1679                           B. Monkey (1998)  06-Feb-1998   
1679      1680                       Sliding Doors (1998)  01-Jan-1998   
1680      1681                        You So Crazy (1994)  01-Jan-1994   
1681      1682  Scream of Stone (Sc

In [134]:
items.Action, type(items.Action)

(0       0
 1       1
 2       0
 3       1
 4       0
        ..
 1677    0
 1678    0
 1679    0
 1680    0
 1681    0
 Name: Action, Length: 1682, dtype: int64,
 pandas.core.series.Series)

In [135]:
def metadata_Action(x):
    if x == 1:
        return "Action"
    else:
        return " "

In [136]:
items['metadata_Action'] = items.Action.apply(metadata_Action)

In [137]:
items

Unnamed: 0,movie_id,movie_title,release_date,video_release_date,IMDb_URL,unknown,Action,Adventure,Animation,Children's,...,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,metadata,metadata_Action
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,Toy Story (1995) 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0...,
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,1,0,0,GoldenEye (1995) 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0...,Action
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,1,0,0,Four Rooms (1995) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...,
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,Get Shorty (1995) 0 1 0 0 0 1 0 0 1 0 0 0 0 0 ...,Action
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,1,0,0,Copycat (1995) 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0...,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1677,1678,Mat' i syn (1997),06-Feb-1998,,http://us.imdb.com/M/title-exact?Mat%27+i+syn+...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Mat' i syn (1997) 0 0 0 0 0 0 0 0 1 0 0 0 0 0 ...,
1678,1679,B. Monkey (1998),06-Feb-1998,,http://us.imdb.com/M/title-exact?B%2E+Monkey+(...,0,0,0,0,0,...,0,0,0,1,0,1,0,0,B. Monkey (1998) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1...,
1679,1680,Sliding Doors (1998),01-Jan-1998,,http://us.imdb.com/Title?Sliding+Doors+(1998),0,0,0,0,0,...,0,0,0,1,0,0,0,0,Sliding Doors (1998) 0 0 0 0 0 0 0 0 1 0 0 0 0...,
1680,1681,You So Crazy (1994),01-Jan-1994,,http://us.imdb.com/M/title-exact?You%20So%20Cr...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,You So Crazy (1994) 0 0 0 0 0 1 0 0 0 0 0 0 0 ...,


In [138]:
def metadata_Adventure(x):
    if x == 1:
        return " Adventure "
    else:
        return " "

items['metadata_Adventure'] = items.Adventure.apply(metadata_Adventure)

In [139]:
items

Unnamed: 0,movie_id,movie_title,release_date,video_release_date,IMDb_URL,unknown,Action,Adventure,Animation,Children's,...,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,metadata,metadata_Action,metadata_Adventure
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,Toy Story (1995) 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0...,,
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,1,0,0,GoldenEye (1995) 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0...,Action,Adventure
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,1,0,0,Four Rooms (1995) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...,,
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,Get Shorty (1995) 0 1 0 0 0 1 0 0 1 0 0 0 0 0 ...,Action,
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,1,0,0,Copycat (1995) 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0...,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1677,1678,Mat' i syn (1997),06-Feb-1998,,http://us.imdb.com/M/title-exact?Mat%27+i+syn+...,0,0,0,0,0,...,0,0,0,0,0,0,0,Mat' i syn (1997) 0 0 0 0 0 0 0 0 1 0 0 0 0 0 ...,,
1678,1679,B. Monkey (1998),06-Feb-1998,,http://us.imdb.com/M/title-exact?B%2E+Monkey+(...,0,0,0,0,0,...,0,0,1,0,1,0,0,B. Monkey (1998) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1...,,
1679,1680,Sliding Doors (1998),01-Jan-1998,,http://us.imdb.com/Title?Sliding+Doors+(1998),0,0,0,0,0,...,0,0,1,0,0,0,0,Sliding Doors (1998) 0 0 0 0 0 0 0 0 1 0 0 0 0...,,
1680,1681,You So Crazy (1994),01-Jan-1994,,http://us.imdb.com/M/title-exact?You%20So%20Cr...,0,0,0,0,0,...,0,0,0,0,0,0,0,You So Crazy (1994) 0 0 0 0 0 1 0 0 0 0 0 0 0 ...,,


In [140]:
genres = ['Action', 'Adventure',
'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']

In [141]:
def create_metadata_column(df, genre):
    """
    Creates a metadata column for the specified genre in the dataframe.

    Args:
    - df: DataFrame
        The dataframe to which the metadata column will be added.
    - genre: str
        The genre for which the metadata column will be created.

    Returns:
    - None
    """
    def metadata_func(x):
        if x == 1:
            return f" {genre} "
        else:
            return " "

    column_name = f"metadata_{genre.replace('-', '').replace('/', '')}"  # Generate column name
    df[column_name] = df[genre].apply(metadata_func)

# List of genres
genres = ['Action', 'Adventure', 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
          'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']

# Create metadata columns for each genre
for genre in genres:
    create_metadata_column(items, genre)

# Display the updated items dataframe with metadata columns
print(items.head())


   movie_id        movie_title release_date  video_release_date  \
0         1   Toy Story (1995)  01-Jan-1995                 NaN   
1         2   GoldenEye (1995)  01-Jan-1995                 NaN   
2         3  Four Rooms (1995)  01-Jan-1995                 NaN   
3         4  Get Shorty (1995)  01-Jan-1995                 NaN   
4         5     Copycat (1995)  01-Jan-1995                 NaN   

                                            IMDb_URL  unknown  Action  \
0  http://us.imdb.com/M/title-exact?Toy%20Story%2...        0       0   
1  http://us.imdb.com/M/title-exact?GoldenEye%20(...        0       1   
2  http://us.imdb.com/M/title-exact?Four%20Rooms%...        0       0   
3  http://us.imdb.com/M/title-exact?Get%20Shorty%...        0       1   
4  http://us.imdb.com/M/title-exact?Copycat%20(1995)        0       0   

   Adventure  Animation  Children's  ...  metadata_Fantasy  metadata_FilmNoir  \
0          0          1           1  ...                                     

In [142]:
items

Unnamed: 0,movie_id,movie_title,release_date,video_release_date,IMDb_URL,unknown,Action,Adventure,Animation,Children's,...,metadata_Fantasy,metadata_FilmNoir,metadata_Horror,metadata_Musical,metadata_Mystery,metadata_Romance,metadata_SciFi,metadata_Thriller,metadata_War,metadata_Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,,,,,,,,,,
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,,,,,,,,Thriller,,
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,,,,,,,,Thriller,,
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,,,,,,,,,,
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,,,,,,,,Thriller,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1677,1678,Mat' i syn (1997),06-Feb-1998,,http://us.imdb.com/M/title-exact?Mat%27+i+syn+...,0,0,0,0,0,...,,,,,,,,,,
1678,1679,B. Monkey (1998),06-Feb-1998,,http://us.imdb.com/M/title-exact?B%2E+Monkey+(...,0,0,0,0,0,...,,,,,,Romance,,Thriller,,
1679,1680,Sliding Doors (1998),01-Jan-1998,,http://us.imdb.com/Title?Sliding+Doors+(1998),0,0,0,0,0,...,,,,,,Romance,,,,
1680,1681,You So Crazy (1994),01-Jan-1994,,http://us.imdb.com/M/title-exact?You%20So%20Cr...,0,0,0,0,0,...,,,,,,,,,,


In [143]:
# Create a list of available metadata columns
available_metadata_columns = [col for col in items.columns if col.startswith('metadata_')]

# Concatenate all available metadata columns for each genre
items['full_metadata'] = items[available_metadata_columns].apply(lambda x: ' '.join(x), axis=1)

# Display the updated items dataframe with the 'full_metadata' column
print(items[['movie_id', 'movie_title', 'full_metadata']].head())


   movie_id        movie_title  \
0         1   Toy Story (1995)   
1         2   GoldenEye (1995)   
2         3  Four Rooms (1995)   
3         4  Get Shorty (1995)   
4         5     Copycat (1995)   

                                       full_metadata  
0       Animation   Children's   Comedy          ...  
1   Action   Adventure                           ...  
2                                      Thriller       
3   Action         Comedy       Drama            ...  
4             Crime     Drama                 Thr...  


In [144]:
items

Unnamed: 0,movie_id,movie_title,release_date,video_release_date,IMDb_URL,unknown,Action,Adventure,Animation,Children's,...,metadata_FilmNoir,metadata_Horror,metadata_Musical,metadata_Mystery,metadata_Romance,metadata_SciFi,metadata_Thriller,metadata_War,metadata_Western,full_metadata
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,,,,,,,,,,Animation Children's Comedy ...
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,,,,,,,Thriller,,,Action Adventure ...
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,,,,,,,Thriller,,,Thriller
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,,,,,,,,,,Action Comedy Drama ...
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,,,,,,,Thriller,,,Crime Drama Thr...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1677,1678,Mat' i syn (1997),06-Feb-1998,,http://us.imdb.com/M/title-exact?Mat%27+i+syn+...,0,0,0,0,0,...,,,,,,,,,,Drama
1678,1679,B. Monkey (1998),06-Feb-1998,,http://us.imdb.com/M/title-exact?B%2E+Monkey+(...,0,0,0,0,0,...,,,,,Romance,,Thriller,,,Romance Thrille...
1679,1680,Sliding Doors (1998),01-Jan-1998,,http://us.imdb.com/Title?Sliding+Doors+(1998),0,0,0,0,0,...,,,,,Romance,,,,,Drama Romance
1680,1681,You So Crazy (1994),01-Jan-1994,,http://us.imdb.com/M/title-exact?You%20So%20Cr...,0,0,0,0,0,...,,,,,,,,,,Comedy


## TF-IDF Vectorizer on Metadata

In [145]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the metadata to obtain TF-IDF features
tfidf_features = tfidf_vectorizer.fit_transform(items['full_metadata'])

# Display the shape of the TF-IDF features
print("Shape of TF-IDF features:", tfidf_features.shape)

Shape of TF-IDF features: (1682, 20)


In [146]:
from sklearn.decomposition import TruncatedSVD
# Adjust the latent_dimension to be less than or equal to the number of features
latent_dimension = 20  # or any value <= 20

# Apply Truncated SVD to TF-IDF features with the updated latent_dimension
svd_model = TruncatedSVD(n_components=latent_dimension)
latent_matrix = svd_model.fit_transform(tfidf_features)

# Convert latent_matrix to a DataFrame (if needed) and set movie titles as index
latent_matrix_1_df = pd.DataFrame(latent_matrix, index=items['movie_title'])

# Display the shape of the latent matrix
print("Shape of Latent Matrix:", latent_matrix_1_df.shape)

Shape of Latent Matrix: (1682, 20)


## 2. Collaborative Filtering
- Use user_movies_data

In [147]:
# Create the User-Item matrix using pivot table
user_movies_data = user_item_matrix.pivot_table
# Display the User-Item matrix
print("User-Item Matrix:")
print(user_movies_data)


User-Item Matrix:
<bound method DataFrame.pivot_table of user_id   1    2    3    4    5    6    7    8    9    10   ...  934  935  \
movie_id                                                    ...             
1           5    4    0    0    4    4    0    0    0    4  ...    2    3   
2           3    0    0    0    3    0    0    0    0    0  ...    4    0   
3           4    0    0    0    0    0    0    0    0    0  ...    0    0   
4           3    0    0    0    0    0    5    0    0    4  ...    5    0   
5           3    0    0    0    0    0    0    0    0    0  ...    0    0   
...       ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...   
1678        0    0    0    0    0    0    0    0    0    0  ...    0    0   
1679        0    0    0    0    0    0    0    0    0    0  ...    0    0   
1680        0    0    0    0    0    0    0    0    0    0  ...    0    0   
1681        0    0    0    0    0    0    0    0    0    0  ...    0    0   
1682        0    0 

In [148]:
from sklearn.decomposition import TruncatedSVD

# Perform Singular Value Decomposition (SVD)
svd_model = TruncatedSVD(n_components=20, random_state=42)
latent_matrix = svd_model.fit_transform(user_item_matrix)

# Convert the latent matrix to a DataFrame
latent_matrix_2_df = pd.DataFrame(latent_matrix, index=user_item_matrix.index)

# Add movie titles and IDs to the DataFrame
movie_id_to_name = items.set_index('movie_id')['movie_title'].to_dict()
latent_matrix_2_df['movie_id'] = user_item_matrix.index
latent_matrix_2_df['movie_title'] = movie_id_to_name.values()

# Set 'movie_ititle' as the index
latent_matrix_2_df.set_index('movie_title', inplace=True)

# Display the modified latent matrix with movie titles and IDs
print("Latent Matrix 2:")
print(latent_matrix_2_df)

Latent Matrix 2:
                                                   0          1          2  \
movie_title                                                                  
Toy Story (1995)                           61.469396  21.359469  -3.697663   
GoldenEye (1995)                           22.537180   1.719989 -13.616242   
Four Rooms (1995)                          12.767067   7.006769  -2.535841   
Get Shorty (1995)                          38.407419  -3.195163  -5.761062   
Copycat (1995)                             13.842238   3.748583  -5.764234   
...                                              ...        ...        ...   
Mat' i syn (1997)                           0.009730   0.054860   0.058059   
B. Monkey (1998)                            0.029189   0.164580   0.174176   
Sliding Doors (1998)                        0.019459   0.109720   0.116117   
You So Crazy (1994)                         0.212086  -0.025764  -0.098976   
Scream of Stone (Schrei aus Stein) (1991)   0.2

In [149]:
latent_matrix_2_df


Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,11,12,13,14,15,16,17,18,19,movie_id
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Toy Story (1995),61.469396,21.359469,-3.697663,-2.579224,22.229725,15.964733,12.768127,-2.751672,6.330230,0.450451,...,-2.641969,-2.108442,-3.066645,-2.288769,-5.032221,1.251594,-0.731709,-2.287147,-3.095011,1
GoldenEye (1995),22.537180,1.719989,-13.616242,-0.514657,-6.448400,-2.409132,0.013745,-1.980483,-8.074766,-4.629508,...,-2.818239,0.239987,-1.350026,-1.274976,-1.364005,-0.799038,1.191902,-2.202397,-0.846734,2
Four Rooms (1995),12.767067,7.006769,-2.535841,-7.766709,-0.696378,-4.604121,-0.299571,-2.451378,2.596744,-3.185019,...,-1.504491,-2.872939,3.251849,-0.311786,0.058531,1.496349,-0.335863,2.803564,1.253793,3
Get Shorty (1995),38.407419,-3.195163,-5.761062,-6.843523,-6.046925,-6.969422,-0.931709,3.535384,-5.464466,-6.105469,...,5.975581,1.897681,-4.108657,0.701611,-3.361011,3.944109,-2.758970,-1.929996,5.167682,4
Copycat (1995),13.842238,3.748583,-5.764234,-2.849087,-2.476821,-4.737611,-6.224776,0.059064,2.041002,5.664593,...,1.835720,8.358653,-2.141676,-0.424077,1.090604,-2.510405,1.847846,-4.456771,-0.748373,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Mat' i syn (1997),0.009730,0.054860,0.058059,0.075858,-0.094315,-0.020942,0.021506,0.003018,0.021657,-0.005961,...,0.002170,-0.008570,-0.019889,0.018225,-0.004673,0.002016,-0.000796,0.047324,-0.010024,1678
B. Monkey (1998),0.029189,0.164580,0.174176,0.227573,-0.282946,-0.062826,0.064518,0.009054,0.064971,-0.017883,...,0.006509,-0.025709,-0.059666,0.054674,-0.014019,0.006048,-0.002387,0.141971,-0.030073,1679
Sliding Doors (1998),0.019459,0.109720,0.116117,0.151716,-0.188631,-0.041884,0.043012,0.006036,0.043314,-0.011922,...,0.004339,-0.017139,-0.039777,0.036449,-0.009346,0.004032,-0.001592,0.094647,-0.020049,1680
You So Crazy (1994),0.212086,-0.025764,-0.098976,-0.012301,-0.071654,0.018887,-0.086156,0.047425,-0.173612,0.000666,...,-0.077680,-0.007501,0.054911,-0.088843,-0.261791,-0.169841,-0.104798,-0.017423,-0.125438,1681


In [150]:
# Strip whitespace characters from column names
items.columns = items.columns.str.strip()


\### Plot variance expalined to see what latent dimensions to use

In [151]:
import plotly.graph_objects as go

# Perform Singular Value Decomposition (SVD)
svd_model = TruncatedSVD(n_components=20, random_state=42)
svd_model.fit(user_item_matrix)

# Plot the explained variance ratio
fig = go.Figure()
fig.add_trace(go.Scatter(x=list(range(1, 21)), y=svd_model.explained_variance_ratio_,
                         mode='lines+markers', name='Explained Variance Ratio'))
fig.update_layout(title='Explained Variance Ratio by Number of Latent Dimensions',
                  xaxis_title='Number of Latent Dimensions',
                  yaxis_title='Explained Variance Ratio',
                  xaxis=dict(tickmode='linear', tick0=1, dtick=1),
                  yaxis=dict(tickformat='.2%'))
fig.show()

### Cosine Similarity

In [152]:
items

Unnamed: 0,movie_id,movie_title,release_date,video_release_date,IMDb_URL,unknown,Action,Adventure,Animation,Children's,...,metadata_FilmNoir,metadata_Horror,metadata_Musical,metadata_Mystery,metadata_Romance,metadata_SciFi,metadata_Thriller,metadata_War,metadata_Western,full_metadata
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,,,,,,,,,,Animation Children's Comedy ...
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,,,,,,,Thriller,,,Action Adventure ...
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,,,,,,,Thriller,,,Thriller
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,,,,,,,,,,Action Comedy Drama ...
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,,,,,,,Thriller,,,Crime Drama Thr...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1677,1678,Mat' i syn (1997),06-Feb-1998,,http://us.imdb.com/M/title-exact?Mat%27+i+syn+...,0,0,0,0,0,...,,,,,,,,,,Drama
1678,1679,B. Monkey (1998),06-Feb-1998,,http://us.imdb.com/M/title-exact?B%2E+Monkey+(...,0,0,0,0,0,...,,,,,Romance,,Thriller,,,Romance Thrille...
1679,1680,Sliding Doors (1998),01-Jan-1998,,http://us.imdb.com/Title?Sliding+Doors+(1998),0,0,0,0,0,...,,,,,Romance,,,,,Drama Romance
1680,1681,You So Crazy (1994),01-Jan-1994,,http://us.imdb.com/M/title-exact?You%20So%20Cr...,0,0,0,0,0,...,,,,,,,,,,Comedy


In [157]:
# Assuming 'movie_title' is the column containing movie titles in the 'items' DataFrame
# Set the index of the DataFrame to 'movie_title'
items.set_index('movie_title', inplace=True)
print(items)

                                           movie_id release_date  \
movie_title                                                        
Toy Story (1995)                                  1  01-Jan-1995   
GoldenEye (1995)                                  2  01-Jan-1995   
Four Rooms (1995)                                 3  01-Jan-1995   
Get Shorty (1995)                                 4  01-Jan-1995   
Copycat (1995)                                    5  01-Jan-1995   
...                                             ...          ...   
Mat' i syn (1997)                              1678  06-Feb-1998   
B. Monkey (1998)                               1679  06-Feb-1998   
Sliding Doors (1998)                           1680  01-Jan-1998   
You So Crazy (1994)                            1681  01-Jan-1994   
Scream of Stone (Schrei aus Stein) (1991)      1682  08-Mar-1996   

                                           video_release_date  \
movie_title                                       

In [158]:
from sklearn.metrics.pairwise import cosine_similarity

# Step 1: Obtain the latent vectors for the selected movie
selected_movie = "Toy Story (1995)"
selected_movie_content_vector = latent_matrix_1_df.loc[selected_movie].values.reshape(1, -1)
selected_movie_collaborative_vector = latent_matrix_2_df.loc[selected_movie].values.reshape(1, -1)

# Step 2: Calculate cosine similarity with all other movies
content_similarity = cosine_similarity(latent_matrix_1_df.values, selected_movie_content_vector)
collaborative_similarity = cosine_similarity(latent_matrix_2_df.values, selected_movie_collaborative_vector)

# Step 3: Generate hybrid score
hybrid_score = (content_similarity + collaborative_similarity) / 2

# Step 4: Create DataFrame of similar movies
similar_movies_df = pd.DataFrame({
    'Movie': latent_matrix_1_df.index,
    'Content Similarity': content_similarity.flatten(),
    'Collaborative Similarity': collaborative_similarity.flatten(),
    'Hybrid Score': hybrid_score.flatten()
})

# Step 5: Sort DataFrame based on content similarity
similar_movies_df_sorted = similar_movies_df.sort_values(by='Content Similarity', ascending=False)

# Display the sorted DataFrame
print(similar_movies_df_sorted.head(10))

                                                  Movie  Content Similarity  \
0                                      Toy Story (1995)            1.000000   
421              Aladdin and the King of Thieves (1996)            1.000000   
945                       Fox and the Hound, The (1981)            0.936967   
1469                            Gumby: The Movie (1995)            0.936967   
1411  Land Before Time III: The Time of the Great Gi...            0.936967   
1408                          Swan Princess, The (1994)            0.936967   
624                      Sword in the Stone, The (1963)            0.936967   
1077                            Oliver & Company (1988)            0.936967   
1065                                       Balto (1995)            0.936967   
101                              Aristocats, The (1970)            0.936967   

      Collaborative Similarity  Hybrid Score  
0                     1.000000      1.000000  
421                   0.022765      

## Hybrid

## 3. Hybrid Recommendation System




In [159]:
from sklearn.metrics.pairwise import cosine_similarity

def recommend_similar_movies(title):
    # Step 1: Obtain the latent vectors for the selected movie from both content and collaborative matrices
    selected_movie_content_vector = latent_matrix_1_df.loc[title].values.reshape(1, -1)
    selected_movie_collaborative_vector = latent_matrix_2_df.loc[title].values.reshape(1, -1)

    # Step 2: Calculate cosine similarity with all other movies for both content and collaborative vectors
    content_similarity = cosine_similarity(latent_matrix_1_df.values, selected_movie_content_vector)
    collaborative_similarity = cosine_similarity(latent_matrix_2_df.values, selected_movie_collaborative_vector)

    # Step 3: Generate hybrid score by averaging the content and collaborative similarities
    hybrid_score = (content_similarity + collaborative_similarity) / 2

    # Step 4: Create DataFrame of similar movies with their scores
    similar_movies_df = pd.DataFrame({
        'Movie': latent_matrix_1_df.index,
        'Content Similarity': content_similarity.flatten(),
        'Collaborative Similarity': collaborative_similarity.flatten(),
        'Hybrid Score': hybrid_score.flatten()
    })

    # Step 5: Sort DataFrame based on hybrid score
    similar_movies_df_sorted = similar_movies_df.sort_values(by='Hybrid Score', ascending=False)

    return similar_movies_df_sorted.head(10)


In [160]:
recommend_similar_movies("Toy Story (1995)")

Unnamed: 0,Movie,Content Similarity,Collaborative Similarity,Hybrid Score
0,Toy Story (1995),1.0,1.0,1.0
7,Babe (1995),0.616159,0.763757,0.689958
70,"Lion King, The (1994)",0.752183,0.457712,0.604948
94,Aladdin (1992),0.820798,0.377451,0.599124
24,"Birdcage, The (1996)",0.349419,0.771913,0.560666
98,Snow White and the Seven Dwarfs (1937),0.752183,0.278124,0.515154
101,"Aristocats, The (1970)",0.936967,0.092192,0.514579
421,Aladdin and the King of Thieves (1996),1.0,0.022765,0.511383
12,Mighty Aphrodite (1995),0.349419,0.653914,0.501666
403,Pinocchio (1940),0.936967,0.056583,0.496775


In [161]:
recommend_similar_movies("GoldenEye (1995)")

Unnamed: 0,Movie,Content Similarity,Collaborative Similarity,Hybrid Score
1,GoldenEye (1995),1.0,1.0,1.0
27,Apollo 13 (1995),0.692499,0.69562,0.69406
116,"Rock, The (1996)",1.0,0.369364,0.684682
78,"Fugitive, The (1993)",0.759103,0.585132,0.672118
23,Rumble in the Bronx (1995),0.653037,0.675139,0.664088
117,Twister (1996),1.0,0.290739,0.645369
28,Batman Forever (1995),0.611661,0.635115,0.623388
173,Raiders of the Lost Ark (1981),0.84373,0.382219,0.612975
3,Get Shorty (1995),0.381454,0.834772,0.608113
32,Desperado (1995),0.618665,0.579699,0.599182


In [162]:
recommend_similar_movies("Mission: Impossible (1996)")

Unnamed: 0,Movie,Content Similarity,Collaborative Similarity,Hybrid Score
404,Mission: Impossible (1996),1.0,1.0,1.0
808,Rising Sun (1993),0.780103,0.992085,0.886094
768,Congo (1995),0.758788,0.991896,0.875342
147,"Ghost and the Darkness, The (1996)",0.727201,0.997487,0.862344
430,Highlander (1986),0.727201,0.994996,0.861099
553,Waterworld (1995),0.727201,0.993699,0.86045
678,Conan the Barbarian (1981),0.727201,0.993428,0.860315
490,"Adventures of Robin Hood, The (1938)",0.727201,0.992685,0.859943
540,Mortal Kombat (1995),0.727201,0.992609,0.859905
828,Fled (1996),0.727201,0.991965,0.859583


##### Turicreate - Python Library for easy recommendation engine building

# 4. Matrix Factorization Recommender

- R – The user-movie rating matrix
- K – Number of latent features
- alpha – Learning rate for stochastic gradient descent
- beta – Regularization parameter for bias
- iterations – Number of iterations to perform stochastic gradient descent


In [166]:
class MF():

    # Initializing the user-movie rating matrix, no. of latent features, alpha and beta.
    def __init__(self, R, K, alpha, beta, iterations):
        self.R = R
        self.num_users, self.num_items = R.shape
        self.K = K
        self.alpha = alpha
        self.beta = beta
        self.iterations = iterations

    # Initializing user-feature and movie-feature matrix
    def train(self):
        self.P = np.random.normal(scale=1./self.K, size=(self.num_users, self.K))
        self.Q = np.random.normal(scale=1./self.K, size=(self.num_items, self.K))

        # Initializing the bias terms
        self.b_u = np.zeros(self.num_users)
        self.b_i = np.zeros(self.num_items)
        self.b = np.mean(self.R[np.where(self.R != 0)])

        # List of training samples
        self.samples = [
        (i, j, self.R[i, j])
        for i in range(self.num_users)
        for j in range(self.num_items)
        if self.R[i, j] > 0
        ]

        # Stochastic gradient descent for given number of iterations
        training_process = []
        for i in range(self.iterations):
            np.random.shuffle(self.samples)
            self.sgd()
            mse = self.mse()
            training_process.append((i, mse))
            if (i+1) % 20 == 0:
                print("Iteration: %d ; error = %.4f" % (i+1, mse))

        return training_process

    # Computing total mean squared error
    def mse(self):
        xs, ys = self.R.nonzero()
        predicted = self.full_matrix()
        error = 0
        for x, y in zip(xs, ys):
            error += pow(self.R[x, y] - predicted[x, y], 2)
        return np.sqrt(error)

    # Stochastic gradient descent to get optimized P and Q matrix
    def sgd(self):
        for i, j, r in self.samples:
            prediction = self.get_rating(i, j)
            e = (r - prediction)

            self.b_u[i] += self.alpha * (e - self.beta * self.b_u[i])
            self.b_i[j] += self.alpha * (e - self.beta * self.b_i[j])

            self.P[i, :] += self.alpha * (e * self.Q[j, :] - self.beta * self.P[i,:])
            self.Q[j, :] += self.alpha * (e * self.P[i, :] - self.beta * self.Q[j,:])

    # Ratings for user i and moive j
    def get_rating(self, i, j):
        prediction = self.b + self.b_u[i] + self.b_i[j] + self.P[i, :].dot(self.Q[j, :].T)
        return prediction

    # Full user-movie rating matrix
    def full_matrix(self):
        return mf.b + mf.b_u[:,np.newaxis] + mf.b_i[np.newaxis:,] + mf.P.dot(mf.Q.T)

In [167]:
import numpy as np

R= np.array(ratings_df.pivot(index = 'user_id', columns ='movie_id', values = 'rating').fillna(0))

In [168]:
mf = MF(R, K=20, alpha=0.001, beta=0.01, iterations=100)
training_process = mf.train()
print()
print("P x Q:")
print(mf.full_matrix())
print()

Iteration: 20 ; error = 296.1360
Iteration: 40 ; error = 291.0947
Iteration: 60 ; error = 287.7523
Iteration: 80 ; error = 282.4022
Iteration: 100 ; error = 273.0485

P x Q:
[[4.02138317 3.45282233 3.17568781 ... 3.41401606 3.49613245 3.45147471]
 [3.90580513 3.32376365 3.17295985 ... 3.36851916 3.50105629 3.43840348]
 [3.40198256 2.76943872 2.61245924 ... 2.83381968 2.93772547 2.92086465]
 ...
 [4.19495919 3.58683239 3.45895193 ... 3.65690478 3.80526987 3.75445368]
 [4.33305916 3.78944594 3.53427974 ... 3.78567232 3.89431655 3.89315614]
 [3.88057362 3.27121854 3.01158204 ... 3.27558837 3.403359   3.31992795]]



## Surprise

### Import the libraries

In [169]:
!pip install surprise



In [170]:
from surprise import Dataset, Reader, SVD, accuracy
from surprise.model_selection import train_test_split

In [171]:
ratings_df

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596
...,...,...,...,...
99995,880,476,3,880175444
99996,716,204,5,879795543
99997,276,1090,1,874795795
99998,13,225,2,882399156


In [172]:
items


Unnamed: 0_level_0,movie_id,release_date,video_release_date,IMDb_URL,unknown,Action,Adventure,Animation,Children's,Comedy,...,metadata_FilmNoir,metadata_Horror,metadata_Musical,metadata_Mystery,metadata_Romance,metadata_SciFi,metadata_Thriller,metadata_War,metadata_Western,full_metadata
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Toy Story (1995),1,01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,1,...,,,,,,,,,,Animation Children's Comedy ...
GoldenEye (1995),2,01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,0,...,,,,,,,Thriller,,,Action Adventure ...
Four Rooms (1995),3,01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,0,...,,,,,,,Thriller,,,Thriller
Get Shorty (1995),4,01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,1,...,,,,,,,,,,Action Comedy Drama ...
Copycat (1995),5,01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,0,...,,,,,,,Thriller,,,Crime Drama Thr...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Mat' i syn (1997),1678,06-Feb-1998,,http://us.imdb.com/M/title-exact?Mat%27+i+syn+...,0,0,0,0,0,0,...,,,,,,,,,,Drama
B. Monkey (1998),1679,06-Feb-1998,,http://us.imdb.com/M/title-exact?B%2E+Monkey+(...,0,0,0,0,0,0,...,,,,,Romance,,Thriller,,,Romance Thrille...
Sliding Doors (1998),1680,01-Jan-1998,,http://us.imdb.com/Title?Sliding+Doors+(1998),0,0,0,0,0,0,...,,,,,Romance,,,,,Drama Romance
You So Crazy (1994),1681,01-Jan-1994,,http://us.imdb.com/M/title-exact?You%20So%20Cr...,0,0,0,0,0,1,...,,,,,,,,,,Comedy


In [173]:
# Create an empty dictionary to store the mapping
Mapping_file = {}

# Iterate through the DataFrame rows
for index, row in items.iterrows():
    # Extract movie title and ID from the current row
    movie_title = index.strip()  # Remove leading and trailing whitespace
    movie_id = row['movie_id']

    # Add the title-ID pair to the dictionary
    Mapping_file[movie_title] = movie_id

### Instantiate a reader and read in our rating data

In [174]:
from surprise import Dataset, Reader

# Define the rating scale (here, assuming ratings range from 1 to 5)
reader = Reader(rating_scale=(1, 5))

# Load the rating data
data = Dataset.load_from_df(ratings_df[['user_id', 'movie_id', 'rating']], reader)


### Train SVD on 75% of known rates

In [175]:
from surprise.model_selection import train_test_split
from surprise import SVD

# Split the data into train and test sets (75% train, 25% test)
trainset, testset = train_test_split(data, test_size=0.25, random_state=42)

# Instantiate the SVD algorithm
svd = SVD()

# Train the model on the training set
svd.fit(trainset)

# Predict ratings for the test set
predictions = svd.test(testset)

# Evaluate the model (optional)
accuracy.rmse(predictions)


RMSE: 0.9437


0.9436507696853293

### Check the accuracy using Root Mean Square Error

In [176]:
def predict_user_ratings(user_id):
    if user_id in ratings_df['user_id'].unique():
        user_movies = ratings_df[ratings_df['user_id'] == user_id]['movie_id'].tolist()
        movies_to_predict = {movie_title: movie_id for movie_title, movie_id in Mapping_file.items() if movie_id not in user_movies}

        predicted_ratings = []
        for movie_title, movie_id in movies_to_predict.items():
            predicted = svd.predict(user_id, movie_id)
            predicted_ratings.append((movie_title, predicted[3]))

        predicted_df = pd.DataFrame(predicted_ratings, columns=['movies', 'ratings'])
        predicted_df.sort_values('ratings', ascending=False, inplace=True)
        predicted_df.set_index('movies', inplace=True)

        return predicted_df.head(10)
    else:
        print("User ID does not exist in the list!")
        return None


In [177]:
ratings_df.columns

Index(['user_id', 'movie_id', 'rating', 'unix_timestamp'], dtype='object')

In [178]:
print(Mapping_file)


{'Toy Story (1995)': 1, 'GoldenEye (1995)': 2, 'Four Rooms (1995)': 3, 'Get Shorty (1995)': 4, 'Copycat (1995)': 5, 'Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)': 6, 'Twelve Monkeys (1995)': 7, 'Babe (1995)': 8, 'Dead Man Walking (1995)': 9, 'Richard III (1995)': 10, 'Seven (Se7en) (1995)': 11, 'Usual Suspects, The (1995)': 12, 'Mighty Aphrodite (1995)': 13, 'Postino, Il (1994)': 14, "Mr. Holland's Opus (1995)": 15, 'French Twist (Gazon maudit) (1995)': 16, 'From Dusk Till Dawn (1996)': 17, 'White Balloon, The (1995)': 18, "Antonia's Line (1995)": 19, 'Angels and Insects (1995)': 20, 'Muppet Treasure Island (1996)': 21, 'Braveheart (1995)': 22, 'Taxi Driver (1976)': 23, 'Rumble in the Bronx (1995)': 24, 'Birdcage, The (1996)': 25, 'Brothers McMullen, The (1995)': 26, 'Bad Boys (1995)': 27, 'Apollo 13 (1995)': 28, 'Batman Forever (1995)': 29, 'Belle de jour (1967)': 30, 'Crimson Tide (1995)': 31, 'Crumb (1994)': 32, 'Desperado (1995)': 33, 'Doom Generation, The (1995)': 34, 'Fr

In [179]:
# Filter out movie IDs that are not in the Mapping_file
valid_movie_ids = [movie_id for movie_id in average_ratings.index if movie_id in Mapping_file.values()]

# Map valid movie IDs to movie titles
movie_titles = [list(Mapping_file.keys())[list(Mapping_file.values()).index(movie_id)] for movie_id in valid_movie_ids]

# Create a new DataFrame with valid movie titles and their corresponding average ratings
valid_average_ratings = average_ratings[valid_movie_ids]
valid_average_ratings.index = movie_titles

# Sort the movies based on average ratings in descending order
sorted_ratings = valid_average_ratings.sort_values(ascending=False)

# Display the top-rated movies
print(sorted_ratings.head(10))

Prefontaine (1997)                                   5.0
Aiqing wansui (1994)                                 5.0
Marlene Dietrich: Shadow and Light (1996)            5.0
Star Kid (1997)                                      5.0
Great Day in Harlem, A (1994)                        5.0
Someone Else's America (1995)                        5.0
They Made Me a Criminal (1939)                       5.0
Entertaining Angels: The Dorothy Day Story (1996)    5.0
Santa with Muscles (1996)                            5.0
Saint of Fort Washington, The (1993)                 5.0
Name: rating, dtype: float64


In [184]:
predict_user_ratings(40)

Unnamed: 0_level_0,ratings
movies,Unnamed: 1_level_1
"Usual Suspects, The (1995)",4.137069
To Kill a Mockingbird (1962),3.951173
"Third Man, The (1949)",3.937327
"Close Shave, A (1995)",3.92351
"Wrong Trousers, The (1993)",3.918685
Star Wars (1977),3.907292
"Shawshank Redemption, The (1994)",3.885769
"Manchurian Candidate, The (1962)",3.872426
One Flew Over the Cuckoo's Nest (1975),3.871499
12 Angry Men (1957),3.858687


In [185]:
predict_user_ratings(50)

Unnamed: 0_level_0,ratings
movies,Unnamed: 1_level_1
"Shawshank Redemption, The (1994)",4.675778
"Close Shave, A (1995)",4.529891
L.A. Confidential (1997),4.460722
"Manchurian Candidate, The (1962)",4.435976
"Usual Suspects, The (1995)",4.424563
"Empire Strikes Back, The (1980)",4.418949
Good Will Hunting (1997),4.41818
"Silence of the Lambs, The (1991)",4.41501
Titanic (1997),4.374292
Richard III (1995),4.357705


In [190]:
predict_user_ratings(49)

Unnamed: 0_level_0,ratings
movies,Unnamed: 1_level_1
Blade Runner (1982),3.98201
Rear Window (1954),3.929438
12 Angry Men (1957),3.891878
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963),3.886122
"Third Man, The (1949)",3.882569
Boogie Nights (1997),3.882139
Schindler's List (1993),3.88068
Ran (1985),3.845404
Dead Man Walking (1995),3.838081
Strictly Ballroom (1992),3.766722


In [189]:
predict_user_ratings(1)

Unnamed: 0_level_0,ratings
movies,Unnamed: 1_level_1
North by Northwest (1959),4.725838
"Third Man, The (1949)",4.602182
Schindler's List (1993),4.546283
"Boot, Das (1981)",4.516794
Bringing Up Baby (1938),4.506274
"Manchurian Candidate, The (1962)",4.456308
Titanic (1997),4.45493
L.A. Confidential (1997),4.445271
Secrets & Lies (1996),4.436043
Rear Window (1954),4.425483


In [188]:
predict_user_ratings(915)

Unnamed: 0_level_0,ratings
movies,Unnamed: 1_level_1
To Kill a Mockingbird (1962),4.361981
Pulp Fiction (1994),4.305682
Rear Window (1954),4.292893
Apocalypse Now (1979),4.142235
"Close Shave, A (1995)",4.139565
One Flew Over the Cuckoo's Nest (1975),4.128509
Sunset Blvd. (1950),4.109944
Chinatown (1974),4.098562
"Wrong Trousers, The (1993)",4.091206
Casablanca (1942),4.088138
