The data used was the MovieLens 32M dataset from: https://grouplens.org/datasets/movielens/. First pandas is imported so that the dataset can be imported as a pandas dataframe. This makes it usable by Python

In [16]:
import pandas as pd

The movies database is used to get the genres for different movies

In [27]:
movies_df=pd.read_csv("ml-32m/movies.csv")
print(movies_df)

       movieId                               title  \
0            1                    Toy Story (1995)   
1            2                      Jumanji (1995)   
2            3             Grumpier Old Men (1995)   
3            4            Waiting to Exhale (1995)   
4            5  Father of the Bride Part II (1995)   
...        ...                                 ...   
87580   292731           The Monroy Affaire (2022)   
87581   292737          Shelter in Solitude (2023)   
87582   292753                         Orca (2023)   
87583   292755              The Angry Breed (1968)   
87584   292757           Race to the Summit (2023)   

                                            genres  
0      Adventure|Animation|Children|Comedy|Fantasy  
1                       Adventure|Children|Fantasy  
2                                   Comedy|Romance  
3                             Comedy|Drama|Romance  
4                                           Comedy  
...                              

The genres are going to be used to recommend similar movies. Currently they are not stored in a usable format. The genres datafields are turned into a list by splitting the string

In [18]:
movies_df['genres'] = movies_df.genres.str.split('|')
print(movies_df)

       movieId                               title  \
0            1                    Toy Story (1995)   
1            2                      Jumanji (1995)   
2            3             Grumpier Old Men (1995)   
3            4            Waiting to Exhale (1995)   
4            5  Father of the Bride Part II (1995)   
...        ...                                 ...   
87580   292731           The Monroy Affaire (2022)   
87581   292737          Shelter in Solitude (2023)   
87582   292753                         Orca (2023)   
87583   292755              The Angry Breed (1968)   
87584   292757           Race to the Summit (2023)   

                                                  genres  
0      [Adventure, Animation, Children, Comedy, Fantasy]  
1                         [Adventure, Children, Fantasy]  
2                                      [Comedy, Romance]  
3                               [Comedy, Drama, Romance]  
4                                               [Comedy]

By looping through the list of genres in each row, a table can be created that stores the genres in seperate columns. The loop puts in 1 if the genre is present and the gaps are then filled with 0

In [19]:
for index, row in movies_df.iterrows():
    for genre in row['genres']:
         movies_df.at[index, genre] = 1
movies_df = movies_df.fillna(0)
print(movies_df)

       movieId                               title  \
0            1                    Toy Story (1995)   
1            2                      Jumanji (1995)   
2            3             Grumpier Old Men (1995)   
3            4            Waiting to Exhale (1995)   
4            5  Father of the Bride Part II (1995)   
...        ...                                 ...   
87580   292731           The Monroy Affaire (2022)   
87581   292737          Shelter in Solitude (2023)   
87582   292753                         Orca (2023)   
87583   292755              The Angry Breed (1968)   
87584   292757           Race to the Summit (2023)   

                                                  genres  Adventure  \
0      [Adventure, Animation, Children, Comedy, Fantasy]        1.0   
1                         [Adventure, Children, Fantasy]        1.0   
2                                      [Comedy, Romance]        0.0   
3                               [Comedy, Drama, Romance]        0.0

A user profile of rated movies is created with movie titles and ratings. This can be used to recommend the user new movies

In [20]:
userInput = [
            {'title':'It Takes Two (1995)', 'rating':3},
            {'title':'Four Rooms (1995)', 'rating':4},
            {'title':'Amazing Panda Adventure, The (1995)', 'rating':5},
            {'title':"Miami Rhapsody (1995)", 'rating':3.5},
            {'title':'Strawberry and Chocolate (Fresa y chocolate) (1993)', 'rating':4}
         ] 
inputMovies = pd.DataFrame(userInput)
print(inputMovies)

                                               title  rating
0                                It Takes Two (1995)     3.0
1                                  Four Rooms (1995)     4.0
2                Amazing Panda Adventure, The (1995)     5.0
3                              Miami Rhapsody (1995)     3.5
4  Strawberry and Chocolate (Fresa y chocolate) (...     4.0


The matching movie id and genres are created as a table which is merged with the user ratings table

In [21]:
inputId = movies_df[movies_df['title'].isin(inputMovies['title'].tolist())]
inputMovies = pd.merge(inputId, inputMovies)
print(inputMovies)

   movieId                                              title  \
0       18                                  Four Rooms (1995)   
1       38                                It Takes Two (1995)   
2      146                Amazing Panda Adventure, The (1995)   
3      278                              Miami Rhapsody (1995)   
4      321  Strawberry and Chocolate (Fresa y chocolate) (...   

                  genres  Adventure  Animation  Children  Comedy  Fantasy  \
0               [Comedy]        0.0        0.0       0.0     1.0      0.0   
1     [Children, Comedy]        0.0        0.0       1.0     1.0      0.0   
2  [Adventure, Children]        1.0        0.0       1.0     0.0      0.0   
3               [Comedy]        0.0        0.0       0.0     1.0      0.0   
4                [Drama]        0.0        0.0       0.0     0.0      0.0   

   Romance  Drama  ...  Mystery  Sci-Fi  IMAX  Documentary  War  Musical  \
0      0.0    0.0  ...      0.0     0.0   0.0          0.0  0.0      0

Only the detailed genres columns are need so the other columns are dropped

In [22]:
inputGenreTable = inputMovies.drop('movieId', axis=1).drop("title",axis=1).drop("genres",axis=1)
print(inputGenreTable)

   Adventure  Animation  Children  Comedy  Fantasy  Romance  Drama  Action  \
0        0.0        0.0       0.0     1.0      0.0      0.0    0.0     0.0   
1        0.0        0.0       1.0     1.0      0.0      0.0    0.0     0.0   
2        1.0        0.0       1.0     0.0      0.0      0.0    0.0     0.0   
3        0.0        0.0       0.0     1.0      0.0      0.0    0.0     0.0   
4        0.0        0.0       0.0     0.0      0.0      0.0    1.0     0.0   

   Crime  Thriller  ...  Mystery  Sci-Fi  IMAX  Documentary  War  Musical  \
0    0.0       0.0  ...      0.0     0.0   0.0          0.0  0.0      0.0   
1    0.0       0.0  ...      0.0     0.0   0.0          0.0  0.0      0.0   
2    0.0       0.0  ...      0.0     0.0   0.0          0.0  0.0      0.0   
3    0.0       0.0  ...      0.0     0.0   0.0          0.0  0.0      0.0   
4    0.0       0.0  ...      0.0     0.0   0.0          0.0  0.0      0.0   

   Western  Film-Noir  (no genres listed)  rating  
0      0.0      

The ratings are scaled using a dot product. Therefore, the genres that have a lot of high ratings are scored highest. It can be seen that children and comedy is rated highest so it can be presumed that those movies will be recommended

In [23]:
userProfile = inputGenreTable.transpose().dot(inputMovies['rating'])

print(userProfile)

Adventure              5.00
Animation              0.00
Children               8.00
Comedy                10.50
Fantasy                0.00
Romance                0.00
Drama                  4.00
Action                 0.00
Crime                  0.00
Thriller               0.00
Horror                 0.00
Mystery                0.00
Sci-Fi                 0.00
IMAX                   0.00
Documentary            0.00
War                    0.00
Musical                0.00
Western                0.00
Film-Noir              0.00
(no genres listed)     0.00
rating                78.25
dtype: float64


A genres table is made of all of the movies to compare against the user specific genres table

In [24]:
genreTable = movies_df.set_index(movies_df['movieId'])
genreTable = genreTable.drop('movieId', axis=1).drop('title', axis=1).drop('genres', axis=1)
print(genreTable)

         Adventure  Animation  Children  Comedy  Fantasy  Romance  Drama  \
movieId                                                                    
1              1.0        1.0       1.0     1.0      1.0      0.0    0.0   
2              1.0        0.0       1.0     0.0      1.0      0.0    0.0   
3              0.0        0.0       0.0     1.0      0.0      1.0    0.0   
4              0.0        0.0       0.0     1.0      0.0      1.0    1.0   
5              0.0        0.0       0.0     1.0      0.0      0.0    0.0   
...            ...        ...       ...     ...      ...      ...    ...   
292731         0.0        0.0       0.0     0.0      0.0      0.0    1.0   
292737         0.0        0.0       0.0     1.0      0.0      0.0    1.0   
292753         0.0        0.0       0.0     0.0      0.0      0.0    1.0   
292755         0.0        0.0       0.0     0.0      0.0      0.0    1.0   
292757         1.0        0.0       0.0     0.0      0.0      0.0    0.0   

         Ac

Recommendation scores are calculated and sorted from high to low

In [25]:
recommendationTable_df = ((genreTable*userProfile).sum(axis=1))/(userProfile.sum())
recommendationTable_df = recommendationTable_df.sort_values(ascending=False)
print(recommendationTable_df)

movieId
124519    0.260047
230667    0.260047
4683      0.260047
134853    0.260047
226208    0.260047
            ...   
89343     0.000000
184515    0.000000
184517    0.000000
184519    0.000000
165741    0.000000
Length: 87585, dtype: float64


The top recommended movies from this user profile have children and comedy in the genres list. Therefore, it can be assumed that the recommendation algorithm is accurate

In [26]:
movies_df.loc[movies_df['movieId'].isin(recommendationTable_df.head(20).keys())]

Unnamed: 0,movieId,title,genres,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,...,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
1818,1907,Mulan (1998),"[Adventure, Animation, Children, Comedy, Drama...",1.0,1.0,1.0,1.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2754,2846,"Adventures of Milo and Otis, The (Koneko monog...","[Adventure, Children, Comedy, Drama]",1.0,0.0,1.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3574,3674,For the Love of Benji (1977),"[Adventure, Children, Comedy, Drama]",1.0,0.0,1.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4579,4683,"Wizard, The (1989)","[Adventure, Children, Comedy, Drama]",1.0,0.0,1.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4971,5076,"Adventures of Huck Finn, The (1993)","[Adventure, Children, Comedy, Drama]",1.0,0.0,1.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6435,6557,Born to Be Wild (1995),"[Adventure, Children, Comedy, Drama]",1.0,0.0,1.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8573,26093,"Wonderful World of the Brothers Grimm, The (1962)","[Adventure, Animation, Children, Comedy, Drama...",1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
17431,91335,"Gruffalo, The (2009)","[Adventure, Animation, Children, Comedy, Drama]",1.0,1.0,1.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25844,124519,Snow White and the Three Stooges (1961),"[Adventure, Children, Comedy, Drama, Fantasy]",1.0,0.0,1.0,1.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
26473,125972,Halloweentown II: Kalabar's Revenge (2001),"[Adventure, Children, Comedy, Drama, Fantasy]",1.0,0.0,1.0,1.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
