# Content Based Recommender Trial on IMDb Movies

While following "Machine Learning with Python" course by IBM on Coursera, I decided to try this project. 

Basically, content-based recommenders use the data which comes with the content. The recommender defines your taste with the input data and recommends you similar examples.

In this project, movie genres are our content indicators. We get some movie ratings from our user, then we'll try to recommend similar movies.

Simple explanation video can be found here: https://coursera.org/share/962d392cc93515cbf26e22058fc01cd2

Used IMDb Movies data can be found here: https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset

Let's start with importing pandas...

In [1]:
import pandas as pd

Then read our 'IMDb movies.csv' data as a dataframe. The path must be specified.

In [2]:
movies_df = pd.read_csv('<path_of_data>\IMDb movies.csv', sep=',', error_bad_lines=False, warn_bad_lines=False, low_memory=False)
movies_df.head()

Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,...,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics
0,tt0000009,Miss Jerry,Miss Jerry,1894,1894-10-09,Romance,45,USA,,Alexander Black,...,"Blanche Bayliss, William Courtenay, Chauncey D...",The adventures of a female reporter in the 1890s.,5.9,154,,,,,1.0,2.0
1,tt0000574,The Story of the Kelly Gang,The Story of the Kelly Gang,1906,1906-12-26,"Biography, Crime, Drama",70,Australia,,Charles Tait,...,"Elizabeth Tait, John Tait, Norman Campbell, Be...",True story of notorious Australian outlaw Ned ...,6.1,589,$ 2250,,,,7.0,7.0
2,tt0001892,Den sorte drøm,Den sorte drøm,1911,1911-08-19,Drama,53,"Germany, Denmark",,Urban Gad,...,"Asta Nielsen, Valdemar Psilander, Gunnar Helse...",Two men of high rank are both wooing the beaut...,5.8,188,,,,,5.0,2.0
3,tt0002101,Cleopatra,Cleopatra,1912,1912-11-13,"Drama, History",100,USA,English,Charles L. Gaskill,...,"Helen Gardner, Pearl Sindelar, Miss Fielding, ...",The fabled queen of Egypt's affair with Roman ...,5.2,446,$ 45000,,,,25.0,3.0
4,tt0002130,L'Inferno,L'Inferno,1911,1911-03-06,"Adventure, Drama, Fantasy",68,Italy,Italian,"Francesco Bertolini, Adolfo Padovan",...,"Salvatore Papa, Arturo Pirovano, Giuseppe de L...",Loosely adapted from Dante's Divine Comedy and...,7.0,2237,,,,,31.0,14.0


Data has 22 columns, in this project we will use small amount of them. You can use them for making different visualization projects or anything. So, lets see what we have and drop those that don't work for us.

In [3]:
movies_df.columns.values

array(['imdb_title_id', 'title', 'original_title', 'year',
       'date_published', 'genre', 'duration', 'country', 'language',
       'director', 'writer', 'production_company', 'actors',
       'description', 'avg_vote', 'votes', 'budget', 'usa_gross_income',
       'worlwide_gross_income', 'metascore', 'reviews_from_users',
       'reviews_from_critics'], dtype=object)

In [4]:
dropColumns = ['imdb_title_id', 'title', 'date_published', 'duration',
                 'country', 'language', 'writer', 'production_company', 'actors',
                  'description', 'budget', 'usa_gross_income', 'worlwide_gross_income',
                   'metascore', 'reviews_from_users', 'reviews_from_critics']

movies_df = movies_df.drop(dropColumns, 1)
movies_df.head()

Unnamed: 0,original_title,year,genre,director,avg_vote,votes
0,Miss Jerry,1894,Romance,Alexander Black,5.9,154
1,The Story of the Kelly Gang,1906,"Biography, Crime, Drama",Charles Tait,6.1,589
2,Den sorte drøm,1911,Drama,Urban Gad,5.8,188
3,Cleopatra,1912,"Drama, History",Charles L. Gaskill,5.2,446
4,L'Inferno,1911,"Adventure, Drama, Fantasy","Francesco Bertolini, Adolfo Padovan",7.0,2237


Then we can make a list that contains all genres.

In [5]:
genresList = []
for i in movies_df['genre']:
    i = i.split(',')
    for j in i:
        j = j.lstrip()
        if j not in genresList:
            genresList.append(j)
            
print(genresList)

['Romance', 'Biography', 'Crime', 'Drama', 'History', 'Adventure', 'Fantasy', 'War', 'Mystery', 'Horror', 'Western', 'Comedy', 'Family', 'Action', 'Sci-Fi', 'Thriller', 'Sport', 'Animation', 'Musical', 'Music', 'Film-Noir', 'Adult', 'Documentary', 'Reality-TV', 'News']


We want to add those genres next to our dataframe. To do that copying our movies_df is the best way. If a movie has them it must has the value 1, if it doesnt it must has the value 0.

For example, the movie Cleopatra has Drama and History genres, so Drama and History get 1 and all other genres get 0 in the same row. And also NA/NaN values must be filled with 0 with fillna() method.

In [6]:
moviesWithGenres_df = movies_df.copy()

for i in range(len(movies_df)):
    for genre in genresList:
        if genre in movies_df['genre'][i]:
            moviesWithGenres_df.at[i, genre] = 1
        else:
            moviesWithGenres_df.at[i, genre] = 0

moviesWithGenres_df = moviesWithGenres_df.fillna(0)

moviesWithGenres_df.head()

Unnamed: 0,original_title,year,genre,director,avg_vote,votes,Romance,Biography,Crime,Drama,...,Thriller,Sport,Animation,Musical,Music,Film-Noir,Adult,Documentary,Reality-TV,News
0,Miss Jerry,1894,Romance,Alexander Black,5.9,154,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,The Story of the Kelly Gang,1906,"Biography, Crime, Drama",Charles Tait,6.1,589,0.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Den sorte drøm,1911,Drama,Urban Gad,5.8,188,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Cleopatra,1912,"Drama, History",Charles L. Gaskill,5.2,446,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,L'Inferno,1911,"Adventure, Drama, Fantasy","Francesco Bertolini, Adolfo Padovan",7.0,2237,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


So, we have the data which shaped clearly for our project. Cells has the value 1 if the movie has that genre, and has the value 0 if not.

Now we can get input movie scorings from our user. More movies give better results.

In [7]:
userInput = [
    {'original_title': 'Kaybedenler Kulübü', 'vote': 6.9},
    {'original_title': 'Pulp Fiction', 'vote': 8.2},
    {'original_title': 'Portrait de la jeune fille en feu', 'vote': 7.2},
    {'original_title': 'The Avengers', 'vote': 6.8},
    {'original_title': 'The Invention of Lying', 'vote': 5.7}
]

inputMovies_df = pd.DataFrame(userInput)

Then we need to check input movies if they exist in our movies dataframe. Then merge them with scorings.

In [8]:
inputId_df = moviesWithGenres_df[moviesWithGenres_df['original_title'].isin(inputMovies_df['original_title'].tolist())]
inputMovies_df = pd.merge(inputId_df, inputMovies_df)

print(inputMovies_df)

                      original_title  year                      genre  \
0                       Pulp Fiction  1994               Crime, Drama   
1                       The Avengers  1998  Action, Adventure, Sci-Fi   
2                       The Avengers  2012  Action, Adventure, Sci-Fi   
3             The Invention of Lying  2009   Comedy, Fantasy, Romance   
4                 Kaybedenler Kulübü  2011     Comedy, Drama, Romance   
5  Portrait de la jeune fille en feu  2019             Drama, Romance   

                          director  avg_vote    votes  Romance  Biography  \
0                Quentin Tarantino       8.9  1780147      0.0        0.0   
1              Jeremiah S. Chechik       3.8    40374      0.0        0.0   
2                      Joss Whedon       8.0  1241220      0.0        0.0   
3  Ricky Gervais, Matthew Robinson       6.4   122986      1.0        0.0   
4                      Tolga Örnek       7.6    20898      1.0        0.0   
5                   Céline

Then drop unnecessary columns again. And now we need to multiply our input votes (scorings) with our genres by using dot product, to see weighted genres as our user profile.

In [9]:
userWithGenres_df = inputMovies_df.drop('original_title', 1).drop('year', 1).drop('genre', 1).drop('director', 1).drop('avg_vote', 1).drop('vote', 1).drop('votes', 1)
userProfile = userWithGenres_df.transpose().dot(inputMovies_df['vote'])

print(userProfile)

Romance        19.8
Biography       0.0
Crime           8.2
Drama          22.3
History         0.0
Adventure      13.6
Fantasy         5.7
War             0.0
Mystery         0.0
Horror          0.0
Western         0.0
Comedy         12.6
Family          0.0
Action         13.6
Sci-Fi         13.6
Thriller        0.0
Sport           0.0
Animation       0.0
Musical         0.0
Music           0.0
Film-Noir       0.0
Adult           0.0
Documentary     0.0
Reality-TV      0.0
News            0.0
dtype: float64


As we seen drama has the highest score, that means our user prefer drama movies more than other genres.

Then we need to check all movies to see their relevance against our user profile.

In [10]:
genreTable = moviesWithGenres_df.set_index(moviesWithGenres_df['original_title'])
genreTable = genreTable.drop('original_title', 1).drop('year', 1).drop('genre', 1).drop('director', 1).drop('avg_vote', 1).drop('votes', 1)

recTable_df = ((genreTable*userProfile).sum(axis=1))/(sum(userProfile))
print(recTable_df)

original_title
Miss Jerry                        0.180987
The Story of the Kelly Gang       0.278793
Den sorte drøm                    0.203839
Cleopatra                         0.203839
L'Inferno                         0.380256
                                    ...   
Le lion                           0.115174
De Beentjes van Sint-Hildegard    0.319013
Padmavyuhathile Abhimanyu         0.203839
Sokagin Çocuklari                 0.203839
La vida sense la Sara Amat        0.203839
Length: 85855, dtype: float64


Our values range from 0 to 1, we can sort them to see best matches.

In [11]:
recTable_df = recTable_df.sort_values(axis=0, ascending=False)
recTable_df.head()

original_title
Decameron Nights                     0.509141
Tere Naam                            0.509141
Untamed                              0.509141
Bao chou                             0.509141
There's Something About a Soldier    0.509141
dtype: float64

We have more than one highest score movies, because there are hundreds of movies of the same genres. It would be the best to get how many highest matches we have, then we can manipulate that list by sorting different ways.

In [13]:
m = 0
for i in recTable_df:
    if recTable_df[0] == i:
        m += 1
    else:
        break

topRecTable_df = recTable_df.head(m)

print(m)

topRecTable_df.head(10)

525


original_title
Decameron Nights                     0.509141
Tere Naam                            0.509141
Untamed                              0.509141
Bao chou                             0.509141
There's Something About a Soldier    0.509141
Never Let Me Go                      0.509141
Lord Jim                             0.509141
Jaanu                                0.509141
2046                                 0.509141
Last Goodbye                         0.509141
dtype: float64

We have m=525 best match movies and these are 10 of them. Sorting is pretty important for this case. You can use several sorting methods.

For example, sorting by average IMDb scores higher to lower is an alternative. But the risk here is some of movies have higher scores even though they have less votes. 

A better way to avoid this is to sort by number of votes. Thus we can recommend the most seen movies. 

Here are the best matched 10 movies that our user might like, sorted by the number of IMDb votes, in other words our recommendation table...

In [14]:
recList_df = movies_df.loc[movies_df['original_title'].isin(topRecTable_df.keys())]
recList_df = recList_df.sort_values(by=['votes'], ascending=False)

recList_df.head(10)

Unnamed: 0,original_title,year,genre,director,avg_vote,votes
42569,Eternal Sunshine of the Spotless Mind,2004,"Drama, Romance, Sci-Fi",Michel Gondry,8.3,889875
57584,Suicide Squad,2016,"Action, Adventure, Fantasy",David Ayer,6.0,586474
62072,Her,2013,"Drama, Romance, Sci-Fi",Spike Jonze,8.0,523975
33760,Cast Away,2000,"Adventure, Drama, Romance",Robert Zemeckis,7.8,510788
17688,Rocky,1976,"Drama, Sport",John G. Avildsen,8.1,506246
49194,Wanted,2008,"Action, Crime, Fantasy",Timur Bekmambetov,6.7,355095
57306,Passengers,2016,"Drama, Romance, Sci-Fi",Morten Tyldum,7.0,340493
28492,Speed,1994,"Action, Adventure, Thriller",Jan de Bont,7.2,322275
33862,The Beach,2000,"Adventure, Drama, Romance",Danny Boyle,6.7,219346
31226,Deep Impact,1998,"Action, Drama, Romance",Mimi Leder,6.2,160722
