# Content Management
Xiaolan Li

This project mainly uses TF-IDF algorithm to recommend movies for users that they have not seen before based on the content filtering recommendation system.

The content is made up of genre and summary text, which is scraped from the HTML format on the IMDB website.

### Reading Dataset

In [1]:
import pandas as pd

In [2]:
df_rating = pd.read_csv('https://raw.githubusercontent.com/xiaolancara/Recommender-System/main/data/Movie_Survey/MovieSurvey_Rating.csv')
df_rating

Unnamed: 0,userid,movieid,ratings
0,1,1,5
1,1,2,5
2,1,3,3
3,1,4,5
4,2,1,3
5,2,2,4
6,2,3,1
7,2,4,5
8,2,5,5
9,2,6,4


In [3]:
df_movie = pd.read_csv('https://raw.githubusercontent.com/xiaolancara/Recommender-System/main/data/Movie_Survey/MovieSurvey_Tag.csv')
df_movie.rename(columns={"tag": "genres"},inplace = True)

In [4]:
# replace symbol to space
df_movie['genres'] = df_movie['genres'].str.replace(',',' ')
df_movie.shape

(6, 3)

In [5]:
df_movie

Unnamed: 0,movieid,movietiltle,genres
0,1,Forrest Gump,Romance
1,2,Joker,Crime Thriller
2,3,Avengers: Endgame,Action Adventure
3,4,Spirited Away,Crime Animation
4,5,Parasite,Comedy Thriller
5,6,Soul,Animation Adventure Comedy


### Scrape Summary Text from HTML

Data is from IMDB website.

In [6]:
# get web page for each movie
url_lst = []

ForrestGump_url = 'https://www.imdb.com/title/tt0109830/'
Joker_url = 'https://www.imdb.com/title/tt7286456/'
AvengersEndgame_url = 'https://www.imdb.com/title/tt4154796/'
SpiritedAway_url = 'https://www.imdb.com/title/tt0245429/'
Parasite_url = 'https://www.imdb.com/title/tt6751668/'
Soul_url = 'https://www.imdb.com/title/tt2948372/'
url_lst = [ForrestGump_url,Joker_url,AvengersEndgame_url,SpiritedAway_url,Parasite_url,Soul_url]

In [7]:
import requests
from bs4 import BeautifulSoup
import re

def getMovieDetails(url):
    r = requests.get(url=url)
    # Create a BeautifulSoup object
    soup = BeautifulSoup(r.text, 'html.parser')

    # get summary text
    summary_text = soup.find("div",{'class':'summary_text'})
    
    # using regular expression get string
    htmlTag = re.compile(r'<[^>]+>',re.S)
    Summary = htmlTag.sub('',summary_text.text).strip()
    return Summary

In [8]:
summary_text_lst = []
for url in url_lst:
    summary_text_lst.append(getMovieDetails(url))
df_movie['summarytext'] = summary_text_lst
df_movie

Unnamed: 0,movieid,movietiltle,genres,summarytext
0,1,Forrest Gump,Romance,"The presidencies of Kennedy and Johnson, the e..."
1,2,Joker,Crime Thriller,"In Gotham City, mentally troubled comedian Art..."
2,3,Avengers: Endgame,Action Adventure,After the devastating events of Avengers: Infi...
3,4,Spirited Away,Crime Animation,"During her family's move to the suburbs, a sul..."
4,5,Parasite,Comedy Thriller,Greed and class discrimination threaten the ne...
5,6,Soul,Animation Adventure Comedy,"After landing the gig of a lifetime, a New Yor..."


### Sparsity Determination

In [9]:
num_Sur_movieUsers = df_rating.userid.unique().shape[0]
num_Sur_movies = df_movie.movieid.unique().shape[0]

#Calculate sparse rate
Movie__matrixSparsity = 1 - len(df_rating) / (num_Sur_movieUsers * num_Sur_movies)

print('rating:{}, num_movieUsers:{}, num_movies:{}'.format(len(df_rating), num_Sur_movieUsers, num_Sur_movies))
print('Movie__matrixSparsity:', Movie__matrixSparsity)

rating:24, num_movieUsers:5, num_movies:6
Movie__matrixSparsity: 0.19999999999999996


### Content Based Recommender System_Tf-idf

- Only based on genres

In [10]:
# tf-idf vectors
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(df_movie['genres'])

In [11]:
tfidf_df = pd.DataFrame(tfidf_matrix.toarray())
print(tfidf_df.shape)

(6, 7)


Since there are only 7 features, we don't need to reduce dimensionality of the feature matrix.

In [12]:
df_genres_matrix_content = pd.DataFrame(tfidf_matrix.toarray(), index=df_movie.movietiltle.tolist())

In [13]:
df_genres_matrix_content

Unnamed: 0,0,1,2,3,4,5,6
Forrest Gump,0.0,0.0,0.0,0.0,0.0,1.0,0.0
Joker,0.0,0.0,0.0,0.0,0.707107,0.0,0.707107
Avengers: Endgame,0.773262,0.634086,0.0,0.0,0.0,0.0,0.0
Spirited Away,0.0,0.0,0.707107,0.0,0.707107,0.0,0.0
Parasite,0.0,0.0,0.0,0.707107,0.0,0.0,0.707107
Soul,0.0,0.57735,0.57735,0.57735,0.0,0.0,0.0


- Based on genres and summary text

In [14]:
# tf-idf vectors
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(df_movie['summarytext'])

In [15]:
tfidf_df = pd.DataFrame(tfidf_matrix.toarray())
print(tfidf_df.shape)

(6, 99)


In [17]:
df__text_matrix_content = pd.DataFrame(tfidf_matrix.toarray(), index=df_movie.movietiltle.tolist())
df__text_matrix_content

### Building Recommender System_Content Cosine Similarity

In [19]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def recommend(UserId,MovieTitle, NumRecommend):
    # take the vectors for a selected movie from both content 
    a_1 = np.array(df_genres_matrix_content.loc[MovieTitle]).reshape(1, -1)
    a_2 = np.array(df__text_matrix_content.loc[MovieTitle]).reshape(1, -1)

    # calculate the similartity of this movie with the others in the list
    score_1 = cosine_similarity(df_genres_matrix_content, a_1).reshape(-1)
    score_2 = cosine_similarity(df__text_matrix_content, a_2).reshape(-1)

    # form a data frame of similar movies 
    dictDf = {'genres_content': score_1 , 'text_content': score_2} 
    similar = pd.DataFrame(dictDf, index = df_genres_matrix_content.index )

    #sort it on the basis of genres_content
    movie = unratedMovie(UserId,similar)
    movie.sort_values('genres_content', ascending=False, inplace=True)
    return movie.head(NumRecommend)

def unratedMovie(UserId, df_recommend):
    # map movie to id:
    Mapping_file = dict(zip(df_movie.movietiltle.tolist(), df_movie.movieid.tolist()))
    if UserId in df_rating.userid.unique():
        ui_list = df_rating[df_rating.userid == UserId].movieid.tolist()
        # getting movie that the user unrated(not seen)
        d = {k for k,v in Mapping_file.items() if not v in ui_list}
        movie = df_recommend.loc[d]
        return movie

### Recommend for Participant

In [20]:
# merging both the datasets on 'movieId' column
df_movie_rating = pd.merge(df_movie, df_rating, on="movieid").sort_values(by = 'userid')
df_movie_rating.tail(3)

Unnamed: 0,movieid,movietiltle,genres,summarytext,userid,ratings
4,1,Forrest Gump,Romance,"The presidencies of Kennedy and Johnson, the e...",5,5
16,4,Spirited Away,Crime Animation,"During her family's move to the suburbs, a sul...",5,5
23,6,Soul,Animation Adventure Comedy,"After landing the gig of a lifetime, a New Yor...",5,3


In [21]:
df_MovieRating_pivot = df_movie_rating.pivot(index ='movietiltle', columns ='userid', values = 'ratings').fillna(0)

In [22]:
print(df_MovieRating_pivot.shape)
df_MovieRating_pivot

(6, 5)


userid,1,2,3,4,5
movietiltle,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Avengers: Endgame,3.0,1.0,4.0,4.0,2.0
Forrest Gump,5.0,3.0,4.0,4.0,5.0
Joker,5.0,4.0,0.0,3.0,0.0
Parasite,0.0,5.0,4.0,3.0,0.0
Soul,0.0,4.0,2.0,4.0,3.0
Spirited Away,5.0,5.0,2.0,0.0,5.0


I will analyze the top movie for user 1 and 5 based on the content filter

For each user, I will determine that rating more than 3 means the user likes it

In [23]:
# 1. For user 1, 
# based on movie 'Joker', choose 2 top movies I recommend
recommend(1,'Joker',2)

Unnamed: 0,genres_content,text_content
Parasite,0.5,0.0
Soul,0.0,0.0


In [24]:
# based on movie 'Spirited Away', choose 2 top movies I recommend
recommend(1,'Spirited Away',2)

Unnamed: 0,genres_content,text_content
Soul,0.408248,0.0
Parasite,0.0,0.045936


In [25]:
# 2. For user 5
# based on movie 'Spirited Away', choose 2 top movies I recommend
recommend(5,'Spirited Away',2)

Unnamed: 0,genres_content,text_content
Joker,0.5,0.0
Parasite,0.0,0.045936


### Result

By using the recommender system,

I will recommend Parasite, Soul for user1 based on the movie 'Joker' and 'Spirited Away' the user has seen.

I will recommend Joker for user5 based on the movie 'Spirited Away' the user has seen.

### Tf-idf Method
![tf-idf](https://miro.medium.com/max/455/1*3Ig7VSgscBzXaYa0Q-UM1w.png)


### Cosine_similarity Method

![Cosine_similarity](https://miro.medium.com/max/871/1*Q4xQoV8k_7S7xB-NfvFdrw.png)

# Reference

https://heartbeat.fritz.ai/recommender-systems-with-python-part-i-content-based-filtering-5df4940bd831

https://codingnomads.co/blog/data-analysis-example-analyzing-movie-ratings-with-python/

https://dev.to/magesh236/scrape-imdb-movie-rating-and-details-3a7c