## Assignment - Content Management

-by Qi Sun

**Purpose: In this assignment, you’re asked to build a content management system to rate the unrated movies for at least one of your first assignment’s survey participants.**


Here is the survey that I designed for my first assignment.

<img src="https://raw.githubusercontent.com/susanqisun/DAV6300/main/Screen%20Shot%202021-02-02%20at%201.47.20%20PM.png" width="500">

The movie informtion was downloaded here:

https://www.imdb.com/chart/moviemeter/



## 1.1 Load Survey data

In [2]:
import pandas as pd
import numpy as np

# survey results
df_survey = pd.read_csv('https://raw.githubusercontent.com/susanqisun/DAV6300/main/movie%20recommender.csv')
df_survey


Unnamed: 0,UserID,The Little Things,The White Tiger,The Dig,Soul,Wonder Woman 1984,Promising Young Woman
0,1,3,4,5,Not Sure,1,3.0
1,2,Not Sure,Not Sure,4,5,5,4.0
2,3,5,4,5,5,3,4.0
3,4,Not Sure,Not Sure,Not Sure,3,4,5.0
4,5,5,5,4,3,4,2.0
5,6,3,2,,3,4,5.0
6,7,,1,,,4,


## 1.2 Clean Survey Data

In [32]:
df_survey02 = df_survey.replace('Not Sure',np.NaN)

## 1.3 Movies with top 3 ratings for each user



In [33]:
# https://stackoverflow.com/questions/28609667/pandas-find-column-name-and-value-with-max-and-second-max-value-for-each-row

df03 = df_survey02.copy()

def top(x):
    x.set_index('UserID', inplace=True)
    df03 = pd.DataFrame({'1st Max':[],'Max1Value':[],'2nd Max':[],'Max2Value':[],'3rd Max':[],'Max3Value':[]})
    df03.index.name='User'
    df03.loc[x.index.values[0],['1st Max', '2nd Max','3rd Max']] = x.sum().nlargest(3).index.tolist()
    df03.loc[x.index.values[0],['Max1Value', 'Max2Value','Max3Value']] = x.sum().nlargest(3).values
    return df03

df_top = df03.groupby('UserID').apply(top).reset_index(level=1, drop=True).reset_index()
df_top



Unnamed: 0,UserID,1st Max,Max1Value,2nd Max,Max2Value,3rd Max,Max3Value
0,1,The Dig,5.0,The White Tiger,4.0,The Little Things,3.0
1,2,Soul,5.0,Wonder Woman 1984,5.0,The Dig,4.0
2,3,The Little Things,5.0,The Dig,5.0,Soul,5.0
3,4,Promising Young Woman,5.0,Wonder Woman 1984,4.0,Soul,3.0
4,5,The Little Things,5.0,The White Tiger,5.0,The Dig,4.0
5,6,Promising Young Woman,5.0,Wonder Woman 1984,4.0,The Little Things,3.0
6,7,Wonder Woman 1984,4.0,The White Tiger,1.0,The Little Things,0.0


## 2.1 Create Movie Genre dataset

Assignemnt requirement: Look up movie genres on IMBD. One movie can have multiple genres.  Use this information to build a list of content-based recommendations.  Indicate the top movie that you would recommend to each participant that you are analyzing.



In [6]:
# create table for movie genres
data = {'title':  ['The Little Things', 'The White Tiger','The Dig','Soul','Wonder Woman 1984','Promising Young Woman'],
        'genre': ['Crime, Drama, Thriller', 'Crime, Drama','Biography, Drama, History','Animation, Adventure, Comedy','Action, Adventure, Fantasy','Crime, Drama, Thriller']
        }

df = pd.DataFrame (data, columns = ['title','genre'])
df


Unnamed: 0,title,genre
0,The Little Things,"Crime, Drama, Thriller"
1,The White Tiger,"Crime, Drama"
2,The Dig,"Biography, Drama, History"
3,Soul,"Animation, Adventure, Comedy"
4,Wonder Woman 1984,"Action, Adventure, Fantasy"
5,Promising Young Woman,"Crime, Drama, Thriller"


## 2.2 Which are the most popular genres?

In [7]:
genre_popularity = (df.genre.str.split(',')
                      .explode()
                      .value_counts()
                      .sort_values(ascending=False))
genre_popularity.head(10)

 Drama        4
Crime         3
 Adventure    2
 Thriller     2
Animation     1
 Comedy       1
Action        1
 Fantasy      1
 History      1
Biography     1
Name: genre, dtype: int64

## 2.3 Build a content based recommender using genre

Code reference: https://towardsdatascience.com/content-based-recommender-systems-28a1dbd858f5

### tf-idf

To obtain the tf-idf vectors I'll be using sklearn's TfidfVectorizer.

In [8]:
#https://towardsdatascience.com/content-based-recommender-systems-28a1dbd858f5
from sklearn.feature_extraction.text import TfidfVectorizer
from itertools import combinations

tf = TfidfVectorizer(analyzer=lambda s: (c for i in range(1,4) #Here we're finding the sets of combinations of genres up to k (4 here).
                     for c in combinations(s.split(','), r=i)))
tfidf_matrix = tf.fit_transform(df['genre'])
tfidf_matrix.shape

(6, 26)

### tf-idf vectors 

In [10]:
pd.DataFrame(tfidf_matrix.todense(), columns=tf.get_feature_names(),index=df.title).sample(8, axis=1).sample(6, axis=0)


Unnamed: 0_level_0,"(Crime, Drama, Thriller)","(Action, Adventure, Fantasy)","( Comedy,)","(Biography,)","(Action, Fantasy)","(Biography, Drama)","(Animation,)","( Drama, History)"
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Promising Young Woman,0.409995,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Wonder Woman 1984,0.0,0.387131,0.0,0.0,0.387131,0.0,0.0,0.0
The Little Things,0.409995,0.0,0.0,0.0,0.0,0.0,0.0,0.0
The Dig,0.0,0.0,0.0,0.396777,0.0,0.396777,0.0,0.396777
The White Tiger,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Soul,0.0,0.0,0.387131,0.0,0.0,0.0,0.387131,0.0


### Similarity between vectors

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = cosine_similarity(tfidf_matrix)

In [12]:
cosine_sim_df = pd.DataFrame(cosine_sim, index=df['title'], columns=df['title'])
print('Shape:', cosine_sim_df.shape)
cosine_sim_df.sample(5, axis=1).round(2)

Shape: (6, 6)


title,The Little Things,Promising Young Woman,The White Tiger,The Dig,Wonder Woman 1984
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
The Little Things,1.0,1.0,0.57,0.07,0.0
The White Tiger,0.57,0.57,1.0,0.12,0.0
The Dig,0.07,0.07,0.12,1.0,0.0
Soul,0.0,0.0,0.0,0.0,0.1
Wonder Woman 1984,0.0,0.0,0.0,0.0,1.0
Promising Young Woman,1.0,1.0,0.57,0.07,0.0


Now we have to define some logic to find the highest tf-idf scores for a given movie. I'll input i as a given movie, the similarity matrix M, the items dataframe and returns up to k recommendations:


In [None]:
def genre_recommendations(i, M, items, k=7):
    """
    Recommends movies based on a similarity dataframe

    Parameters
    ----------
    i : str
        Movie (index of the similarity dataframe)
    M : pd.DataFrame
        Similarity dataframe, symmetric, with movies as indices and columns
    items : pd.DataFrame
        Contains both the title and some other features used to define similarity
    k : int
        Amount of recommendations to return

    """
    ix = M.loc[:,i].to_numpy().argpartition(range(-1,-k,-1))
    closest = M.columns[ix[-1:-(k+2):-1]]
    closest = closest.drop(i, errors='ignore')
    return pd.DataFrame(closest).merge(items).head(k)

## Recommendations for User 1:

In [34]:
df_top

Unnamed: 0,UserID,1st Max,Max1Value,2nd Max,Max2Value,3rd Max,Max3Value
0,1,The Dig,5.0,The White Tiger,4.0,The Little Things,3.0
1,2,Soul,5.0,Wonder Woman 1984,5.0,The Dig,4.0
2,3,The Little Things,5.0,The Dig,5.0,Soul,5.0
3,4,Promising Young Woman,5.0,Wonder Woman 1984,4.0,Soul,3.0
4,5,The Little Things,5.0,The White Tiger,5.0,The Dig,4.0
5,6,Promising Young Woman,5.0,Wonder Woman 1984,4.0,The Little Things,3.0
6,7,Wonder Woman 1984,4.0,The White Tiger,1.0,The Little Things,0.0


In [35]:
df[df.title.eq('Promising Young Woman')]

Unnamed: 0,title,genre
5,Promising Young Woman,"Crime, Drama, Thriller"


In [36]:
genre_recommendations('Promising Young Woman', cosine_sim_df, df[['title', 'genre']])


Unnamed: 0,title,genre
0,The Little Things,"Crime, Drama, Thriller"
1,The White Tiger,"Crime, Drama"
2,The Dig,"Biography, Drama, History"
3,Wonder Woman 1984,"Action, Adventure, Fantasy"
4,Soul,"Animation, Adventure, Comedy"


As expected, the most similar movies are those which share the most genres.

## 3.1 Create Movie Description dataset

Assignemnt requirement: Scrape movie descriptions from a web site. This could be anything from a single sentence to a full synopsis or a movie review.  Using tools like TF-IDF and (if you want) LDA, again determine the top movie that you would recommend to each participant that you are analyzing.


In [39]:
# create table for movie description
data02 = {'movie':  ['The Little Things', 'The White Tiger','The Dig','Soul','Wonder Woman 1984','Promising Young Woman'],
        'description': ['Kern County Deputy Sheriff Joe Deacon is sent to Los Angeles for what should have been a quick evidence-gathering assignment. Instead, he becomes embroiled in the search for a serial killer who is terrorizing the city.', 
                        'An ambitious Indian driver uses his wit and cunning to escape from poverty and rise to the top. An epic journey based on the New York Times bestseller.',
                        'An archaeologist embarks on the historically important excavation of Sutton Hoo in 1938.',
                        'After landing the gig of a lifetime, a New York jazz pianist suddenly finds himself trapped in a strange land between Earth and the afterlife.',
                        'Diana must contend with a work colleague and businessman, whose desire for extreme wealth sends the world down a path of destruction, after an ancient artifact that grants wishes goes missing.',
                        'A young woman, traumatized by a tragic event in her past, seeks out vengeance against those who crossed her path.']
        }

df_desc = pd.DataFrame (data02, columns = ['movie','description'])
df_desc


Unnamed: 0,movie,description
0,The Little Things,Kern County Deputy Sheriff Joe Deacon is sent ...
1,The White Tiger,An ambitious Indian driver uses his wit and cu...
2,The Dig,An archaeologist embarks on the historically i...
3,Soul,"After landing the gig of a lifetime, a New Yor..."
4,Wonder Woman 1984,Diana must contend with a work colleague and b...
5,Promising Young Woman,"A young woman, traumatized by a tragic event i..."


In [40]:
# read movie ID
movie = pd.read_csv('https://raw.githubusercontent.com/susanqisun/DAV6300/main/movieID.csv')
movie

Unnamed: 0,movieID,movie
0,101,The Little Things
1,102,The White Tiger
2,103,The Dig
3,104,Soul
4,105,Wonder Woman 1984
5,106,Promising Young Woman


In [55]:
# merge together
df_movie = pd.merge(left=movie, right=df_desc, how='outer')

df_movie.sort_values(by='movie')
df_movie

Unnamed: 0,movieID,movie,description
0,101,The Little Things,Kern County Deputy Sheriff Joe Deacon is sent ...
1,102,The White Tiger,An ambitious Indian driver uses his wit and cu...
2,103,The Dig,An archaeologist embarks on the historically i...
3,104,Soul,"After landing the gig of a lifetime, a New Yor..."
4,105,Wonder Woman 1984,Diana must contend with a work colleague and b...
5,106,Promising Young Woman,"A young woman, traumatized by a tragic event i..."


## 3.2 Content based recommendation system : Using movie description

Code reference: https://github.com/jalajthanaki/Movie_recommendation_engine/blob/master/Movie_recommendation_engine.ipynb

In [42]:
tf02 = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix02 = tf02.fit_transform(df_movie['description'])


In [43]:
tfidf_matrix02.shape


(6, 162)

In [44]:
cosine_sim02 = cosine_similarity(tfidf_matrix02,tfidf_matrix02)

In [45]:
# We now have a pairwise cosine similarity matrix for all the movies in our dataset.
cosine_sim02[0]

array([1., 0., 0., 0., 0., 0.])

In [56]:
def get_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim02[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:7]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

In [57]:
df_movie = df_movie.reset_index()
titles = df_movie['movie']
indices = pd.Series(df_movie.index, index=df_movie['movie'])
indices.head

<bound method NDFrame.head of movie
The Little Things        0
The White Tiger          1
The Dig                  2
Soul                     3
Wonder Woman 1984        4
Promising Young Woman    5
dtype: int64>

In [58]:
get_recommendations('Promising Young Woman')


4    Wonder Woman 1984
0    The Little Things
1      The White Tiger
2              The Dig
3                 Soul
Name: movie, dtype: object

We see that for Promising Young Woman, our system recommends 'Wonder Woman 1984' as its top recommendation.

### Compare Movie Description of the results:

**'Promising Young Woman'**: 'A young woman, traumatized by a tragic event in her past, seeks out vengeance against those who crossed her path.'


**'Wonder Woman 1984'**: Diana must contend with a work colleague and businessman, whose desire for extreme wealth sends the world down a path of destruction, after an ancient artifact that grants wishes goes missing.',


