In [1]:
# execute to import notebook styling for tables and width etc.
from IPython.core.display import HTML
import urllib.request
response = urllib.request.urlopen('https://raw.githubusercontent.com/DataScienceUWL/DS775v2/master/ds755.css')
HTML(response.read().decode("utf-8"));

In [2]:
import pandas as pd

<font size=18>Lesson 10 Homework</font>

# Build a Simple Recommender

Use the data set **tmdb_5000_movies.csv** to build a simple recommender system with the following characteristics 

- movies in the to 50\% according to the popularity score
- movies with runtime of 80 minutes or more
- movies with a budget between \\$10 and \\$100 million
- movies with genre of either action or comedy

then rank the movies according to the IMDB weighted rating formula that is described in the Banik textbook and print the top 10 highest rated movies with the characteristics listed above. 

### Code

In [3]:
# make df
tmdb_df = pd.read_csv("data/tmdb-5000-movie-dataset/tmdb_5000_movies.csv", encoding="ISO-8859-1")

movie_indices = []
for i in range(tmdb_df.shape[0]):
    movie_ratings = eval(tmdb_df['genres'].iloc[i])
    tmdb_df.loc[i, 'genres'] = ', '.join([rating['name'] for rating in movie_ratings])

    for rating in movie_ratings:
        if rating['name'] == 'Action' or rating['name'] == 'Comedy':
            movie_indices.append(i)

movie_indices = set(movie_indices)

# filter on genres (action or comedy)
df_filtered = tmdb_df.filter(items=movie_indices, axis=0)

# filter on other requests
df_filtered = df_filtered[(df_filtered['popularity'] >= df_filtered['popularity'].quantile(.5)) & 
                      (df_filtered['runtime'] >= 80) & 
                      (df_filtered['budget'] >= 10000000) & 
                      (df_filtered['budget'] <= 100000000)]

# create imdb weighted rating
m = df_filtered['vote_count'].quantile(.25)
r = df_filtered['vote_average']
v = df_filtered['vote_count']
c = r.mean()
wr_imdb = round(((v/(v+m))*r)+((m/(v+m))*c),2)

# assign weighted rating as a column
df_filtered = df_filtered.assign(weighted_rating=wr_imdb)

# sort the filtered df by weighted rating
df_filtered = df_filtered.sort_values('weighted_rating', ascending=False).reset_index()

**Provide the following with your solution:**

(a) How many movies are in the original data frame?

In [4]:
tmdb_df.shape[0]

4803

(b) What is the median for popularity score?

In [5]:
tmdb_df['popularity'].median()

12.921594

(c) How many movies are in the reduced data frame (*i.e.* how many meet the selection criteria)?

In [6]:
df_filtered.shape[0]

935

(d) Display only the movie title and the requested characteristics as well as the vote count, vote average, and IMDB weighted rating with your recommendations.

In [7]:
df_filtered[['original_title','genres','weighted_rating','vote_count','vote_average']]

Unnamed: 0,original_title,genres,weighted_rating,vote_count,vote_average
0,Forrest Gump,"Comedy, Drama, Romance",8.09,7927,8.2
1,The Empire Strikes Back,"Adventure, Action, Science Fiction",8.06,5879,8.2
2,The Lord of the Rings: The Return of the King,"Adventure, Fantasy, Action",8.00,8064,8.1
3,Star Wars,"Adventure, Action, Science Fiction",7.98,6624,8.1
4,The Lord of the Rings: The Fellowship of the Ring,"Adventure, Fantasy, Action",7.91,8705,8.0
...,...,...,...,...,...
930,Disaster Movie,"Action, Comedy",5.10,240,3.0
931,Left Behind,"Thriller, Action, Science Fiction",5.05,392,3.7
932,Jack and Jill,Comedy,5.00,604,4.1
933,Catwoman,"Action, Crime",4.92,808,4.2


This data set can be found in the presentation download for this lesson.  You will need to use the option **encoding = "ISO-8859-1"** in the **read_csv** function in order to open this file.  Use the examples given in Banik's book as a guide.

# Build a Knowledge-Based Recommender

Use the data set **tmdb_5000_movies.csv** to build a knowledge-based recommender system that solicits the following information listed below and then ranks the movies according to the IMDB weighted rating formula. Use all available movies to begin with (*i.e.* don't restrict it to just the top 20%, for example) Print the top 5 highest rated movies for this recommendation. 

Ask the user to enter answers to the following questions:

- Enter a preferred genre.
- Enter another preferred genre.
- Enter a minimum runtime (in minutes).
- Enter a maximum runtime (in minutes).

Print a list of genres for the user to choose from before asking for their inputs.

Create the recommender to select movies with either genre entered.  

In [8]:
def movie_recommender():
    tmdb_df = pd.read_csv("data/tmdb-5000-movie-dataset/tmdb_5000_movies.csv", encoding="ISO-8859-1")
    
    # calculate unique genres
    list_of_genres = []
    
    for i in range(tmdb_df.shape[0]):
        genre_dict = eval(tmdb_df['genres'].iloc[i])
        list_of_genres.extend(genre_dict)

    genres_df = pd.DataFrame.from_dict(list_of_genres, orient='columns')
    all_genres = ', '.join(genres_df['name'].unique().tolist())
    
    #present genres and get inputs
    print(f"See genres: {all_genres}")
    preferred_genre1 = input("Enter a preferred genre")
    preferred_genre2 = input("Enter another preferred genre")
    min_runtime = int(input("Enter min runtime"))
    max_runtime = int(input("Enter max runtime"))
    
    movie_indices = []
                             
    # filter df by genre
    for i in range(tmdb_df.shape[0]):
        movie_genres = eval(tmdb_df['genres'].iloc[i])
        tmdb_df.loc[i, 'genres'] = ', '.join([genre['name'] for genre in movie_genres])

        for genre in movie_genres:
            if genre['name'] == preferred_genre1 or genre['name'] == preferred_genre2:
                movie_indices.append(i)

    movie_indices = set(movie_indices)

    # filter on genres (action or comedy)
    df_filtered = tmdb_df.filter(items=movie_indices, axis=0)

    # filter on other requests
    df_filtered = df_filtered[(df_filtered['runtime'] >= min_runtime) & 
                          (df_filtered['runtime'] <= max_runtime)]

    # create imdb weighted rating
    m = df_filtered['vote_count'].quantile(.25)
    r = df_filtered['vote_average']
    v = df_filtered['vote_count']
    c = r.mean()
    wr_imdb = round(((v/(v+m))*r)+((m/(v+m))*c),2)

    # assign weighted rating as a column
    df_filtered = df_filtered.assign(weighted_rating=wr_imdb)

    # sort the filtered df by weighted rating
    df_filtered = df_filtered.sort_values('weighted_rating', ascending=False).reset_index()
    df_filtered = df_filtered[['original_title', 'genres', 'runtime', 'weighted_rating','vote_count','vote_average']]
    return df_filtered.head()

Have your recommender give recommendations for genres "family" and "tv movie" between of 50 and 120 minutes long and display only the movie title and the requested characteristics as well as the vote count, vote average, and IMDB weighted rating with your recommendations.

Use the examples given in Banik's book as a guide.

In [9]:
movie_recommender()

See genres: Action, Adventure, Fantasy, Science Fiction, Crime, Drama, Thriller, Animation, Family, Western, Comedy, Romance, Horror, Mystery, History, War, Music, Documentary, Foreign, TV Movie
Enter a preferred genreTV Movie
Enter another preferred genreFamily
Enter min runtime50
Enter max runtime120


Unnamed: 0,original_title,genres,runtime,weighted_rating,vote_count,vote_average
0,Inside Out,"Drama, Comedy, Animation, Family",94.0,7.98,6560,8.0
1,Back to the Future,"Adventure, Comedy, Science Fiction, Family",116.0,7.97,6079,8.0
2,The Lion King,"Family, Animation, Drama",89.0,7.97,5376,8.0
3,Big Hero 6,"Adventure, Family, Animation, Action, Comedy",102.0,7.78,6135,7.8
4,WALLåáE,"Animation, Family",98.0,7.78,6296,7.8


# Build a Content-Based Recommender

Use the data set **tmdb_5000_movies.csv** to build a meta-data based recommender by creating a "soup" and using cosine similarity based on the 

- all genres
- top three keywords
- top three production companies
- overview

In [18]:
import string
import pandas as pd

tmdb_df = pd.read_csv("data/tmdb-5000-movie-dataset/tmdb_5000_movies.csv", encoding="ISO-8859-1")
new_df = []

# loop through df to create new variables of list form
for i in range(tmdb_df.shape[0]):
    new_row = {}
    title = tmdb_df['title'].iloc[i]
    movie_genres = eval(tmdb_df['genres'].iloc[i])
    keywords = eval(tmdb_df['keywords'].iloc[i])
    prod_comps = eval(tmdb_df['production_companies'].iloc[i])
    
    # eliminate punctuation
    if isinstance(tmdb_df['overview'].iloc[i], str):
        strip_string = tmdb_df['overview'].iloc[i].translate(str.maketrans('', '', string.punctuation)).strip()
        split_string = strip_string.split(" ")
        overview = list(set(split_string))
    else:
        overview = tmdb_df['overview'].iloc[i]
    
    # create fields
    new_row['title'] = title
    new_row['genres'] = [genre['name'] for genre in movie_genres]
    new_row['keywords'] = [keyword['name'] for keyword in keywords]
    new_row['prod_comps'] = [pc['name'] for pc in prod_comps]
    new_row['overview'] = overview
    
    if not new_row['genres']or not new_row['keywords'] or not new_row['prod_comps'] or not new_row['overview']:
        continue
    
    new_df.append(new_row)
    
new_df = pd.DataFrame.from_dict(new_df, orient='columns')

In [19]:
# drop words to lowercase and replace spaces
def sanitize(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

for feature in ['genres', 'keywords', 'prod_comps', 'overview']:
    new_df[feature] = new_df[feature].apply(sanitize)

# eliminate NAs in overview
new_df = new_df[new_df['overview'] != '']

# create soup
new_df = new_df.assign(soup=new_df['genres'] + new_df['keywords'] + new_df['prod_comps'] + new_df['overview'])

def create_soup(x):
    return ' '.join(x['soup'])

new_df['soup'] = new_df.apply(create_soup, axis=1)

In [24]:
# Function that takes in movie title as input and gives recommendations 
def content_recommender(title, cosine_sim, df, indices):
    # Obtain the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    # And convert it into a list of tuples as described above
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the cosine similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies. Ignore the first movie.
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return df['title'].iloc[movie_indices]

In [25]:
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(new_df['soup'])
cosine_sim2 = cosine_similarity(count_matrix, count_matrix)


new_df = new_df.reset_index()
indices2 = pd.Series(new_df.index, index=new_df['title'])

**Provide the following with your solution:**

(a) After you have constructed it, print the "soup" for the first entry [0].

In [26]:
new_df.iloc[0]['soup']

'action adventure fantasy sciencefiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d ingeniousfilmpartners twentiethcenturyfoxfilmcorporation duneentertainment lightstormentertainment alien protecting civilization 22nd torn moon the mission dispatched in an between and a becomes following is paraplegic pandora century marine orders on to unique but'

(b) List the top 10 recommended movies that go with the movie Road House. Display only the movie titles of your recommendations.


Use the examples given in Banik's book as a guide.

In [27]:
content_recommender('Road House', cosine_sim2, new_df, indices2)

[(0, 0.03823595564509363), (1, 0.02177517781546711), (2, 0.02120949209919259), (3, 0.030386856273138196), (4, 0.01892189081521491), (5, 0.03604920469572501), (6, 0.015193428136569098), (7, 0.03508771929824562), (8, 0.0), (9, 0.03166237934306518), (10, 0.03905832834322535), (11, 0.04087595596566439), (12, 0.022075539284417398), (13, 0.017243942512516184), (14, 0.02177517781546711), (15, 0.0), (16, 0.019744962591969745), (17, 0.04966996338993915), (18, 0.017859990659227328), (19, 0.015719307070490657), (20, 0.06243905410544627), (21, 0.020942695414584777), (22, 0.0), (23, 0.0), (24, 0.040397858162338846), (25, 0.027923593886113034), (26, 0.019320290100197885), (27, 0.028903665650804003), (28, 0.03864058020039577), (29, 0.033375078288117255), (30, 0.03673591791853226), (31, 0.02177517781546711), (32, 0.0), (33, 0.04355035563093422), (34, 0.02503130871608794), (35, 0.02177517781546711), (36, 0.01836795895926613), (37, 0.02177517781546711), (38, 0.03478392380269824), (39, 0.0201989290811694

654              Proof of Life
3041                    Splash
2668                   Machete
525                    Killers
1535                    Looper
288          The Expendables 2
1167           The Other Woman
2348    Escobar: Paradise Lost
3626                    Oldboy
623             Need for Speed
Name: title, dtype: object