In [2]:
# execute to import notebook styling for tables and width etc.
from IPython.core.display import HTML
import urllib.request
response = urllib.request.urlopen('https://raw.githubusercontent.com/DataScienceUWL/DS775v2/master/ds755.css')
HTML(response.read().decode("utf-8"));

In [3]:
import pandas as pd

<font size=18>Lesson 10 Homework</font>

# Build a Simple Recommender

Use the data set **tmdb_5000_movies.csv** to build a simple recommender system with the following characteristics 

- movies in the to 50\% according to the popularity score
- movies with runtime of 80 minutes or more
- movies with a budget between \\$10 and \\$100 million
- movies with genre of either action or comedy

then rank the movies according to the IMDB weighted rating formula that is described in the Banik textbook and print the top 10 highest rated movies with the characteristics listed above. 

### Code

In [4]:
# make df
tmdb_df = pd.read_csv("data/tmdb-5000-movie-dataset/tmdb_5000_movies.csv", encoding="ISO-8859-1")

movie_indices = []
for i in range(tmdb_df.shape[0]):
    movie_ratings = eval(tmdb_df['genres'].iloc[i])

    for rating in movie_ratings:
        if rating['name'] == 'Action' or rating['name'] == 'Comedy':
            movie_indices.append(i)

movie_indices = set(movie_indices)

# filter on genres (action or comedy)
df_filtered = tmdb_df.filter(items=movie_indices, axis=0)

# filter on other requests
df_filtered = df_filtered[(df_filtered['popularity'] >= df_filtered['popularity'].quantile(.5)) & 
                      (df_filtered['runtime'] >= 80) & 
                      (df_filtered['budget'] >= 10000000) & 
                      (df_filtered['budget'] <= 100000000)]

# create imdb weighted rating
m = df_filtered['vote_count'].quantile(.25)
r = df_filtered['vote_average']
v = df_filtered['vote_count']
c = r.mean()
wr_imdb = ((v/(v+m))*r)+((m/(v+m))*c)

# assign weighted rating as a column
df_filtered = df_filtered.assign(weighted_rating=wr_imdb)

# sort the filtered df by weighted rating
df_filtered = df_filtered.sort_values('weighted_rating', ascending=False).reset_index()

**Provide the following with your solution:**

(a) How many movies are in the original data frame?

In [5]:
tmdb_df.shape[0]

4803

(b) What is the median for popularity score?

In [6]:
tmdb_df['popularity'].median()

12.921594

(c) How many movies are in the reduced data frame (*i.e.* how many meet the selection criteria)?

In [7]:
df_filtered.shape[0]

935

(d) Display only the movie title and the requested characteristics as well as the vote count, vote average, and IMDB weighted rating with your recommendations.

In [8]:
df_filtered[['original_title','weighted_rating','vote_count','vote_average']]

Unnamed: 0,original_title,weighted_rating,vote_count,vote_average
0,Forrest Gump,8.092498,7927,8.2
1,The Empire Strikes Back,8.057728,5879,8.2
2,The Lord of the Rings: The Return of the King,7.999546,8064,8.1
3,Star Wars,7.979106,6624,8.1
4,The Lord of the Rings: The Fellowship of the Ring,7.911524,8705,8.0
...,...,...,...,...
930,Son of the Mask,5.095494,338,3.6
931,Left Behind,5.046314,392,3.7
932,Jack and Jill,5.004859,604,4.1
933,Catwoman,4.922550,808,4.2


This data set can be found in the presentation download for this lesson.  You will need to use the option **encoding = "ISO-8859-1"** in the **read_csv** function in order to open this file.  Use the examples given in Banik's book as a guide.

# Build a Knowledge-Based Recommender

Use the data set **tmdb_5000_movies.csv** to build a knowledge-based recommender system that solicits the following information listed below and then ranks the movies according to the IMDB weighted rating formula. Use all available movies to begin with (*i.e.* don't restrict it to just the top 20%, for example) Print the top 5 highest rated movies for this recommendation. 

Ask the user to enter answers to the following questions:

- Enter a preferred genre.
- Enter another preferred genre.
- Enter a minimum runtime (in minutes).
- Enter a maximum runtime (in minutes).

Print a list of genres for the user to choose from before asking for their inputs.

In [34]:

all_genres.tolist()

['Action',
 'Adventure',
 'Fantasy',
 'Science Fiction',
 'Crime',
 'Drama',
 'Thriller',
 'Animation',
 'Family',
 'Western',
 'Comedy',
 'Romance',
 'Horror',
 'Mystery',
 'History',
 'War',
 'Music',
 'Documentary',
 'Foreign',
 'TV Movie']

Create the recommender to select movies with either genre entered.  

In [42]:
def movie_recommender():
    tmdb_df = pd.read_csv("data/tmdb-5000-movie-dataset/tmdb_5000_movies.csv", encoding="ISO-8859-1")
    
    # calculate unique genres
    list_of_genres = []
    
    for i in range(tmdb_df.shape[0]):
        genre_dict = eval(tmdb_df['genres'].iloc[i])
        list_of_genres.extend(genre_dict)

    genres_df = pd.DataFrame.from_dict(list_of_genres, orient='columns')
    all_genres = ', '.join(genres_df['name'].unique().tolist())
    
    #present genres and get inputs
    print(f"See genres: {all_genres}")
    preferred_genre1 = input("Enter a preferred genre")
    preferred_genre2 = input("Enter another preferred genre")
    min_runtime = int(input("Enter min runtime"))
    max_runtime = int(input("Enter max runtime"))
    
    movie_indices = []
                             
    # filter df by genre
    for i in range(tmdb_df.shape[0]):
        movie_ratings = eval(tmdb_df['genres'].iloc[i])
        tmdb_df.loc[i, 'genres'] = ', '.join([rating['name'] for rating in movie_ratings])

        for rating in movie_ratings:
            if rating['name'] == preferred_genre1 or rating['name'] == preferred_genre2:
                movie_indices.append(i)

    movie_indices = set(movie_indices)

    # filter on genres (action or comedy)
    df_filtered = tmdb_df.filter(items=movie_indices, axis=0)

    # filter on other requests
    df_filtered = df_filtered[(df_filtered['runtime'] >= min_runtime) & 
                          (df_filtered['runtime'] <= max_runtime)]

    # create imdb weighted rating
    m = df_filtered['vote_count'].quantile(.25)
    r = df_filtered['vote_average']
    v = df_filtered['vote_count']
    c = r.mean()
    wr_imdb = round(((v/(v+m))*r)+((m/(v+m))*c),2)

    # assign weighted rating as a column
    df_filtered = df_filtered.assign(weighted_rating=wr_imdb)

    # sort the filtered df by weighted rating
    df_filtered = df_filtered.sort_values('weighted_rating', ascending=False).reset_index()
    df_filtered = df_filtered[['original_title', 'genres', 'runtime', 'weighted_rating','vote_count','vote_average']]
    return df_filtered.head()

Have your recommender give recommendations for genres "family" and "tv movie" between of 50 and 120 minutes long and display only the movie title and the requested characteristics as well as the vote count, vote average, and IMDB weighted rating with your recommendations.

Use the examples given in Banik's book as a guide.

In [43]:
movie_recommender()

See genres: Action, Adventure, Fantasy, Science Fiction, Crime, Drama, Thriller, Animation, Family, Western, Comedy, Romance, Horror, Mystery, History, War, Music, Documentary, Foreign, TV Movie
Enter a preferred genreFamily
Enter another preferred genreTV Movie
Enter min runtime50
Enter max runtime120


Unnamed: 0,original_title,genres,runtime,weighted_rating,vote_count,vote_average
0,Inside Out,"Drama, Comedy, Animation, Family",94.0,7.98,6560,8.0
1,Back to the Future,"Adventure, Comedy, Science Fiction, Family",116.0,7.97,6079,8.0
2,The Lion King,"Family, Animation, Drama",89.0,7.97,5376,8.0
3,Big Hero 6,"Adventure, Family, Animation, Action, Comedy",102.0,7.78,6135,7.8
4,WALLåáE,"Animation, Family",98.0,7.78,6296,7.8


# Build a Content-Based Recommender

Use the data set **tmdb_5000_movies.csv** to build a meta-data based recommender by creating a "soup" and using cosine similarity based on the 

- all genres
- top three keywords
- top three production companies
- overview

**Provide the following with your solution:**

(a) After you have constructed it, print the "soup" for the first entry [0].

(b) List the top 10 recommended movies that go with the movie Road House. Display only the movie titles of your recommendations.


Use the examples given in Banik's book as a guide.