<p><font size=18>Lesson 10: Recommender Systems 1</font></p>

In [1]:
# execute to import notebook styling for tables and width etc.
from IPython.core.display import HTML
import urllib.request
response = urllib.request.urlopen('https://raw.githubusercontent.com/DataScienceUWL/DS775v2/master/ds755.css')
HTML(response.read().decode("utf-8"));

# Practice

Follow the examples and use the code files provided from chapters 1-4 in **Hands-On Recommendation Systems with Python** by Rounak Banik to do the following self-assessment exercises.  Learn by doing!

## Simple Recommender

### <font color = "blue"> Self-Assessment: Load and Display </font>

Load the data set **ted_main.csv** and display the first 5 rows. This data set can be found in the presentation download for this lesson.  More information about this data set <a href = https://www.kaggle.com/rounakbanik/ted-talks> here </a>.  

In [None]:
# enter your code here

### <font color = "blue"> Self-Assessment: Pandas </font>

How many talks are in the TED Talks data frame?

### <font color = "blue"> Self-Assessment: Prerequisites </font> 

Select TED talks with these prerequisites:

1. talks with duration of at least 5 minutes (i.e. 300 seconds)
2. talks with only 1 speaker
3. talks in the top 90\% of views (exclude the bottom 10\%)

Also inspect the number of talks that made the cut.

In [None]:
# enter your code here

### <font color = "blue"> Self-Assessment: Compute a Metric, Sort and Print </font>  

In the absence of numerical ratings here, use the ratio of the number of comments per 1000 views as a metric to sort the TED talks and print the 10 with the highest ratios.  

Display only the description, the main speaker, and the number of views.

In [None]:
# enter your code here

### A Note about Banik's Metric
In the homework, we will ask you to use Banik's IMDB metric. Let's quickly review what his metric is doing. We will use different variable names that more accurately reflect what these variables represent. Pay close attention to what the variables represent.

In [None]:
import pandas as pd
import numpy as np
df = pd.read_csv('data/movies_metadata.csv')

#Calculate the number of votes garnered by the 80th percentile movie of our UNFILTERED data
#This is Banik's chosen cut off for his metric. 
vote_count_quantile = df['vote_count'].quantile(0.80)

# Calculate the mean of the vote average for our UNFILTERED data
vote_avg_mean = df['vote_average'].mean()

# Function to compute the IMDB weighted rating for each movie
def weighted_rating(x, vcq=vote_count_quantile, vam=vote_avg_mean):
    v = x['vote_count']
    R = x['vote_average']
    # Compute the weighted score
    return (v/(v+vcq) * R) + (vcq/(vcq+v) * vam)

Note the function requires 2 bits of data that are determined prior to any filtering being done. The first is the 80th quantile of votes and the second is the mean of the vote average. We also pass in "x", which is a row from our filtered dataframe. 

We then get the vote count and vote average from our row (x) and compute the metric. When we apply this to the filtered dataframe, it will add a new column to our filtered dataframe with the result.

Once we have our metric function set up, we can go ahead and filter our data.

In [None]:
#NOW  we filter data
#Only consider movies longer than 45 minutes and shorter than 300 minutes
q_movies = df[(df['runtime'] >= 45) & (df['runtime'] <= 300)]

#Only consider movies that have garnered more than some percentile of votes
m = df['vote_count'].quantile(.8)
q_movies = q_movies[q_movies['vote_count'] >= m]

Note that it is purely coincidence that Banik also chose to filter his dataframe by the same vote count quantile. That is not necessary for his weighted rating metric and is not what you will be doing in the homework.

In [4]:
# Compute the score using the weighted_rating function defined above
q_movies['score'] = q_movies.apply(weighted_rating, axis=1)

#Sort movies in descending order of their scores
q_movies = q_movies.sort_values('score', ascending=False)

#Print the top 25 movies
q_movies[['title', 'vote_count', 'vote_average', 'score', 'runtime']].head(25)

Unnamed: 0,title,vote_count,vote_average,score,runtime
314,The Shawshank Redemption,8358,8.5,8.427977,142.0
834,The Godfather,6024,8.5,8.401206,175.0
2843,Fight Club,9678,8.3,8.242686,139.0
292,Pulp Fiction,8670,8.3,8.236213,154.0
522,Schindler's List,4436,8.3,8.178643,195.0
2211,Life Is Beautiful,3643,8.3,8.153956,116.0
1178,The Godfather: Part II,3418,8.3,8.14501,200.0
351,Forrest Gump,8147,8.2,8.13528,142.0
1152,One Flew Over the Cuckoo's Nest,3001,8.3,8.125161,133.0
1154,The Empire Strikes Back,5998,8.2,8.113038,124.0


## Knowledge-Based Recommender

For this example we will use the TED Talks data set that you have already loaded to build a knowledge-based recommender by soliciting the desired publication year and word rating from the user.

### <font color = "blue"> Self-Assessment: Dealing with Dates </font>

Extract the year of the talk from the feature called **published_date** and put it in a new variable called **published_year**.  

First, the film dates need to be converted to datetime objects and then extract the year of the film date.  However, for the TED Talks data, include the argument *unit='s'* in the **to_datetime()** function in order to convert the dates correctly (based on the number of seconds to the unix epoch start).

Then convert **published_year** to an integer data type and be sure that there are no NAT values among them.

In [None]:
# enter your code here

### <font color = "blue"> Self-Assessment: Stringified Dictionaries </font>

Since we will be asking the user to enter a descriptive word rating to select a talk and the feature  **ratings** is a stringified dictionary, convert the list of dictionaries into a list of strings and explode the ratings column in a pandas series. (Note: this is what the book does - just follow the book.) 

Do not add the series back to the dataframe like the book does. We'll show you a different way to use it in the next section. When you have your series, use the unique() function to preview a list of the unique ratings that were given. (*Hint* - you can do this with the line s.unique(), if your series name is s.)

In [None]:
# enter your code here

### Filtering using list functions
The book explodes dataframes so that it's easy to use simple equality when selecting data (df['word_ratings'] = 'obnoxious'). But this isn't necessary. We can use pandas shortcuts to determine if the information in a pandas column contains the value we are interested in. 

Let's look at a quick example. First we'll just create a very simple dataframe that contains a list of genres. We can't use these shortcuts on lists, so we'll first convert the list to a simple string.

In [39]:
#import and create some simple data
import pandas as pd

df = pd.DataFrame({'genres': [['horror,thriller'], ['family,animation,comedy'],['action']]},
                  index=['movie1', 'movie2', 'movie3'])
#convert the list to a string
df['genres'] = df['genres'].apply(', '.join)
#display what we have
df

Unnamed: 0,genres
movie1,"horror,thriller"
movie2,"family,animation,comedy"
movie3,action


If we want to select movies that have either a horror or action genres, we can use the str.contains function with a regular expression.

In [42]:
# create the filter
filter1 = df["genres"].str.contains('horror|action', regex=True)

# filter for generes
df[filter1]

Unnamed: 0,genres
movie1,"horror,thriller"
movie3,action


This approach prevents the duplicate row problem that we see in the book. (For homework, we will accept either the book approach or this approach.)

### <font color = "blue"> Self-Assessment: Create the Knowledge-Based Recommender </font>

1. Print a list of the descriptive word ratings for the user to choose from. (*Hint:  you can do this with the line* print(gen_ted['word_ratings'].unique()))


2. Ask the user to enter answers to the following questions:

    - Enter a descriptive word for rating.
    - Enter the earliest year published for the talk (between 2006 and 2017).
    - Enter the latest year published for the talk (between 2006 and 2017).

3. Consider only talks with the top 90% of views (after filtering based on user preferences).

4. Display the top 5 recommended talks according to the "comments per 1000 views" ratio (calculated AFTER doing steps 2 & 3).

5. Display only the main speaker, the name of the talk, the year published, and the comments per thousand views ratio.

6.  Show the results for the word rating "obnoxious" and published years between 2009 and 2014.

In [None]:
# enter your code here

## Content-Based Recommender

For this example we will use the TED Talks data set that you have already loaded to build a content-based recommender based on the descriptions of the talks.  This will correspond to the **plot description-based recommender**.

### <font color = "blue"> Self-Assessment: TF-IDF Vectors </font>

From the original TED Talks data frame that use in this lesson, create the TF-IDF (term frequency - inverse document frequency) matrix from the descriptions of the talks.  The TF-IDF is high where a rare term is present or frequent in a document and TF-IDF is near zero where a term is absent from a document, or abundant across all documents.

The feature name in the data frame is **description**.

Output the shape of the TF-IDF matrix you create. The number of rows corresponds to the number of TED talks in the data frame and the number of columns represents the number of unique terms. 

In [None]:
# enter your code here

### <font color = "blue"> Self-Assessment: Create the Content-Based Recommender Based on Cosine Similarity </font>

Compute the cosine similarity score for all of the TED talks in the data frame. Next build the recommender to request the name of a TED talk in the data frame and provide the top 5 recommended talks based on the similarity of the descriptions with the name of the talk supplied.

Show that it works by getting the top 5 recommended talks that are similar to the talk named "Tyler Cowen: Be suspicious of simple stories" (from the `name` column of the data frame).

In [None]:
# enter your code here