In [None]:
# execute to import notebook styling for tables and width etc.
from IPython.core.display import HTML
import urllib.request
response = urllib.request.urlopen('https://raw.githubusercontent.com/DataScienceUWL/DS775v2/master/ds755.css')
HTML(response.read().decode("utf-8"));

<font size=18>Lesson 10 - Self-Assessment Solutions</font>

# <font color = "blue"> Self-Assessment: Load and Display - Solution</font>
There's nothing too new here. You've done this kind of work before. What's more important here than the code is making sure you take a minute or two to understand the data you're pulling in. What columns do you have available to you? Which columns contain simple values and which columns contain lists. Think about how you could or couldn't use this data to make recommendations.

In [None]:
import pandas as pd
import numpy as np

ted = pd.read_csv('./data/ted-talks/ted_main.csv')
ted.head()

# <font color = "blue"> Self-Assessment: Pandas - Solution </font>
Remember that shape gives you the number of rows first, followed by the number of columns.

In [None]:
ted.shape

There are 2550 TED talks in this data frame.

# <font color = "blue"> Self-Assessment: Prerequisites - Solution </font> 

Remember that when you're calculating the quantile for some piece of data, you'll get different results if you calculate it before or after you do your other subsetting. First, let's calculate the views quantile before we figure the rest of our prerequisites.

In [None]:
#Calculate the number of views for the 10th percentile - calculated from the whole dataframe
m = ted['views'].quantile(0.10)

#Only consider talks of at least 5 minutes
q_talks = ted[(ted['duration'] >= 300)]

#Only consider talks with one speaker
q_talks = q_talks[q_talks['num_speaker']==1]

#Only consider talks in the top 90%
q_talks = q_talks[q_talks['views'] >= m]

#Inspect the number of talks that made the cut
q_talks.shape

Let's compare that with calculating the quantile after we subset.

In [None]:
#Only consider talks of at least 5 minutes
q_talks2 = ted[(ted['duration'] >= 300)]

#Only consider talks with one speaker
q_talks2 = q_talks2[q_talks2['num_speaker']==1]

#Calculate the number of views for the 10th percentile - calculated from the whole dataframe
m2 = q_talks2['views'].quantile(0.10)

#Only consider talks in the top 90%
q_talks2 = q_talks2[q_talks2['views'] >= m2]

#Inspect the number of talks that made the cut
q_talks2.shape

There is no universally "right" answer as to whether you should calculate the quantile before or after you've narrowed the initial dataset. It depends on what you're trying to accomplish. If you want the most viewed talks *that meet your criteria* you'd calculate it after you've subsetted. If you want the most viewed talks *overall* you'd calculate it before you've subsetted.

For our homework, we'll typically calculate the quantile on the entire data set, to keep consistent with the book.

# <font color = "blue"> Self-Assessment: Compute a Metric, Sort and Print - Solution </font>  

Note that here we are computing our metric on our narrowed data set. We could have created the metric on the entire dataset. But, if we know that we're only interested in a portion of the talks, we should narrow our dataset before computing the metric.

In [None]:
#create the metric of the comments to views ratio
q_talks['comments_per_1000views']=1000*q_talks['comments']/q_talks['views']

#Sort talks in descending order of the ratio of views to comments
q_talks = q_talks.sort_values('comments_per_1000views', ascending=False)

#Print the top 10 talks
q_talks[['description', 'main_speaker', 'comments_per_1000views']].head(10)


# <font color = "blue"> Self-Assessment: Dealing with Dates - Solution </font>
This is straight out of the book. Apply is a handy function available in pandas that lets you run a function for each row or column of your data. You're seeing examples here of using a lambda (inline) function as well as using a separately created function (convert_int). 

The lambda function is just grabbing the year from the published date. It's doing that by splitting the string on the '-' character. This creates an array. We grab the first item in the array, which, if we had a valid date, should be the year. If we didn't have a valid date, then we drop in the np.nan.

In [None]:
#Convert release_date into pandas datetime format
ted['published_date'] = pd.to_datetime(ted['published_date'],
                                       errors='coerce', unit='s')

#see what the new date looks like
print("This is what the datetime string looks like:")
display(ted['published_date'].head())


#Extract year from the datetime
ted['published_year'] = ted['published_date'].apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

#Helper function to convert NaT to 0 and all other years to integers.
def convert_int(x):
    try:
        return int(x)
    except:
        return 0

#Apply convert_int to the year feature
ted['published_year'] = ted['published_year'].apply(convert_int)
    
ted.head()

# <font color = "blue"> Self-Assessment: Stringified Dictionaries - Solution </font>

This is also straight from the book. When we use the literal_eval function on the ratings column, we get a dictionary that we can manipulate. The "name" key holds the part of the ratings that we care about. We want to convert these words to lower case and create a list of the words. We create an empty list if there were no ratings.


In [None]:
#Import the literal_eval function from ast
from ast import literal_eval

#Convert all NaN into stringified empty lists
ted['ratings'] = ted['ratings'].fillna('[]')

#Apply literal_eval to convert stringified empty lists to the list object
ted['ratings'] = ted['ratings'].apply(literal_eval)


#Convert list of dictionaries to a list of strings
ted['ratings'] = ted['ratings'].apply(lambda x: [i['name'].lower() for i in x] if isinstance(x, list) else [])

#See how 'ratings' has changed?
ted.head()

"Exploding" creates a row for each word in our ratings list. This obviously creates duplicate data, and if you create more metrics after you've exploded, you're going to get strange results. In the case of our knowledge-based recommender, since we're only asking you to return the results for a single word, it's not that big of a deal. But we'll show you two approaches to accomplish the same thing. First, let's explode, but not join our exploded column back to our main dataframe.

In [None]:
#Create a new feature by exploding word ratings
s = ted.apply(lambda x: pd.Series(x['ratings']),axis=1).stack().reset_index(level=1, drop=True)

#look at what we have
s.unique()

# <font color = "blue"> Self-Assessment: Create the Knowledge-Based Recommender - Solution </font>

We're creating this as a function that takes in the dataframe and the percentile of views that we want to return. 
We'll first generate our list of unique words to present to users. We'll also stringify our list of ratings so we can use str.contains to filter.

In [None]:
def build_chart(gen_ted, percentile=0.1):
    #Show user the list of word ratings to choose from
    s = gen_ted.apply(lambda x: pd.Series(x['ratings']),axis=1).stack().reset_index(level=1, drop=True)
    print(s.unique())
    
    #convert our ratings to strings
    gen_ted['ratings'] = gen_ted['ratings'].apply(', '.join)
    
    #Ask for preferred word rating
    print("Select a descriptive word from the list above for the 'word rating'")
    rating = input()
    
    #Ask for lower limit of film year
    print("Input earliest year published (2006 to 2017)")
    low_year = int(input())
    
    #Ask for upper limit of film year
    print("Input latest year published(2006 to 2017)")
    high_year = int(input())
    
    
    #Define a new talks variable to store the preferred talks. 
    #Copy the contents of gen_ted to talks
    talks = gen_ted.copy()
    
    #Filter based on the condition
    talks = talks[(talks['ratings'].str.contains(rating)) & 
                    (talks['published_year'] >= low_year) & 
                    (talks['published_year'] <= high_year)]
    
    #Calculate the number of views for the 10th percentile 
    m = ted['views'].quantile(percentile)

    #Only consider movies that have higher than m votes. Save this in a new dataframe q_movies
    q_talks = talks.copy().loc[talks['views'] >= m]
    
    #create the metric of the comments to views ratio
    q_talks['comments_per_1000views']=1000*q_talks['comments']/q_talks['views']

    #Sort talks in descending order of the ratio of views to comments
    q_talks = q_talks.sort_values('comments_per_1000views', ascending=False)
    
    return q_talks

In [11]:
#Generate the chart for top talks for these user preferences and display top 5.
gen_ted_final = build_chart(ted).head(5)

gen_ted_final[['main_speaker','name','published_year','comments_per_1000views']]

['funny' 'beautiful' 'ingenious' 'courageous' 'longwinded' 'confusing'
 'informative' 'fascinating' 'unconvincing' 'persuasive' 'jaw-dropping'
 'ok' 'obnoxious' 'inspiring']
Select a descriptive word from the list above for the 'word rating'
obnoxious
Input earliest year published (2006 to 2017)
2009
Input latest year published(2006 to 2017)
2014


Unnamed: 0,main_speaker,name,published_year,comments_per_1000views
803,David Bismark,David Bismark: E-voting without fraud,2010,1.534355
694,Sharmeen Obaid-Chinoy,Sharmeen Obaid-Chinoy: Inside a school for sui...,2010,1.420683
954,Janet Echelman,Janet Echelman: Taking imagination seriously,2011,1.359572
840,Lesley Hazleton,Lesley Hazleton: On reading the Koran,2011,1.285149
1787,David Chalmers,David Chalmers: How do you explain consciousness?,2014,1.235918


# <font color = "blue"> Self-Assessment: TF-IDF Vectors - Solution</font>
This is all straight from the book. More information about the TfidfVectorizer is available online here: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

In [12]:
#Import TfIdfVectorizer from the scikit-learn library
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stopwords
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
ted['description'] = ted['description'].fillna('')

#Construct the required TF-IDF matrix by applying the fit_transform method on the description feature
tfidf_matrix = tfidf.fit_transform(ted['description'])

#Output the shape of tfidf_matrix (rows first, then columns)
tfidf_matrix.shape

(2550, 15162)

In [21]:
#bonus - take a look some of the individual words in the description
feature_names = tfidf.get_feature_names()
feature_names[500:510]

['albatrosses',
 'albert',
 'alberta',
 'alberto',
 'albinism',
 'albright',
 'album',
 'albuquerque',
 'alchemists',
 'alcohol']

In [26]:
#bonus - this is saying that for the first document, none of the first 10 words shown above show up in that document
tfidf_list = tfidf_matrix.toarray()
tfidf_list[0, 500:510]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

# <font color = "blue"> Self-Assessment: Create the Content-Based Recommender Based on Cosine Similarity - Solution </font>
This is also straight from the book. We don't expect you to understand everything to do with linear kernels. But if you're interested, the documentation is here:
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.linear_kernel.html



In [28]:
# Import linear_kernel to compute the dot product
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

#Construct a reverse mapping of indices and talk names, and drop duplicate names, if any
indices = pd.Series(ted.index, index=ted['name']).drop_duplicates()

In [29]:
# Function that takes in talk title as input and gives recommendations 
def content_recommender(name, cosine_sim=cosine_sim, df=ted, indices=indices):
    # Obtain the index of the talks that matches the title
    idx = indices[name]

    # Get the pairwsie similarity scores of all talks with that name
    # And convert it into a list of tuples as described above
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the talks based on the cosine similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar talks. Ignore the first talk.
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return df['name'].iloc[movie_indices]

In [30]:
#Get recommendations for Tyler Cowen: Be suspicious of simple stories
content_recommender('Tyler Cowen: Be suspicious of simple stories')

1910    Tom Wujec: Got a wicked problem? First, tell m...
451     Dan Ariely: Are we in control of our own decis...
1115         Mikko Hypponen: Three types of online attack
1405           Ronny Edry: Israel and Iran: A love story?
2361     Sisonke Msimang: If a story moves you, act on it
406                      Dan Ariely: Our buggy moral code
775         Julian Treasure: Shh! Sound health in 8 steps
335     Samantha Power: A complicated hero in the war ...
145                  John Maeda: Designing for simplicity
757     His Holiness the Karmapa: The technology of th...
Name: name, dtype: object