<a href="https://colab.research.google.com/github/villafue/Capstone_2_Netflix/blob/main/Springboard/Tutorial/DataCamp/Building%20Recommendation%20Engines%20in%20Python/2%20Content-Based%20Recommendations/2_Content_Based_Recommendations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Content-Based Recommendations

Discover how item attributes can be used to make recommendations. Create valuable comparisons between items with both categorical and text data. Generate profiles to recommend new items for users based on their past preferences.

# Intro to content-based recommendations

1. Intro to content-based recommendations

So far we have looked at making recommendations based solely on how the entire population feels about items. While these recommendations can be useful, they aren't personalized.
2. What are content-based recommendations?

In this chapter, we will move to more targeted models by recommending items based on their similarities to items a user has liked in the past. For example, if a user likes book A, and we calculate that book A and book B are similar, we believe the user will like book B. We will address how to calculate what items are similar and which ones are not. We can do so by comparing the attributes of our items. The recommendations made by finding items with similar attributes are called content-based recommendations.
3. Items' attributes or characteristics

For example, if we were looking at a dataset describing books, the attributes could be the author of the book, its publishing date, its length, or its genre, really any descriptive information. A big advantage of using an item's attributes over user feedback is that you can make recommendations for any items you have attribute data on. This includes even brand new items that users have not seen yet. Content-based models require us to use any available attributes to build profiles of items in a way that allows us to mathematically compare between them. This allows us for example to find the most similar items and recommend them.
4. Vectorizing your attributes

This is best done by encoding each item as a vector. Here we can see an example with a vector for each item stored as a row and each feature as a column. Why this shape you might ask? It is extremely valuable to have your data in this format so the distance and similarities between items can be easily calculated, which is vital for generating recommendations. We'll discuss how to calculate distances and similarities between vectors later in the course. First, we will cover how to convert the most common data format for attributes to this shape. We will continue using the book dataset from chapter 1, but this time we introduce an additional book_genre table.
5. One to many relationships

This book_genre table, as seen here on the left, contains a one to many reference of books to their genres. This type of one to many lookup is very common in relational databases. Remember from this table, we want to create a new table that contains a single row per item, encoding whether or not it has that attribute like you see here on the right.
6. Crosstabulation

To transform this data we can use pandas' crosstab function. The crosstab function generates the cross-tabulation of two (or more) factors, and here we want to use it to find the cross-tabulation of the book titles and the genres they have been labeled with.
7. Crosstabulation

We call pd-dot-crosstab, passing in the book titles as the first argument, and the book genres as the second argument. The first argument will become the rows, and the second becomes the columns. Here we can see the desired result.
8. Let's practice!

Great, now we have our data in a format that will allow us to calculate similarities and make recommendations. Time to try these data transformations yourself. 

# Why use content-based models?

Imagine you are working for a large retailer that has a constantly changing product line, with new items being added every day. Why might content-based models be a good choice to make recommendations on your data?

Possible Answers

1. You are always guaranteed better recommendations with content-based data.
 - Incorrect - Unfortunately not, the best model for the job can never be guaranteed. It will depend on the data you have available.

2. Content-based models always recommend the newest products; customers always like the newest products no matter what their past preferences were.
 - Incorrect. Content-based models make recommendations based on an item's attributes, not just how new it is.

3. As the recommendations are based on the item attributes rather than user feedback, recommendations can be made on never-before-purchased products.
 - Correct! Content-based models are ideal for creating recommendations for products that have no user feedback data such as reviews or purchases.

# Creating content-based data

As much as you might want to jump right to finding similar items and making recommendations, you first need to get your data in a usable format. In the next few exercises, you will explore your base data and work through how to format that data to be used for content-based recommendations.

As a reminder, the desired outcome is a row per movie with each column indicating whether a genre applies to the movie. You will be looking at movie_genre_df, which contains these columns:

 * name - Name of movie
 * genre_list - Genre that the movie has been labeled as

A movie may have multiple genres, and therefore multiple rows. In this exercise, you will particularly focus on one movie (Toy Story in this case) to be able to clearly see what is happening with the data.

Instructions

Question

1. How many different movies are contained in movie_genre_df?

Possible Answers

1. 50
 - Incorrect, this is the number of rows in the DataFrame, but as you can see, movies can have multiple rows, one for each of their genre labels.

2. 21
 - Correct
 
3. 11
 - Not quite, that is the number of unique genre labels, not the number of unique movies.

In [None]:
In [1]:
movie_genre_df
Out[1]:

                              name genre_list
0                        Toy Story  Adventure
1                        Toy Story  Animation
2                        Toy Story   Children
3                        Toy Story     Comedy
4                        Toy Story    Fantasy
5                          Jumanji  Adventure
6                          Jumanji   Children
7                          Jumanji    Fantasy
8                 Grumpier Old Men     Comedy
9                 Grumpier Old Men    Romance
10               Waiting to Exhale     Comedy
11               Waiting to Exhale      Drama
12               Waiting to Exhale    Romance
13     Father of the Bride Part II     Comedy
14                            Heat     Action
15                            Heat      Crime
16                            Heat   Thriller
17                         Sabrina     Comedy
18                         Sabrina    Romance
19                    Tom and Huck  Adventure
20                    Tom and Huck   Children
21                    Sudden Death     Action
22                       GoldenEye     Action
23                       GoldenEye  Adventure
24                       GoldenEye   Thriller
25         American President, The     Comedy
26         American President, The      Drama
27         American President, The    Romance
28     Dracula: Dead and Loving It     Comedy
29     Dracula: Dead and Loving It     Horror
30                           Balto  Adventure
31                           Balto  Animation
32                           Balto   Children
33                           Nixon      Drama
34                Cutthroat Island     Action
35                Cutthroat Island  Adventure
36                Cutthroat Island    Romance
37                          Casino      Crime
38                          Casino      Drama
39           Sense and Sensibility      Drama
40           Sense and Sensibility    Romance
41                      Four Rooms     Comedy
42  Ace Ventura: When Nature Calls     Comedy
43                     Money Train     Action
44                     Money Train     Comedy
45                     Money Train      Crime
46                     Money Train      Drama
47                     Money Train   Thriller
48                      Get Shorty     Comedy
49                      Get Shorty      Crime

In [None]:
movie_genre_df.describe()

'''
               name genre_list
count            50         50
unique           21         11
top     Money Train     Comedy
freq              5         11
'''

movie_genre_df.describe().T

'''
           count unique          top freq
name          50     21  Money Train    5
genre_list    50     11       Comedy   11
'''

 2. Get the rows in movie_genre_df which have a name equal to Toy Story and save this as toy_story_genres.

In [None]:
# Select only the rows with values in the name column equal to Toy Story
toy_story_genres = movie_genre_df[movie_genre_df.name == 'Toy Story']

# Inspect the subset
print(toy_story_genres)

'''
<script.py> output:
            name genre_list
    0  Toy Story  Adventure
    1  Toy Story  Animation
    2  Toy Story   Children
    3  Toy Story     Comedy
    4  Toy Story    Fantasy
'''

 3. Transform movie_genre_df to a table called movie_cross_table.

 4. Assign the subset of movie_cross_table that contains Toy Story to the variable toy_story_genres_ct and inspect the results.


In [None]:
# Select only the rows with values in the name column equal to Toy Story
toy_story_genres = movie_genre_df[movie_genre_df['name'] == 'Toy Story']

# Create cross-tabulated DataFrame from name and genre_list columns
movie_cross_table = pd.crosstab(movie_genre_df['name'], movie_genre_df['genre_list'])

# Select only the rows with Toy Story as the index
toy_story_genres_ct = movie_cross_table[movie_cross_table.index == 'Toy Story']
print(toy_story_genres_ct)

'''

<script.py> output:
    genre_list  Action  Adventure  Animation  Children  Comedy  ...  Drama  Fantasy  Horror  Romance  Thriller
    name                                                        ...                                           
    Toy Story        0          1          1         1       1  ...      0        1       0        0         0
    
    [1 rows x 11 columns]
'''

In [None]:
In [3]:
movie_cross_table
Out[3]:

genre_list                      Action  Adventure  Animation  Children  Comedy  ...  Drama  Fantasy  Horror  Romance  Thriller
name                                                                            ...                                           
Ace Ventura: When Nature Calls       0          0          0         0       1  ...      0        0       0        0         0
American President, The              0          0          0         0       1  ...      1        0       0        1         0
Balto                                0          1          1         1       0  ...      0        0       0        0         0
Casino                               0          0          0         0       0  ...      1        0       0        0         0
Cutthroat Island                     1          1          0         0       0  ...      0        0       0        1         0
Dracula: Dead and Loving It          0          0          0         0       1  ...      0        0       1        0         0
Father of the Bride Part II          0          0          0         0       1  ...      0        0       0        0         0
Four Rooms                           0          0          0         0       1  ...      0        0       0        0         0
Get Shorty                           0          0          0         0       1  ...      0        0       0        0         0
GoldenEye                            1          1          0         0       0  ...      0        0       0        0         1
Grumpier Old Men                     0          0          0         0       1  ...      0        0       0        1         0
Heat                                 1          0          0         0       0  ...      0        0       0        0         1
Jumanji                              0          1          0         1       0  ...      0        1       0        0         0
Money Train                          1          0          0         0       1  ...      1        0       0        0         1
Nixon                                0          0          0         0       0  ...      1        0       0        0         0
Sabrina                              0          0          0         0       1  ...      0        0       0        1         0
Sense and Sensibility                0          0          0         0       0  ...      1        0       0        1         0
Sudden Death                         1          0          0         0       0  ...      0        0       0        0         0
Tom and Huck                         0          1          0         1       0  ...      0        0       0        0         0
Toy Story                            0          1          1         1       1  ...      0        1       0        0         0
Waiting to Exhale                    0          0          0         0       1  ...      1        0       0        1         0

[21 rows x 11 columns]

Conclusion

Good work! This newly formatted table with a vector contained in a row per movie and a column per feature will allow you to calculate distances and similarities between movies.

# Understanding the content-based data

You are now able to convert common attribute data to a DataFrame containing a row per movie, and each of its attributes as columns. You will now take a closer look at the full DataFrame you just created to see if you understand the information within.

A subset of the DataFrame you have created in the last exercise has been loaded as movie_cross_table. As a reminder, the genres are stored as individual columns and the movie names are stored as the index.

Inspect the rows corresponding to 'Toy Story' and 'Yogi Bear' in movie_cross_table. How many genres do they have in common?



In [None]:
In [2]:
movie_cross_table[movie_cross_table.index == 'Toy Story']
Out[2]:

genre_list  Action  Adventure  Animation  Children  Comedy  Crime
name                                                             
Toy Story        0          1          1         1       1      0
In [3]:
movie_cross_table[movie_cross_table.index == 'Yogi Bear']
Out[3]:

genre_list  Action  Adventure  Animation  Children  Comedy  Crime
name                                                             
Yogi Bear        0          0          0         1       1      0

Possible Answers

1. 0 genres in common
 - Incorrect. Note that the Boolean value in each column signifies whether a movie has been labeled with that feature.

2. 2 genres in common
 - Correct! Yogi Bear and Toy Story both have the 'Children' and 'Comedy' attributes. The more genres that two movies have in common, the more likely it is that someone who liked one will like the other, so now we're going to apply this at a larger scale instead of just one pair of movies.

3. 4 genres in common
 - Incorrect. Although the two movies have 4 features with the same value, (two 1s and two 0s) you only care about the genres that they both have.

4. 6 genres in common
 - Incorrect. Note that the Boolean value in each column signifies whether a movie has been labeled with that feature.

# Making content-based recommendations

1. Making content-based recommendations

With our data formatted, we can begin making comparisons and recommendations, but to do so, we will need a way of calculating similarity between rows.
2. Introducing the Jaccard similarity

The metric we will use to measure similarity between items in our newly encoded dataset is called the Jaccard similarity. The Jaccard similarity is the ratio of attributes that two items have in common, divided by the total number of their combined attributes. These are respectively shown by the two orange shaded areas in the Venn diagrams here. It will always be between 0 and 1 and the more attributes the two items have in common, the higher the score.
3. Calculating Jaccard similarity between books

We will continue working on the book genre DataFrame created in the last video called genres_array_df. This contains one row for each item (books in this case) and a column for each genre.
4. Calculating Jaccard similarity between books

To calculate the Jaccard similarity between the books in the DataFrame we first need to import jaccard_score from the sklearn metrics library. This function takes two vectors (rows in our case) and calculates the similarity value. So we can take the row for The Hobbit And the row for A Game of Thrones And find the Jaccard score. While this is valuable for the lookup of individual similarities, it is often more useful to have the similarities of all your items calculated at once in an easy to access DataFrame.
5. Finding the distance between all items

To get all of these similarities at once for our data we will call upon two helpful functions from the scipy package. First pdist (short for pairwise distance) helps us find all the distances at once, using Jaccard as the metric argument. This returns a condensed matrix, which contains all the distances in a 1D array. We then use squareform to get this 1D data into the rectangular shape we need.
6. Finding the distance between all items

Note that pdist calculates the Jaccard distance which is a measure of how different rows are from each other. As we want the complement of this, the similarity, we subtract the values from 1.
7. Creating a usable distance table

We can now wrap this similarity array in a DataFrame for ease of use. We create a DataFrame with the newly generated jaccard_similarity_array as the main argument and set both the index and column arguments to the title column of the distance_df DataFrame. Let's take a look at the distance_df DataFrame we just created.
8. Comparing books

This distance DataFrame can be used to look up any pairings of Books to see how similar they are. Let's look up the similarity between The Hobbit and A Game of Thrones again by using book titles to filter the distance_df DataFrame. This returns 0-point-75, a reasonable score, as they are both fun action-packed fantasy books. If we perform a similar comparison between The Hobbit and The Great Gatsby, we get a much lower score of point-one-five. Not a huge surprise as the Great Gatsby has very little in common with The Hobbit.
9. Finding the most similar books

Finally, while comparing two books is useful, it is most valuable when you can use it to find a new book that is similar to the one you just read and enjoyed. For this, we select the column containing the book we want to compare with and then sort the results using dot-sort_values(). The ascending argument must be set to False to show the highest ranked books first. Unsurprisingly, all the top recommendations are similar fantasy adventure books!
10. Let's practice!

This method of recommendation is valuable for instances when you have good descriptive attributes on the items you want to compare, lets generate recommendations using these techniques with the movie dataset from chapter one. 

# Comparing individual movies with Jaccard similarity

In the last lesson, you built a DataFrame of movies, where each column represents a different genre. You can now use this DataFrame to compare movies by measuring the Jaccard similarity between rows. The higher the Jaccard similarity score, the more similar the two items are.

In this exercise, you will compare the movie GoldenEye with the movie Toy Story, and GoldenEye with SkyFall and compare the results.

The DataFrame movie_cross_table containing all the movies as rows and the genres as Boolean columns that you created in the last lesson has been loaded.

Instructions

1. Import the Jaccard similarity score function from sklearn.metrics.

2. Convert the rows containing 'GoldenEye' and 'Toy Story' to numpy arrays and measure their similarity.

3. Convert the row containing Skyfall to a numpy array and measure its similarity to GoldenEye.


In [None]:
In [1]:
movie_cross_table.head()
Out[1]:

genre_list                            Action  Adventure  Animation  Children  Comedy  ...  Mystery  Romance  Sci-Fi  Thriller  War
name                                                                                  ...                                         
21 Jump Street                             1          0          0         0       1  ...        0        0       0         0    0
Alvin and the Chipmunks: Chipwrecked       0          0          1         0       1  ...        0        0       0         0    0
Another Earth                              0          0          0         0       0  ...        0        1       1         0    0
Beastly                                    0          0          0         0       0  ...        0        1       0         0    0
Bridesmaids                                0          0          0         0       1  ...        0        0       0         0    0

[5 rows x 14 columns]

In [None]:
# Import numpy and the distance metric
import numpy as np
from sklearn.metrics import jaccard_score

# Extract just the rows containing GoldenEye and Toy Story
goldeneye_values = movie_cross_table.loc['GoldenEye'].values
toy_story_values = movie_cross_table.loc['Toy Story'].values

# Find the similarity between GoldenEye and Toy Story
print(jaccard_score(goldeneye_values, toy_story_values))

# Repeat for GoldenEye and Skyfall
skyfall_values = movie_cross_table.loc['Skyfall'].values
print(jaccard_score(goldeneye_values, skyfall_values))

'''

<script.py> output:
    0.14285714285714285

<script.py> output:
    0.14285714285714285
    0.75

'''

Conclusion

Great! As you can see, based on Jaccard similarity, GoldenEye and Skyfall (both James Bond movies) are more similar than GoldenEye and Toy Story (a spy movie and an animated kids movie).

# Comparing all your movies at once

While finding the Jaccard similarity between any two individual movies in your dataset is great for small-scale analyses, it can prove slow on larger datasets to make recommendations.

In this exercise, you will find the similarities between all movies and store them in a DataFrame for quick and easy lookup.

When finding the similarities between the rows in a DataFrame, you could run through all pairs and calculate them individually, but it's more efficient to use the pdist() (pairwise distance) function from scipy.

This can be reshaped into the desired rectangular shape using squareform() from the same library. Since you want similarity values as opposed to distances, you should subtract the values from 1.

movie_cross_table has once again been loaded for you.

Instructions

1. Find the Jaccard distance measures between all movies and assign the results to jaccard_similarity_array.

2. Create a DataFrame from the jaccard_similarity_array with movie_genre_df.index as its rows and columns.

3. Print the top 5 rows of the DataFrame and examine the similarity scores.


In [None]:
# Import functions from scipy
from scipy.spatial.distance import pdist, squareform

# Calculate all pairwise distances
jaccard_distances = pdist(movie_cross_table.values, metric='jaccard')

# Convert the distances to a square matrix
jaccard_similarity_array = 1 - squareform(jaccard_distances)

# Wrap the array in a pandas DataFrame
jaccard_similarity_df = pd.DataFrame(jaccard_similarity_array, index=movie_cross_table.index, columns=movie_cross_table.index)

# Print the top 5 rows of the DataFrame
print(jaccard_similarity_df.head())

'''
<script.py> output:
    name                                  21 Jump Street  Alvin and the Chipmunks: Chipwrecked  Another Earth  Beastly  Bridesmaids  ...    Cars 2  Green Lantern  Oldboy       Rio      Thor
    name                                                                                                                             ...                                                     
    21 Jump Street                              1.000000                                  0.25            0.0      0.0     0.333333  ...  0.142857            0.2     0.2  0.166667  0.142857
    Alvin and the Chipmunks: Chipwrecked        0.250000                                  1.00            0.0      0.0     0.500000  ...  0.400000            0.0     0.0  0.500000  0.000000
    Another Earth                               0.000000                                  0.00            1.0      0.5     0.000000  ...  0.000000            0.2     0.2  0.000000  0.142857
    Beastly                                     0.000000                                  0.00            0.5      1.0     0.000000  ...  0.000000            0.0     0.2  0.000000  0.333333
    Bridesmaids                                 0.333333                                  0.50            0.0      0.0     1.000000  ...  0.200000            0.0     0.0  0.250000  0.000000
    
    [5 rows x 12 columns]
'''

Conclusion

Correct! As you can see, the table has the movies as rows and columns, allowing you to quickly look up any distance of any movie pairing.

Making recommendations based on movie genres

Now that you have your data in a usable format and know how to compare two movies, the next step is to use this to generate recommendations. In this exercise, you will learn how to generate recommendations for any movie in your dataset. The similarity scores between all movies in the dataset that you calculated in the last exercise have been pre-loaded for you as jaccard_similarity_array. movie_cross_table containing the movies and their attributes is also available.

For ease of use, you will need to wrap the similarity scores in a DataFrame. Then you will use this new DataFrame to suggest a movie recommendation.

Instructions

1. Generate a DataFrame called jaccard_similarity_df from jaccard_similarity_array.

2. Store the similarity values between Thor and all other movies as a Series.

3. Sort these from largest to smallest in ordered_similarities.


In [None]:
# Wrap the preloaded array in a DataFrame
jaccard_similarity_df = pd.DataFrame(jaccard_similarity_array, index=movie_cross_table.index, columns=movie_cross_table.index)

# Find the values for the movie Thor
jaccard_similarity_series = jaccard_similarity_df.loc['Thor']

# Sort these values from highest to lowest
ordered_similarities = jaccard_similarity_series.sort_values(ascending=False)

# Print the results
print(ordered_similarities)

'''
<script.py> output:
    name
    Thor                                    1.000000
    Green Lantern                           0.333333
    Cars 2                                  0.250000
    Captain America: The First Avenger      0.250000
    Carnage                                 0.166667
    Another Earth                           0.142857
    21 Jump Street                          0.142857
    Rio                                     0.125000
    Bridesmaids                             0.000000
    Alvin and the Chipmunks: Chipwrecked    0.000000
    Name: Thor, dtype: float64
'''

Question

4. Based on your analysis, which movie in the dataset is most similar to Thor?

Possible Answers

1. Green Lantern
 - Correct. Green Lantern has the highest similarity value to Thor! This means that viewers that liked Thor are likely to enjoy Green Lantern also.
 
2. Cars 2
 - Incorrect: Cars 2 does not have the highest similarity score.

3. Alvin and the Chipmunks: Chipwrecked
 - Incorrect: Remember that a higher similarity score means a more similar movie.

4. Another Earth
 - Incorrect: Another Earth does not have the highest similarity score.


# Text-based similarities

1. Text-based similarities

You can now generate content-based recommendations when descriptive attributes are available.
2. Working without clear attributes

Unfortunately in the real world, this is often not the case as attribute labels such as book genres might not be available. Thankfully if there is text tied to an item then we may still be in luck. This could be a plot summary, an item description, or even the contents of a book itself. For this kind of data, we use "Term Frequency Inverse Document Frequency" or TF-IDF to transform the text into something usable.
3. Term frequency inverse document frequency

TF-IDF divides the number of times a word occurs in a document by a measure of what proportion of all the documents a word occurs in. This has the effect of reducing the value of common words while increasing the weight of words that do not occur in many documents. For example, if you were comparing the script of this course against the scripts of all the courses on DataCamp, the term "DataFrame" might get a low score as although it occurs a lot, it is present in many DataCamp courses. The term "recommendation" on the other hand would get a high score as it is not as common in other course's scripts.
4. Our data

In this video, we will be working with a dataset of books and their descriptions as seen here.
5. Instantiate the vectorizer

To transform our data we import TfidfVectorizer() from sklearn. We instantiate it to a variable; tfidfvec in this case. By default, the vectorizer generates a feature for every word in every document, which is a lot of features. Thankfully we can specify restraints on the features being generated.
6. Filtering the data

First, we set the min_df argument to two. This limits our features to only those that have occurred in at least two documents. Useful as terms occurring once are not valuable for finding similarities.
7. Filtering the data

We should also remove words that are too common using max_df. By setting this to point seven, words that occur in more than 70% of the descriptions will be excluded.
8. Vectorizing the data

Once the vectorizer is instantiated we call its fit_transform method on the text column. The vectorizer's get_feature_names method shows the features that were generated. Vectorized_data when converted to an array has a row for each book, and a column for each feature. Success! We have transformed unorganized text into usable features for our models.
9. Formatting the data

Let's wrap the array in a DataFrame (using the output of the get_feature_names method as the columns). And assign the titles from the original DataFrame as the index. The resulting DataFrame will look familiar to you from the previous exercises, with a row per item, and a column per feature. The scores represent how prominent that word is in the text compared to other texts, a useful attribute. For example, the term 'battle' is much higher for A Game of Thrones, understandable due to its theme.
10. Cosine similarity

As we advance from Boolean features to continuous TF-IDF values, we will use a metric that's better at measuring between items that have more variation in their data; cosine similarity. We won't go into it in depth here, but mathematically, it's the measure of the angle between two documents in the high dimensional metric space as seen on this two-dimensional example. All values are between 0 and 1 where 1 is an exact match.
11. Cosine similarity

Thankfully sklearn has a premade cosine_similarity function, that we use to find the distance between all rows by calling it on the DataFrame. Or between two rows by shaping their values as seen here.
12. Let's practice!

Now its your turn to use these similarities to generate recommendations! 



# Instantiate the TF-IDF model

TF-IDF by default generates a column for every word in all of your documents (movie summaries in our case). This creates a huge and unintuitive dataset as it will contain both very common words that appear in every document, and words that appear so rarely they provide no value in finding similarities between items.

In this exercise, you will work with the df_plots DataFrame. It contains movies' names in the Title column and their plots in the Plot column.

Using this DataFrame, you will generate the default TF-IDF scores and see if non-valuable columns are present.

You will go on to rerun the TF-IDF calculations, this time limiting the number of columns using the min_df and max_df arguments and hopefully see the improvement.

Instructions

1. Create a TfidfVectorizer and call it vectorizer.

2. Use vectorizer to transform the data in the Plots column of df_plots and assign the output to vectorized_data.

3. Inspect the features that have been generated by the transformation.


In [None]:
In [1]:
df_plots.head()
Out[1]:

                            Title                                               Plot
0  Ace Ventura: When Nature Calls  In the Himalayas, after a failed rescue missio...
1     Dracula: Dead and Loving It  Solicitor Thomas Renfield travels all the way ...
2     Father of the Bride Part II  The film begins five years after the events of...
3                      Four Rooms  The film is set on New Year's Eve, and starts ...
4                Grumpier Old Men  The feud between Max (Walter Matthau) and John...

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Instantiate the vectorizer object to the vectorizer variable
vectorizer = TfidfVectorizer()

# Fit and transform the plot column
vectorized_data = vectorizer.fit_transform(df_plots['Plot'])

# Look at the features generated
print(vectorizer.get_feature_names())

'''
<script.py> output:
    ['000', '10', '100', '1869', '1969', '1986', '1995', '2nd', '309', '40', '404', '409', '500', 'abandoned', 'abandons', 'abbey', 'abbot', 'aboard', 'about', 'above', 'abraham', 'abusing', 'accepting', 'access', 'accident', 'accidentally', 'accompanied', 'accomplice', 'account', 'accounts', 'accuses', 'accusing', 'ace', 'achieved', 'acquainted', 'action', 'active', 'activity', 'actual', 'actually', 'ad', 'adding', 'addition', 'administrator', 'admiral', 'admiring', 'admits', 'adopt', 'adult', 'advances', 'advice', 'advised', 'advises', 'affectionately', 'africa', 'african', 'after', 'afterwards', 'again', 'against', 'agent', 'agents', 'ago', 'agonized', 'agrees', 'air', 'alan', 'alec', 'alibi', 'alive', 'all', 'allow', 'allows', 'alone', 'along', 'also', 'although', 'always', 'amanda', 'ambulance', 'american', 'among', 'an', 'and', 'andy', 'angela', 'angrily', 'angry', 'animal', 'animals', 'animated', 'ann', 'annie', 'announce', 'annual', 'another', 'answers', 'antenna', 'antonio', 'any', 'anyone', 'apologize', 'apologizes', 'apology', 'apparently', 'appears', 'appointed', 'approached', 'approves', 'archives', 'are', 'arena', 'argument', 'ariel', 'arkady', 'arkhangelsk', 'arm', 'armistice', 'arms', 'army', 'around', 'arrangement', 'arranges', 'arrested', 'arrivals', 'arrive', 'arrives', 'arriving', 'as', 'ashes', 'asked', 'asks', 'asleep', 'assigned', 'assigns', 'assistant', 'assists', 'assumes', 'astral', 'asylum', 'at', 'atmospheric', 'atop', 'attack', 'attacked', 'attacks', 'attempt', 'attempting', 'attempts', 'attend', 'attending', 'attic', 'attorney', 'audience', 'aunt', 'averting', 'awakens', 'away', 'baby', 'back', 'badly', 'bait', 'ball', 'banderas', 'bank', 'banks', 'barely', 'bargain', 'barry', 'basketball', 'bat', 'bathroom', 'bats', 'batteries', 'battlefield', 'battles', 'be', 'beach', 'beacon', 'beals', 'beauty', 'because', 'becky', 'become', 'becomes', 'becoming', 'bed', 'bedroom', 'been', 'before', 'befriends', 'begging', 'begin', 'beginning', 'begins', 'behavior', 'behind', 'being', 'beings', 'belief', 'believe', 'believed', 'believing', 'bellhop', 'bender', 'bentley', 'bernadine', 'bernie', 'bet', 'betray', 'betrayed', 'betrays', 'betty', 'between', 'bicycle', 'big', 'bigger', 'bill', 'billions', 'billy', 'binder', 'birth', 'birthday', 'blackhawks', 'blend', 'blinds', 'block', 'blocked', 'blood', 'blow', 'blowing', 'bo', 'board', 'boarding', 'body', 'bodyguards', 'bombs', 'bond', 'bonding', 'booby', 'boothe', 'boris', 'born', 'borrow', 'boss', 'boston', 'both', 'bottle', 'box', 'boy', 'boys', 'brad', 'brantford', 'bravely', 'break', 'breakdown', 'breaking', 'brian', 'bride', 'briefing', 'bring', 'bringing', 'brings', 'britain', 'british', 'brothers', 'brown', 'bryan', 'bucket', 'bulldog', 'bullet', 'bullies', 'bunker', 'bureau', 'burgess', 'buried', 'burn', 'burns', 'bury', 'business', 'but', 'butt', 'buy', 'buzz', 'by', 'cadby', 'cadenet', 'cage', 'calamities', 'calderón', 'call', 'called', 'calls', 'came', 'camp', 'can', 'canada', 'canadian', 'cancel', 'captive', 'captured', 'captures', 'capuchin', 'car', 'care', 'career', 'carfax', 'carl', 'carla', 'carlo', 'carnivorous', 'carried', 'carrying', 'cartoons', 'case', 'casino', 'cast', 'castle', 'casts', 'cat', 'catch', 'catches', 'catering', 'catfish', 'caught', 'cauldron', 'causes', 'cave', 'caves', 'celibacy', 'center', 'century', 'chagrin', 'chairman', 'challenge', 'challenges', 'champagne', 'chance', 'change', 'channel', 'chaos', 'chapel', 'charge', 'charged', 'charges', 'charles', 'charming', 'chase', 'chases', 'chasing', 'chasm', 'cheating', 'chemical', 'chest', 'chester', 'chicago', 'chief', 'child', 'children', 'chiming', 'choose', 'chopper', 'chops', 'christmas', 'chronic', 'chuck', 'church', 'cia', 'cigarette', 'civic', 'civil', 'claim', 'clans', 'clarify', 'claude', 'clear', 'cliffhanger', 'climbs', 'clock', 'closed', 'closet', 'club', 'coast', 'coffin', 'coherent', 'cold', 'collaborated', 'colonel', 'colony', 'combustible', 'come', 'comes', 'coming', 'commandeers', 'comment', 'committee', 'community', 'company', 'complete', 'completely', 'completes', 'complex', 'complicate', 'computer', 'concealing', 'conceals', 'concern', 'concludes', 'concluding', 'condition', 'confidence', 'confined', 'confirms', 'confrontations', 'confronts', 'confused', 'connecting', 'consciousness', 'constant', 'construction', 'consul', 'consulate', 'consults', 'contact', 'contacts', 'containing', 'contented', 'continue', 'continued', 'continues', 'control', 'conversation', 'converting', 'convince', 'convinced', 'convinces', 'cook', 'cooled', 'corpse', 'correspondent', 'cossack', 'cossacks', 'count', 'country', 'couple', 'court', 'coven', 'covered', 'cowboy', 'cradle', 'crane', 'crash', 'create', 'creations', 'credentials', 'credits', 'creeping', 'crew', 'crime', 'crisis', 'cross', 'crushes', 'crypt', 'cryptic', 'crystal', 'cuba', 'cup', 'currently', 'curtain', 'cut', 'cuts', 'cyberterrorism', 'cynical', 'damaged', 'damages', 'damme', 'dance', 'dangerous', 'dangling', 'daniel', 'danny', 'darren', 'dart', 'darts', 'daryl', 'dates', 'dating', 'daughter', 'david', 'davis', 'day', 'days', 'de', 'dead', 'deal', 'death', 'deaths', 'decent', 'decide', 'decides', 'declare', 'declared', 'decline', 'decrypt', 'deduces', 'defeated', 'defence', 'defending', 'delayed', 'delivered', 'delivery', 'demolish', 'demolition', 'demoted', 'department', 'departure', 'depleted', 'deprivation', 'derelict', 'descendant', 'describing', 'desire', 'desk', 'despair', 'despite', 'despondency', 'destroy', 'destroyed', 'destroyer', 'destroying', 'destroys', 'destruct', 'destruction', 'details', 'devil', 'devises', 'devotion', 'diana', 'dice', 'did', 'didn', 'died', 'dies', 'different', 'differing', 'digging', 'dimitri', 'dines', 'dinner', 'dinosaur', 'director', 'directs', 'disable', 'disappearance', 'disappeared', 'disapproves', 'disarm', 'disavowing', 'discover', 'discovered', 'discovers', 'discussion', 'diseases', 'dish', 'disk', 'disorder', 'display', 'distraught', 'divert', 'divorce', 'do', 'dobermans', 'doc', 'doctor', 'does', 'dog', 'doll', 'dollars', 'door', 'dorian', 'dosage', 'doughnut', 'douglas', 'down', 'dowry', 'dr', 'dracula', 'draining', 'drains', 'drama', 'dramatically', 'dreams', 'dressed', 'drinking', 'drive', 'driver', 'driveway', 'driving', 'drop', 'dropped', 'drops', 'drove', 'drugged', 'drumbeats', 'drunk', 'due', 'dumping', 'dumps', 'duo', 'during', 'dusts', 'duty', 'dwight', 'dyes', 'each', 'earlier', 'easily', 'economy', 'edsion', 'eerie', 'eisenberg', 'either', 'elated', 'electromagnetic', 'electronic', 'elephant', 'eliminated', 'eliminating', 'elopes', 'embark', 'emergency', 'emerging', 'emily', 'emmett', 'emotional', 'employee', 'employer', 'empty', 'encounters', 'encourages', 'end', 'ends', 'enduring', 'enemas', 'energetic', 'engaged', 'engagement', 'england', 'enjoying', 'enough', 'ensued', 'ensues', 'ensuing', 'ensure', 'enter', 'enters', 'entire', 'entry', 'equalizer', 'equally', 'erasing', 'eric', 'escalations', 'escape', 'escapes', 'escaping', 'estate', 'eternity', 'eurocopter', 'europe', 'evaded', 'eve', 'even', 'events', 'eventually', 'ever', 'everyone', 'everything', 'evidence', 'ex', 'exact', 'exam', 'examinations', 'excellent', 'excuses', 'executed', 'executive', 'exhale', 'existence', 'expecting', 'experience', 'experiencing', 'expert', 'explains', 'explode', 'explodes', 'explore', 'explosion', 'explosives', 'express', 'extremely', 'eye', 'eyed', 'eyelids', 'face', 'facility', 'fact', 'factory', 'fail', 'failed', 'fails', 'faint', 'faked', 'fall', 'falling', 'falls', 'family', 'famous', 'fancy', 'fantasy', 'farrel', 'fast', 'father', 'favor', 'favorite', 'fear', 'fearfully', 'fearing', 'feature', 'features', 'feeds', 'feel', 'feelings', 'female', 'fertilizer', 'festival', 'feud', 'few', 'fiance', 'fictional', 'field', 'fight', 'figure', 'fills', 'film', 'final', 'finalize', 'finally', 'finals', 'financial', 'find', 'finding', 'finds', 'finish', 'finishes', 'finn', 'fire', 'fired', 'firefight', 'firefighter', 'firework', 'first', 'fishing', 'five', 'flaherty', 'flee', 'flees', 'flies', 'flirts', 'flooding', 'floor', 'florida', 'fly', 'follow', 'following', 'follows', 'foot', 'for', 'forces', 'foreign', 'forget', 'form', 'formally', 'formed', 'former', 'forward', 'foss', 'found', 'four', 'fractured', 'frame', 'francesca', 'franck', 'frantically', 'free', 'french', 'friend', 'friends', 'friendship', 'frisky', 'from', 'front', 'frye', 'full', 'fully', 'fulton', 'fun', 'funeral', 'furiously', 'further', 'future', 'gain', 'game', 'gamekeeper', 'gang', 'gareth', 'garlic', 'gas', 'gay', 'gears', 'george', 'get', 'gets', 'getting', 'gift', 'girl', 'girls', 'give', 'given', 'gives', 'giving', 'gliding', 'glo', 'gloria', 'go', 'goal', 'goalie', 'goddess', 'goes', 'going', 'goldeneye', 'gone', 'good', 'gorilla', 'gorillas', 'got', 'government', 'grabs', 'grand', 'grandchild', 'grandfather', 'grappa', 'grave', 'graveyard', 'great', 'green', 'greenwald', 'greenwall', 'greeted', 'grenade', 'grew', 'griffin', 'grigorovich', 'grishenko', 'grounds', 'group', 'grow', 'growing', 'grown', 'guano', 'guard', 'guards', 'guest', 'guests', 'guilbert', 'gun', 'gunpoint', 'habib', 'habibs', 'hacked', 'had', 'hair', 'hallmark', 'hamm', 'hampshire', 'hand', 'handle', 'handling', 'hannah', 'happened', 'hard', 'harewood', 'harker', 'harris', 'has', 'hatchet', 'have', 'having', 'hawaii', 'he', 'head', 'headquarters', 'heads', 'hear', 'hearing', 'hears', 'heavy', 'helicopter', 'help', 'helsing', 'henchmen', 'her', 'herds', 'hero', 'heroes', 'hers', 'herself', 'hesitation', 'hesling', 'high', 'highlands', 'highly', 'him', 'himalayas', 'himself', 'hired', 'hires', 'his', 'hit', 'hitched', 'hits', 'hitu', 'hold', 'holding', 'holds', 'hole', 'hollywood', 'home', 'honeymoon', 'hope', 'horror', 'hospital', 'hostage', 'hostages', 'hostility', 'hotel', 'hour', 'house', 'how', 'howard', 'however', 'huck', 'huge', 'humans', 'humiliating', 'humilited', 'hundreds', 'hunter', 'husband', 'hut', 'hyphenated', 'hypnotic', 'ice', 'iceburgh', 'identify', 'identity', 'if', 'ignites', 'impact', 'importance', 'important', 'impresses', 'in', 'inability', 'inadvertently', 'includes', 'including', 'increased', 'indeed', 'infant', 'infiltrate', 'informs', 'ingredient', 'initially', 'initiate', 'injun', 'innocent', 'inquiry', 'insensitivity', 'inside', 'insisting', 'inspired', 'instant', 'instead', 'instructs', 'insults', 'intelligence', 'intended', 'intending', 'intense', 'intent', 'interest', 'interrogates', 'intervenes', 'into', 'introduces', 'investigate', 'investigating', 'invited', 'invites', 'involvement', 'involves', 'ione', 'irritated', 'is', 'island', 'it', 'italian', 'its', 'itself', 'jack', 'jackson', 'jacob', 'james', 'janus', 'japan', 'jealousy', 'jean', 'jennifer', 'jessup', 'jim', 'job', 'joe', 'john', 'join', 'joining', 'joins', 'joke', 'jokes', 'jokingly', 'jonas', 'jonathan', 'joshua', 'joy', 'juancho', 'judy', 'jumanji', 'jungle', 'just', 'justify', 'kathy', 'keep', 'kevin', 'keys', 'kgb', 'kidnap', 'kidnapped', 'kids', 'kill', 'killed', 'killing', 'kills', 'kincade', 'king', 'kiss', 'kneel', 'knife', 'knocking', 'know', 'knowledge', 'labor', 'lake', 'lana', 'land', 'language', 'lanny', 'laptop', 'large', 'laser', 'last', 'late', 'later', 'laugh', 'launch', 'lawrence', 'leader', 'leads', 'league', 'learn', 'learning', 'learns', 'leave', 'leaves', 'leaving', 'led', 'left', 'leg', 'leigh', 'lemmon', 'let', 'lets', 'lies', 'life', 'lifeless', 'lift', 'light', 'lighter', 'lightyear', 'likely', 'lion', 'lipstick', 'liquor', 'list', 'listen', 'livid', 'living', 'll', 'local', 'locate', 'located', 'locks', 'lodged', 'london', 'lone', 'long', 'longer', 'look', 'looking', 'loren', 'loses', 'losing', 'lost', 'love', 'lover', 'luc', 'lucky', 'lucy', 'luggage', 'lunatic', 'luxury', 'macau', 'machine', 'mackenzies', 'mad', 'made', 'maine', 'make', 'makes', 'making', 'malinger', 'mallory', 'man', 'manage', 'manages', 'maniacally', 'manner', 'mansion', 'many', 'map', 'marc', 'margaret', 'margret', 'maria', 'marian', 'marines', 'marisa', 'mark', 'market', 'marks', 'marriage', 'marriages', 'married', 'marries', 'marshal', 'martha', 'marvin', 'mascot', 'mass', 'massacre', 'master', 'masturbates', 'mate', 'mating', 'matthau', 'matthew', 'matthews', 'matty', 'max', 'mccord', 'mcdougal', 'mckissack', 'mcshane', 'meaning', 'meanwhile', 'medicine', 'meet', 'meeting', 'meets', 'megan', 'melanie', 'member', 'memories', 'men', 'menopause', 'mercenary', 'meredith', 'message', 'mi6', 'middle', 'might', 'mike', 'mildly', 'military', 'millions', 'mina', 'mind', 'minister', 'mirror', 'misbehave', 'mischievous', 'misha', 'mishap', 'mishkin', 'misplaces', 'misses', 'missing', 'mission', 'mississippi', 'mistake', 'mistaken', 'mistakenly', 'mistakes', 'mistress', 'mobile', 'moldavian', 'molly', 'moments', 'mon', 'monastery', 'money', 'moneypenny', 'monitors', 'monkey', 'monkeys', 'monster', 'monte', 'months', 'more', 'morgan', 'morning', 'mosquitoes', 'most', 'mother', 'mouth', 'move', 'moved', 'moves', 'movie', 'moving', 'mr', 'much', 'muff', 'murder', 'murders', 'murrell', 'must', 'mutant', 'nails', 'name', 'named', 'natalya', 'native', 'navy', 'nazi', 'near', 'nearby', 'neck', 'need', 'needs', 'neighbor', 'neighbors', 'nest', 'never', 'new', 'newly', 'newman', 'news', 'newspaper', 'next', 'nibia', 'night', 'nights', 'nina', 'nine', 'no', 'non', 'none', 'nora', 'norman', 'not', 'notice', 'notices', 'noticing', 'now', 'nubile', 'number', 'numerous', 'nursery', 'oath', 'obscure', 'obsessed', 'obstetrician', 'obvious', 'obviously', 'occupied', 'occurring', 'odd', 'of', 'off', 'offer', 'offered', 'offering', 'offers', 'office', 'officer', 'offices', 'ointment', 'old', 'on', 'onatopp', 'once', 'one', 'only', 'opening', 'opens', 'opera', 'operation', 'operative', 'opposite', 'or', 'orchestrated', 'order', 'ordered', 'orders', 'organizes', 'other', 'others', 'otherwise', 'ouda', 'ourumov', 'out', 'outcome', 'outing', 'outline', 'over', 'overcome', 'overly', 'overpowers', 'overprotective', 'overrun', 'overtime', 'own', 'owned', 'owner', 'page', 'paid', 'pain', 'painting', 'pale', 'panic', 'panther', 'paramedics', 'paranoid', 'parents', 'park', 'parliament', 'parody', 'parrish', 'parrishes', 'part', 'partially', 'participate', 'party', 'passed', 'passes', 'passion', 'patch', 'patient', 'patients', 'patrice', 'patricia', 'paul', 'pay', 'peace', 'peep', 'pelican', 'pelt', 'pen', 'penguins', 'penthouse', 'people', 'perform', 'performs', 'period', 'personal', 'perspectives', 'persuade', 'persuades', 'peter', 'petersburg', 'petya', 'phillips', 'phone', 'phrases', 'physical', 'physician', 'piece', 'piggy', 'pills', 'pilot', 'pink', 'pinky', 'pistol', 'pittsburgh', 'pizza', 'place', 'placed', 'places', 'placing', 'plan', 'plane', 'planet', 'planned', 'planner', 'plans', 'plant', 'plants', 'platform', 'play', 'playing', 'pleads', 'plot', 'plummets', 'poachers', 'pocket', 'point', 'police', 'pollak', 'poorly', 'position', 'possession', 'posttraumatic', 'potato', 'potion', 'potter', 'powered', 'powers', 'ppk', 'practical', 'praised', 'pregnancies', 'pregnancy', 'pregnant', 'prepare', 'prepares', 'preparing', 'prescribing', 'presence', 'present', 'preserving', 'president', 'presses', 'pressure', 'presumed', 'pretend', 'prevent', 'preventing', 'previous', 'priest', 'prince', 'princess', 'prisoner', 'problem', 'problems', 'proceed', 'process', 'producer', 'program', 'programmed', 'programmer', 'programs', 'projection', 'prolonged', 'prolonging', 'promises', 'properly', 'property', 'prospect', 'prostate', 'prostitute', 'prototype', 'proval', 'prove', 'provide', 'provincial', 'prudish', 'psychiatrist', 'psychological', 'public', 'pull', 'pulled', 'punches', 'puncture', 'puppy', 'purchase', 'purchased', 'pursue', 'pursued', 'pursuers', 'pursues', 'pursuit', 'put', 'puts', 'putting', 'puzzled', 'quartermaster', 'quell', 'quest', 'quickly', 'quietly', 'quit', 'quits', 'raccoon', 'rachael', 'radio', 'raft', 'ragetti', 'raise', 'range', 'ranger', 'raoul', 'raped', 'rappels', 'rather', 'raving', 'raymond', 'rc', 're', 'reads', 'ready', 'real', 'realize', 'realizes', 'realizing', 'reasons', 'rebel', 'recaptures', 'receive', 'received', 'receives', 'recognizes', 'reconcile', 'reconciles', 'reconnaissance', 'records', 'recover', 'recovered', 'redecoration', 'reflection', 'reformed', 'refuses', 'regaining', 'regains', 'rehired', 'reignite', 'relationship', 'release', 'releases', 'relieved', 'reluctant', 'reluctantly', 'remaining', 'reminds', 'remove', 'removed', 'removes', 'rendition', 'renfield', 'renfro', 'renovation', 'repeatedly', 'repel', 'replaced', 'replacement', 'report', 'requests', 'rescue', 'rescued', 'rescues', 'rescuing', 'resents', 'respectively', 'responds', 'responsibility', 'responsible', 'rest', 'restaurant', 'restored', 'restoring', 'result', 'results', 'retire', 'retired', 'retiring', 'retreats', 'retrieve', 'retrieves', 'return', 'returned', 'returns', 'reunite', 'reveal', 'revealed', 'revealing', 'reveals', 'revenge', 'reverse', 'reversed', 'rex', 'ride', 'rifle', 'right', 'rights', 'rises', 'ritual', 'rival', 'river', 'road', 'rob', 'robin', 'robinson', 'robitaille', 'rock', 'rocket', 'role', 'roll', 'rolling', 'rolls', 'roof', 'room', 'ross', 'rot', 'roth', 'routine', 'row', 'rules', 'run', 'running', 'runs', 'rush', 'russell', 'russian', 'sabotage', 'sabotages', 'sacred', 'sacrificing', 'safari', 'safely', 'safety', 'saint', 'salon', 'salvaged', 'sam', 'same', 'sand', 'sandwich', 'sarah', 'sarge', 'satellite', 'satellites', 'savannah', 'save', 'saved', 'saves', 'saving', 'sawyer', 'saying', 'scale', 'scare', 'scarf', 'scarred', 'scat', 'scenario', 'scene', 'school', 'schweig', 'score', 'scored', 'scores', 'scottish', 'scouting', 'scrapes', 'screaming', 'scribbled', 'scud', 'search', 'searched', 'searching', 'season', 'second', 'seconds', 'secret', 'secretary', 'security', 'seduce', 'seducing', 'see', 'seeing', 'seems', 'seen', 'sees', 'segment', 'seldes', 'self', 'sell', 'sells', 'semen', 'send', 'sends', 'sentimental', 'separate', 'series', 'serve', 'servers', 'service', 'set', 'sets', 'setting', 'settlement', 'several', 'severely', 'severnaya', 'seward', 'sex', 'shanghai', 'share', 'shared', 'sharp', 'she', 'shepherd', 'shepherdess', 'shepherds', 'sheriff', 'shikaka', 'ship', 'shoe', 'shoot', 'shoots', 'shop', 'shore', 'shortly', 'shot', 'should', 'show', 'shower', 'shows', 'siberia', 'siberian', 'sid', 'side', 'sigfried', 'signed', 'signor', 'silva', 'simonova', 'simple', 'simultaneous', 'simultaneously', 'since', 'single', 'sinks', 'sister', 'site', 'situation', 'six', 'skemp', 'skeptical', 'ski', 'skye', 'skyfall', 'skyscraper', 'slam', 'slave', 'sleep', 'sleeping', 'slept', 'slinky', 'slowed', 'small', 'smile', 'snapping', 'snares', 'so', 'soar', 'sold', 'sole', 'solicitor', 'some', 'somehow', 'someone', 'something', 'son', 'song', 'soon', 'sophia', 'sound', 'soviet', 'space', 'spain', 'speaking', 'specifically', 'spell', 'spends', 'spike', 'spirits', 'spring', 'spy', 'squabble', 'st', 'stabs', 'staff', 'stage', 'stagecoach', 'stages', 'stampede', 'stand', 'standing', 'standoff', 'stanley', 'startled', 'starts', 'state', 'station', 'stay', 'steal', 'steals', 'step', 'stern', 'still', 'stokes', 'stolen', 'stop', 'stops', 'store', 'story', 'stowing', 'stranding', 'strange', 'strangely', 'street', 'stress', 'stressed', 'strike', 'string', 'strongly', 'stuck', 'stuffed', 'stumble', 'stunned', 'style', 'subsequently', 'succeed', 'succeeds', 'successful', 'successfully', 'succumbs', 'sucked', 'sucks', 'sudden', 'suddenly', 'suffered', 'suffering', 'suggested', 'suggestible', 'suggests', 'suite', 'summer', 'summon', 'summoned', 'summons', 'sunlight', 'support', 'sure', 'surprise', 'surprises', 'surround', 'survives', 'survivor', 'suspect', 'suspects', 'suspended', 'suspicion', 'suspicious', 'swarm', 'sweeps', 'swing', 'switches', 'symptoms', 'syndicate', 'syringe', 'sévérine', 'take', 'taken', 'takes', 'taking', 'tamlyn', 'tank', 'tanner', 'tarantino', 'target', 'taunting', 'taunts', 'tavern', 'taylor', 'teach', 'team', 'tear', 'ted', 'television', 'telling', 'tells', 'temple', 'ten', 'tend', 'termites', 'terms', 'terrorists', 'than', 'that', 'thatcher', 'the', 'theft', 'their', 'them', 'then', 'theodore', 'there', 'thereafter', 'these', 'they', 'thighs', 'thing', 'things', 'thinking', 'thinks', 'third', 'this', 'thomas', 'though', 'thought', 'threat', 'threatens', 'three', 'throat', 'through', 'throughout', 'throw', 'throwing', 'throws', 'thugs', 'thunderstorm', 'thus', 'thwarts', 'tibetan', 'ticks', 'tied', 'tiger', 'tim', 'time', 'times', 'to', 'together', 'token', 'told', 'tom', 'tomei', 'tomita', 'too', 'took', 'toothbrushes', 'top', 'toss', 'touch', 'tow', 'toward', 'towards', 'town', 'toy', 'toys', 'trade', 'traders', 'trail', 'train', 'tranquilizer', 'translating', 'transported', 'transylvania', 'trap', 'traps', 'travel', 'travels', 'treasure', 'tree', 'trevelyan', 'trial', 'trials', 'triangulates', 'tribal', 'tribe', 'tribes', 'tribulations', 'tricks', 'tries', 'triggers', 'trio', 'trip', 'troop', 'troublesome', 'truck', 'true', 'trust', 'truth', 'try', 'tunnel', 'turned', 'turns', 'tv', 'twenty', 'twine', 'two', 'tyler', 'unable', 'uncertainty', 'uncovering', 'undead', 'under', 'undercover', 'undergoes', 'underground', 'unfortunately', 'unintentionally', 'unknowingly', 'unsettled', 'unsuccessfully', 'unsure', 'until', 'unusual', 'unusually', 'unwittingly', 'up', 'upcoming', 'upon', 'upset', 'us', 'use', 'used', 'uses', 'uttering', 'vacant', 'vacation', 'valentin', 'value', 'vampire', 'van', 'vandalize', 'vannah', 'various', 'vatsnik', 'vent', 'ventura', 'verduzco', 'vertigogo', 'via', 'vic', 'vice', 'vicious', 'victim', 'village', 'vincent', 'vips', 'virgin', 'visiting', 'visits', 'vomiting', 'vonne', 'vowed', 'voyage', 'wachati', 'wachootoo', 'wade', 'wait', 'waiting', 'walk', 'walking', 'walls', 'walter', 'walther', 'wanted', 'wanting', 'wants', 'war', 'ward', 'warnings', 'warns', 'warts', 'was', 'wash', 'washed', 'watch', 'watching', 'water', 'way', 'weapon', 'weapons', 'wear', 'wed', 'wedding', 'week', 'weeks', 'welcome', 'well', 'what', 'when', 'where', 'whether', 'which', 'while', 'whilst', 'white', 'whittle', 'whittni', 'who', 'whom', 'whore', 'widow', 'widowed', 'wield', 'wife', 'wildlife', 'will', 'william', 'win', 'window', 'wings', 'wins', 'winter', 'wired', 'witches', 'with', 'without', 'witness', 'woman', 'women', 'wood', 'woody', 'word', 'words', 'work', 'working', 'world', 'worried', 'worries', 'worse', 'worth', 'would', 'wounded', 'wounds', 'wrecking', 'wright', 'wrong', 'xenia', 'yacht', 'year', 'years', 'yells', 'yes', 'yet', 'you', 'young', 'younger', 'yourself', 'zippo', 'zukovsky']
'''

4. Repeat the creation of the TfidfVectorizer, but this time, set the minimum document frequency to 2 and the maximum document frequency to 0.7.
    
5. Inspect the features that have been generated by the transformation.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Instantiate the vectorizer object to the vectorizer variable
vectorizer = TfidfVectorizer(min_df=2, max_df=0.7)

# Fit and transform the plot column
vectorized_data = vectorizer.fit_transform(df_plots['Plot'])

# Look at the features generated
print(vectorizer.get_feature_names())

'''
<script.py> output:
    ['000', '100', 'abandoned', 'above', 'accidentally', 'accomplice', 'admits', 'adult', 'african', 'again', 'against', 'agent', 'agents', 'alive', 'all', 'allows', 'alone', 'also', 'although', 'animals', 'another', 'appears', 'approached', 'around', 'arrested', 'arrives', 'arriving', 'asks', 'assistant', 'assists', 'attack', 'attacked', 'attacks', 'attempting', 'attempts', 'attending', 'away', 'baby', 'back', 'ball', 'bank', 'bats', 'because', 'become', 'becomes', 'bed', 'been', 'before', 'begin', 'begins', 'being', 'between', 'blow', 'board', 'bond', 'boss', 'both', 'box', 'bride', 'bring', 'brings', 'britain', 'british', 'burns', 'business', 'call', 'called', 'calls', 'can', 'canadian', 'captured', 'captures', 'car', 'care', 'case', 'caves', 'chaos', 'chase', 'chest', 'child', 'children', 'christmas', 'cia', 'clock', 'closed', 'come', 'comes', 'containing', 'continue', 'control', 'convinces', 'country', 'couple', 'credits', 'crew', 'crime', 'dart', 'darts', 'daughter', 'day', 'dead', 'death', 'decides', 'declare', 'deduces', 'delivered', 'desk', 'despite', 'destroy', 'destroyed', 'devil', 'did', 'died', 'different', 'discover', 'discovered', 'discovers', 'discussion', 'do', 'doctor', 'does', 'door', 'down', 'dr', 'drive', 'drops', 'drove', 'due', 'during', 'each', 'earlier', 'electronic', 'embark', 'employee', 'empty', 'end', 'ends', 'england', 'ensues', 'entire', 'escape', 'escapes', 'eve', 'events', 'eventually', 'ever', 'executed', 'executive', 'expecting', 'explode', 'explodes', 'explosives', 'eyed', 'failed', 'fails', 'falling', 'falls', 'family', 'father', 'female', 'few', 'field', 'fight', 'film', 'finally', 'find', 'finds', 'finish', 'finishes', 'fire', 'first', 'five', 'flees', 'floor', 'follow', 'following', 'follows', 'foot', 'forces', 'former', 'four', 'free', 'french', 'friend', 'friends', 'friendship', 'front', 'full', 'funeral', 'further', 'game', 'gang', 'get', 'gets', 'getting', 'girl', 'gives', 'go', 'goes', 'going', 'grand', 'grave', 'great', 'group', 'growing', 'guard', 'guests', 'had', 'hand', 'have', 'having', 'head', 'heads', 'helicopter', 'help', 'herself', 'high', 'himself', 'holding', 'holds', 'home', 'honeymoon', 'hospital', 'hostage', 'hotel', 'house', 'how', 'however', 'hunter', 'husband', 'ice', 'identity', 'if', 'inadvertently', 'including', 'indeed', 'initially', 'instead', 'instructs', 'introduces', 'involvement', 'its', 'itself', 'jack', 'james', 'job', 'join', 'joins', 'jonathan', 'jungle', 'just', 'keep', 'kill', 'killed', 'killing', 'kills', 'knife', 'lake', 'large', 'last', 'late', 'learns', 'leave', 'leaves', 'leaving', 'led', 'left', 'let', 'lies', 'life', 'light', 'local', 'locks', 'london', 'long', 'longer', 'looking', 'loses', 'love', 'made', 'make', 'makes', 'making', 'man', 'manage', 'manages', 'mansion', 'married', 'meanwhile', 'meet', 'meeting', 'meets', 'member', 'men', 'message', 'mi6', 'might', 'missing', 'mission', 'mistake', 'moments', 'money', 'more', 'morning', 'most', 'mother', 'move', 'movie', 'moving', 'mr', 'much', 'murder', 'must', 'named', 'need', 'neighbor', 'new', 'news', 'next', 'night', 'no', 'not', 'now', 'numerous', 'oath', 'off', 'old', 'once', 'only', 'opening', 'opens', 'operative', 'or', 'order', 'orders', 'other', 'others', 'out', 'over', 'overpowers', 'own', 'owned', 'owner', 'parents', 'part', 'party', 'people', 'phone', 'place', 'plan', 'plans', 'play', 'point', 'police', 'pregnant', 'pretend', 'previous', 'pursue', 'put', 'raise', 'rather', 'reads', 'ready', 'real', 'realizes', 'realizing', 'receives', 'reconcile', 'refuses', 'release', 'releases', 'relieved', 'reluctant', 'reluctantly', 'remove', 'repel', 'rescue', 'rescuing', 'responsible', 'restaurant', 'result', 'results', 'retrieve', 'return', 'returns', 'revealed', 'reveals', 'revenge', 'ride', 'right', 'river', 'role', 'roll', 'room', 'runs', 'sam', 'same', 'sarah', 'save', 'saying', 'screaming', 'search', 'season', 'second', 'secret', 'security', 'seduce', 'see', 'seeing', 'seen', 'sees', 'self', 'sell', 'send', 'sends', 'set', 'sets', 'sex', 'share', 'sharp', 'she', 'shoot', 'shoots', 'shore', 'shortly', 'shot', 'should', 'six', 'so', 'some', 'son', 'soon', 'spell', 'spring', 'starts', 'stay', 'steal', 'steals', 'still', 'stolen', 'stop', 'stops', 'story', 'strange', 'stress', 'strike', 'stuck', 'successful', 'suggests', 'suite', 'suspicious', 'take', 'taken', 'takes', 'taking', 'target', 'team', 'television', 'tells', 'tend', 'than', 'their', 'then', 'there', 'thereafter', 'things', 'this', 'thomas', 'three', 'through', 'throughout', 'throw', 'time', 'times', 'together', 'toward', 'train', 'traps', 'travels', 'tree', 'tribal', 'tries', 'trip', 'truck', 'true', 'try', 'turns', 'unable', 'until', 'unusual', 'upon', 'uses', 'van', 'various', 'via', 'waiting', 'walking', 'wants', 'war', 'warns', 'was', 'watch', 'watching', 'water', 'way', 'weapons', 'wedding', 'what', 'where', 'white', 'whom', 'wife', 'window', 'without', 'women', 'word', 'words', 'work', 'working', 'world', 'worried', 'wounded', 'wrong', 'year', 'years', 'you', 'young']
'''


Conclusion

Great! You now have a way of trainsforming free bodies of text into structured arrays, with each relevant word being stored as a feature. This can be used to to measure similarities between items and make recommendations, even for items that you have no structured attribute data for.

# Creating the TF-IDF DataFrame

Now that you have generated our TF-IDF features, you will need to get them in a format that you can use to make recommendations. You will once again leverage pandas for this and wrap the array in a DataFrame. As you will be using the movie titles to do your filtering of the data, you can assign the titles to the DataFrame's index.

The df_plots DataFrame has once again been loaded for you. It contains movies' names in the Title column and their plots in the Plot column.

Instructions

1. Create a TfidfVectorizer and fit and transform it as you did in the previous exercise.

2. Wrap the generated vectorized_data in a DataFrame. Use the names of the features generated during the fit and transform phase as its column names and assign your new DataFrame to tfidf_df.

3. Assign the original movie titles to the index of the newly created tfidf_df DataFrame.


In [None]:
In [2]:
df_plots.head()
Out[2]:

                            Title                                               Plot
0  Ace Ventura: When Nature Calls  In the Himalayas, after a failed rescue missio...
1     Dracula: Dead and Loving It  Solicitor Thomas Renfield travels all the way ...
2     Father of the Bride Part II  The film begins five years after the events of...
3                      Four Rooms  The film is set on New Year's Eve, and starts ...
4                Grumpier Old Men  The feud between Max (Walter Matthau) and John...

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Instantiate the vectorizer object and transform the plot column
vectorizer = TfidfVectorizer(max_df=0.7, min_df=2)
vectorized_data = vectorizer.fit_transform(df_plots['Plot']) 

# Create Dataframe from TF-IDFarray
tfidf_df = pd.DataFrame(vectorized_data.toarray(), columns=vectorizer.get_feature_names())

# Assign the movie titles to the index and inspect
tfidf_df.index = df_plots['Title']
print(tfidf_df.head())

In [None]:
<script.py> output:
                                         000       100  abandoned     above  accidentally  ...     wrong      year     years       you     young
    Title                                                                                  ...                                                  
    Ace Ventura: When Nature Calls  0.000000  0.000000        0.0  0.000000      0.000000  ...  0.000000  0.000000  0.044595  0.000000  0.053863
    Dracula: Dead and Loving It     0.000000  0.000000        0.0  0.000000      0.000000  ...  0.000000  0.000000  0.000000  0.055645  0.000000
    Father of the Bride Part II     0.045850  0.045850        0.0  0.000000      0.000000  ...  0.045850  0.000000  0.030099  0.000000  0.072708
    Four Rooms                      0.039916  0.039916        0.0  0.079831      0.039916  ...  0.039916  0.079831  0.026203  0.000000  0.000000
    Grumpier Old Men                0.000000  0.000000        0.0  0.000000      0.000000  ...  0.000000  0.000000  0.000000  0.000000  0.000000
    
    [5 rows x 527 columns]

Conclusion

Good work! You now are able to manipulate text data into DataFrames with each row representing an item, and each column represeting a word extracted from the texts. You will be able to use this in a similar way to the attribute DataFrames you generated previously to to measure similarities between items and make recommendations.

# Comparing all your movies with TF-IDF

Now that you have put in the hard work of getting your TF-IDF data into a usable format, it's time to put it to work generating finding similarities and generating recommendations.

This time as you are using TF-IDF scores (which are floats as opposed to Booleans) you will use the cosine similarity metric to find the similarities between items. In this exercise, you will generate a matrix of all of the movie cosine similarities and store them in a DataFrame for ease of lookup. This will allow you to compare movies and find recommendations quickly and easily.

The tfidf_df DataFrame you created in the last exercise containing a row for each movie has been loaded for you.

Instructions

1. Find the cosine similarity measures between all movies and assign the results to cosine_similarity_array.

2. Create a DataFrame from the cosine_similarity_array with tfidf_summary_df.index as its rows and columns.

3. Print the top five rows of the DataFrame and examine the similarity scores.


In [None]:
In [1]:
tfidf_df.head()
Out[1]:

   abducted  abin  able     about  above  ...   younger  zachary       zoe      zola  zündapp
0  0.022671   0.0   0.0  0.009336    0.0  ...  0.000000      0.0  0.000000  0.000000      0.0
1  0.000000   0.0   0.0  0.015369    0.0  ...  0.000000      0.0  0.213213  0.000000      0.0
2  0.000000   0.0   0.0  0.059691    0.0  ...  0.000000      0.0  0.000000  0.000000      0.0
3  0.000000   0.0   0.0  0.000000    0.0  ...  0.000000      0.0  0.000000  0.124908      0.0
4  0.000000   0.0   0.0  0.034712    0.0  ...  0.042149      0.0  0.000000  0.000000      0.0

[5 rows x 1000 columns]

In [None]:
# Import cosine_similarity measure
from sklearn.metrics.pairwise import cosine_similarity

# Create the array of cosine similarity values
cosine_similarity_array = cosine_similarity(tfidf_summary_df)

# Wrap the array in a pandas DataFrame
cosine_similarity_df = pd.DataFrame(cosine_similarity_array, index=tfidf_summary_df.index, columns=tfidf_summary_df.index)

# Print the top 5 rows of the DataFrame
print(cosine_similarity_df.head())

'''
<script.py> output:
                                                        The Adventures of Tintin: The Secret of the Unicorn  Alvin and the Chipmunks: Chipwrecked  Another Earth   Beastly  The Beaver  ...       Rio  \
    The Adventures of Tintin: The Secret of the Uni...                                           1.000000                                0.353153       0.264676  0.202191    0.283238  ...  0.313965   
    Alvin and the Chipmunks: Chipwrecked                                                         0.353153                                1.000000       0.297457  0.223095    0.289044  ...  0.345742   
    Another Earth                                                                                0.264676                                0.297457       1.000000  0.325769    0.382758  ...  0.270999   
    Beastly                                                                                      0.202191                                0.223095       0.325769  1.000000    0.271717  ...  0.204481   
    The Beaver                                                                                   0.283238                                0.289044       0.382758  0.271717    1.000000  ...  0.274623   
    
                                                            Thor  21 Jump Street  The Avengers    Oldboy  
    The Adventures of Tintin: The Secret of the Uni...  0.312927        0.282663      0.374425  0.248183  
    Alvin and the Chipmunks: Chipwrecked                0.323938        0.311788      0.400024  0.267687  
    Another Earth                                       0.304739        0.236896      0.229218  0.249804  
    Beastly                                             0.229194        0.187408      0.186539  0.207715  
    The Beaver                                          0.300383        0.238325      0.266592  0.253751  
    
    [5 rows x 18 columns]
'''

# Making recommendations with TF-IDF

In the last exercise you pre-calculated the similarity ratings between all movies in the dataset based on their plots transformed by TF-IDF. Now you will put these similarity ratings in a DataFrame for ease of use. Then you will use this new DataFrame to suggest a movie recommendation.

The cosine_similarity_array containing a matrix of the similarity values between all movies that you created in the last exercise has been loaded for you. The tfidf_summary_df DataFrame containing the movies and their TF-IDF features is also available.

Instructions

1. Generate a DataFrame from cosine_similarity_array.

2. Store the cosine similarity values between the movie Rio and all other movies as a Series.

3. Sort these from largest to smallest in ordered_similarities and print the ordered results.


In [None]:
In [1]:
cosine_similarity_array
Out[1]:

array([[1.        , 0.36636167, 0.274455  , 0.21179536, 0.24678459,
        0.23480675, 0.28537076, 0.32007898, 0.3202698 , 0.33226816,
        0.35832441, 0.32742156, 0.3255014 , 0.29467487, 0.38208505,
        0.25951764],
       [0.36636167, 1.        , 0.30908233, 0.2334096 , 0.29483809,
        0.26197948, 0.28458874, 0.34427044, 0.32566286, 0.33185446,
        0.37784412, 0.36117977, 0.33690584, 0.32586938, 0.40826124,
        0.28007357],
       [0.274455  , 0.30908233, 1.        , 0.33872454, 0.38356167,
        0.34676173, 0.22865073, 0.32255115, 0.26643419, 0.31810998,
        0.29380596, 0.2817803 , 0.31295709, 0.24644603, 0.2328075 ,
        0.26008199],
       [0.21179536, 0.2334096 , 0.33872454, 1.        , 0.25415718,
        0.21886163, 0.19642002, 0.23133099, 0.21382424, 0.23705468,
        0.23175731, 0.21350227, 0.23969702, 0.19449683, 0.19110445,
        0.21814833],
       [0.24678459, 0.29483809, 0.38356167, 0.25415718, 1.        ,
        0.27182559, 0.20680875, 0.27270782, 0.23875601, 0.2484392 ,
        0.27409456, 0.26275549, 0.25460214, 0.23384183, 0.21933973,
        0.22925186],
       [0.23480675, 0.26197948, 0.34676173, 0.21886163, 0.27182559,
        1.        , 0.19727284, 0.24548508, 0.21495101, 0.22993324,
        0.25240652, 0.23451379, 0.23274595, 0.2090821 , 0.20409183,
        0.20353307],
       [0.28537076, 0.28458874, 0.22865073, 0.19642002, 0.20680875,
        0.19727284, 1.        , 0.26384617, 0.26539562, 0.27202905,
        0.27971283, 0.2663581 , 0.28691374, 0.34541934, 0.32321859,
        0.21535199],
       [0.32007898, 0.34427044, 0.32255115, 0.23133099, 0.27270782,
        0.24548508, 0.26384617, 1.        , 0.29038668, 0.30791958,
        0.39843217, 0.31206741, 0.30782641, 0.31578155, 0.33640756,
        0.25951696],
       [0.3202698 , 0.32566286, 0.26643419, 0.21382424, 0.23875601,
        0.21495101, 0.26539562, 0.29038668, 1.        , 0.30241193,
        0.33201647, 0.30649858, 0.30833511, 0.27158278, 0.34767905,
        0.243493  ],
       [0.33226816, 0.33185446, 0.31810998, 0.23705468, 0.2484392 ,
        0.22993324, 0.27202905, 0.30791958, 0.30241193, 1.        ,
        0.3434828 , 0.31457016, 0.34990921, 0.28994308, 0.36589511,
        0.25749137],
       [0.35832441, 0.37784412, 0.29380596, 0.23175731, 0.27409456,
        0.25240652, 0.27971283, 0.39843217, 0.33201647, 0.3434828 ,
        1.        , 0.34440658, 0.33409947, 0.31204844, 0.41018388,
        0.27389096],
       [0.32742156, 0.36117977, 0.2817803 , 0.21350227, 0.26275549,
        0.23451379, 0.2663581 , 0.31206741, 0.30649858, 0.31457016,
        0.34440658, 1.        , 0.31821617, 0.29025159, 0.34486924,
        0.25232311],
       [0.3255014 , 0.33690584, 0.31295709, 0.23969702, 0.25460214,
        0.23274595, 0.28691374, 0.30782641, 0.30833511, 0.34990921,
        0.33409947, 0.31821617, 1.        , 0.2829944 , 0.33243958,
        0.26584196],
       [0.29467487, 0.32586938, 0.24644603, 0.19449683, 0.23384183,
        0.2090821 , 0.34541934, 0.31578155, 0.27158278, 0.28994308,
        0.31204844, 0.29025159, 0.2829944 , 1.        , 0.31787379,
        0.23477281],
       [0.38208505, 0.40826124, 0.2328075 , 0.19110445, 0.21933973,
        0.20409183, 0.32321859, 0.33640756, 0.34767905, 0.36589511,
        0.41018388, 0.34486924, 0.33243958, 0.31787379, 1.        ,
        0.20600951],
       [0.25951764, 0.28007357, 0.26008199, 0.21814833, 0.22925186,
        0.20353307, 0.21535199, 0.25951696, 0.243493  , 0.25749137,
        0.27389096, 0.25232311, 0.26584196, 0.23477281, 0.20600951,
        1.        ]])

In [None]:
In [2]:
tfidf_summary_df
Out[2]:

                                                        abin      able     about  academic   academy  ...     young   zachary       zoe      zola   zündapp
The Adventures of Tintin: The Secret of the Uni...  0.000000  0.000000  0.009708  0.000000  0.000000  ...  0.022231  0.000000  0.000000  0.000000  0.000000
Alvin and the Chipmunks: Chipwrecked                0.000000  0.000000  0.015985  0.000000  0.000000  ...  0.000000  0.000000  0.210159  0.000000  0.000000
Another Earth                                       0.000000  0.000000  0.061780  0.000000  0.000000  ...  0.035369  0.000000  0.000000  0.000000  0.000000
Beastly                                             0.000000  0.000000  0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.000000  0.122963  0.000000
The Twilight Saga: Breaking Dawn - Part 1           0.000000  0.000000  0.011388  0.000000  0.000000  ...  0.000000  0.000000  0.000000  0.000000  0.000000
Bridesmaids                                         0.000000  0.000000  0.006414  0.000000  0.000000  ...  0.000000  0.000000  0.000000  0.000000  0.000000
Captain America: The First Avenger                  0.000000  0.000000  0.009649  0.000000  0.000000  ...  0.000000  0.000000  0.000000  0.110484  0.000000
Carnage                                             0.000000  0.000000  0.044636  0.000000  0.000000  ...  0.000000  0.117371  0.000000  0.000000  0.000000
Cars 2                                              0.000000  0.000000  0.010126  0.000000  0.000000  ...  0.000000  0.000000  0.000000  0.000000  0.106511
Green Lantern                                       0.109422  0.000000  0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.000000  0.000000  0.000000
The Hangover: Part II                               0.000000  0.016289  0.017494  0.000000  0.000000  ...  0.000000  0.000000  0.000000  0.000000  0.000000
Rio                                                 0.000000  0.016672  0.008953  0.000000  0.000000  ...  0.000000  0.000000  0.000000  0.000000  0.000000
Thor                                                0.000000  0.000000  0.010729  0.000000  0.000000  ...  0.000000  0.000000  0.000000  0.000000  0.000000
21 Jump Street                                      0.000000  0.016439  0.017655  0.046425  0.020215  ...  0.000000  0.000000  0.000000  0.000000  0.000000
The Avengers                                        0.000000  0.000000  0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.000000  0.000000  0.000000
Oldboy                                              0.000000  0.015797  0.008483  0.000000  0.019426  ...  0.000000  0.000000  0.000000  0.000000  0.000000

[16 rows x 1000 columns]

In [None]:
# Wrap the preloaded array in a DataFrame
cosine_similarity_df = pd.DataFrame(cosine_similarity_array, index=tfidf_summary_df.index, columns=tfidf_summary_df.index)

# Find the values for the movie Rio
cosine_similarity_series = cosine_similarity_df.loc['Rio']

# Sort these values highest to lowest
ordered_similarities = cosine_similarity_series.sort_values(ascending=False)

# Print the results
print(ordered_similarities)

'''
<script.py> output:
    Rio                                                    1.000000
    Alvin and the Chipmunks: Chipwrecked                   0.361180
    The Avengers                                           0.344869
    The Hangover: Part II                                  0.344407
    The Adventures of Tintin: The Secret of the Unicorn    0.327422
    Thor                                                   0.318216
    Green Lantern                                          0.314570
    Carnage                                                0.312067
    Cars 2                                                 0.306499
    21 Jump Street                                         0.290252
    Another Earth                                          0.281780
    Captain America: The First Avenger                     0.266358
    The Twilight Saga: Breaking Dawn - Part 1              0.262755
    Oldboy                                                 0.252323
    Bridesmaids                                            0.234514
    Beastly                                                0.213502
    Name: Rio, dtype: float64
'''

Question

 4. Based on your analysis, which movie in the dataset is most similar to Rio?

Possible Answers

1. Cars 2
 - Incorrect: Cars 2 does not have the highest similarity score.

2. Beastly
 - Incorrect: Remember that a higher similarity score means a more similar movie.

3. Another Earth
 - Incorrect: Another Earth does not have the highest similarity score.

4. Alvin and the Chipmunks: Chipwrecked
 - Correct. Alvin and the Chipmunks: Chipwrecked has the highest similarity value to Rio! This means that viewers that liked Rio are likely to enjoy Alvin and the Chipmunks: Chipwrecked also. Since both are animated children's movies, this makes a lot of sense.

# User profile recommendations

1. User profile recommendations

In this chapter, you have learned how to use items' attributes to generate content-based recommendations by finding items that are similar to each other.
2. Item to item recommendations

This has many uses such as suggesting obscure books that are similar to your favorite, proposing the next movie to watch that is like the one you just finished, or even finding alternative options when items are out of stock.
3. User profiles

But people are not so one dimensional that they only like one item. They may have read many different books and want to find one that is aligned with their wide array of tastes. For example, taking a look at tfidf_summary_df that we have used previously, we have a row per book, with a column for each of the possible genres it could fall under. For user-based recommendations, we need vectors to represent individual items as well as vectors to represent a user's likes. This will allow us to compare a user's likes to various items to see which items might suit them best.
4. Extract the user data

Let's take an example of a user that has read a set of books. The most straightforward way of creating a user profile is to first get the vectors corresponding with the books they have read, by slicing tfidf_summary_df containing all the books, as you see here using the reindex method. Remember tfidf_summary_df contains TF-IDF features for all books in our data set. This creates a DataFrame containing rows only for books the user has read and their TF-IDF scores. This still has multiple rows and we want a single vector for our user. To go from the full table to a summary of the users tastes we can simply find the average of each column, representing the average of the characteristics of the books the user liked.
5. Build the user profile

We find the average of each column by calling dot mean on the DataFrame. The average values in this Series represent the user profile or in other words a way of representing all of the user's preferences at once. For example, this user appears to enjoy books that have high values in the "ancient" TF-IDF feature. This implies that the word "ancient" is prominent in books they like. This profile can, with a bit of reshaping, be used as a vector to compare against other books.
6. Finding recommendations for a user

This user profile can then be used to find the most similar books that they have not yet read. We first must find the subset of books that have not been read by dropping those contained in the watched list (specifying the index axis by setting axis to 0). We then calculate the cosine similarity matrix as we did in the previous lesson, but this time between the User profile vector you just created and the DataFrame of all the books the user has not read yet. Then we wrap the output in a DataFrame and sort the results once again so we can access and order the data easily.
7. Getting the top recommendations

After sorting the recommendation scores you will now be able to recommend items based on a user's full history, not just based on individual items. These top values are the items that are the most similar to the interests of the user based on their full background of interests, making them good suggestions for the user to read next.
8. Let's practice!

Great, let's work with the movie data set to build up user-profiles and create recommendations based on them. 

# Build the user profiles

You are now able to generate suggestions for similar items based on their labeled features or based on their descriptions. But sometimes finding similar items might not be enough. In the next exercises, you will work through how one could create recommendations based on a user and all the items they liked as opposed to a singular item. You will first generate a profile for a user by aggregating all of the movies they have previously enjoyed.

The tfidf_summary_df you have been working on in the last few exercises has been loaded for you. This contains a row per movie with their titles as the index and a column for each feature containing their respective TF-IDF score.

Instructions

1. Create a subset of the tfidf_summary_df that contains only rows corresponding to the supplied list_of_movies_enjoyed list.


In [None]:
In [1]:
tfidf_summary_df.head()
Out[1]:

                                                    abducted  able     about  above  academy  ...    writes  years  yellow     young      zola
The Adventures of Tintin: The Secret of the Uni...  0.032346   0.0  0.014125    0.0      0.0  ...  0.000000    0.0     0.0  0.032346  0.000000
Alvin and the Chipmunks: Chipwrecked                0.000000   0.0  0.021498    0.0      0.0  ...  0.000000    0.0     0.0  0.000000  0.000000
Another Earth                                       0.000000   0.0  0.069575    0.0      0.0  ...  0.000000    0.0     0.0  0.039831  0.000000
Beastly                                             0.000000   0.0  0.000000    0.0      0.0  ...  0.048677    0.0     0.0  0.000000  0.194709
The Twilight Saga: Breaking Dawn - Part 1           0.000000   0.0  0.017968    0.0      0.0  ...  0.000000    0.0     0.0  0.000000  0.000000

[5 rows x 702 columns]

In [None]:
list_of_movies_enjoyed = ['Captain America: The First Avenger', 'Green Lantern', 'The Avengers']

# Create a subset of only the movies the user has enjoyed
movies_enjoyed_df = tfidf_summary_df.reindex(list_of_movies_enjoyed)

# Inspect the DataFrame
print(movies_enjoyed_df)

2. Generate the user profile by finding the average TF-IDF scores of each of the features of the movies contained in movies_enjoyed_df.

3. Inspect the results.


In [None]:
list_of_movies_enjoyed = ['Captain America: The First Avenger', 'Green Lantern', 'The Avengers']

# Create a subset of only the movies the user has enjoyed
movies_enjoyed_df = tfidf_summary_df.reindex(list_of_movies_enjoyed)

# Generate the user profile by finding the average scores of movies they enjoyed
user_prof = movies_enjoyed_df.mean()

# Inspect the results
print(user_prof)

'''
<script.py> output:
    abducted    0.000000
    able        0.000000
    about       0.005115
    above       0.011187
    academy     0.000000
                  ...   
    writes      0.000000
    years       0.015804
    yellow      0.040041
    young       0.000000
    zola        0.058561
    Length: 702, dtype: float64
'''

Conclusion

Good work, by aggregating the scores of the movies the user enjoyed, you have been able to create a summary of a user's tastes that you will be able to use to find new movies similar to what they usually enjoy.

# User profile based recommendations

Now that you have built the user profile based on the aggregate of the individual movies they enjoyed, you can compare it to the larger tfidf_summary_df DataFrame that you have been working with to generate suggestions. As you would not want to suggest movies that the user has already watched, you will first find a subset of the tfidf_summary_df DataFrame that does not contain any of the previously watched movies.

The DataFrame user_prof that you generated in the last exercise that contains a single column representing the user has been loaded for you. Similarly, the list_of_movies_enjoyed has been loaded so you can exclude them from the predictions.

Instructions

1. Find the subset of tfidf_df that does not include movies in list_of_movies_enjoyed and assign it to tfidf_subset_df.


In [None]:
In [2]:
user_prof.head()
Out[2]:

abandoned    0.000000
abducted     0.000000
abilities    0.018469
ability      0.000000
able         0.000000
dtype: float64

In [3]:
list_of_movies_enjoyed
Out[3]:
['Captain America: The First Avenger', 'Green Lantern', 'The Avengers']

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Find subset of tfidf_df that does not include movies in list_of_movies_enjoyed
tfidf_subset_df = tfidf_df.drop(list_of_movies_enjoyed, axis=0)

 2. Calculate the cosine_similarity between the user profile contained in user_prof and all the movie profiles in tfidf_subset_df.

 3. Wrap the similarity_array in a DataFrame, assigning it the same index as tfidf_subset_df.


In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Find subset of tfidf_df that does not include movies in list_of_movies_enjoyed
tfidf_subset_df = tfidf_df.drop(list_of_movies_enjoyed, axis=0)

# Calculate the cosine_similarity and wrap it in a DataFrame
similarity_array = cosine_similarity(user_prof.values.reshape(1, -1), tfidf_subset_df)

similarity_df = pd.DataFrame(similarity_array.T, index=tfidf_subset_df.index, columns=["similarity_score"])

 4. Sort the results from high to low and take a look at the movies most similar to the user's likes.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Find subset of tfidf_df that does not include movies in list_of_movies_enjoyed
tfidf_subset_df = tfidf_df.drop(list_of_movies_enjoyed, axis=0)

# Calculate the cosine_similarity and wrap it in a DataFrame
similarity_array = cosine_similarity(user_prof.values.reshape(1, -1), tfidf_subset_df)
similarity_df = pd.DataFrame(similarity_array.T, index=tfidf_subset_df.index, columns=["similarity_score"])

# Sort the values from high to low by the values in the similarity_score
sorted_similarity_df = similarity_df.sort_values(by="similarity_score", ascending=False)

# Inspect the most similar to the user preferences
print(sorted_similarity_df.head())

'''
<script.py> output:
                                    similarity_score
    Title                                           
    21 Jump Street                          0.362488
    Thor                                    0.266075
    X-Men: First Class                      0.263540
    Transformers: Dark of the Moon          0.224254
    Beastly                                 0.179626
'''

Conclusion

Great job! As you can see, the top recommendations are all action-packed blockbusters, similar to those previously enjoyed by the user.