# Scikit-Learn Overview

(https://scikit-learn.org/stable/)

- Simple and efficient tools for predictive data analysis
- Accessible to everybody, and reusable in various contexts
- Built on NumPy, SciPy, and matplotlib
- Open source, commercially usable - BSD license

(https://scikit-learn.org/stable/user_guide.html)
1. Supervised Learning
        1.1.  Linear Models
        1.2.  Linear and Quadratic Discriminant Analysis
        1.3.  Kernel ridge regression
        1.4.  Support Vector Machines
        1.5.  Stochastic Gradient Descent
        1.6.  Nearest Neighbors
        1.7.  Gaussian Processes
        1.8.  Cross decomposition
        1.9.  Naive Bayes
        1.10. Decision Trees
        1.11. Ensemble methods
        1.12. Multiclass and multilabel algorithms
        1.13. Feature selection
        1.14. Semi-Supervised
        1.15. Isotonic regression
        1.16. Probability calibration
        1.17. Neural network models (supervised)
2. Unsupervised learning
        2.1. Gaussian mixture models
        2.2. Manifold learning
        2.3. Clustering
        2.4. Biclustering
        2.5. Decomposing signals in components (matrix factorization problems)
        2.6. Covariance estimation
        2.7. Novelty and Outlier Detection
        2.8. Density Estimation
        2.9. Neural network models (unsupervised)
3. Model selection and evaluation
        3.1. Cross-validation: evaluating estimator performance
        3.2. Tuning the hyper-parameters of an estimator
        3.3. Metrics and scoring: quantifying the quality of predictions
        3.4. Model persistence
        3.5. Validation curves: plotting scores to evaluate models
4. Dataset transformations
        4.1. Pipelines and composite estimators
        4.2. Feature extraction
        4.3. Preprocessing data
        4.4. Imputation of missing values
        4.5. Unsupervised dimensionality reduction
        4.6. Random Projection
        4.7. Kernel Approximation
        4.8. Pairwise metrics, Affinities and Kernels
        4.9. Transforming the prediction target (y)

## Typical Process for Implementing ML

1. Asking the Buisness Question (Most Important Step) 
2. Define Success Criteria, Supervised vs Unsupervised Learning
3. Find/Collect Data Sources to help solve the problem
4. Import Data : Pandas
5. Exploratory Data Analysis <-> Cleaning Data : Pandas, matplotlib, seaborn
6. Feature Extraction (eg: process text/image data) : sklearn.feature_extraction
7. Feature Selection (eg: NMF, Dimenssionality Reduction) : sklearn.feature_selection
8. Holdout data for model evaluation after traing the model (10%-20% data) : sklearn.model_selection.train_test_split
9. Apply ML algorithms: Classification, Regression, Clustering
10. Evaluate model performance - Scoring, Hyperparamter tuning
11. Test model on Hold out data
12. Repeat 7-11 for best success criteria
13. Persist model
14. Deployment
15. Periodic Evaluation and updates

## Use case: Movie Recommendation System (Unsupervised Learning)

We will use Content based recommendation for this example. 

Further reading about recommendation systems: https://en.wikipedia.org/wiki/Recommender_system

Data Source: List of 5000 movies (https://www.kaggle.com/tmdb/tmdb-movie-metadata)


We have 2 files in the data set with following columns.


The first dataset contains the following features:-

    movie_id - A unique identifier for each movie.
    cast - The name of lead and supporting actors.
    crew - The name of Director, Editor, Composer, Writer etc.

The second dataset has the following features:-

    budget - The budget in which the movie was made.
    genre - The genre of the movie, Action, Comedy ,Thriller etc.
    homepage - A link to the homepage of the movie.
    id - This is infact the movie_id as in the first dataset.
    keywords - The keywords or tags related to the movie.
    original_language - The language in which the movie was made.
    original_title - The title of the movie before translation or adaptation.
    overview - A brief description of the movie.
    popularity - A numeric quantity specifying the movie popularity.
    production_companies - The production house of the movie.
    production_countries - The country in which it was produced.
    release_date - The date on which it was released.
    revenue - The worldwide revenue generated by the movie.
    runtime - The running time of the movie in minutes.
    status - "Released" or "Rumored".
    tagline - Movie's tagline.
    title - Title of the movie.
    vote_average - average ratings the movie recieved.
    vote_count - the count of votes recieved.
    
We will try to recommend a movie based on information available in these datasets.

Full Workbook: https://www.kaggle.com/ibtesama/getting-started-with-a-movie-recommendation-system/notebook


### Importing Data

In [25]:
# Importing datasets
import pandas as pd 
import numpy as np

df_credits=pd.read_csv('tmdb_5000_credits.csv')
df_movies=pd.read_csv('tmdb_5000_movies.csv')
print(df_credits.shape)
df_credits.head()

(4803, 4)


Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [26]:
print(df_movies.shape)
df_movies.head()

(4803, 20)


Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


In [27]:
# Merging both datasets
df_movies= df_movies.merge(df_credits.rename(columns={"movie_id":"id"}), on="id")
df_movies.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title_x,vote_average,vote_count,title_y,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...",...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]",...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [28]:
#Inspecting the data frame
df_movies.info()
df_movies.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4803 entries, 0 to 4802
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

budget                     0
genres                     0
homepage                3091
id                         0
keywords                   0
original_language          0
original_title             0
overview                   3
popularity                 0
production_companies       0
production_countries       0
release_date               1
revenue                    0
runtime                    2
spoken_languages           0
status                     0
tagline                  844
title_x                    0
vote_average               0
vote_count                 0
title_y                    0
cast                       0
crew                       0
dtype: int64

### Recommending movies based on cast, crew, keyword and genres

We are going to build a recommender based on the following metadata: the 3 top actors, the director, related genres and the movie plot keywords.

From the cast, crew and keywords features, we need to extract the three most important actors, the director and the keywords associated with that movie. Right now, our data is present in the form of "stringified" lists , we need to convert it into a safe and usable structure.

In [29]:
# Look at feature columns
features = ['cast', 'crew', 'keywords', 'genres']
df_movies[features]

Unnamed: 0,cast,crew,keywords,genres
0,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam..."
1,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""..."
2,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de...","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam..."
3,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de...","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...","[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam..."
4,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de...","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam..."
...,...,...,...,...
4798,"[{""cast_id"": 1, ""character"": ""El Mariachi"", ""c...","[{""credit_id"": ""52fe44eec3a36847f80b280b"", ""de...","[{""id"": 5616, ""name"": ""united states\u2013mexi...","[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam..."
4799,"[{""cast_id"": 1, ""character"": ""Buzzy"", ""credit_...","[{""credit_id"": ""52fe487dc3a368484e0fb013"", ""de...",[],"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 10749, ""..."
4800,"[{""cast_id"": 8, ""character"": ""Oliver O\u2019To...","[{""credit_id"": ""52fe4df3c3a36847f8275ecf"", ""de...","[{""id"": 248, ""name"": ""date""}, {""id"": 699, ""nam...","[{""id"": 35, ""name"": ""Comedy""}, {""id"": 18, ""nam..."
4801,"[{""cast_id"": 3, ""character"": ""Sam"", ""credit_id...","[{""credit_id"": ""52fe4ad9c3a368484e16a36b"", ""de...",[],[]


In [30]:
# Parse the stringified features into their corresponding python objects
from ast import literal_eval
for feature in features:
    df_movies[feature] = df_movies[feature].apply(literal_eval)
df_movies[features]

Unnamed: 0,cast,crew,keywords,genres
0,"[{'cast_id': 242, 'character': 'Jake Sully', '...","[{'credit_id': '52fe48009251416c750aca23', 'de...","[{'id': 1463, 'name': 'culture clash'}, {'id':...","[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam..."
1,"[{'cast_id': 4, 'character': 'Captain Jack Spa...","[{'credit_id': '52fe4232c3a36847f800b579', 'de...","[{'id': 270, 'name': 'ocean'}, {'id': 726, 'na...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '..."
2,"[{'cast_id': 1, 'character': 'James Bond', 'cr...","[{'credit_id': '54805967c3a36829b5002c41', 'de...","[{'id': 470, 'name': 'spy'}, {'id': 818, 'name...","[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam..."
3,"[{'cast_id': 2, 'character': 'Bruce Wayne / Ba...","[{'credit_id': '52fe4781c3a36847f81398c3', 'de...","[{'id': 849, 'name': 'dc comics'}, {'id': 853,...","[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam..."
4,"[{'cast_id': 5, 'character': 'John Carter', 'c...","[{'credit_id': '52fe479ac3a36847f813eaa3', 'de...","[{'id': 818, 'name': 'based on novel'}, {'id':...","[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam..."
...,...,...,...,...
4798,"[{'cast_id': 1, 'character': 'El Mariachi', 'c...","[{'credit_id': '52fe44eec3a36847f80b280b', 'de...","[{'id': 5616, 'name': 'united states–mexico ba...","[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam..."
4799,"[{'cast_id': 1, 'character': 'Buzzy', 'credit_...","[{'credit_id': '52fe487dc3a368484e0fb013', 'de...",[],"[{'id': 35, 'name': 'Comedy'}, {'id': 10749, '..."
4800,"[{'cast_id': 8, 'character': 'Oliver O’Toole',...","[{'credit_id': '52fe4df3c3a36847f8275ecf', 'de...","[{'id': 248, 'name': 'date'}, {'id': 699, 'nam...","[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam..."
4801,"[{'cast_id': 3, 'character': 'Sam', 'credit_id...","[{'credit_id': '52fe4ad9c3a368484e16a36b', 'de...",[],[]


In [31]:
# Exract director's name
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

# Get top 3 cast, keywords and genres
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names

    #Return empty list in case of missing/malformed data
    return []

df_movies['director'] = df_movies['crew'].apply(get_director)
features_top3 = ['cast', 'keywords', 'genres']
for feature in features_top3:
    df_movies[feature] = df_movies[feature].apply(get_list)

In [32]:
# Look at sample rows
df_movies[['original_title', 'cast', 'director', 'keywords', 'genres']].head(3)

Unnamed: 0,original_title,cast,director,keywords,genres
0,Avatar,"[Sam Worthington, Zoe Saldana, Sigourney Weaver]",James Cameron,"[culture clash, future, space war]","[Action, Adventure, Fantasy]"
1,Pirates of the Caribbean: At World's End,"[Johnny Depp, Orlando Bloom, Keira Knightley]",Gore Verbinski,"[ocean, drug abuse, exotic island]","[Adventure, Fantasy, Action]"
2,Spectre,"[Daniel Craig, Christoph Waltz, Léa Seydoux]",Sam Mendes,"[spy, based on novel, secret agent]","[Action, Adventure, Crime]"


Clean up data and convert to lower case

In [33]:
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

# Apply clean_data function to your features.
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    df_movies[feature] = df_movies[feature].apply(clean_data)
df_movies[['original_title', 'cast', 'director', 'keywords', 'genres']].head(3)

Unnamed: 0,original_title,cast,director,keywords,genres
0,Avatar,"[samworthington, zoesaldana, sigourneyweaver]",jamescameron,"[cultureclash, future, spacewar]","[action, adventure, fantasy]"
1,Pirates of the Caribbean: At World's End,"[johnnydepp, orlandobloom, keiraknightley]",goreverbinski,"[ocean, drugabuse, exoticisland]","[adventure, fantasy, action]"
2,Spectre,"[danielcraig, christophwaltz, léaseydoux]",sammendes,"[spy, basedonnovel, secretagent]","[action, adventure, crime]"


Create metadata column that has all text ('cast', 'keywords', 'director', 'genres')

In [34]:
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])
df_movies['soup'] = df_movies.apply(create_soup, axis=1)
df_movies['soup']

0       cultureclash future spacewar samworthington zo...
1       ocean drugabuse exoticisland johnnydepp orland...
2       spy basedonnovel secretagent danielcraig chris...
3       dccomics crimefighter terrorist christianbale ...
4       basedonnovel mars medallion taylorkitsch lynnc...
                              ...                        
4798    unitedstates–mexicobarrier legs arms carlosgal...
4799     edwardburns kerrybishé marshadietlein edwardb...
4800    date loveatfirstsight narration ericmabius kri...
4801       danielhenney elizacoupe billpaxton danielhsia 
4802    obsession camcorder crush drewbarrymore brianh...
Name: soup, Length: 4803, dtype: object

Using CountVectorizer to convert text to matrix of token counts
(https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

In [35]:
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(df_movies['soup'])
count_matrix

<4803x11520 sparse matrix of type '<class 'numpy.int64'>'
	with 42935 stored elements in Compressed Sparse Row format>

In [36]:
count.get_feature_names()

['17thcentury',
 '18thcentury',
 '1910s',
 '1960s',
 '1970s',
 '1990s',
 '19thcentury',
 '3d',
 '50cent',
 'aaliyah',
 'aamirkhan',
 'aaranthomas',
 'aaronabrams',
 'aaroneckhart',
 'aaronhann',
 'aaronkwok',
 'aaronpaul',
 'aaronruell',
 'aaronschneider',
 'aaronstanford',
 'aarontaylor',
 'aaronyoo',
 'aasheekaabathija',
 'aasifmandvi',
 'abandonedhouse',
 'abbeylincoln',
 'abbiecornish',
 'abelferrara',
 'abhinayvaddi',
 'abhishekbachchan',
 'abigailbreslin',
 'abigailhargrove',
 'abigailspencer',
 'abo',
 'abrams',
 'abuse',
 'acceptingdeath',
 'accident',
 'accidentallove',
 'acemarrero',
 'action',
 'actors',
 'actress',
 'adambaldwin',
 'adambeach',
 'adambousdoukos',
 'adambrody',
 'adambrooks',
 'adambutcher',
 'adamcampbell',
 'adamcarolla',
 'adamclayton',
 'adamdriver',
 'adamg',
 'adamgarcia',
 'adamgoldberg',
 'adamgreaves',
 'adamgreen',
 'adamjayepstein',
 'adamlamberg',
 'adamlevine',
 'adammarcus',
 'adammckay',
 'adampascal',
 'adamrapp',
 'adamrayner',
 'adamrifkin'

In [37]:
print(count_matrix.shape)
count_matrix.toarray()

(4803, 11520)


array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

Using cosine similarity to to get similar movies
(https://en.wikipedia.org/wiki/Cosine_similarity)
(https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html)

In [38]:
# Compute the Cosine Similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity(count_matrix, count_matrix)
print(cosine_sim.shape)
cosine_sim

(4803, 4803)


array([[1. , 0.3, 0.2, ..., 0. , 0. , 0. ],
       [0.3, 1. , 0.2, ..., 0. , 0. , 0. ],
       [0.2, 0.2, 1. , ..., 0. , 0. , 0. ],
       ...,
       [0. , 0. , 0. , ..., 1. , 0. , 0. ],
       [0. , 0. , 0. , ..., 0. , 1. , 0. ],
       [0. , 0. , 0. , ..., 0. , 0. , 1. ]])

In [39]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(df_movies.index, index=df_movies['original_title']).drop_duplicates()
indices

original_title
Avatar                                         0
Pirates of the Caribbean: At World's End       1
Spectre                                        2
The Dark Knight Rises                          3
John Carter                                    4
                                            ... 
El Mariachi                                 4798
Newlyweds                                   4799
Signed, Sealed, Delivered                   4800
Shanghai Calling                            4801
My Date with Drew                           4802
Length: 4803, dtype: int64

In [40]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return df_movies['original_title'].iloc[movie_indices]

In [41]:
get_recommendations('The Dark Knight Rises', cosine_sim)

65               The Dark Knight
119                Batman Begins
4638    Amidst the Devil's Wings
1196                The Prestige
3073           Romeo Is Bleeding
3326              Black November
1503                      Takers
1986                      Faster
303                     Catwoman
747               Gangster Squad
Name: original_title, dtype: object

In [42]:
get_recommendations('Tangled', cosine_sim)

1108                                  Pinocchio
1481                         The House of Magic
3670                            Running Forever
42                                  Toy Story 3
254                                  The Smurfs
390                          Hotel Transylvania
578     Alvin and the Chipmunks: The Squeakquel
1695                                    Aladdin
1426                                    Valiant
358                   Atlantis: The Lost Empire
Name: original_title, dtype: object

In [43]:
get_recommendations('Harry Potter and the Half-Blood Prince', cosine_sim)

113             Harry Potter and the Order of the Phoenix
114                   Harry Potter and the Goblet of Fire
197              Harry Potter and the Philosopher's Stone
276               Harry Potter and the Chamber of Secrets
191              Harry Potter and the Prisoner of Azkaban
37                             Oz: The Great and Powerful
63      The Chronicles of Narnia: The Lion, the Witch ...
109     The Chronicles of Narnia: The Voyage of the Da...
143                                                   Pan
2294                                             千と千尋の神隠し
Name: original_title, dtype: object

### Using Piplelines to streamline data processing

After basic sklearn modeling, we might need to repeat the same few steps over and over again in the same anaysis. To get to the next level, pipelines can be used.

Pipelines are a way to streamline a lot of the routine processes, encapsulating little pieces of logic into one function call, which makes it easier to actually do modeling instead just writing a bunch of code.

(https://www.kaggle.com/dansbecker/pipelines)

(https://www.kaggle.com/baghern/a-deep-dive-into-sklearn-pipelines)