Download https://files.grouplens.org/datasets/movielens/ml-25m.zip from MovieLens.

In [1]:
import os
import requests


GroupLens is a research group in the Department of Computer Science and Engineering at the University of Minnesota. Since its inception in 1992, GroupLens's research projects have explored a variety of fields including:

* recommender systems
* online communities
* mobile and ubiquitious technologies
* digital libraries
* local geographic information systems

GroupLens Research operates a movie recommender based on collaborative filtering, MovieLens, which is the source of these data. We encourage you to visit <http://movielens.org> to try it out! If you have exciting ideas for experimental work to conduct on MovieLens, send us an email at <grouplens-info@cs.umn.edu> - we are always interested in working with external collaborators.


Formatting and Encoding
-----------------------

The dataset files are written as [comma-separated values](http://en.wikipedia.org/wiki/Comma-separated_values) files with a single header row. Columns that contain commas (`,`) are escaped using double-quotes (`"`). These files are encoded as UTF-8. If accented characters in movie titles or tag values (e.g. Misérables, Les (1995)) display incorrectly, make sure that any program reading the data, such as a text editor, terminal, or script, is configured for UTF-8.


User Ids
--------

MovieLens users were selected at random for inclusion. Their ids have been anonymized. User ids are consistent between `ratings.csv` and `tags.csv` (i.e., the same id refers to the same user across the two files).


Movie Ids
---------

Only movies with at least one rating or tag are included in the dataset. These movie ids are consistent with those used on the MovieLens web site (e.g., id `1` corresponds to the URL <https://movielens.org/movies/1>). Movie ids are consistent between `ratings.csv`, `tags.csv`, `movies.csv`, and `links.csv` (i.e., the same id refers to the same movie across these four data files).



In [2]:
# url = 'https://files.grouplens.org/datasets/movielens/ml-25m.zip'

# !pip install -q wget
# import wget
# filename = wget.download(url)

# import shutil
# destination_path = './'

# shutil.unpack_archive(filename)

Movies Data File Structure (movies.csv)
---------------------------------------

Movie information is contained in the file `movies.csv`. Each line of this file after the header row represents one movie, and has the following format:

    movieId,title,genres

Movie titles are entered manually or imported from <https://www.themoviedb.org/>, and include the year of release in parentheses. Errors and inconsistencies may exist in these titles.

Genres are a pipe-separated list, and are selected from the following:

* Action
* Adventure
* Animation
* Children's
* Comedy
* Crime
* Documentary
* Drama
* Fantasy
* Film-Noir
* Horror
* Musical
* Mystery
* Romance
* Sci-Fi
* Thriller
* War
* Western
* (no genres listed)

In [3]:
import pandas as pd

movies = pd.read_csv('ml-25m/movies.csv')
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Extract year into another column.

In [4]:
#re.findall(r"\w+(?:\d)", 'Father of the Bride Part II  qdqf')


In [5]:
import re
import numpy as np
movies['year'] = movies['title'].apply(lambda x: re.findall(r"\w+(?:\d)\b", x) if x != np.nan else np.nan)

In [6]:
movies['year'] = movies['year'].apply(lambda x: x[0] if len(x) else np.nan)
movies.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men (1995),Comedy|Romance,1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II (1995),Comedy,1995


In [7]:
movies.shape

(62423, 4)

Ratings Data File Structure (ratings.csv)
-----------------------------------------

All ratings are contained in the file `ratings.csv`. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:

    userId,movieId,rating,timestamp

The lines within this file are ordered first by userId, then, within user, by movieId.

Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.


In [8]:
ratings = pd.read_csv('ml-25m/ratings.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


In [9]:
#ratings_small = ratings.iloc[0:10000000]

In [10]:
#ratings_small.iloc[0]

In [11]:
ratings.shape

(25000095, 4)

In [12]:
md = movies.merge(ratings[["movieId","rating"]], on = 'movieId', how = 'left')

In [13]:
md.head()

Unnamed: 0,movieId,title,genres,year,rating
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,3.5
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,4.0
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,3.0
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,4.0
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,4.0


In [14]:
md=md.groupby(['movieId', 'title', 'genres', 'year'])['rating'].mean().reset_index(name='rating')

In [15]:
md.head()

Unnamed: 0,movieId,title,genres,year,rating
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,3.893708
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995,3.251527
2,3,Grumpier Old Men (1995),Comedy|Romance,1995,3.142028
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995,2.853547
4,5,Father of the Bride Part II (1995),Comedy,1995,3.058434


In [16]:
md['rating']

0        3.893708
1        3.251527
2        3.142028
3        2.853547
4        3.058434
           ...   
62017    1.500000
62018    3.000000
62019    4.500000
62020    3.000000
62021    3.000000
Name: rating, Length: 62022, dtype: float64

Tags Data File Structure (tags.csv)
-----------------------------------

All tags are contained in the file `tags.csv`. Each line of this file after the header row represents one tag applied to one movie by one user, and has the following format:

    userId,movieId,tag,timestamp

The lines within this file are ordered first by userId, then, within user, by movieId.

Tags are user-generated metadata about movies. Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user.

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.


In [17]:
tags = pd.read_csv('ml-25m/tags.csv')
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,3,260,classic,1439472355
1,3,260,sci-fi,1439472256
2,4,1732,dark comedy,1573943598
3,4,1732,great dialogue,1573943604
4,4,7569,so bad it's good,1573943455


In [18]:
md = md.merge(tags[['movieId','tag']], on = 'movieId', how = 'left')

In [19]:
md.head()

Unnamed: 0,movieId,title,genres,year,rating,tag
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,3.893708,Owned
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,3.893708,imdb top 250
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,3.893708,Pixar
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,3.893708,Pixar
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,3.893708,time travel


In [20]:
 md=md.groupby(['movieId', 'title', 'genres', 'year', 'rating'])['tag'].agg(list).reset_index(name='tag')

In [21]:
md['tag']

0        [Owned, imdb top 250, Pixar, Pixar, time trave...
1        [Robin Williams, time travel, fantasy, based o...
2        [funny, best friend, duringcreditsstinger, fis...
3        [based on novel or book, chick flick, divorce,...
4        [aging, baby, confidence, contraception, daugh...
                               ...                        
58678                                                [nan]
58679                                                [nan]
58680                                                [nan]
58681                                                [nan]
58682                                                [nan]
Name: tag, Length: 58683, dtype: object

In [22]:
md.head()

Unnamed: 0,movieId,title,genres,year,rating,tag
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,3.893708,"[Owned, imdb top 250, Pixar, Pixar, time trave..."
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995,3.251527,"[Robin Williams, time travel, fantasy, based o..."
2,3,Grumpier Old Men (1995),Comedy|Romance,1995,3.142028,"[funny, best friend, duringcreditsstinger, fis..."
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995,2.853547,"[based on novel or book, chick flick, divorce,..."
4,5,Father of the Bride Part II (1995),Comedy,1995,3.058434,"[aging, baby, confidence, contraception, daugh..."


Links Data File Structure (links.csv)
---------------------------------------

Identifiers that can be used to link to other sources of movie data are contained in the file `links.csv`. Each line of this file after the header row represents one movie, and has the following format:

    movieId,imdbId,tmdbId

movieId is an identifier for movies used by <https://movielens.org>. E.g., the movie Toy Story has the link <https://movielens.org/movies/1>.

imdbId is an identifier for movies used by <http://www.imdb.com>. E.g., the movie Toy Story has the link <http://www.imdb.com/title/tt0114709/>.

tmdbId is an identifier for movies used by <https://www.themoviedb.org>. E.g., the movie Toy Story has the link <https://www.themoviedb.org/movie/862>.


In [23]:
links = pd.read_csv('ml-25m/links.csv')
links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


Tag Genome (genome-scores.csv and genome-tags.csv)
-------------------------------------------------

This data set includes a current copy of the Tag Genome.

[genome-paper]: http://files.grouplens.org/papers/tag_genome.pdf

The tag genome is a data structure that contains tag relevance scores for movies.  The structure is a dense matrix: each movie in the genome has a value for *every* tag in the genome.

As described in [this article][genome-paper], the tag genome encodes how strongly movies exhibit particular properties represented by tags (atmospheric, thought-provoking, realistic, etc.). The tag genome was computed using a machine learning algorithm on user-contributed content including tags, ratings, and textual reviews.

The genome is split into two files.  The file `genome-scores.csv` contains movie-tag relevance data in the following format:

    movieId,tagId,relevance

The second file, `genome-tags.csv`, provides the tag descriptions for the tag IDs in the genome file, in the following format:

    tagId,tag

The `tagId` values are generated when the data set is exported, so they may vary from version to version of the MovieLens data sets.



In [24]:
genome_scores = pd.read_csv('ml-25m/genome-scores.csv')
genome_scores.head()

Unnamed: 0,movieId,tagId,relevance
0,1,1,0.02875
1,1,2,0.02375
2,1,3,0.0625
3,1,4,0.07575
4,1,5,0.14075


In [25]:
genome_scores=genome_scores[genome_scores['relevance']>0.75]
genome_scores.head()

Unnamed: 0,movieId,tagId,relevance
28,1,29,0.89375
62,1,63,0.94725
63,1,64,0.98425
185,1,186,0.95475
192,1,193,0.8145


In [26]:
genome_tags = pd.read_csv('ml-25m/genome-tags.csv')
genome_tags.head()

Unnamed: 0,tagId,tag
0,1,007
1,2,007 (series)
2,3,18th century
3,4,1920s
4,5,1930s


In [27]:
genome_scores = genome_scores.merge(genome_tags[['tagId','tag']], on = 'tagId', how = 'left')

In [28]:
genome_scores.head()

Unnamed: 0,movieId,tagId,relevance,tag
0,1,29,0.89375,adventure
1,1,63,0.94725,animated
2,1,64,0.98425,animation
3,1,186,0.95475,cartoon
4,1,193,0.8145,cgi


In [29]:
genome_scores2=genome_scores.groupby(['movieId'])['tag'].agg(list).reset_index(name='tag')

In [30]:
genome_scores2.head()

Unnamed: 0,movieId,tag
0,1,"[adventure, animated, animation, cartoon, cgi,..."
1,2,"[adventure, animals, based on a book, big budg..."
2,3,"[comedy, good sequel, original, sequel, sequels]"
3,4,"[chick flick, divorce, girlie movie, romantic,..."
4,5,"[comedy, family, father daughter relationship,..."


In [31]:
md2 = md.merge(genome_scores2[['movieId','tag']], on = 'movieId', how = 'left')

In [32]:
md2.head()

Unnamed: 0,movieId,title,genres,year,rating,tag_x,tag_y
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,3.893708,"[Owned, imdb top 250, Pixar, Pixar, time trave...","[adventure, animated, animation, cartoon, cgi,..."
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995,3.251527,"[Robin Williams, time travel, fantasy, based o...","[adventure, animals, based on a book, big budg..."
2,3,Grumpier Old Men (1995),Comedy|Romance,1995,3.142028,"[funny, best friend, duringcreditsstinger, fis...","[comedy, good sequel, original, sequel, sequels]"
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995,2.853547,"[based on novel or book, chick flick, divorce,...","[chick flick, divorce, girlie movie, romantic,..."
4,5,Father of the Bride Part II (1995),Comedy,1995,3.058434,"[aging, baby, confidence, contraception, daugh...","[comedy, family, father daughter relationship,..."


In [33]:
md2 = md2.merge(links[['movieId','imdbId', 'tmdbId']], on = 'movieId', how = 'left')

In [34]:
md2.head()

Unnamed: 0,movieId,title,genres,year,rating,tag_x,tag_y,imdbId,tmdbId
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,3.893708,"[Owned, imdb top 250, Pixar, Pixar, time trave...","[adventure, animated, animation, cartoon, cgi,...",114709,862.0
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995,3.251527,"[Robin Williams, time travel, fantasy, based o...","[adventure, animals, based on a book, big budg...",113497,8844.0
2,3,Grumpier Old Men (1995),Comedy|Romance,1995,3.142028,"[funny, best friend, duringcreditsstinger, fis...","[comedy, good sequel, original, sequel, sequels]",113228,15602.0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995,2.853547,"[based on novel or book, chick flick, divorce,...","[chick flick, divorce, girlie movie, romantic,...",114885,31357.0
4,5,Father of the Bride Part II (1995),Comedy,1995,3.058434,"[aging, baby, confidence, contraception, daugh...","[comedy, family, father daughter relationship,...",113041,11862.0


In [35]:
md2["tag"] = md2["tag_x"] + md2["tag_y"]

In [36]:
md2.drop('tag_x', axis=1, inplace=True)
md2.drop('tag_y', axis=1, inplace=True)
md2.head()

Unnamed: 0,movieId,title,genres,year,rating,imdbId,tmdbId,tag
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,3.893708,114709,862.0,"[Owned, imdb top 250, Pixar, Pixar, time trave..."
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995,3.251527,113497,8844.0,"[Robin Williams, time travel, fantasy, based o..."
2,3,Grumpier Old Men (1995),Comedy|Romance,1995,3.142028,113228,15602.0,"[funny, best friend, duringcreditsstinger, fis..."
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995,2.853547,114885,31357.0,"[based on novel or book, chick flick, divorce,..."
4,5,Father of the Bride Part II (1995),Comedy,1995,3.058434,113041,11862.0,"[aging, baby, confidence, contraception, daugh..."


In [37]:
md2['tag'] = md2['tag'].fillna('')

In [38]:
md2.tail()

Unnamed: 0,movieId,title,genres,year,rating,imdbId,tmdbId,tag
58678,209157,We (2018),Drama,2018,1.5,6671244,499546.0,
58679,209159,Window of the Soul (2001),Documentary,2001,3.0,297986,63407.0,
58680,209163,Bad Poems (2018),Comedy|Drama,2018,4.5,6755366,553036.0,
58681,209169,A Girl Thing (2001),(no genres listed),2001,3.0,249603,162892.0,
58682,209171,Women of Devil's Island (1962),Action|Adventure|Drama,1962,3.0,55323,79513.0,


## Cosine Similarity

In [39]:
md2['genres'] = md2['genres'].apply(lambda x: x.split('|'))

In [40]:
md2.head()

Unnamed: 0,movieId,title,genres,year,rating,imdbId,tmdbId,tag
0,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",1995,3.893708,114709,862.0,"[Owned, imdb top 250, Pixar, Pixar, time trave..."
1,2,Jumanji (1995),"[Adventure, Children, Fantasy]",1995,3.251527,113497,8844.0,"[Robin Williams, time travel, fantasy, based o..."
2,3,Grumpier Old Men (1995),"[Comedy, Romance]",1995,3.142028,113228,15602.0,"[funny, best friend, duringcreditsstinger, fis..."
3,4,Waiting to Exhale (1995),"[Comedy, Drama, Romance]",1995,2.853547,114885,31357.0,"[based on novel or book, chick flick, divorce,..."
4,5,Father of the Bride Part II (1995),[Comedy],1995,3.058434,113041,11862.0,"[aging, baby, confidence, contraception, daugh..."


In [41]:
md2['combination'] =  md2['genres']

In [42]:
md2.tail()

Unnamed: 0,movieId,title,genres,year,rating,imdbId,tmdbId,tag,combination
58678,209157,We (2018),[Drama],2018,1.5,6671244,499546.0,,[Drama]
58679,209159,Window of the Soul (2001),[Documentary],2001,3.0,297986,63407.0,,[Documentary]
58680,209163,Bad Poems (2018),"[Comedy, Drama]",2018,4.5,6755366,553036.0,,"[Comedy, Drama]"
58681,209169,A Girl Thing (2001),[(no genres listed)],2001,3.0,249603,162892.0,,[(no genres listed)]
58682,209171,Women of Devil's Island (1962),"[Action, Adventure, Drama]",1962,3.0,55323,79513.0,,"[Action, Adventure, Drama]"


In [43]:
md2['combination'] = md2['combination'] + md2['tag'].apply(lambda x : list(x))

In [44]:
md2.tail()

Unnamed: 0,movieId,title,genres,year,rating,imdbId,tmdbId,tag,combination
58678,209157,We (2018),[Drama],2018,1.5,6671244,499546.0,,[Drama]
58679,209159,Window of the Soul (2001),[Documentary],2001,3.0,297986,63407.0,,[Documentary]
58680,209163,Bad Poems (2018),"[Comedy, Drama]",2018,4.5,6755366,553036.0,,"[Comedy, Drama]"
58681,209169,A Girl Thing (2001),[(no genres listed)],2001,3.0,249603,162892.0,,[(no genres listed)]
58682,209171,Women of Devil's Island (1962),"[Action, Adventure, Drama]",1962,3.0,55323,79513.0,,"[Action, Adventure, Drama]"


In [45]:
md2['title'] = md2['title'].apply(lambda x: x[:x.find('(')].rstrip())

In [46]:
md2.head()

Unnamed: 0,movieId,title,genres,year,rating,imdbId,tmdbId,tag,combination
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,3.893708,114709,862.0,"[Owned, imdb top 250, Pixar, Pixar, time trave...","[Adventure, Animation, Children, Comedy, Fanta..."
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,3.251527,113497,8844.0,"[Robin Williams, time travel, fantasy, based o...","[Adventure, Children, Fantasy, Robin Williams,..."
2,3,Grumpier Old Men,"[Comedy, Romance]",1995,3.142028,113228,15602.0,"[funny, best friend, duringcreditsstinger, fis...","[Comedy, Romance, funny, best friend, duringcr..."
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995,2.853547,114885,31357.0,"[based on novel or book, chick flick, divorce,...","[Comedy, Drama, Romance, based on novel or boo..."
4,5,Father of the Bride Part II,[Comedy],1995,3.058434,113041,11862.0,"[aging, baby, confidence, contraception, daugh...","[Comedy, aging, baby, confidence, contraceptio..."


In [47]:
md2=md2.drop(columns=['tag'])

In [48]:
md2.head()

Unnamed: 0,movieId,title,genres,year,rating,imdbId,tmdbId,combination
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,3.893708,114709,862.0,"[Adventure, Animation, Children, Comedy, Fanta..."
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,3.251527,113497,8844.0,"[Adventure, Children, Fantasy, Robin Williams,..."
2,3,Grumpier Old Men,"[Comedy, Romance]",1995,3.142028,113228,15602.0,"[Comedy, Romance, funny, best friend, duringcr..."
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995,2.853547,114885,31357.0,"[Comedy, Drama, Romance, based on novel or boo..."
4,5,Father of the Bride Part II,[Comedy],1995,3.058434,113041,11862.0,"[Comedy, aging, baby, confidence, contraceptio..."


In [49]:
md2.tail()

Unnamed: 0,movieId,title,genres,year,rating,imdbId,tmdbId,combination
58678,209157,We,[Drama],2018,1.5,6671244,499546.0,[Drama]
58679,209159,Window of the Soul,[Documentary],2001,3.0,297986,63407.0,[Documentary]
58680,209163,Bad Poems,"[Comedy, Drama]",2018,4.5,6755366,553036.0,"[Comedy, Drama]"
58681,209169,A Girl Thing,[(no genres listed)],2001,3.0,249603,162892.0,[(no genres listed)]
58682,209171,Women of Devil's Island,"[Action, Adventure, Drama]",1962,3.0,55323,79513.0,"[Action, Adventure, Drama]"


In [50]:
# from ast import literal_eval

In [51]:
# def try_the_eval(row):
#     try:
#         literal_eval(row.cast)
#     except:
#         print('Found bad data at row: {}'.format(row))

In [52]:
#_ = md2['combination'].apply(try_the_eval)

In [53]:
# md2['combination'] = md2['combination'].str.replace('null', 'None')

In [54]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
count = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')

In [55]:
def convertToString(term):
    if type(term) is str:
        return term
    else:
        return str(term)

In [56]:
md2['combination'] = md2['combination'].apply(convertToString)

In [57]:
from sklearn.metrics.pairwise import cosine_similarity

In [58]:
count_matrix = count.fit_transform(md2['combination'])

In [59]:
print(type(count_matrix))

<class 'scipy.sparse.csr.csr_matrix'>


In [60]:
recommender = cosine_similarity(count_matrix)

In [61]:
recommender_df = pd.DataFrame(recommender, 
                                  columns=md2.index,
                                  index=md2.index)

In [62]:
recommender_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,58673,58674,58675,58676,58677,58678,58679,58680,58681,58682
0,1.0,0.07769,0.040341,0.016576,0.056595,0.014291,0.015125,0.045948,0.007202,0.02294,...,0.0,0.0,0.0,0.056552,0.050063,0.0,0.0,0.056552,0.0,0.039633
1,0.07769,1.0,0.000842,0.052019,0.037921,0.002843,0.021043,0.095626,0.02557,0.011234,...,0.0,0.006573,0.0,0.0,0.005092,0.0,0.0,0.0,0.0,0.035641
2,0.040341,0.000842,1.0,0.030023,0.211047,0.00423,0.064721,0.0,0.015565,0.021654,...,0.0,0.0,0.0,0.085358,0.132236,0.0,0.0,0.085358,0.0,0.0
3,0.016576,0.052019,0.030023,1.0,0.048313,0.013204,0.243644,0.084229,0.007126,0.006262,...,0.101535,0.0,0.0,0.175863,0.090815,0.101535,0.0,0.175863,0.0,0.045408
4,0.056595,0.037921,0.211047,0.048313,1.0,0.002888,0.056576,0.0,0.016698,0.015895,...,0.0,0.0,0.0,0.137361,0.106399,0.0,0.0,0.137361,0.0,0.0


In [63]:
recommender_df.shape

(58683, 58683)

In [64]:
# recommender_df.to_csv('cosine_similarity_recommender_df.csv')

In [65]:
for item in recommender_df.iterrows():
    print(f'type(item)={type(item)}')
    print(f'type(item[0]) = {type(item[0])}')
    print(f'type(item[1]) = {type(item[1])}')
    print(f'item[1].sort_values(ascending=False) = {item[1].sort_values(ascending=False)[:20]}')
    print(f'item[1].sort_values(ascending=False) = {item[1].sort_values(ascending=False)[1:20].keys()}')
    break

type(item)=<class 'tuple'>
type(item[0]) = <class 'int'>
type(item[1]) = <class 'pandas.core.series.Series'>
item[1].sort_values(ascending=False) = 0        1.000000
3021     0.883399
2264     0.792139
4780     0.727488
14803    0.674252
6258     0.638986
11359    0.631469
37262    0.627132
8246     0.623481
19816    0.612167
22440    0.612145
5110     0.610312
22439    0.609498
21083    0.606688
18276    0.597164
13357    0.584243
16630    0.576945
56144    0.573070
18195    0.569762
18277    0.562639
Name: 0, dtype: float64
item[1].sort_values(ascending=False) = Int64Index([ 3021,  2264,  4780, 14803,  6258, 11359, 37262,  8246, 19816,
            22440,  5110, 22439, 21083, 18276, 13357, 16630, 56144, 18195,
            18277],
           dtype='int64')


In [66]:
pivot_item_based = pd.DataFrame(md2, index=md2.title, columns=['similar'])

In [67]:
pivot_item_based.head()

Unnamed: 0_level_0,similar
title,Unnamed: 1_level_1
Toy Story,
Jumanji,
Grumpier Old Men,
Waiting to Exhale,
Father of the Bride Part II,


In [68]:
pivot_item_based.reset_index(level=0, inplace=True)

In [69]:
pivot_item_based.head()

Unnamed: 0,title,similar
0,Toy Story,
1,Jumanji,
2,Grumpier Old Men,
3,Waiting to Exhale,
4,Father of the Bride Part II,


In [70]:
for item in recommender_df.iterrows():
    #title = md2[md2['movieId'] == item[0]]['title']
    title = md2.iloc[item[0]]['title']
    #print(f'title={title}')
    list_of_movies= []
    for index in item[1].sort_values(ascending=False)[1:20].keys():
        movie_name = md2.iloc[index]['title']
        #print(f'movie_name={movie_name}')
        list_of_movies.append(movie_name)
    #print(f'list_of_movies={list_of_movies}')
    pivot_item_based.loc[pivot_item_based['title'] == title, ['similar']] = str(list_of_movies)
    #break

In [136]:
pivot_item_based.head()

Unnamed: 0,title,similar
0,Toy Story,"['Toy Story 2', ""Bug's Life, A"", 'Monsters, In..."
1,Jumanji,"['Flubber', 'Final Cut, The', 'Jack', 'Mrs. Do..."
2,Grumpier Old Men,"['House Party 2', 'Grumpy Old Men', 'F/X2', 'E..."
3,Waiting to Exhale,"['Violets Are Blue...', 'In Her Shoes', ""What'..."
4,Father of the Bride Part II,"['Father of the Bride', 'My Big Fat Greek Wedd..."


In [72]:
pivot_item_based.to_csv('cosine_similarity_recommender_df.csv')

In [74]:
recommender.shape

(58683, 58683)

In [75]:
type(recommender)

numpy.ndarray

In [76]:
# cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [54]:
cosine_sim.shape

(58683, 58683)

In [51]:
type(cosine_sim)

numpy.ndarray

In [55]:
# calculate sparsity
from numpy import count_nonzero
sparsity = 1.0 - count_nonzero(recommender) / recommender.size
print(f"sparsity={sparsity}")
# sparsity=0.36485166556248483

sparsity=0.36485166556248483


In [62]:
# calculate sparsity
from numpy import count_nonzero
sparsity = 1.0 - count_nonzero(cosine_sim) / cosine_sim.size
print(f"sparsity={sparsity}")
# sparsity=0.36485166556248483

sparsity=0.36485166556248483


In [53]:
#from scipy import sparse

# create scipy sparse from pivot tables
#data_sparse = sparse.csr_matrix(count_matrix)

In [54]:
# dense_matrix = count_matrix.todense()

In [55]:
#df = pd.DataFrame(dense_matrix, 
#                  columns=count.get_feature_names())
#df

In [56]:
#cosine_sim = cosine_similarity(data_sparse, data_sparse, dense_output=False)

In [57]:
#cos_sim_df = pd.DataFrame(cosine_sim, columns=count.get_feature_names())

In [58]:
#cosine_sim_sparse = cosine_similarity(count_matrix, count_matrix, dense_output=False)

In [59]:
#type(cosine_sim_sparse)

In [124]:
def get_recommendations(title):
    idx = md2.index[md2['title'] == title]
    print(f'type(idx)={type(idx)}, idx={idx[0]}')
    sim_scores = list(enumerate(recommender_df[idx[0]]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    print(f'sim_scores={sim_scores}')
    movie_indices = [i[0] for i in sim_scores]
    print(f'movie_indices={movie_indices}')
    return md2.iloc[movie_indices]

In [147]:
def get_recommendations_from_df(title):
    idx = md2.index[md2['title'] == title]
    print(f'type(idx)={type(idx)}, idx={idx[0]}')
    movie_list = pivot_item_based.loc[pivot_item_based['title'] == title, 'similar'][0]
    movie_list = eval(movie_list)
    print(f'movie_list={movie_list}')
    movie_indices = [md2.index[md2['title'] == movie][0] for movie in movie_list]
    print(f'movie_indices={movie_indices}')
    return md2.iloc[movie_indices]

In [125]:
get_recommendations('Toy Story').head(10)

type(idx)=<class 'pandas.core.indexes.numeric.Int64Index'>, idx=0
sim_scores=[(3021, 0.883399415956529), (2264, 0.7921391962750051), (4780, 0.7274877508821692), (14803, 0.6742522362519374), (6258, 0.6389862448178554), (11359, 0.6314693901255664), (37262, 0.6271322379939358), (8246, 0.62348084508426), (19816, 0.6121674205624531), (22440, 0.612144539925954), (5110, 0.6103120467429644), (22439, 0.6094980145717912), (21083, 0.6066883730431709), (18276, 0.5971638918092335), (13357, 0.5842433555864375), (16630, 0.5769446516399769), (56144, 0.5730695990683348), (18195, 0.5697620429632), (18277, 0.5626386261358628), (33934, 0.555913511163542), (10808, 0.5544959014775455), (2203, 0.5244667976307551), (2497, 0.5224014701224302), (13438, 0.5224014701224302), (13728, 0.5224014701224302), (14083, 0.5224014701224302), (14084, 0.5224014701224302), (15848, 0.5224014701224302), (17286, 0.5224014701224302), (17358, 0.5224014701224302)]
movie_indices=[3021, 2264, 4780, 14803, 6258, 11359, 37262, 8246, 19

Unnamed: 0,movieId,title,genres,year,rating,imdbId,tmdbId,combination
3021,3114,Toy Story 2,"[Adventure, Animation, Children, Comedy, Fantasy]",1999,3.811464,120363,863.0,"['Adventure', 'Animation', 'Children', 'Comedy..."
2264,2355,"Bug's Life, A","[Adventure, Animation, Children, Comedy]",1998,3.569156,120623,9487.0,"['Adventure', 'Animation', 'Children', 'Comedy..."
4780,4886,"Monsters, Inc.","[Adventure, Animation, Children, Comedy, Fantasy]",2001,3.84862,198781,585.0,"['Adventure', 'Animation', 'Children', 'Comedy..."
14803,78499,Toy Story 3,"[Adventure, Animation, Children, Comedy, Fanta...",2010,3.857757,435761,10193.0,"['Adventure', 'Animation', 'Children', 'Comedy..."
6258,6377,Finding Nemo,"[Adventure, Animation, Children, Comedy]",2003,3.833977,266543,12.0,"['Adventure', 'Animation', 'Children', 'Comedy..."
11359,50872,Ratatouille,"[Animation, Children, Drama]",2007,3.81114,382932,2062.0,"['Animation', 'Children', 'Drama', 'Owned', 't..."
37262,157296,Finding Dory,"[Adventure, Animation, Comedy]",2016,3.615559,2277860,127380.0,"['Adventure', 'Animation', 'Comedy', 'animatio..."
8246,8961,"Incredibles, The","[Action, Adventure, Animation, Children, Comedy]",2004,3.854885,317705,9806.0,"['Action', 'Adventure', 'Animation', 'Children..."
19816,103141,Monsters University,"[Adventure, Animation, Comedy]",2013,3.502423,1453405,62211.0,"['Adventure', 'Animation', 'Comedy', 'pre', 'O..."
22440,115879,Toy Story Toons: Small Fry,"[Adventure, Animation, Children, Comedy, Fantasy]",2011,3.092105,2033372,82424.0,"['Adventure', 'Animation', 'Children', 'Comedy..."


In [148]:
get_recommendations_from_df('Toy Story').head(10)

type(idx)=<class 'pandas.core.indexes.numeric.Int64Index'>, idx=0
movie_list=['Toy Story 2', "Bug's Life, A", 'Monsters, Inc.', 'Toy Story 3', 'Finding Nemo', 'Ratatouille', 'Finding Dory', 'Incredibles, The', 'Monsters University', 'Toy Story Toons: Small Fry', 'Ice Age', 'Toy Story Toons: Hawaiian Vacation', 'Your Friend the Rat', 'Knick Knack', 'Up', 'Cars 2', 'Toy Story 4', "Boundin'", 'For the Birds']
movie_indices=[3021, 2264, 4780, 14803, 6258, 11359, 37262, 8246, 19816, 22440, 5110, 22439, 21083, 18276, 13357, 16630, 56144, 18195, 18277]


Unnamed: 0,movieId,title,genres,year,rating,imdbId,tmdbId,combination
3021,3114,Toy Story 2,"[Adventure, Animation, Children, Comedy, Fantasy]",1999,3.811464,120363,863.0,"['Adventure', 'Animation', 'Children', 'Comedy..."
2264,2355,"Bug's Life, A","[Adventure, Animation, Children, Comedy]",1998,3.569156,120623,9487.0,"['Adventure', 'Animation', 'Children', 'Comedy..."
4780,4886,"Monsters, Inc.","[Adventure, Animation, Children, Comedy, Fantasy]",2001,3.84862,198781,585.0,"['Adventure', 'Animation', 'Children', 'Comedy..."
14803,78499,Toy Story 3,"[Adventure, Animation, Children, Comedy, Fanta...",2010,3.857757,435761,10193.0,"['Adventure', 'Animation', 'Children', 'Comedy..."
6258,6377,Finding Nemo,"[Adventure, Animation, Children, Comedy]",2003,3.833977,266543,12.0,"['Adventure', 'Animation', 'Children', 'Comedy..."
11359,50872,Ratatouille,"[Animation, Children, Drama]",2007,3.81114,382932,2062.0,"['Animation', 'Children', 'Drama', 'Owned', 't..."
37262,157296,Finding Dory,"[Adventure, Animation, Comedy]",2016,3.615559,2277860,127380.0,"['Adventure', 'Animation', 'Comedy', 'animatio..."
8246,8961,"Incredibles, The","[Action, Adventure, Animation, Children, Comedy]",2004,3.854885,317705,9806.0,"['Action', 'Adventure', 'Animation', 'Children..."
19816,103141,Monsters University,"[Adventure, Animation, Comedy]",2013,3.502423,1453405,62211.0,"['Adventure', 'Animation', 'Comedy', 'pre', 'O..."
22440,115879,Toy Story Toons: Small Fry,"[Adventure, Animation, Children, Comedy, Fantasy]",2011,3.092105,2033372,82424.0,"['Adventure', 'Animation', 'Children', 'Comedy..."


In [60]:
get_recommendations_from_df('Finding Nemo').head(10)

type(idx)=<class 'pandas.core.indexes.numeric.Int64Index'>, idx=6258
movie_indices=[2264, 0, 37262, 4780, 5110, 3021, 11359, 18276, 10375, 13357, 18277, 8246, 18195, 18194, 22439, 19816, 21083, 22440, 13750, 10808, 14803, 20990, 18241, 33934, 15871, 21084, 16630, 359, 13972, 2203]


Unnamed: 0,movieId,title,genres,year,rating,tag,combination
2264,2355,"Bug's Life, A (1998)","[Adventure, Animation, Children, Comedy]",1998,3.569156,"[Owned, Pixar, Animated, Pixar, animation, ant...","['Owned', 'Pixar', 'Animated', 'Pixar', 'anima..."
0,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",1995,3.893708,"[Owned, imdb top 250, Pixar, Pixar, time trave...","['Owned', 'imdb top 250', 'Pixar', 'Pixar', 't..."
37262,157296,Finding Dory (2016),"[Adventure, Animation, Comedy]",2016,3.615559,"[animation, fish, ocean, Pixar, pixar, amnesia...","['animation', 'fish', 'ocean', 'Pixar', 'pixar..."
4780,4886,"Monsters, Inc. (2001)","[Adventure, Animation, Children, Comedy, Fantasy]",2001,3.84862,"[Owned, imdb top 250, Katottava, funny, cute, ...","['Owned', 'imdb top 250', 'Katottava', 'funny'..."
5110,5218,Ice Age (2002),"[Adventure, Animation, Children, Comedy]",2002,3.540187,"[Owned, animated, Disney, Katottava, pixar, Ic...","['Owned', 'animated', 'Disney', 'Katottava', '..."
3021,3114,Toy Story 2 (1999),"[Adventure, Animation, Children, Comedy, Fantasy]",1999,3.811464,"[Disney, Owned, imdb top 250, original, animat...","['Disney', 'Owned', 'imdb top 250', 'original'..."
11359,50872,Ratatouille (2007),"[Animation, Children, Drama]",2007,3.81114,"[Owned, top funniest animation, animation, cle...","['Owned', 'top funniest animation', 'animation..."
18276,95856,Knick Knack (1989),"[Animation, Children]",1989,3.542918,"[Pixar, short, pixar animation, short, snowglo...","['Pixar', 'short', 'pixar animation', 'short',..."
10375,40339,Chicken Little (2005),"[Action, Adventure, Animation, Children, Comed...",2005,2.752399,"[aliens, bland, bullying, dumb, father-son rel...","['aliens', 'bland', 'bullying', 'dumb', 'fathe..."
13357,68954,Up (2009),"[Adventure, Animation, Children, Drama]",2009,3.963585,"[emotional, motivational, tear jerker, owned, ...","['emotional', 'motivational', 'tear jerker', '..."


In [61]:
get_recommendations_from_df('Jumanji').head(10)

type(idx)=<class 'pandas.core.indexes.numeric.Int64Index'>, idx=1
movie_indices=[1638, 8224, 749, 495, 3391, 3352, 5420, 13786, 10941, 11692, 2162, 10787, 5053, 2876, 2340, 9663, 5065, 5157, 10413, 1474, 1213, 1253, 11510, 580, 3015, 5280, 20481, 9264, 1640, 2206]


Unnamed: 0,movieId,title,genres,year,rating,tag,combination
1638,1702,Flubber (1997),"[Children, Comedy, Fantasy]",1997,2.570245,"[inventor, romance, flight, green, inventor, m...","['inventor', 'romance', 'flight', 'green', 'in..."
8224,8939,"Final Cut, The (2004)","[Sci-Fi, Thriller]",2004,3.208225,"[boss, dying and death, filmteam, microchip, s...","['boss', 'dying and death', 'filmteam', 'micro..."
749,765,Jack (1996),"[Comedy, Drama]",1996,3.021152,"[Francis Ford Coppola, age difference, appeara...","['Francis Ford Coppola', 'age difference', 'ap..."
495,500,Mrs. Doubtfire (1993),"[Comedy, Drama]",1993,3.38631,"[Robin Williams, cross dressing, deceit, divor...","['Robin Williams', 'cross dressing', 'deceit',..."
3391,3489,Hook (1991),"[Adventure, Comedy, Fantasy]",1991,3.214023,"[daughter, duel, fairy tale, fantasy, flying, ...","['daughter', 'duel', 'fairy tale', 'fantasy', ..."
3352,3448,"Good Morning, Vietnam (1987)","[Comedy, Drama, War]",1987,3.666639,"[cynic, dying and death, entertainer, explosiv...","['cynic', 'dying and death', 'entertainer', 'e..."
5420,5528,One Hour Photo (2002),"[Drama, Thriller]",2002,3.316268,"[birthday party, cheating, hotel room, imagina...","['birthday party', 'cheating', 'hotel room', '..."
13786,71429,World's Greatest Dad (2009),"[Comedy, Drama]",2009,3.423,"[adolescence, high school, independent film, l...","['adolescence', 'high school', 'independent fi..."
10941,46972,Night at the Museum (2006),"[Action, Comedy, Fantasy, IMAX]",2006,3.045935,"[based on children's book, chaos, dinosaur, du...","[""based on children's book"", 'chaos', 'dinosau..."
11692,53974,License to Wed (2007),"[Comedy, Romance]",2007,2.576402,"[bride, bridegroom, church, civil registry off...","['bride', 'bridegroom', 'church', 'civil regis..."


In [None]:
get_recommendations_from_df('Father of the Bride Part II').head(10)

In [None]:
#cosine_sim_sparse = sparse.csr_matrix(cosine_sim)

In [71]:
#np.savez_compressed('cosine_sim_25m.npz',cosine_sim)

In [72]:
#loaded = np.load('cosine_sim_25m.npz')

In [73]:
#loaded.keys()

KeysView(<numpy.lib.npyio.NpzFile object at 0x7fae23a93d00>)

In [74]:
#type(loaded)

numpy.lib.npyio.NpzFile

In [75]:
#loaded.files

['arr_0']

In [76]:
#type(loaded['arr_0'])

numpy.ndarray

In [77]:
#cosine_sim_loaded = loaded['arr_0']

## SVD

In [54]:
!pip install -q surprise
from surprise import BaselineOnly
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate


In [79]:
# A reader is still needed but only the rating_scale param is required.
reader = Reader(rating_scale=(1, 5))
ratings_dataset = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

In [80]:
from surprise import SVD

# We'll use the famous SVD algorithm.
algo = SVD()

# Run 5-fold cross-validation and print results
cross_validate(algo, ratings_dataset, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.7778  0.7776  0.7772  0.7775  0.7781  0.7776  0.0003  
MAE (testset)     0.5868  0.5869  0.5866  0.5869  0.5875  0.5869  0.0003  
Fit time          1191.06 1122.46 1004.57 1021.97 1010.47 1070.11 74.22   
Test time         310.95  214.26  183.04  240.51  138.92  217.54  57.68   


{'test_rmse': array([0.77776536, 0.7775672 , 0.77723935, 0.77754661, 0.77809416]),
 'test_mae': array([0.58679315, 0.58687918, 0.58660995, 0.58685411, 0.58751339]),
 'fit_time': (1191.0620748996735,
  1122.4649488925934,
  1004.5693008899689,
  1021.9744300842285,
  1010.46510887146),
 'test_time': (310.9504041671753,
  214.25812196731567,
  183.03560280799866,
  240.5103621482849,
  138.92254281044006)}

Export algo.

In [82]:
type(algo)

surprise.prediction_algorithms.matrix_factorization.SVD

In [83]:
from surprise import dump
dump.dump('svd', algo=algo)

Load SVD model from the dump file and perform inference.

In [None]:
!pip install -q wget
import wget
wget.download('https://s3.amazonaws.com/movielens.data/ml-25m/svd-5', 'svd')
!pip install -q surprise
from surprise import dump
_, svd = dump.load('svd')

We can now predict ratings by directly calling the predict() method. Let’s say you’re interested in user 196 and item 302 (make sure they’re in the trainset!), and you know that the true rating 𝑟𝑢𝑖=4

In [None]:
uid = str(196)  # raw user id (as in the ratings file). They are **strings**!
iid = str(302)  # raw item id (as in the ratings file). They are **strings**!

# get a prediction for specific users and items.
pred = svd.predict(uid, iid, r_ui=4, verbose=True)

In [None]:
#svd.get_neighbors(302, 10)

In [76]:
md2['title'] = md2['title'].apply(lambda x: x[:x.find('(')].rstrip())

In [77]:
md2.head()

Unnamed: 0,movieId,title,genres,year,rating,tag,combination
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,3.893708,"[Owned, imdb top 250, Pixar, Pixar, time trave...","['Owned', 'imdb top 250', 'Pixar', 'Pixar', 't..."
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,3.251527,"[Robin Williams, time travel, fantasy, based o...","['Robin Williams', 'time travel', 'fantasy', ""..."
2,3,Grumpier Old Men,"[Comedy, Romance]",1995,3.142028,"[funny, best friend, duringcreditsstinger, fis...","['funny', 'best friend', 'duringcreditsstinger..."
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995,2.853547,"[based on novel or book, chick flick, divorce,...","['based on novel or book', 'chick flick', 'div..."
4,5,Father of the Bride Part II,[Comedy],1995,3.058434,"[aging, baby, confidence, contraception, daugh...","['aging', 'baby', 'confidence', 'contraception..."


In [88]:
x='Fa'

In [108]:
starts_with = md2[md2['title'].apply(lambda y: y.lower().startswith(x.lower()))]

In [109]:
starts_with

Unnamed: 0,movieId,title,genres,year,rating,tag,combination
4,5,Father of the Bride Part II,[Comedy],1995,3.058434,"[aging, baby, confidence, contraception, daugh...","['aging', 'baby', 'confidence', 'contraception..."
70,71,Fair Game,[Action],1995,2.345674,"[based on novel or book, bomb, car chase, car ...","['based on novel or book', 'bomb', 'car chase'..."
235,238,Far From Home: The Adventures of Yellow Dog,"[Adventure, Children]",1995,3.165796,"[boy and dog, wilderness, dog, dogs, wildernes...","['boy and dog', 'wilderness', 'dog', 'dogs', '..."
239,242,Farinelli: il castrato,"[Drama, Musical]",1994,3.385500,"[18th century, brotherhood, opera, biography, ...","['18th century', 'brotherhood', 'opera', 'biog..."
385,390,Faster Pussycat! Kill! Kill!,"[Action, Crime, Drama]",1965,3.407253,"[cult classic, campy, hallucinatory, irreveren...","['cult classic', 'campy', 'hallucinatory', 'ir..."
...,...,...,...,...,...,...,...
57366,204846,Fair of the Dove,[Comedy],1963,3.000000,,
57704,205827,Fanatic,[Thriller],2019,1.500000,,
57773,206012,Fathers are People,[Animation],1951,3.000000,,
58053,206837,Falling Camellia,[Drama],2018,2.250000,,


In [46]:
tag='romance'

In [47]:
tag_match = md2[md2['combination'].apply(lambda y: tag in y)]

In [48]:
tag_match

Unnamed: 0,movieId,title,genres,year,rating,tag,combination
6,7,Sabrina,"[Comedy, Romance]",1995,3.363666,"[remake, chauffeur, fusion, long island, milli...","['remake', 'chauffeur', 'fusion', 'long island..."
10,11,"American President, The","[Comedy, Drama, Romance]",1995,3.657171,"[Romance, white house, new love, usa president...","['Romance', 'white house', 'new love', 'usa pr..."
16,17,Sense and Sensibility,"[Drama, Romance]",1995,3.948806,"[chick flick, British, Jane Austen, 19th centu...","['chick flick', 'British', 'Jane Austen', '19t..."
28,29,"City of Lost Children, The","[Adventure, Drama, Fantasy, Mystery, Sci-Fi]",1995,3.936725,"[dystopia, surreal, bleak, children, kidnappin...","['dystopia', 'surreal', 'bleak', 'children', '..."
38,39,Clueless,"[Comedy, Romance]",1995,3.418738,"[teen movie, teen movie, chick flick, quotable...","['teen movie', 'teen movie', 'chick flick', 'q..."
...,...,...,...,...,...,...,...
54230,197199,Isn't It Romantic,"[Comedy, Fantasy, Romance]",2019,2.835681,"[cliche, musical number, romance, Liam Hemswor...","['cliche', 'musical number', 'romance', 'Liam ..."
55699,200540,Aladdin,"[Adventure, Fantasy, Romance]",2019,3.401832,"[costumes, Arabia, Disney, fantasy, genie, mus...","['costumes', 'Arabia', 'Disney', 'fantasy', 'g..."
55844,200864,Batman: Hush,"[Action, Animation, Crime, Mystery]",2019,2.780488,"[Bane, Batgirl, Batman, Catwoman, DC Universe,...","['Bane', 'Batgirl', 'Batman', 'Catwoman', 'DC ..."
56220,201773,Spider-Man: Far from Home,"[Action, Adventure, Sci-Fi]",2019,3.712522,"[Marvel, Spiderman, tom holland, Michael Giacc...","['Marvel', 'Spiderman', 'tom holland', 'Michae..."


In [78]:
md2.to_csv('ml-25m.csv')

## Perform Inference

In [50]:
import json
import os
import pathlib
import pickle

import pandas as pd
import numpy as np
import sys
import traceback

In [151]:
# movie_db = pd.read_csv(os.path.join(pathlib.Path(__file__).parent.absolute(), 'ml-25m.csv'), header=0,
#                   index_col=0,
#                   squeeze=True)
movie_db = pd.read_csv('ml-25m.csv', header=0,
                   index_col=0,
                   squeeze=True)

In [152]:
movie_db.head()

Unnamed: 0,movieId,title,genres,year,rating,tag,combination
0,1,Toy Story,"['Adventure', 'Animation', 'Children', 'Comedy...",1995,3.893708,"['Owned', 'imdb top 250', 'Pixar', 'Pixar', 't...","['Owned', 'imdb top 250', 'Pixar', 'Pixar', 't..."
1,2,Jumanji,"['Adventure', 'Children', 'Fantasy']",1995,3.251527,"['Robin Williams', 'time travel', 'fantasy', ""...","['Robin Williams', 'time travel', 'fantasy', ""..."
2,3,Grumpier Old Men,"['Comedy', 'Romance']",1995,3.142028,"['funny', 'best friend', 'duringcreditsstinger...","['funny', 'best friend', 'duringcreditsstinger..."
3,4,Waiting to Exhale,"['Comedy', 'Drama', 'Romance']",1995,2.853547,"['based on novel or book', 'chick flick', 'div...","['based on novel or book', 'chick flick', 'div..."
4,5,Father of the Bride Part II,['Comedy'],1995,3.058434,"['aging', 'baby', 'confidence', 'contraception...","['aging', 'baby', 'confidence', 'contraception..."


In [153]:
cosine_sim_df = pd.read_csv('cosine_similarity_recommender_df.csv', header=0,
                         index_col=0,
                         squeeze=True)

In [154]:
cosine_sim_df.head()

Unnamed: 0,title,similar
0,Toy Story,"['Toy Story 2', ""Bug's Life, A"", 'Monsters, In..."
1,Jumanji,"['Flubber', 'Final Cut, The', 'Jack', 'Mrs. Do..."
2,Grumpier Old Men,"['House Party 2', 'Grumpy Old Men', 'F/X2', 'E..."
3,Waiting to Exhale,"['Violets Are Blue...', 'In Her Shoes', ""What'..."
4,Father of the Bride Part II,"['Father of the Bride', 'My Big Fat Greek Wedd..."


In [155]:
# Pre computed cosine similarity Numpy arrays.
# loaded = np.load(os.path.join(pathlib.Path(__file__).parent.absolute(), 'cosine_sim.npy'))
#loaded = np.load('cosine_sim_25m.npz')

In [156]:
#type(loaded)

In [157]:
#cosine_sim = loaded['arr_0']

In [158]:
#cosine_sim.shape

In [159]:
#type(cosine_sim)

In [160]:
#cos_sim_df = pd.DataFrame(cosine_sim, index=indices, columns=indices)
#cos_sim_df.to_csv("artist_similarities.csv")

In [161]:
def load(file_name):
    dump_obj = pickle.load(open(file_name, 'rb'))
    return dump_obj['algo']

In [163]:
import pickle
# svd = load(os.path.join(pathlib.Path(__file__).parent.absolute(), 'svd'))
svd = load('svd')

In [164]:
def lambda_handler(event, context):
    top_10_movies = ""
    try:
        print(f"event={event}")
        title = event['multiValueQueryStringParameters']['Title'][0]
        userId = event['multiValueQueryStringParameters']['UserId'][0]
        print(f'Received Userid={userId}, Title={title}')
        idx = movie_db.index[movie_db['title'] == title]

        if not len(idx.values):
            # STAGE 1:
            # Check if there is a keyword match for the starting name of the movie.
            print('Unable to find exact title match')
            print('Trying to search for the starting keyword in the title name.')
            movies = movie_db[movie_db['title'].apply(lambda y: str(y).lower().startswith(str(title).lower()))]

            # STAGE 2:
            # Check if there is a keyword match in the tags for the movies.
            if movies.shape[0] < 20:
                print('Trying to search for keywords in the tags column.')
                tag_match = movie_db[movie_db['combination'].apply(lambda y: str(title) in str(y))]
                movies = pd.concat([movies, tag_match], ignore_index=True, axis=0)
            print(movies[['title', 'rating']])
            print("Stage 2 completed. Successfully fetched the title, vote count, vote average, year and id for the "
                  "top matched movies. ")
        else:
            idx = idx.values[0]
            print("STAGE 0 complete - Found the corresponding idx={} for the title {}."
                  .format(idx, title))

            # STAGE 1: Filter out top 20 movies based on cosine similarity.
            movie_list = cosine_sim_df.loc[cosine_sim_df['title'] == title, 'similar'][0]
            movie_list = eval(movie_list)
            print(f'movie_list={movie_list}')
            movie_indices = [movie_db.index[movie_db['title'] == movie][0] for movie in movie_list]
            print(
                f"Stage 1 complete - Found movie indices = {movie_indices} for top 50 titles that share the same cosine "
                f"similarity with the passed in movie index.")

            # STAGE 2:
            # For each of the 20 top movies that share cosine similarity with the passed in movie title,
            # Fetch the title, vote count, vote average, year and id.
            movies = movie_db.loc[movie_indices]
            print(movies[['title', 'rating']])
            print("Stage 2 completed. Successfully fetched the title, vote count, vote average, year and id for the "
                  "top 20 movies. ")

        # STAGE 3: Use the SVD model that is built 62,000 movies by 162,000 users.
        # Apply SVD algorithm predict() API call to find the estimation for the top 50 movie titles.
        movies['est'] = movies['movieId'].apply(lambda x: svd.predict(int(userId), x).est)
        movies = movies.sort_values(['est', 'year'], ascending=False)
        # print('movies={}'.format(movies[['title', 'est']]))
        print(movies[['title', 'est', 'rating']])
        print("Stage 3 completed. Successfully applied SVD predict() on the list of 50 movie indices.")

        # STAGE 4: Pick top 10 movies from this list and send this back to the customer.
        top_10_movies = movies.head(20)['title'].tolist()
        print("Stage 4 completed. Successfully fetched top 10 movies from the list and returning this back to the "
              "customer.")
        print('TOP 10 movie recommendations={}'.format(top_10_movies))
    except Exception:
        # printing stack trace
        traceback.print_exception(*sys.exc_info())
        print(traceback.format_exc())
    finally:
        return {
            'statusCode': 200,
            'body': json.dumps(
                {
                    "top_10_movies": top_10_movies,
                }
            )
        }

In [166]:
event = {'multiValueQueryStringParameters': {'Title': ['Toy Story'], 'UserId': ['500']}}
lambda_response = lambda_handler(event, None)
print(f'lambda_response={lambda_response}')        

event={'multiValueQueryStringParameters': {'Title': ['Toy Story'], 'UserId': ['500']}}
Received Userid=500, Title=Toy Story
STAGE 0 complete - Found the corresponding idx=0 for the title Toy Story.
movie_list=['Toy Story 2', "Bug's Life, A", 'Monsters, Inc.', 'Toy Story 3', 'Finding Nemo', 'Ratatouille', 'Finding Dory', 'Incredibles, The', 'Monsters University', 'Toy Story Toons: Small Fry', 'Ice Age', 'Toy Story Toons: Hawaiian Vacation', 'Your Friend the Rat', 'Knick Knack', 'Up', 'Cars 2', 'Toy Story 4', "Boundin'", 'For the Birds']
Stage 1 complete - Found movie indices = [3021, 2264, 4780, 14803, 6258, 11359, 37262, 8246, 19816, 22440, 5110, 22439, 21083, 18276, 13357, 16630, 56144, 18195, 18277] for top 50 titles that share the same cosine similarity with the passed in movie index.
                                    title    rating
3021                          Toy Story 2  3.811464
2264                        Bug's Life, A  3.569156
4780                       Monsters, Inc.  3.

NameError: name 'json' is not defined