### In the following I'm going to look for similar movies using only the ratings coming from the tweets. 
#### (I'm aware that there are python libraries designed specifically for building recommendation engines. They would conduct more sophisticated computations and give better recommendations, but I didn't want to Import antigravity and fly) 

In [418]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "https://image.slidesharecdn.com/programming-with-python-basic-130312114054-phpapp01/95/programming-with-python-basic-19-638.jpg?cb=1465001793")

In [1]:
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import coo_matrix
import pandas as pd
import scipy
import numpy
import pickle

#### Preparing a matrix where every movie is a row, every user is a column and values are the ratings. Using sparse matrix to speed up computation

In [2]:
cols = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_csv('latest/ratings.dat', sep='::',
                    index_col=False, names=cols,
                          encoding="UTF-8", engine='python')

In [3]:
ratings['user_id'] = ratings['user_id'].astype("category")
ratings['movie_id'] = ratings['movie_id'].astype("category")

In [4]:
pivoted = coo_matrix((ratings['rating'].astype(float), 
                   (ratings['movie_id'].cat.codes, 
                    ratings['user_id'].cat.codes)))

### For finding latent characteristics reflected in users ratings I'm goint to use SVD algorithm
#### choosing params for svd algorithm to explaine most variance

In [59]:
svd = TruncatedSVD(n_components=100, algorithm='arpack')
svd.fit(pivoted)
svd.explained_variance_ratio_.sum()

0.34029293092040525

In [21]:
svd_default = TruncatedSVD(n_components=100)
svd_default.fit(pivoted)
svd_default.explained_variance_ratio_.sum()

0.33804611167474657

In [22]:
svd_500 = TruncatedSVD(n_components=500)
svd_500.fit(pivoted)
svd_500.explained_variance_ratio_.sum()

0.63398036486040477

#### try to give the answer for only movies rated by at least 10 users

In [6]:
movies_to_keep = ratings.groupby('movie_id').count()
movies_to_keep = movies_to_keep[movies_to_keep.rating > 9].reset_index()[['movie_id']]

In [7]:
filtered_ratings = ratings.merge(movies_to_keep, on='movie_id', how='inner')

In [8]:
len(ratings), len(filtered_ratings)

(583130, 534140)

#### we didn't loose too much data with this limitations

In [9]:
filtered_ratings['user_id'] = filtered_ratings['user_id'].astype("category")
filtered_ratings['movie_id'] = filtered_ratings['movie_id'].astype("category")

filtered_pivoted = coo_matrix((filtered_ratings['rating'].astype(float), 
                   (filtered_ratings['movie_id'].cat.codes, 
                    filtered_ratings['user_id'].cat.codes)))

In [57]:
filtered_svd = TruncatedSVD(n_components=100, algorithm='arpack')
filtered_svd.fit(filtered_pivoted)
filtered_svd.explained_variance_ratio_.sum()

0.35710184729416261

#### it did not help that much. How about if we also filter for users rating at least 10 movies?

In [10]:
users_to_keep = ratings.groupby('user_id').count()
users_to_keep = users_to_keep[users_to_keep.rating > 9].reset_index()[['user_id']]

In [11]:
double_filtered_ratings = filtered_ratings.merge(users_to_keep, on='user_id', how='inner')

In [63]:
len(ratings), len(filtered_ratings), len(double_filtered_ratings)

(583130, 534140, 454543)

In [13]:
double_filtered_ratings['user_id'] = double_filtered_ratings['user_id'].astype("category")
double_filtered_ratings['movie_id'] = double_filtered_ratings['movie_id'].astype("category")

double_filtered_pivoted = coo_matrix((double_filtered_ratings['rating'].astype(float), 
                   (double_filtered_ratings['movie_id'].cat.codes, 
                    double_filtered_ratings['user_id'].cat.codes)))

In [65]:
double_filtered_svd = TruncatedSVD(n_components=100, algorithm='arpack')
double_filtered_svd.fit(double_filtered_pivoted)
double_filtered_svd.explained_variance_ratio_.sum()

0.36497076339588913

In [14]:
double_filtered_svd_500 = TruncatedSVD(n_components=500, algorithm='arpack')
double_filtered_svd_500.fit(double_filtered_pivoted)
double_filtered_svd_500.explained_variance_ratio_.sum()

0.68026765749971452

#### it also didn't help that much, but let's continue with this 

In [15]:
decomposed_data = double_filtered_svd_500.transform(double_filtered_pivoted)

In [74]:
decomposed_data.shape

(5770L, 500L)

### And let's find similar movies!
#### To be able to interpret the results, I join the names of the movies to the id-s

In [16]:
movies = double_filtered_ratings[['movie_id']].drop_duplicates().sort_values('movie_id')
movies.index = range(0,len(movies))

In [17]:
cols = ['movie_id', 'movie_title', 'movie_genre']
all_movies = pd.read_csv('latest/movies.dat', sep='::',
                    index_col=False, names=cols,
                          encoding="UTF-8", engine='python')

In [18]:
filtered_movies = all_movies.merge(movies, on='movie_id', how='inner')

#### Let's create a class for calculating cosine distance between the vectors representing the movies. It gives back the X most similar movies

In [382]:
class movie_lookup(object):
    """
    This class calculates the cosine distance of one row from all other rows in a matrix.
    
    Args:
    dataframe (dataframe): a dataframe with the column 'movie_title' containting the movie titles 
                           and indexed in the same order as the decomposed_data matrix
    decomposed_data (array): a numpy array with rows for movies and columns for user ratings after dimensionality reduction
    """
    def __init__(self, dataframe, decomposed_data):
        self.dataframe = dataframe
        self.decomposed_data = decomposed_data
    
    def most_similar_movies(self,title,howmany):
        """
        This method connects the movie titles with the decomposed_data and finds X rows with the smallest cosine distance
        from the given row.
        
        Args:
        title (str): a title from the movie_title column of the dataframe
        howmany (int): a number defining how many results should be returned
        
        Return:
        A list of the most similar movie titles, and their 1-cosine distance from the given title
        """
        if not title in self.dataframe['movie_title'].tolist():
            raise ValueError('The given title is not among the movies')
        movie_index = self.dataframe[self.dataframe.movie_title==title].index[0]
        v = self.decomposed_data[movie_index,:].reshape(1, -1)
        distances = scipy.spatial.distance.cdist(self.decomposed_data, v, 'cosine').reshape(-1)
        most_similars = numpy.argsort(distances)[1:howmany+1]
        return  [(self.dataframe.loc[x]['movie_title'], 1-round(distances[x],3)) for x in most_similars]


#### Collecting the most similar movie for all movies and their similarity 

In [None]:
m = movie_lookup(filtered_movies, decomposed_data)
highests = []

for title in filtered_movies['movie_title']:
    most_similar = m.most_similar_movies(title,1)
    highests.append((most_similar[0][1],title, most_similar[0][0]))

In [421]:
sorted(highests)[-10:]

[(0.945,
  'The Disappearance of Eleanor Rigby: Her (2013)',
  'The Disappearance of Eleanor Rigby: Him (2013)'),
 (0.945,
  'The Disappearance of Eleanor Rigby: Him (2013)',
  'The Disappearance of Eleanor Rigby: Her (2013)'),
 (0.949, 'Nymphomaniac (2013)', 'Nymphomaniac: Volume II (2013)'),
 (0.949, 'Nymphomaniac: Volume II (2013)', 'Nymphomaniac (2013)'),
 (0.949,
  'The Lord of the Rings: The Return of the King (2003)',
  'The Lord of the Rings: The Two Towers (2002)'),
 (0.949,
  'The Lord of the Rings: The Two Towers (2002)',
  'The Lord of the Rings: The Return of the King (2003)'),
 (0.954,
  'Harry Potter and the Goblet of Fire (2005)',
  'Harry Potter and the Prisoner of Azkaban (2004)'),
 (0.954,
  'Harry Potter and the Prisoner of Azkaban (2004)',
  'Harry Potter and the Goblet of Fire (2005)'),
 (0.96,
  'Harry Potter and the Chamber of Secrets (2002)',
  "Harry Potter and the Sorcerer's Stone (2001)"),
 (0.96,
  "Harry Potter and the Sorcerer's Stone (2001)",
  'Harry Po

In [423]:
sorted(highests)[:20]

[(0.21199999999999997,
  'Peaceful Warrior (2006)',
  'Escape from Planet Earth (2013)'),
 (0.21299999999999997,
  'Birlesen Gonuller (2014)',
  'Waiting for Forever (2010)'),
 (0.21399999999999997, 'Rise of the Footsoldier (2007)', 'Flight (2012)'),
 (0.22099999999999997, 'Life of Pi (2012)', 'Silver Linings Playbook (2012)'),
 (0.22299999999999998, 'Kod Adi K.O.Z. (2015)', 'Birdman (2014)'),
 (0.22699999999999998, 'Drive (2011)', 'Pusher (1996)'),
 (0.22999999999999998, 'Nights in Rodanthe (2008)', 'The Last Song (2010)'),
 (0.235, 'Zero Dark Thirty (2012)', 'Premium Rush (2012)'),
 (0.237, 'Pearl Jam Twenty (2011)', 'Les Mis\xc3\xa9rables (2012)'),
 (0.238, 'Flight (2012)', 'Trouble with the Curve (2012)'),
 (0.24, 'Lincoln (2012)', 'The New World (2005)'),
 (0.242, '127 Hours (2010)', 'Grown Ups (2010)'),
 (0.242, 'Transit (2012)', 'Dilwale (2015)'),
 (0.245, 'Searching for Bobby Fischer (1993)', "Fool's Gold (2008)"),
 (0.245, 'The Good Girl (2002)', 'Into the Woods (2014)'),
 (0.

### Looks like that the liking of movies from the same series are very similar to each other. But movies with unique athmosphere, like the Life of Pi, Memento or Django are not that similar to any other movies. 

In [408]:
sorted(highests)[-70:-60]

[(0.876, 'Badlapur (2015)', 'Tanu Weds Manu Returns (2015)'),
 (0.876, 'The Bourne Ultimatum (2007)', 'The Bourne Supremacy (2004)'),
 (0.877, 'Udta Punjab (2016)', 'Shuddh Desi Romance (2013)'),
 (0.877, 'Wazir (2016)', 'Kapoor and Sons (2016)'),
 (0.88,
  'Ice Age: Dawn of the Dinosaurs (2009)',
  'Ice Age: The Meltdown (2006)'),
 (0.88,
  'Ice Age: The Meltdown (2006)',
  'Ice Age: Dawn of the Dinosaurs (2009)'),
 (0.881, 'Tanu Weds Manu Returns (2015)', 'Dum Laga Ke Haisha (2015)'),
 (0.883,
  "A Nightmare on Elm Street Part 2: Freddy's Revenge (1985)",
  'A Nightmare on Elm Street 3: Dream Warriors (1987)'),
 (0.884,
  "Freddy's Dead: The Final Nightmare (1991)",
  'A Nightmare on Elm Street 4: The Dream Master (1988)'),
 (0.884,
  'Friday the 13th Part VIII: Jason Takes Manhattan (1989)',
  'Friday the 13th: A New Beginning (1985)')]

In [417]:
sorted(highests)[-130:-120]

[(0.84,
  'Halloween 4: The Return of Michael Myers (1988)',
  'Halloween 5 (1989)'),
 (0.84,
  'Halloween 5 (1989)',
  'Halloween 4: The Return of Michael Myers (1988)'),
 (0.841,
  'Boruto: Naruto the Movie (2015)',
  'The Last: Naruto the Movie (2014)'),
 (0.841, 'The Hangover (2009)', 'The Hangover Part II (2011)'),
 (0.841, 'The Hangover Part II (2011)', 'The Hangover (2009)'),
 (0.841,
  'The Last: Naruto the Movie (2014)',
  'Boruto: Naruto the Movie (2015)'),
 (0.841, 'War Room (2015)', 'Woodlawn (2015)'),
 (0.841, 'Woodlawn (2015)', 'War Room (2015)'),
 (0.842, 'Ae Dil Hai Mushkil (2016)', 'Kapoor and Sons (2016)'),
 (0.843, 'Spider-Man 3 (2007)', 'Spider-Man 2 (2004)')]

### In fact for finding the first movie pairs not coming from the same series we have to skip the first 67 most similar movies. And the first non-series and non-bollywood movie pairs don't come up until going through the first 120 results

In [19]:
#Saving dataframes for later usage in ohter notebook
output = open('filtered_movies.pkl', 'wb')
pickle.dump(filtered_movies, output)
output.close()

output = open('collaborative_decomposed_data.pkl', 'wb')
pickle.dump(decomposed_data, output)
output.close()