# **Collaborative filtering implementation:**

**1. Library importing, data initialization and auxilary functions definition:**

In [76]:
import pandas as pd
import numpy as np
import math

ratings_path = "./ml-latest-small/ratings.csv"
movies_path = "./ml-latest-small/movies.csv"
tags_path = "./ml-latest-small/tags.csv"
similar_users_path = "./similar_users.csv"
user_coefficients_path = "./user_coefficients.csv"
top_similar_items_path = "./top_similar_items.csv"
top_movie_coefficients_path = "./top_item_coefficients.csv"
user_recommendations_path = "./user_recommendations/"
item_recommendations_path = "./item_recommendations/"

The 
`load_dataset()` function tries to read a 
`.csv` file from the given location and returns dataset on success and 
`None` otheriwise:

In [77]:
def load_dataset(dataset):
    dataset_name = dataset[0]
    dataset_path = dataset[1]
    try:
        data = pd.read_csv(dataset_path)
        print("{} is successfully read from memory.").format(dataset_name)
        return data
    except:
        print("CAUTION: {} cannot be read from memory.").format(dataset_name)
        return None

The 
`load_precalculated()` function checks if precalculated data was successfully read from memory. It returns the requested data - either read from memory or recalculated in case of failure:

In [78]:
def load_precalculated(data, name, path, recalculator, arg):
    data = load_dataset((name, path))
    if data is None:
        print("{} data is being recalculated... It might take a while...").format(name)
        data = recalculator()[arg]
    return data

The 
`process_ratings()` function processes the raw data from 
`Ratings.csv`. Firstly it calculates the mean of ratings for each user. Then it "shifts" the ratings by substracting the mean from each of them. Afterwards, comes the part which the authors came up with on their own. Namely, we thought it would be nicer if the adjasted ratings where also unskewed. That means that previously, the farhter from the mean a rating was, the more weight it should be given, which was highly dependant on the mean. For example, if a user had the mean of 4.0, his other ratings where 3.0 and 5.0. The adjusted ratings would be -1.0 and 1.0 respectively. But the authors felt it to be unfair that user ratings could be skewed to either direction (in this case the worst movie could get adjusted rating of -3.5, but the best one would never get more than 1.0). 

So, the authors have empirically come up with the following. Firstly, they calculated mean skewness:
```python
Ratings['mean_skewness'] = Ratings['rating_adjusted']/
                           (5-Ratings['rating_mean'])*
                           (Ratings['rating_adjusted']>0)-
                               (Ratings['rating_adjusted'])/
                               (Ratings['rating_mean']-0.5)*
                               (Ratings['rating_adjusted']<0)```
and then used it to unskew the adjusted ratings:
```python
Ratings['rating_unskewed'] = Ratings['rating_adjusted'] * np.sqrt(1+(Ratings['mean_skewness']**2)*2)``` 

Afterwards, the 
`PivotedUserMatrix` and 
`PivotedMoviesMatrix` matrices were created by pivoting the 
`Ratings` matrix in the way that correspondingly users' ratings and movies' ratings formed columns, in order to simplify further operations on them.
The function returns three of the above mentioned matrices:


In [79]:
def process_ratings(Ratings):
    MeanUserRating = Ratings.groupby(['userId'], as_index = False, sort = False).mean()\
    .rename(columns = {'rating': 'rating_mean'})[['userId','rating_mean']]
    
    Ratings = pd.merge(Ratings, MeanUserRating, on = 'userId', how = 'left', sort = False)
    Ratings['rating_adjusted'] = Ratings['rating'] - Ratings['rating_mean']
    Ratings['mean_skewness'] = (Ratings['rating_adjusted'])/(5-Ratings['rating_mean'])*(Ratings['rating_adjusted']>0).astype(float)\
    - (Ratings['rating_adjusted'])/(Ratings['rating_mean']-0.5)*(Ratings['rating_adjusted']<0).astype(float)
    
    Ratings['rating_unskewed'] = Ratings['rating_adjusted'] * np.sqrt(1+(Ratings['mean_skewness']**2)*2) 
    PivotedUserMatrix = Ratings.pivot_table(index='movieId', columns='userId', values='rating_unskewed', fill_value=0)
    PivotedMoviesMatrix = Ratings.pivot_table(index='userId', columns='movieId', values='rating_unskewed', fill_value=0)
    return Ratings, PivotedUserMatrix, PivotedMoviesMatrix

Start of the program's execution. Here we just declare and initialize global 
```pandas.DataFrame```'s as well as create lists of distinct 
```userId```'s and
```movieId```'s:

In [80]:
SimilarUsers = pd.DataFrame()
UserCoefficients = pd.DataFrame()
TopSimilarItems = pd.DataFrame()
TopMovieCoefficients = pd.DataFrame()

Ratings = load_dataset(("Ratings", ratings_path))
Ratings, PivotedUserMatrix, PivotedMoviesMatrix = process_ratings(Ratings)
Movies = load_dataset(("Movies", movies_path))

distinct_users = np.unique(Ratings['userId'])
distinct_movies = np.unique(Ratings['movieId'])

Ratings is successfully read from memory.
Movies is successfully read from memory.


The
```calc_vectors_length()``` function takes in a matrix and calculates the "lengths" of its column vectors. As a result, it returns a one-column
```pandas.DataFrame``` with the vectors' lengths:

In [332]:
def calc_vectors_length(matrix):
    vectors_length = pd.DataFrame(columns=['length'], dtype='float')
    for column in matrix:
        vector = np.transpose(matrix[column].as_matrix())
        vector_length = math.sqrt((vector*vector).sum())
        vectors_length = vectors_length.append({'length': vector_length}, ignore_index=True)
    return vectors_length

The
```unskewed_pearson_similarity()``` function takes in two vectors along with their lengths, calculates their dot product and returns their pearson-similarity distance or
```0``` in case the input vectors are equal or either of them has a zero-length:

In [343]:
def unskewed_pearson_similarity(v1, v2, v1_length, v2_length):
    dot_product = (v1*v2).sum()
    if v1_length < 0.0000001 or v2_length < 0.0000001 or (v1==v2).all():
        return 0
    else:
        return dot_product / v1_length / v2_length

**2. Computation of User-User and Item-Item similarity matrices:** 

The 
```user_similarity_matrix()``` is to calculate how similar each pair of users is and save it in a matrix form. After it has been done, the similarity values (i.e. the pearson distances) are sorted descending and saved in the
```UserCoefficients``` dataframe. The corresponding
```userId```'s for each coefficient are also saved in descending order in the
```SimilarUsers``` dataframe. As the calculations proceed, we can observe the progress with printed out message string. At the very end, the function saves the both
```UserCoefficients``` and
```SimilarUsers``` dataframes to a specific place in memory, so that we don't need to recalculate the same data at future program launches:

In [345]:
def user_similarity_matrix():
    SimilarUsers = pd.DataFrame(0, index=np.arange(distinct_users.size), columns=np.arange(distinct_users.size), dtype='float')
    UserCoefficients = pd.DataFrame(0, index=np.arange(distinct_users.size), columns=np.arange(distinct_users.size), dtype='float')
    user_vectors_length = calc_vectors_length(PivotedUserMatrix)

    for user in distinct_users:
        userIndex = np.searchsorted(distinct_users, user)

        for user2 in distinct_users:
            user2Index = np.searchsorted(distinct_users, user2)

            proximity = unskewed_pearson_similarity(PivotedUserMatrix.iloc[:,userIndex],\
                                                    PivotedUserMatrix.iloc[:,user2Index],\
                                                    user_vectors_length.values[userIndex][0],\
                                                    user_vectors_length.values[user2Index][0])
            SimilarUsers[user2Index][userIndex] = proximity
            

        similarity_values = np.copy(SimilarUsers.values[userIndex])
        SimilarUsers.values[userIndex] = np.argsort(SimilarUsers.values[userIndex])[::-1]
        similarity_values.sort()
        similarity_values = similarity_values[::-1]
        UserCoefficients.values[userIndex] = np.where(similarity_values > 0, similarity_values, 0)
        SimilarUsers.values[userIndex] = np.where(similarity_values > 0, SimilarUsers.values[userIndex], 0)
        if not userIndex+1%20:
            print("Calculated for {} users out of {}.").format(userIndex+1, distinct_users.size)

    UserCoefficients.to_csv(user_coefficients_path, index=False)
    SimilarUsers.to_csv(similar_users_path, index=False)
    return SimilarUsers, UserCoefficients

The 
```item_similarity_matrix()``` is to calculate how similar each pair of mocies is and save it in a matrix form. After it has been done, the similarity values (i.e. the pearson distances) are sorted descending and saved in the
```MovieCoefficients``` dataframe. The corresponding
```movieId```'s for each coefficient are also saved in descending order in the
```MoviesMatrix``` dataframe. As the calculations proceed, we can observe the progress with printed out message string. At the very end, the function takes for each movie the TOP-1000 most similar movies along with the corresponding coefficients and saves them in
```TopSimilarItems``` and 
```TopMovieCoefficients``` dataframes, and then also to a specific place in memory, so that we don't need to recalculate the same data at future program launches:

In [335]:
def item_similarity_matrix():
    MoviesMatrix = pd.DataFrame(0, index=np.arange(distinct_movies.size), columns=np.arange(distinct_movies.size), dtype='float')
    TopSimilarItems = pd.DataFrame(0, index=np.arange(distinct_movies.size), columns=np.arange(1000), dtype='float')
    MovieCoefficients = pd.DataFrame(0, index=np.arange(distinct_movies.size), columns=np.arange(distinct_movies.size), dtype='float')
    TopMovieCoefficients = pd.DataFrame(0, index=np.arange(distinct_movies.size), columns=np.arange(1000), dtype='float')
    movie_vectors_length = calc_vectors_length(PivotedMoviesMatrix)
    
    for movie in distinct_movies:
        movieIndex = np.searchsorted(distinct_movies, movie)

        for movie2 in distinct_movies:
            movie2Index = np.searchsorted(distinct_movies, movie2)
            
            proximity = unskewed_pearson_similarity(PivotedMoviesMatrix.iloc[:,movieIndex],\
                                                    PivotedMoviesMatrix.iloc[:,movie2Index],\
                                                    movie_vectors_length.values[movieIndex][0],\
                                                    movie_vectors_length.values[movie2Index][0])
            MoviesMatrix[movie2Index][movieIndex] = proximity

        similarity_values = np.copy(MoviesMatrix.values[movieIndex])
        MoviesMatrix.values[movieIndex] = np.argsort(MoviesMatrix.values[movieIndex])[::-1]
        similarity_values.sort()
        similarity_values = similarity_values[::-1]

        MovieCoefficients.values[movieIndex] = np.where(similarity_values > 0, similarity_values, 0)
        TopMovieCoefficients.values[movieIndex] = MovieCoefficients.values[movieIndex][:1000]
        MoviesMatrix.values[movieIndex] = np.where(similarity_values > 0, MoviesMatrix.values[movieIndex], 0)
        TopSimilarItems.values[movieIndex] = MoviesMatrix.values[movieIndex][:1000]
        if not userIndex+1%20:
            print("Calculated for {} items out of {}.").format(movieIndex+1, distinct_movies.size)

    TopSimilarItems.to_csv(top_similar_items_path, index=False)
    TopMovieCoefficients.to_csv(top_movie_coefficients_path, index=False)
    return TopSimilarItems, TopMovieCoefficients

**3. Loading the similarity matrices from the memory, or recalculating if missing:**

Here we are just initializing the global matrices with sorted user-based and item-based recommendations - either through loading from memory, or through recalculating them from scratch if missing:

In [380]:
SimilarUsers = load_precalculated(SimilarUsers, "SimilarUsers", similar_users_path, user_similarity_matrix, 0)
UserCoefficients = load_precalculated(UserCoefficients, "UserCoefficients", user_coefficients_path, user_similarity_matrix, 1)    
TopSimilarItems = load_precalculated(TopSimilarItems, "TopSimilarItems", top_similar_items_path, item_similarity_matrix, 0)
TopMovieCoefficients = load_precalculated(TopMovieCoefficients, "TopMovieCoefficients", top_movie_coefficients_path, item_similarity_matrix, 1)    

SimilarUsers is successfully read from memory.
UserCoefficients is successfully read from memory.
TopSimilarItems is successfully read from memory.
TopMovieCoefficients is successfully read from memory.


**4. Calculating top user-based and item-based collaborative recommendations for a particular user utilizing similarity matrices from the previous step:**

The 
```accumulate_user_recommendations()``` function collects the recommendations for a particular 
```userId``` from different similar users and adds them together, so that the more people recommend a movie, the higher recommendation score it gets:

In [183]:
def accumulate_user_recommendations(userId, recommenders):
    recommendations = np.zeros(distinct_movies.size)
    userIndex = np.searchsorted(distinct_users, userId)
    recommendersProximity = UserCoefficients.values[userIndex]

    for recommender, proximity in zip(recommenders, recommendersProximity):
        recommenderIndex = np.searchsorted(distinct_users, recommender)
        recommenderRatings = PivotedUserMatrix.iloc[:,recommenderIndex]
        recommendation_vector = (recommenderRatings*proximity).values
        recommendations += recommendation_vector

    return recommendations

The 
```accumulate_item_recommendations()``` function collects the recommendations for a particular 
```userId``` from different similar items and adds them together, so that the more items recommend a movie, the higher recommendation score it gets:

In [184]:
def accumulate_item_recommendations(userId, user_preferences, user_rates):
    recommendations = np.zeros(distinct_movies.size)
    
    for preference, rate in zip(user_preferences, user_rates):
        preferenceIndex = np.searchsorted(distinct_movies, preference)
        preferenceTwins = TopSimilarItems.values[preferenceIndex].astype(int)
        twinsProximity = TopMovieCoefficients.values[preferenceIndex]
        
        for twinId, twinProximity in zip(preferenceTwins, twinsProximity):
            if twinId > 0:
                twinIndex = np.searchsorted(distinct_movies, twinId)
                recommendations[twinIndex] += rate*twinProximity

    return recommendations

The 
```user_collaborative_recommendations()``` function calculates user-based collaborative recommendations for a particular 
```userId``` and returns up to 1000 the most recommended 
```movieId```'s in a descending order:

In [234]:
def user_collaborative_recommendations(userId):
    userIndex = np.searchsorted(np.unique(Ratings['userId']), userId)
    recommenders = np.extract((SimilarUsers.values[userIndex])>0, SimilarUsers.values[userIndex])
    acc_recommendations = accumulate_user_recommendations(userId, recommenders)
    user_recommendations = distinct_movies[np.argsort(acc_recommendations)][::-1]
    acc_recommendations.sort()
    sorted_recommendations = acc_recommendations[::-1]
    user_recommendations = np.extract(sorted_recommendations>0, user_recommendations)
    return user_recommendations[:1000]

The 
```item_collaborative_recommendations()``` function calculates item-based collaborative recommendations for a particular 
```userId``` and returns up to 1000 the most recommended 
```movieId```'s in a descending order:

In [186]:
def item_collaborative_recommendations(userId):
    userIndex = np.searchsorted(distinct_users, userId)
    user_rates = PivotedMoviesMatrix.values[userIndex]
    user_rates = np.where(user_rates>0, user_rates, 0)
    user_preferences = distinct_movies[np.argsort(user_rates)][::-1]
    
    user_rates.sort()
    user_rates = user_rates[::-1]
    user_rates = np.extract(user_rates>0, user_rates)
    user_preferences = user_preferences[:user_rates.size]
    
    acc_recommendations = accumulate_item_recommendations(userId, user_preferences, user_rates)
    item_recommendations = distinct_movies[np.argsort(acc_recommendations)][::-1]
    acc_recommendations.sort()
    acc_recommendations = acc_recommendations[::-1]
    item_recommendations = np.extract(acc_recommendations>0, item_recommendations)

    return item_recommendations[:1000]

The 
```generate_recommendation_files()``` function generates for each 
```userId``` a
```pandas.DataFrame``` of up to 1000 the most recommended movies for both user-based and item-based collaborative recommendations and saves them at a particular folder and file named by the 
```userId```:

In [348]:
def generate_recommendation_files():
    for userId in PivotedMoviesMatrix.index:

        pd.DataFrame({'user_recommendation': user_collaborative_recommendations(userId)}, dtype='int')\
        .to_csv(user_recommendations_path + '/' + str(userId), index=False)

        pd.DataFrame({'item_recommendation': item_collaborative_recommendations(userId)}, dtype='int')\
        .to_csv(item_recommendations_path + '/' + str(userId), index=False)
        
        userIndex = np.searchsorted(distinct_users, user)
        if not userIndex+1%20:
            print("Calculated for {} items out of {}.").format(movieIndex+1, distinct_movies.size)

generate_recommendation_files()

The 
```user_profile()``` function generates a 
```pandas.DataFrame``` of the given
```userId```'s rated movies along with the unskewed adjusted ratings from this user:

In [91]:
def user_profile(userId):
    userIndex = np.searchsorted(distinct_users, userId)
    user_rates = PivotedMoviesMatrix.values[userIndex]
    userChoices = distinct_movies[np.argsort(user_rates)]
    user_rates.sort()
    userChoices = np.extract(user_rates!=0, userChoices)
    user_rates = np.extract(user_rates!=0, user_rates)
    userPreferences = pd.DataFrame(data={'movieId':userChoices, 'rating':user_rates})
    UserProfile = Movies[Movies['movieId'].isin(userChoices)].set_index('movieId').join(userPreferences.set_index('movieId')).sort_values('rating', ascending=False)
    return UserProfile

**Examples:**

In [19]:
userId=0
user_profile(userId)

Unnamed: 0_level_0,title,genres,rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1172,Cinema Paradiso (Nuovo cinema Paradiso) (1989),Drama,1.89087
1953,"French Connection, The (1971)",Action|Crime|Thriller,1.89087
2105,Tron (1982),Action|Adventure|Sci-Fi,1.89087
1339,Dracula (Bram Stoker's Dracula) (1992),Fantasy|Horror|Romance|Thriller,1.083462
1029,Dumbo (1941),Animation|Children|Drama|Musical,0.464933
2150,"Gods Must Be Crazy, The (1980)",Adventure|Comedy,0.464933
3671,Blazing Saddles (1974),Comedy|Western,0.464933
1061,Sleepers (1996),Thriller,0.464933
3792,Duel in the Sun (1946),Drama|Romance|Western,-0.05003
3785,Scary Movie (2000),Comedy|Horror,-0.05003


In [154]:
users_recommend = pd.read_csv(user_recommendations_path + '/' + str(userId))
users_recommend = users_recommend.as_matrix().flatten()[:]
Movies[Movies['movieId'].isin(users_recommend)]

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
5,6,Heat (1995),Action|Crime|Thriller
15,16,Casino (1995),Crime|Drama
16,17,Sense and Sensibility (1995),Drama|Romance
24,25,Leaving Las Vegas (1995),Drama|Romance
27,28,Persuasion (1995),Drama|Romance
28,29,"City of Lost Children, The (Cité des enfants p...",Adventure|Drama|Fantasy|Mystery|Sci-Fi
31,32,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Mystery|Sci-Fi|Thriller
34,36,Dead Man Walking (1995),Crime|Drama
37,39,Clueless (1995),Comedy|Romance


In [155]:
movies_recommend = pd.read_csv(item_recommendations_path + '/' + str(1))
movies_recommend = movies_recommend.as_matrix().flatten()[:]
Movies[Movies['movieId'].isin(movies_recommend)]

Unnamed: 0,movieId,title,genres
33,35,Carrington (1995),Drama|Romance
72,78,"Crossing Guard, The (1995)",Action|Crime|Drama|Thriller
97,105,"Bridges of Madison County, The (1995)",Drama|Romance
103,113,Before and After (1996),Drama|Mystery
115,129,Pie in the Sky (1996),Comedy|Romance
176,200,"Tie That Binds, The (1995)",Thriller
187,213,Burnt by the Sun (Utomlyonnye solntsem) (1994),Drama
192,218,Boys on the Side (1995),Comedy|Drama
238,266,Legends of the Fall (1994),Drama|Romance|War|Western
278,312,Stuart Saves His Family (1995),Comedy


In [403]:
in_common_common = np.array([])
for userId in PivotedMoviesMatrix.index:
    users_recommend = pd.read_csv(user_recommendations_path + '/' + str(userId))
    users_recommend = users_recommend.as_matrix().flatten()[:]
    #print (userId, users_recommend[:10])
    movies_recommend = pd.read_csv(item_recommendations_path + '/' + str(userId))
    movies_recommend = movies_recommend.as_matrix().flatten()[:]
    in_common = np.intersect1d(users_recommend, movies_recommend, assume_unique=False)
    in_common_common = np.append(in_common_common, in_common)
    unique_common, counts_common = np.unique(in_common_common, return_counts=True)
    dict_common = dict(zip(unique_common, counts_common))
    dict_common = sorted(dict_common.items(), key=lambda kv: kv[1], reverse=True)
    #print (in_common.size)
print(np.unique(in_common_common).size, np.unique(in_common_common).size/671.0)
#Movies[Movies['movieId'].isin(movies_recommend[20:30])]
print(dict_common[:20])

(2325, 3.464977645305514)
[(858.0, 639), (2300.0, 623), (745.0, 614), (733.0, 610), (969.0, 596), (8961.0, 590), (1288.0, 588), (953.0, 584), (232.0, 573), (951.0, 561), (955.0, 556), (741.0, 551), (1060.0, 547), (2186.0, 522), (912.0, 521), (954.0, 512), (1580.0, 504), (994.0, 490), (3996.0, 487), (3307.0, 482)]


In [425]:
most_common = np.array(list(zip(*dict_common[:40])[0]), dtype='int')
Movies[Movies['movieId'].isin(most_common)]

Unnamed: 0,movieId,title,genres
204,232,Eat Drink Man Woman (Yin shi nan nu) (1994),Comedy|Drama|Romance
422,474,In the Line of Fire (1993),Action|Thriller
479,535,Short Cuts (1993),Drama
607,720,Wallace & Gromit: The Best of Aardman Animatio...,Adventure|Animation|Comedy
615,733,"Rock, The (1996)",Action|Adventure|Thriller
619,741,Ghost in the Shell (Kôkaku kidôtai) (1995),Animation|Sci-Fi
622,745,Wallace & Gromit: A Close Shave (1995),Animation|Children|Comedy
695,858,"Godfather, The (1972)",Crime|Drama
733,912,Casablanca (1942),Drama|Romance
736,915,Sabrina (1954),Comedy|Romance
