In [1]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()">
<input type="submit" value="Click here to toggle on/off the raw code."></form>''')

# Week 4 Item-Based CF

In the [spreadsheet](https://d396qusza40orc.cloudfront.net/umntestsite/on-demand_files/A5/Assignment%205.xls), you will find 4 sheets: 
- `Ratings` contains the raw rating data
- `NormRatings` contains the normalized ratings (each rating adjusted with the user’s mean rating)
- You will fill out `Matrix` with the similarities between all items
- `FilterMatrix` is a filtered view of Matrix where all **negative similarities are clamped to 0** (to exclude negative similarities from the computation)

The Ratings and NormRatings sheets also contain the L2 norm (Euclidean length, or square root of sum of squares) of each movie’s ratings.

For each similarity variant, you will fill out the Matrix sheet; use the FilterMatrix sheet to compute recommendations.

#### Similarity Functions
Your core task in this assignment is to compute item similarities using 2 different similarity functions:

- Cosine similarity between items using **raw (unnormalized)** ratings.
- Cosine similarity using **adjusted (normalized)** ratings.

For the numerator of the similarity, you will probably want to use the SUMPRODUCT function; the provided L2 norms will be useful for computing the denominators.

#### Examples
Here are some values to check your calculations:

- Similarity between Toy Story and Star Wars, raw: 0.645
- Similarity between Toy Story and Star Wars, normalized: -0.378

In [2]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from collections import Counter 

Ratings_df = pd.read_csv('Assignment5_Ratings.csv').fillna(0)
NormRatings_df = pd.read_csv('Assignment5_NormRatings.csv').fillna(0)

In [3]:
movie1 = '1: Toy Story (1995)'
movie2 = '1210: Star Wars: Episode VI - Return of the Jedi (1983)'
# movie2 = '260: Star Wars: Episode IV - A New Hope (1977)'

In [4]:
def cosine_calc(rating_df, movie1, movie2):
    """ Return the consine similarity b/t 2 movies 
    Parameters:
    -----------
    rating_df: pd.DataFrame
        can be Ratings_df or NormRatings_df
        
    movie1, movie2: String
        movie names
    
    Returns:
    --------
    numerator/denominator: Float
        Cosine similarity 
    """
    numerator = np.dot(rating_df[:-2][movie1].values, rating_df[:-2][movie2].values)
    denominator = (rating_df.at[20, movie1]*rating_df.at[20, movie2])
    return numerator/denominator

    
def neighbor_movie(rating_df):
    """ create item-item sim matrix for each movie
    Parameters:
    -----------
    rating_df: pd.DataFrame
    
    Returns:
    --------
    movie_matrix: pd.DataFrame
    """
    movie_matrix = pd.DataFrame(index=all_movies)
    for m1 in all_movies:
        rating_list=[]
        for m2 in all_movies:
            rating_list.append(cosine_calc(rating_df, m1, m2))
        movie_matrix[m1] = rating_list
    return movie_matrix

In [5]:
print(f'Similarity between {movie1} and {movie2}, Normed: {round(cosine_calc(NormRatings_df, movie1, movie2), 3)}')

Similarity between 1: Toy Story (1995) and 1210: Star Wars: Episode VI - Return of the Jedi (1983), Normed: -0.378


In [6]:
print(f'Similarity between {movie1} and {movie2}, raw: {round(cosine_calc(Ratings_df, movie1, movie2), 3)}')

Similarity between 1: Toy Story (1995) and 1210: Star Wars: Episode VI - Return of the Jedi (1983), raw: 0.645


#### Deliverables
The output you are supposed to turn in consists of 2 parts. For each part, submit both unnormalized and normalized results.

- Top 5 Toy Story

Provide 5 movies most similar to Toy Story, in decreasing order of similarity. Only provide the movie ID, not the title.

![similarity_matrix_mean-centered](http://latex.codecogs.com/gif.latex?w_%7Bij%7D%3D%5Cfrac%7B%5Csum_%7Bu%7D%28%5Chat%7Br_%7Bui%7D%7D-%5Cbar%7Br%7D_%7Bi%7D%29%28%5Chat%7Br_%7Buj%7D%7D-%5Cbar%7Br%7D_%7Bj%7D%29%7D%7B%5Csqrt%7B%5Csum_%7Bu%7D%5Cleft%20%28r_%7Bui%7D%20-%5Cbar%7Br%7D_%7Bi%7D%20%5Cright%20%29%5E%7B2%7D%7D%20%5Csqrt%7B%5Csum_%7Bu%7D%5Cleft%20%28r_%7Buj%7D%20-%5Cbar%7Br%7D_%7Bj%7D%5Cright%20%29%5E%7B2%7D%7D%7D)

In [7]:
all_movies = Ratings_df.columns[1:-3] # 20 moview
all_users = Ratings_df['User'].values # 22 reviewers

In [11]:
def top_5_movie(rating_df, movie1):
    rating_dic={}
    for m in Ratings_df.columns[1:-3]:
        # rating_dic.update({m:cosine_calc(rating_df, movie1, m)})
        rating_dic.update({m:cosine_calc(rating_df, movie1, m)})
    for movie in sorted(rating_dic, key=rating_dic.get, reverse=True)[1:6]:
        print(movie, round(rating_dic[movie],3))

- Top 5 for User 5277

Provide the top 5 recommended movies for user 5277, using an average of the user’s ratings weighted by similarity to each candidate movie. You do not need to exclude movies they have rated. **Consider all movies with nonnegative similarities (do not limit neighborhood size)**.



# Raw
Learnings:
- All raw/non-normalized similarities are non-negative --> No need to worry about `max(0, weight)`
- Only takes the rated movie list for calculations for this user

most similar to toy story - raw

In [14]:
top_5_movie(Ratings_df, movie1)

260: Star Wars: Episode IV - A New Hope (1977) 0.747
780: Independence Day (ID4) (1996) 0.691
296: Pulp Fiction (1994) 0.668
318: Shawshank Redemption, The (1994) 0.667
1265: Groundhog Day (1993) 0.661


In [15]:
def user_movie_score_raw(userid):
    """Create matrix to predict score for each movie by each user
    Parameters:
    -----------
    rating_df: pd.DataFrame
        e.g. Rating_df or NormRating_df
    
    Returns: 
    --------
    user_movie_score_df: pd.DataFrame
        a matrix with predicted score for each movie by each user
    """
    raw_rating_df = pd.read_csv('Assignment5_Ratings.csv')
    # calculate item-item similarity matrix
    movie_weight_df = neighbor_movie(raw_rating_df.fillna(0))
    # list of movies that user rated 
    rated = raw_rating_df.columns \
                    [raw_rating_df[raw_rating_df['User']==userid].notnull().any()].tolist()
    rated.remove('User')
    rated.remove('Mean') # cleanup rated movie list by removing non-movie cols
    pred_score_dict = {}
    for movie_j in rated:
        # user 755 ratings to rated movies 
        x = Ratings_df[Ratings_df['User']==userid][rated].values[0] 
        # movie_j similarity to all other movies - weights 
        y = movie_weight_df[movie_j][rated].values 
        pred_score_dict.update({movie_j: np.dot(x, y)/np.sum(y)})
    return pred_score_dict


Counter(user_movie_score_raw('5277')).most_common(5)

[("527: Schindler's List (1993)", 2.9738833825906523),
 ('1259: Stand by Me (1986)', 2.9288008978650555),
 ('260: Star Wars: Episode IV - A New Hope (1977)', 2.922240134356363),
 ('593: Silence of the Lambs, The (1991)', 2.883304159274999),
 ('2396: Shakespeare in Love (1998)', 2.852130587479499)]


# Norm

most similar to toy story - normed

In [16]:
# normed
top_5_movie(NormRatings_df, movie1)

34: Babe (1995) 0.554
356: Forrest Gump (1994) 0.356
296: Pulp Fiction (1994) 0.295
318: Shawshank Redemption, The (1994) 0.216
2028: Saving Private Ryan (1998) 0.193


![norm_score](http://latex.codecogs.com/gif.latex?S%5Cleft%20%28%20u%2C%20i%20%5Cright%20%29%20%3D%20%5Cfrac%7B%5Csum_%7Bj%5Cin%20N%5Cleft%20%28%20i%3B%20u%20%5Cright%20%29%7Dw_%7Bij%7D%5Cleft%20%28%20r_%7Buj%7D%20-%20%5Cbar%7Br_%7Bj%7D%7D%20%5Cright%20%29%7D%7B%5Csum_%7Bj%5Cin%20N%5Cleft%20%28%20i%3B%20u%20%5Cright%20%29%7D%5Cleft%20%7Cw_%7Bij%7D%20%5Cright%20%7C%7D%20&plus;%20%5Cbar%7Br_%7Bi%7D%7D)

Not correct answer

In [17]:
def user_movie_score_norm(userid):
    """Create matrix to predict score for each movie by each user
    Parameters:
    -----------
    userid: String
    
    Returns: 
    --------
    user_movie_score_df: pd.DataFrame
        a matrix with predicted score for each movie by each user
    """
    norm_rating_df = pd.read_csv('Assignment5_NormRatings.csv')
    # calculate item-item similarity matrix
    movie_weight_df = neighbor_movie(norm_rating_df)
    # Consider all movies with nonnegative similarities 
    movie_weight_df = movie_weight_df.applymap(lambda x:max(0,x))
    # list of movies that user rated 
    rated = norm_rating_df.columns \
                    [norm_rating_df[norm_rating_df['User']==userid].notnull().any()].tolist()
    # cleanup rated movie list by removing non-movie cols
    rated.remove('User')
    # rated.remove('Mean') 
    pred_score_dict = {}
    for movie_j in rated:
        # user 755 ratings to rated movies 
        x = Ratings_df[Ratings_df['User']==userid][rated].values[0] 
        # movie_j similarity to all other movies - weights 
        y = movie_weight_df[movie_j][rated].values 
        pred_score_dict.update({movie_j: np.dot(x, y)/np.sum(y)})
    return pred_score_dict


Counter(user_movie_score_norm('5277')).most_common(5)

[('260: Star Wars: Episode IV - A New Hope (1977)', 3.386263932479771),
 ("527: Schindler's List (1993)", 3.1636949134875296),
 ('593: Silence of the Lambs, The (1991)', 2.9074757020377135),
 ('1259: Stand by Me (1986)', 2.33468867340751),
 ('2396: Shakespeare in Love (1998)', 2.330016493299407)]