In [262]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()">
<input type="submit" value="Click here to toggle on/off the raw code."></form>''')

# Week 2 : User-User Collaborative Filtering Part 2 Assignment and Quiz
In this assignment, you will implement user-user collaborative filtering using a spreadsheet. (We are using only basic spreadsheet operations that would work with Google's free Drive-based spreadsheet, Excel, or any other common spreadsheet program. We'll help you find the correct operations.)

## Part 1 - Without Normalization
### Step 1 Start by downloading the [starting spreadsheet](https://drive.google.com/file/d/0BxANCLmMqAyIQ0ZWSy1KNUI4RWc/view). 
This is a 25 user x 100 movie matrix of ratings selected from the class data set. The spreadsheet has three sheets in it (this is not supposed to be an exercise in spreadsheet tricks; as a result, we’ve already given you a significant start). 1) The first sheet is a ratings matrix with movies as rows and users as columns, 2) The second sheet is a ratings matrix with movies as columns and users as rows, and 3) The third sheet is the start of your correlations matrix.

In [1]:
import pandas as pd

In [237]:
moview_row = pd.read_csv('movie_row.csv')
moview_row.shape

(100, 26)

### Step 3 Complete the user-by-user correlations matrix. 
To check your math, note that the correlation between users 1648 and 5136 is 0.40298, and the correlation between users 918 and 2824 is -0.31706. All correlations should be between -1 and 1, and the diagonal should be all 1's (since they are self-correlations). <br>

By default `DataFrame.corr(self, method='pearson', min_periods=1)` use **Pearson Correlation**, which only calculate over ratings in common. It's a problem when user X only has 1 rating and ended up with correlation as 1 with user Y, which doesn't mean the users are really similar. 

In [5]:
corr_df = moview_row.corr()

In [226]:
corr_df.at['5136', '1648']

0.4029801884569963

### Step 4 Identify the top 5 neighbors

Identify the top 5 neighbors (the users with the 5 largest, positive correlations) for users 3867 and 89. For example, if the target user were #3712, the closest neighbors are 2824 (corr: 0.46291), 3867 (corr: 0.400275), 5062 (corr: 0.247693), 442 (corr: 0.22713), and 3853 (corr: 0.19366). Don't forget to exclude the target user (corr: 1.0) from your possible selections.

In [33]:
def top_n_neighbors(user, top_n):
    """
    Parameters
    ----------
    user: String
        e.g. '3712'
    top_n: int
    
    Returns
    -------
    pandas.core.series.Series
    top n user and their corr values
    
    """
    return corr_df.nlargest(top_n, [user])[user].drop(user)

In [225]:
top_n_neighbors('3712', 6)

2824    0.462910
3867    0.400275
5062    0.247693
442     0.227130
3853    0.193660
Name: 3712, dtype: float64

### Step 5 - Compute the predictions for each movie for one user
Compute the predictions for each movie for users 3867 and 89 by taking the correlation-weighted average of the ratings of the top-five neighbors (for each target user) for each movie. The formal formula for correlation-weighted average is <br>

## $ \frac{\sum_{5}^{n=1}r_{n}w_{n}}{\sum_{5}^{n=1}w_{n}} $ <br> 

Remember, you will need to make sure that your weight for each contributed rating is the `user-user correlation` when that neighbor has rated the movie, but `0` when the neighbor has not rated the movie (note the same for weight in denominator).


### Step 6 - Implement for quiz
Submit 12 values for this assignment as indicated below. You will be submitting the top three movie IDs and the predicted ratings (to three decimal places) for each user. In a real recommender system, we’d be excluding movies the user has already rated, but do not do this here. Indeed, you should look to see what the user’s rating (if any) is for the top-recommended movies. For example, if the user ID was 3712, the correct submission would be:
- Top Movie: 641 Prediction: 5.000
- 2nd Movie: 603 Prediction: 4.856
- 3rd Movie: 105 Prediction: 4.739 

And if the user ID were 3525, there would be a three-way tie among:
Movies: 238, 194, and 38 (all with a prediction of 5.000)

In [258]:
def pred_rating(userid, moview_row):
    top_n = top_n_neighbors(userid, 6)
    pred_df = moview_row[['Unnamed: 0']+top_n.index.tolist()].copy()
    pred_df['numerator'] = pred_df[top_n.index.tolist()].fillna(0).dot(top_n)
    # include this neighbor's weight in denominator only when this neighbor rated this movie
    pred_df['denominator'] = pred_df[top_n.index.tolist()].notnull().dot(top_n)
    pred_df['pred'] = (pred_df['numerator']/pred_df['denominator']).round(3)
    return pred_df.nlargest(3, 'pred')

## Part 2 - Normalization
Next, you will repeat the computation but this time you will normalize the scores to adjust for different users’ rating scales.

### Step 1 - Repeat step 5 from part 1. This time, however, use the normalization formula: <br>

## $ \bar{r_{u}}+\frac{\sum_{5}^{n=1}(r_{n}-\bar{r_{n}})w_{n}}{\sum_{5}^{n=1}w_{n}} $ <br>

Note: These two normalize the computation to adjust for different users’ rating scales.
## $ \bar{r_{u}} $ avg rating of the neighbor 
## $ \bar{r_{n}} $ avg rating of the target user 


### Step 2 - Implement for quiz
Submit 12 values for this assignment as indicated below. You will be submitting the top three movie IDs and the predicted ratings (to three decimal places) for users 3867 and 89. In a real recommender system, we’d be excluding movies the user has already rated, but do not do this here. Remember, you do not need to re-compute the correlations, just use the existing correlations but normalize the ratings being averaged by subtracting each neighbor’s mean rating from each of their ratings (and add the target user’s mean back into the total). <br>
For example, if the user ID was 3712, the correct submission would be:
- Top Movie: 641 Prediction: 5.900
- 2nd Movie: 603 Prediction: 5.546
- 3rd Movie: 105 Prediction: 5.501

Similar top 3 movies for user `3712` w/ or w/o normalization. **Note that it is possible to have movie predictions outside the five-star scale; in this case this is because `3712` rates movies very high to begin with, and his/her neighbors have wider ranges (and think these movies are very high in those ranges).** <br>

**Also, these top movie predictions are interesting in other ways. The top movie was only rated by one neighbor, but was so highly rated (1.4 above mean) that it stands out. **

In [280]:
def pred_rating_norm(userid, moview_row):
    top_n = top_n_neighbors(userid, 6)
    ru = moview_row[userid].mean()
    print(f'User {userid} has a mean rating {ru.round(2)}')
    pred_df = moview_row[['Unnamed: 0']+top_n.index.tolist()].copy()
    pred_df['numerator'] = pred_df[top_n.index.tolist()] \
                        .sub(pred_df[top_n.index.tolist()].mean(axis=0),axis=1).fillna(0).dot(top_n)
    # include this neighbor's weight in denominator only when this neighbor rated this movie
    pred_df['denominator'] = pred_df[top_n.index.tolist()].notnull().dot(top_n)
    pred_df['pred'] = (ru+(pred_df['numerator']/pred_df['denominator'])).round(3)
    return pred_df.nlargest(3, 'pred')

In [281]:
print(f'User 3712 w/o norm')
pred_rating('3712', moview_row)

User 3712 w/o norm


Unnamed: 0.1,Unnamed: 0,2824,3867,5062,442,3853,numerator,denominator,pred
55,641: Requiem for a Dream (2000),,,,5.0,,1.135649,0.22713,5.0
50,603: The Matrix (1999),5.0,5.0,4.5,5.0,4.5,7.43766,1.531667,4.856
11,105: Back to the Future (1985),,,4.5,5.0,,2.250269,0.474823,4.739


In [285]:
print(f'User 3712 w/ norm')
pred_rating_norm('3712', moview_row)

User 3712 w/ norm
User 3712 has a mean rating 4.5


Unnamed: 0.1,Unnamed: 0,2824,3867,5062,442,3853,numerator,denominator,pred
55,641: Requiem for a Dream (2000),,,,5.0,,0.317982,0.22713,5.9
50,603: The Matrix (1999),5.0,5.0,4.5,5.0,4.5,1.60146,1.531667,5.546
11,105: Back to the Future (1985),,,4.5,5.0,,0.475101,0.474823,5.501


Compare `3525` with and without normalization. Note how only one of these three `238: The Godfather (1972)` is the same as from the top three without normalization! And notice that the (perhaps surprisingly) strong prediction for `134: O Brother Where Art Thou? (2000)` is again the result of having just a single neighbor’s rating (and that being a neighbor who really liked the movie).

In [286]:
print(f'User 3525 w/o norm')
pred_rating('3525', moview_row)

User 3525 w/o norm


Unnamed: 0.1,Unnamed: 0,3556,89,860,918,5136,numerator,denominator,pred
6,38: Eternal Sunshine of the Spotless Mind (2004),,5.0,,5.0,5.0,6.015594,1.203119,5.0
25,194: Amelie (2001),,5.0,,,5.0,4.177756,0.835551,5.0
27,238: The Godfather (1972),5.0,5.0,5.0,,5.0,8.491973,1.698395,5.0


In [287]:
print(f'User 3525 w/ norm')
pred_rating_norm('3525', moview_row)

User 3525 w/ norm
User 3525 has a mean rating 3.71


Unnamed: 0.1,Unnamed: 0,3556,89,860,918,5136,numerator,denominator,pred
27,238: The Godfather (1972),5.0,5.0,5.0,,5.0,1.776457,1.698395,4.76
38,424: Schindler's List (1993),5.0,,,,4.5,0.793736,0.835767,4.663
17,134: O Brother Where Art Thou? (2000),4.5,,,,,0.414722,0.475711,4.585


In [288]:
print(f'User 3867 w/o norm')
pred_rating('3867', moview_row)

User 3867 w/o norm


Unnamed: 0.1,Unnamed: 0,2492,3853,2486,3712,2288,numerator,denominator,pred
77,1891: Star Wars: Episode V - The Empire Strike...,5.0,,4.5,,,4.358878,0.915675,4.76
21,155: The Dark Knight (2008),5.0,5.0,4.5,,3.5,8.008926,1.759641,4.551
16,122: The Lord of the Rings: The Return of the ...,5.0,4.0,4.5,5.0,4.0,9.736117,2.159916,4.508


In [289]:
print(f'User 3867 w/ norm')
pred_rating_norm('3867', moview_row)

User 3867 w/ norm
User 3867 has a mean rating 3.66


Unnamed: 0.1,Unnamed: 0,2492,3853,2486,3712,2288,numerator,denominator,pred
77,1891: Star Wars: Episode V - The Empire Strike...,5.0,,4.5,,,1.450402,0.915675,5.246
21,155: The Dark Knight (2008),5.0,5.0,4.5,,3.5,2.103179,1.759641,4.857
8,77: Memento (2000),4.0,4.5,4.5,,5.0,1.964225,1.759641,4.778


In [290]:
print(f'User 89 w/o norm')
pred_rating('89', moview_row)

User 89 w/o norm


Unnamed: 0.1,Unnamed: 0,4809,5136,860,5062,3525,numerator,denominator,pred
27,238: The Godfather (1972),5.0,5.0,5.0,,4.5,10.98988,2.245525,4.894
33,278: The Shawshank Redemption (1994),5.0,5.0,,4.5,5.0,10.899255,2.23245,4.882
64,807: Seven (a.k.a. Se7en) (1995),5.0,5.0,4.5,,4.5,10.720347,2.245525,4.774


In [291]:
print(f'User 89 w/ norm')
pred_rating_norm('89', moview_row)

User 89 w/ norm
User 89 has a mean rating 4.4


Unnamed: 0.1,Unnamed: 0,4809,5136,860,5062,3525,numerator,denominator,pred
27,238: The Godfather (1972),5.0,5.0,5.0,,4.5,2.076166,2.245525,5.322
33,278: The Shawshank Redemption (1994),5.0,5.0,,4.5,5.0,1.92881,2.23245,5.261
32,275: Fargo (1996),,5.0,,,4.5,0.875687,1.037944,5.241


#### Selected Quiz  Questions
Q: Which of the following would most indicate a situation where user-user collaborative filtering would be strongly preferable to content-based filtering (i.e., filtering based on user preferences of keywords or attributes)? <br>
A: The items being recommended don’t have good attributes or keywords to describe them (e.g., user-submitted children’s drawings without tags).<br>

Q: Resnick talked about resistance of collaborative filtering recommender systems to attacks from fake accounts (called sybils). Which of these statements about this problem is true<br>
A: In order to be resistant to attacks from more sybils, you lose predictive power from genuine raters.<br>

Q: User-user collaborative filtering depends on certain assumptions. Which of the following IS NOT a requirement for a successful user-user collaborative filtering system<br>
- [x] Users mostly have similar tastes on a set of popular items, though they may have individually different tastes on unpopular items.
- [√] User tastes must either be generally stable (individually) or if changing, they change in sync with other user’s tastes.
- [√] Past agreement between users is predictive of future agreement -- i.e., if you and I have agreed on items before, we mostly still do now.
- [√] The domain in which we are performing collaborative filtering is scoped such that people who agree within one part of that domain generally agree within other parts of the domain.

Q: Golbeck explained that trust-based recommenders differ from similarity-based collaborative filtering in all of the following ways EXCEPT which one? (Wrong answer...)
- [√] Trust-based systems have an underlying graph of user trust, while similarity-based systems don’t need a graph because they only use pairwise similarity scores.
- [√] Trust-based systems are harder to get going, because it is often challenging to get trust data.
- [√] Trust-based systems only consider ratings from users that the target user has a direct trust relationship with, and thus often use many fewer ratings in computing a prediction or recommendation
- [x] Similarity-based collaborative filtering treats all rated items as roughly equivalent in evaluating neighbors, trust-based systems may give very strong weight to the items that a user is most passionate about.

