## Collaborative Filtering

- Relationships exist between products and user interests
- Collaborative filtering tries to find these relationships
- **User-based filtering**:
    - Based on users neighborhood
- **Item-based filtering**:
    - Based on items' similarity


### User-based Filtering
- If user A likes x1, x2 and x3, user B who is close to user A might like a movie in the set [x1, x2, x3]
- If user B likes x4 and user A hasn't watched it, then the systems recommends x4 for user A.

#### User-ratings Matrix

In [1]:
import numpy as np
import pandas as pd

ratings_dict = {
    "user": ["u1", "u2", "u3", "ActiveUser"],
    "m1": [9, 2, 5, np.nan],
    "m2": [6, 10, 9, 10],
    "m3": [8, 6, np.nan, 7],
    "m4": [4, np.nan, 10, 8],
    "m5": [np.nan, 8, 7, np.nan]
}

## ActiveUser hasn't watched m1 and m5. How can we know whether to recommend m1 and m5 to the ActiveUser?

df = pd.DataFrame(ratings_dict)
df

Unnamed: 0,user,m1,m2,m3,m4,m5
0,u1,9.0,6,8.0,4.0,
1,u2,2.0,10,6.0,,8.0
2,u3,5.0,9,,10.0,7.0
3,ActiveUser,,10,7.0,8.0,


### Calculating Similarity
- To calculate the level of similarity between two users, find movies they both rated
- Let's assume that AU-u1 similarity is 0.4, AU-u2 similarity is 0.9 and AU-u3 similarity is 0.7
- Then the **Weighted Ratings Matrix** for m1 and m5 is:

In [2]:
df = df[["user", "m1", "m5"]]
df = df.iloc[:-1]
df

Unnamed: 0,user,m1,m5
0,u1,9.0,
1,u2,2.0,8.0
2,u3,5.0,7.0


In [4]:
weighted_df = df
# Multiplying the rows with similarity
weighted_df["m1"] = [3.6, 1.8, 3.5]
weighted_df["m5"] = [np.nan, 7.2, 4.9]
weighted_df

Unnamed: 0,user,m1,m5
0,u1,3.6,
1,u2,1.8,7.2
2,u3,3.5,4.9


In [8]:
## Now let's add up all these ratings:
sum_dict = {
    "rated_by": [["u1", "u2", "u3"], ["u2", "u3"]],
    "m1": [8.9, np.nan],
    "m5": [np.nan, 12.1],
    "weight_sum": [2, 1.6] # sum of the similarities of users rated this movie
}

sum_similarity_index = pd.DataFrame(sum_dict)
sum_similarity_index

Unnamed: 0,rated_by,m1,m5,weight_sum
0,"[u1, u2, u3]",8.9,,2.0
1,"[u2, u3]",,12.1,1.6


In [13]:
# Now we need to normalize since m1 is rated by 3 users while m5 is rated by 2 users.
# Normalization: sum of ratings / sum of similarities

m1_recommend = round(8.9 / 2, 1)
m5_recommend = round(12.1 / 1.6, 1)

recommendation_matrix =  pd.DataFrame({"user": ["ActiveUser"], "m1": [m1_recommend], "m5": [m5_recommend]})
recommendation_matrix

Unnamed: 0,user,m1,m5
0,ActiveUser,4.5,7.6


### Item-based Filtering
- Looks for similarity in items, not users. 
    - SO, Does not try to compare every choice of user to calculate similarity between users
- Assume users u1, u2 and u3 
- u1 and u2 likes movies m1 and m3
- If u3 likes m3 and m4, the system will recommend m1 to u3.
    - "People who liked m3 also liked m1", says Netflix.

### Challenges of Collaborative Filtering

#### Data Sparsity
- When users rate only a limited number of items
- If there are insufficient number of rated items, how can we 

#### Cold start
- How to recommend something to new users?
- How to recommend a new item which hasn't received rating?

#### Scalability
- Increase in number number of users or items
- Results in poor performance due to growth and similarity calculation 