## Chapter 22. Recommender Systems

Another common data problem = producing recommendations of some sort. Netflix = movies, Amazon = products, Twitter = users to follow. We’ll look @ a data set of `users_interests` + think about the problem of recommending new interests to a user based on currently specified interests

In [1]:
users_interests = [
    ["Hadoop", "Big Data", "HBase", "Java", "Spark", "Storm", "Cassandra"],
    ["NoSQL", "MongoDB", "Cassandra", "HBase", "Postgres"],
    ["Python", "scikit-learn", "scipy", "numpy", "statsmodels", "pandas"],
    ["R", "Python", "statistics", "regression", "probability"],
    ["machine learning", "regression", "decision trees", "libsvm"],
    ["Python", "R", "Java", "C++", "Haskell", "programming languages"],
    ["statistics", "probability", "mathematics", "theory"],
    ["machine learning", "scikit-learn", "Mahout", "neural networks"],
    ["neural networks", "deep learning", "Big Data", "artificial intelligence"],
    ["Hadoop", "Java", "MapReduce", "Big Data"],
    ["statistics", "R", "statsmodels"],
    ["C++", "deep learning", "artificial intelligence", "probability"],
    ["pandas", "R", "Python"],
    ["databases", "HBase", "Postgres", "MySQL", "MongoDB"],
    ["libsvm", "regression", "support vector machines"]
]

### Manual Curation
Before Internet, for book recommendations = would go to library = librarian was available to suggest books relevant to your interests or similar to books you liked. Given our limited # of users + interests, it would be easy for you to spend an afternoon manually recommending interests for each user. But this method doesn’t scale well + is limited by your personal knowledge +
imagination.

### Recommending What’s Popular

1 easy approach = simply recommend what’s popular:

In [2]:
from collections import Counter 

popular_interests = [('Python', 4),
           ('R', 4),
           ('Java', 3),
           ('regression', 3),
           ('statistics', 3),
           ('probability', 3)]

popular_interests = Counter(interest
                            for user_interests in users_interests
                            for interest in user_interests).most_common()

Having computed this, can just suggest to a user the most popular interests that he’s not already interested in:

In [3]:
def most_popular_new_interests(user_interests,max_results=5):
    suggestions = [(interest,frequency) 
                   for interest,frequency in popular_interests
                  if interest not in user_interests]
    return suggestions[:max_results]

# user 1 new suggested interests 
print(most_popular_new_interests(users_interests[0],5))
# user 3 new suggested interests 
print(most_popular_new_interests(users_interests[2],5))

[('Python', 4), ('R', 4), ('statistics', 3), ('regression', 3), ('probability', 3)]
[('R', 4), ('Big Data', 3), ('HBase', 3), ('Java', 3), ('statistics', 3)]


"Lots of people = interested in Python so maybe you should be too” is not a sales pitch. If someone is brand new to our site + we don’t know anything about them, that’s possibly the best we can do. Let’s see how we can do better by basing each user’s recommendations on her interests.

### User-Based Collaborative Filtering
1 way of taking a user’s interests into account = look for users who are somehow similar to him, + then suggest the things those users are interested in. To do that, need a way to measure how similar 2 users are: say w/ **cosine similarity**: Given 2 vectors, `v` and `w`, it’s measures the “angle” between them.

In [4]:
import sys
import math

sys.path.insert(0, './../../../00_DataScience/DSFromScratch/code')

from linear_algebra import dot

def cosine_similarity(v,w):
    return dot(v,w) / math.sqrt(dot(v,v) * dot(w,w))

If v + w point in same direction, then the numerator + denominator are =, + their cosine similarity = 1. If v + w point in opposite directions, their cosine similarity = -1. And if v = 0 whenever w is not (+ vice versa), `dot(v, w) = 0` + so cosine similarity = 0.

We’ll apply this to vectors of 0s and 1s, each vector `v` representing 1 user’s interests. `v[i]` = 1 if user is specified the `i`th interest, 0 otherwise. Accordingly, “similar users” will mean *“users whose interest vectors most nearly point in the same direction.”* 

Users w/ identical interests will have similarity = 1, users w/ no identical interests will have similarity = 0. Otherwise, similarity will fall in between, w/ #'s closer to 1 indicating “very similar” + #'s closer to 0 indicating “not very similar.”

A good place to start = collecting *known* interests + (implicitly) assigning indices to them via a **set comprehension** to find the unique interests, putting them in a list, + then sorting. 1st interest in the resulting list will be interest 0, and so on:

In [5]:
unique_interests = sorted(list({interest
                               for user_interests in users_interests
                               for interest in user_interests}))
print(unique_interests[:5])

['Big Data', 'C++', 'Cassandra', 'HBase', 'Hadoop']


Next = produce an “interest” vector of 0s + 1s for each user via iterating over `unique_interests`, substituting a 1 if user has each interest, 0 if
not:

In [6]:
def user_interest_vector(user_interests):
    """Given list of interests, produce vector where ith element = 1
    if unique_interests set is in list, 0 otherwise"""
    return [1 if interest in user_interests else 0 
           for interest in unique_interests]

Then create a matrix of user interests simply by `map`-ping this function
against the list of lists of interests:

In [7]:
user_interest_matrix = list(map(user_interest_vector,users_interests))
for row in user_interest_matrix:
    print(row,"\n")

[1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 

[0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0] 

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0] 

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0] 

[0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0] 

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1] 

[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0] 

[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 

[

B/c we have a small data set, it’s no problem to compute **pairwise similarities** between all users:

In [8]:
user_similarities = [[cosine_similarity(interest_vector_i,interest_vector_j)
                      for interest_vector_j in user_interest_matrix]
                     for interest_vector_i in user_interest_matrix]

`user_similarities[i][j]` gives similarity between users i + j.

In [9]:
print(user_similarities[0][9])
print(user_similarities[0][8])

0.5669467095138409
0.1889822365046136


Users 0 and 9 share interests in Hadoop, Java, Big Data, while users 0 and 8 share only 1 interest, Big Data. In particular, `user_similarities[i]` = vector of user `i`’s similarities to every other user, which can be used to write a function that finds the most similar users to a given user (make sure not to include user themself, nor any users w/ similarity = 0) + sort results from most to least similar:

In [10]:
def most_similar_users(user_id):
    pairs = [(other_user_id,similarity)
            for other_user_id, similarity 
             in enumerate(user_similarities[user_id])
            if user_id != other_user_id and similarity > 0]
    
    return sorted(pairs, key=lambda pair: pair[1], reverse=True)

In [11]:
print(most_similar_users(0))

[(9, 0.5669467095138409), (1, 0.3380617018914066), (8, 0.1889822365046136), (13, 0.1690308509457033), (5, 0.1543033499620919)]


How do we use this to suggest new interests to a user? For each interest, we can just add up user-similarities of other users interested in it:


In [12]:
from collections import defaultdict

def user_based_suggestions(user_id,include_current_interests=False):
    # sum up other users similarities
    suggestions = defaultdict(float)
    for other_user_id, similarity in most_similar_users(user_id):
        for interest in users_interests[other_user_id]:
            suggestions[interest] += similarity
    
    # convert sums to sorted list
    suggestions = sorted(suggestions.items(),
                        key=lambda pair: pair[1], reverse = True)
    
    # include/exclude current interests
    if include_current_interests:
        return suggestions
    else:
        return [(suggestion,weight)
                for suggestion,weight in suggestions
                if suggestion not in users_interests[user_id]]
    
for i in user_based_suggestions(0):
    print(i,"\n")

('MapReduce', 0.5669467095138409) 

('MongoDB', 0.50709255283711) 

('Postgres', 0.50709255283711) 

('NoSQL', 0.3380617018914066) 

('neural networks', 0.1889822365046136) 

('deep learning', 0.1889822365046136) 

('artificial intelligence', 0.1889822365046136) 

('databases', 0.1690308509457033) 

('MySQL', 0.1690308509457033) 

('Python', 0.1543033499620919) 

('R', 0.1543033499620919) 

('C++', 0.1543033499620919) 

('Haskell', 0.1543033499620919) 

('programming languages', 0.1543033499620919) 



These seem like pretty decent suggestions for someone whose stated interests are “Big Data” + database-related. (weights aren’t intrinsically meaningful; + just used for ordering.) 

This approach doesn’t work as well when # of items gets very large. Recall **curse of dimensionality** == in large-dimensional vector spaces, most vectors = very far apart (+ therefore point in very different directions) = when there're a large # of interests, “most similar users” to a given user might not be similar at all.

* Ex: Amazon.com: could attempt to ID similar users based on buying patterns, but most likely in all the world there’s no one whose purchase history looks even remotely like anothers.

### Item-Based Collaborative Filtering
Alternative approach = compute similarities between interests *directly* + then generate suggestions for each user by aggregating interests similar to her their current interests.

To start = transpose `user-interest matrix` so rows = interests + columns = users:

In [13]:
interest_users_matrix = [[user_interest_vector[j]
                         for user_interest_vector in user_interest_matrix]
                        for j,_ in enumerate(unique_interests)]
for row in interest_users_matrix:
    print(row,"\n")

[1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0] 

[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0] 

[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 

[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0] 

[1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0] 

[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0] 

[1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0] 

[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0] 

[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0] 

[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0] 

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0] 

[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 

[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0] 

[0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0] 

[0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0] 

[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 

[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 

[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0] 

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0] 

[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 

[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 

Row `j` of `interest_user_matrix` = column `j` of `user_interest_matrix` = has 1 for each user w/ that interest + 0 for each user w/out that interest (`unique_interests[0]` = Big Data)

Can now use `cosine_similarity` again + if precisely the same users are interested in 2 topics, their similarity = 1. If no 2 users are interested in both topics, similarity = 0:

In [14]:
interest_similarities = [[cosine_similarity(user_vector_i,user_vector_j)
                         for user_vector_j in interest_users_matrix]
                        for user_vector_i in interest_users_matrix]

# find interests most similar to big data (0)
def most_similar_interests(interest_id):
    similarities = interest_similarities[interest_id]
    pairs = [(unique_interests[other_interest_id], similarity)
            for other_interest_id, similarity in enumerate(similarities)
            if interest_id != other_interest_id and similarity > 0]
    
    return sorted(pairs, key=lambda pair: pair[1], reverse=True)

for i in most_similar_interests(0):
    print(i,"\n")

('Hadoop', 0.8164965809277261) 

('Java', 0.6666666666666666) 

('MapReduce', 0.5773502691896258) 

('Spark', 0.5773502691896258) 

('Storm', 0.5773502691896258) 

('Cassandra', 0.4082482904638631) 

('artificial intelligence', 0.4082482904638631) 

('deep learning', 0.4082482904638631) 

('neural networks', 0.4082482904638631) 

('HBase', 0.3333333333333333) 



Can now create recommendations for a user by summing up similarities of interests similar to his:

In [15]:
from collections import defaultdict

def item_based_suggestions(user_id,include_current_interests=False):
    # sum up other users similarities
    suggestions = defaultdict(float)
    user_interest_vector = user_interest_matrix[user_id]
    
    for interest_id, is_interested in enumerate(user_interest_vector):
        if is_interested == 1:
            similar_interests = most_similar_interests(interest_id)
            for interest, similarity in similar_interests:
                suggestions[interest] += similarity
    
    # convert sums to sorted list by weight
    suggestions = sorted(suggestions.items(),
                        key=lambda pair: pair[1], reverse = True)
    
    # include/exclude current interests
    if include_current_interests:
        return suggestions
    else:
        return [(suggestion,weight)
                for suggestion,weight in suggestions
                if suggestion not in users_interests[user_id]]
    
for i in item_based_suggestions(0):
    print(i,"\n")

('MapReduce', 1.861807319565799) 

('MongoDB', 1.3164965809277263) 

('Postgres', 1.3164965809277263) 

('NoSQL', 1.2844570503761732) 

('MySQL', 0.5773502691896258) 

('databases', 0.5773502691896258) 

('Haskell', 0.5773502691896258) 

('programming languages', 0.5773502691896258) 

('artificial intelligence', 0.4082482904638631) 

('deep learning', 0.4082482904638631) 

('neural networks', 0.4082482904638631) 

('C++', 0.4082482904638631) 

('Python', 0.2886751345948129) 

('R', 0.2886751345948129) 



### For Further Exploration
* Crab is a framework for building recommender systems in Python.
* Graphlab also has a recommender toolkit.
* Netflix Prize was a somewhat famous competition to build a better system to recommend movies to Netflix user