# Recommender Systems

## Overview of a recommender

A recommender is a system which offers some suggestions about what should be displayed to an user in order to engage him to take some actions in your product or business.

## Basic recommenders

### Show me the code

Importing dependencies

In [4]:
from __future__ import division
import math, random
from collections import defaultdict, Counter

#### Data
Our dataset of Big Data interests:

In [1]:
users_interests = [
    ["Hadoop", "Big Data", "HBase", "Java", "Spark", "Storm", "Cassandra"],
    ["NoSQL", "MongoDB", "Cassandra", "HBase", "Postgres"],
    ["Python", "scikit-learn", "scipy", "numpy", "statsmodels", "pandas"],
    ["R", "Python", "statistics", "regression", "probability"],
    ["machine learning", "regression", "decision trees", "libsvm"],
    ["Python", "R", "Java", "C++", "Haskell", "programming languages"],
    ["statistics", "probability", "mathematics", "theory"],
    ["machine learning", "scikit-learn", "Mahout", "neural networks"],
    ["neural networks", "deep learning", "Big Data", "artificial intelligence"],
    ["Hadoop", "Java", "MapReduce", "Big Data"],
    ["statistics", "R", "statsmodels"],
    ["C++", "deep learning", "artificial intelligence", "probability"],
    ["pandas", "R", "Python"],
    ["databases", "HBase", "Postgres", "MySQL", "MongoDB"],
    ["libsvm", "regression", "support vector machines"]
]

#### Some linear algebras utilities

Dot (scalar) product:

In [3]:
def dot(v, w):
    """v_1 * w_1 + ... + v_n * w_n"""
    return sum(v_i * w_i for v_i, w_i in zip(v, w))

Cossine similarity between two generic vectors:

In [11]:
def cosine_similarity(v, w):
    return dot(v, w) / math.sqrt(dot(v, v) * dot(w, w))

## #1 Recommendation based in general hot topics
Suggest the most popular items by which user is not interested yet

Getting the most popular (frequent) interests:

In [35]:
popular_interests = Counter(
    interest
    for user_interests in users_interests
    for interest in user_interests
).most_common()

In [36]:
for interest in popular_interests: print interest

('Python', 4)
('R', 4)
('Java', 3)
('regression', 3)
('statistics', 3)
('probability', 3)
('HBase', 3)
('Big Data', 3)
('neural networks', 2)
('Hadoop', 2)
('deep learning', 2)
('pandas', 2)
('artificial intelligence', 2)
('libsvm', 2)
('C++', 2)
('Postgres', 2)
('MongoDB', 2)
('scikit-learn', 2)
('machine learning', 2)
('statsmodels', 2)
('Cassandra', 2)
('NoSQL', 1)
('Mahout', 1)
('Storm', 1)
('MySQL', 1)
('programming languages', 1)
('Haskell', 1)
('mathematics', 1)
('Spark', 1)
('numpy', 1)
('theory', 1)
('decision trees', 1)
('MapReduce', 1)
('scipy', 1)
('databases', 1)
('support vector machines', 1)


Finding the recommendation for each user based on most popular items:

In [8]:
def most_popular_new_interests(user_interests, max_results=5):
    suggestions = [
        (interest, frequency) 
        for interest, frequency in popular_interests
        if interest not in user_interests
    ]
    return suggestions[:max_results]

In [9]:
most_popular_new_interests(users_interests[1], 5)

[('Python', 4), ('R', 4), ('Java', 3), ('regression', 3), ('statistics', 3)]

In [10]:
most_popular_new_interests(users_interests[3], 5)

[('Java', 3),
 ('HBase', 3),
 ('Big Data', 3),
 ('neural networks', 2),
 ('Hadoop', 2)]

## #2 User-based collaborative filtering

Creating our vector space of interests (dimensions):

In [12]:
unique_interests = sorted(list({
    interest 
    for user_interests in users_interests
    for interest in user_interests
}))

In [13]:
unique_interests

['Big Data',
 'C++',
 'Cassandra',
 'HBase',
 'Hadoop',
 'Haskell',
 'Java',
 'Mahout',
 'MapReduce',
 'MongoDB',
 'MySQL',
 'NoSQL',
 'Postgres',
 'Python',
 'R',
 'Spark',
 'Storm',
 'artificial intelligence',
 'databases',
 'decision trees',
 'deep learning',
 'libsvm',
 'machine learning',
 'mathematics',
 'neural networks',
 'numpy',
 'pandas',
 'probability',
 'programming languages',
 'regression',
 'scikit-learn',
 'scipy',
 'statistics',
 'statsmodels',
 'support vector machines',
 'theory']

Creating the representation of each user into the dimensions above:

In [15]:
def make_user_interest_vector(user_interests):
    """given a list of interests, produce a vector whose i-th element is 1
    if unique_interests[i] is in the list, 0 otherwise"""
    return [1 if interest in user_interests else 0 for interest in unique_interests]

In [16]:
user_interest_matrix = map(make_user_interest_vector, users_interests)

In [18]:
user_similarities = [
    [cosine_similarity(interest_vector_i, interest_vector_j) for interest_vector_j in user_interest_matrix]
    for interest_vector_i in user_interest_matrix
]

The similarity between user `0` and `8`:

In [19]:
user_similarities[0][8]

0.1889822365046136

Finding the most similar users to an specified user:

In [20]:
def most_similar_users_to(user_id):
    pairs = [(other_user_id, similarity)                      # find other
             for other_user_id, similarity in                 # users with
                enumerate(user_similarities[user_id])         # nonzero 
             if user_id != other_user_id and similarity > 0]  # similarity

    return sorted(pairs,                                      # sort them
                  key=lambda (_, similarity): similarity,     # most similar
                  reverse=True)                               # first

In [21]:
most_similar_users_to(0)

[(9, 0.5669467095138409),
 (1, 0.3380617018914066),
 (8, 0.1889822365046136),
 (13, 0.1690308509457033),
 (5, 0.1543033499620919)]

Getting the suggestions based in the most similar users acumulating the similarities


In [22]:
def user_based_suggestions(user_id, include_current_interests=False):
    # sum up the similarities
    suggestions = defaultdict(float)
    for other_user_id, similarity in most_similar_users_to(user_id):
        for interest in users_interests[other_user_id]:
            suggestions[interest] += similarity

    # convert them to a sorted list
    suggestions = sorted(suggestions.items(),
                         key=lambda (_, weight): weight,
                         reverse=True)

    # and (maybe) exclude already-interests
    if include_current_interests:
        return suggestions
    else:
        return [(suggestion, weight) 
                for suggestion, weight in suggestions
                if suggestion not in users_interests[user_id]]

In [25]:
users_interests[0]

['Hadoop', 'Big Data', 'HBase', 'Java', 'Spark', 'Storm', 'Cassandra']

In [23]:
user_based_suggestions(0)

[('MapReduce', 0.5669467095138409),
 ('MongoDB', 0.50709255283711),
 ('Postgres', 0.50709255283711),
 ('NoSQL', 0.3380617018914066),
 ('neural networks', 0.1889822365046136),
 ('deep learning', 0.1889822365046136),
 ('artificial intelligence', 0.1889822365046136),
 ('databases', 0.1690308509457033),
 ('MySQL', 0.1690308509457033),
 ('programming languages', 0.1543033499620919),
 ('Python', 0.1543033499620919),
 ('Haskell', 0.1543033499620919),
 ('C++', 0.1543033499620919),
 ('R', 0.1543033499620919)]

### Remarks
- Doesn't works fine with a high number of users

## #3 Filtragem colaborativa baseada em Itens

Transpose the user_interest matrix

In [26]:
interest_user_matrix = [[user_interest_vector[j]
                         for user_interest_vector in user_interest_matrix]
                        for j, _ in enumerate(unique_interests)]

In [27]:
interest_similarities = [[cosine_similarity(user_vector_i, user_vector_j)
                          for user_vector_j in interest_user_matrix]
                         for user_vector_i in interest_user_matrix]

In [28]:
def most_similar_interests_to(interest_id):
    similarities = interest_similarities[interest_id]
    pairs = [(unique_interests[other_interest_id], similarity)
             for other_interest_id, similarity in enumerate(similarities)
             if interest_id != other_interest_id and similarity > 0]
    return sorted(pairs,
                  key=lambda (_, similarity): similarity,
                  reverse=True)

Interest(0) = "Big Data"

In [31]:
most_similar_interests_to(0)

[('Hadoop', 0.8164965809277261),
 ('Java', 0.6666666666666666),
 ('MapReduce', 0.5773502691896258),
 ('Spark', 0.5773502691896258),
 ('Storm', 0.5773502691896258),
 ('Cassandra', 0.4082482904638631),
 ('artificial intelligence', 0.4082482904638631),
 ('deep learning', 0.4082482904638631),
 ('neural networks', 0.4082482904638631),
 ('HBase', 0.3333333333333333)]

In [32]:
def item_based_suggestions(user_id, include_current_interests=False):
    suggestions = defaultdict(float)
    user_interest_vector = user_interest_matrix[user_id]
    for interest_id, is_interested in enumerate(user_interest_vector):
        if is_interested == 1:
            similar_interests = most_similar_interests_to(interest_id)
            for interest, similarity in similar_interests:
                suggestions[interest] += similarity

    suggestions = sorted(suggestions.items(),
                         key=lambda (_, similarity): similarity,
                         reverse=True)

    if include_current_interests:
        return suggestions
    else:
        return [(suggestion, weight) 
                for suggestion, weight in suggestions
                if suggestion not in users_interests[user_id]]

In [33]:
item_based_suggestions(0)

[('MapReduce', 1.861807319565799),
 ('Postgres', 1.3164965809277263),
 ('MongoDB', 1.3164965809277263),
 ('NoSQL', 1.2844570503761732),
 ('programming languages', 0.5773502691896258),
 ('MySQL', 0.5773502691896258),
 ('Haskell', 0.5773502691896258),
 ('databases', 0.5773502691896258),
 ('neural networks', 0.4082482904638631),
 ('deep learning', 0.4082482904638631),
 ('C++', 0.4082482904638631),
 ('artificial intelligence', 0.4082482904638631),
 ('Python', 0.2886751345948129),
 ('R', 0.2886751345948129)]

## Architecting a recommender

A recommender is a real system / product. The engineering side is so important as the intelligence one. And both should be quite coupled with prodcut and business.

![Recommender macro architecture](img/recommender.jpg)

### Very very important questions

- How should your user get the recommendations? Real-time, online, batch, etc;
- How long a recommendation will live?
- Do you have cold start scenario?

#### Recommendation SLA & getting mode

- For passive recommendations (i.e. user requires new suggestion), usually we have real time or/and online strategies;
- Cache always as possible;
- But you should remember that cache is limited. And so?
    - Will you create a very very long cache which includes some long tail (aka crap) recommendations? 
    - Will you fill the cache with fresh recommendations processed during the user interaction?
    - Will you jump to an online strategy after reach some threshold?
- If online, SLA is your big guy;

#### Recommendation lifecycle

- What is the seasonality of your suggestion?
- Your offer strategy MUST to be aligned with this seasonality;
    - When some user stay without use your product for a long time, treat him as a new one;

#### Cold start

- To recommend based on similarities, you need some user preferences to compare with other. But if you don't have?
    - Cold start strategy for newly subscribed users
    - In our basic strategies, the general hot topic is a good one