# Who To Follow: Recommending Brands

In this exercise, we consider a simple dataset: users following brands. We only know of a user follows a brand or not, but not how much he or she likes this brand.  Given the brands the user is following, we would like to recommend similar brands that s/he might be interested in.  

This is an example of _item-based collaborative filtering_ (also called _memory-based collaborative filtering_).  It's the approach known as _"because you liked this, we think you'd also like this."_  This is a neighborhood method, which is easy to understand.

### Import code and data

In [1]:
import numpy as np
import pandas as pd
import seaborn as sb
from matplotlib import pyplot as plt
%matplotlib inline

In [2]:
path_to_repo = '/Users/ruben/repo/personal/ga/DAT-23-NYC/'

In [3]:
data = pd.read_csv(path_to_repo + 'data/brand_followers/user-brands.csv')
print "We have %d pairs of %s users and %s brands." % \
    (len(data), data.id.nunique(), data.brand.nunique())
data.head()

We have 23804 pairs of 3759 users and 198 brands.


Unnamed: 0,id,brand
0,80002,Target
1,80002,Home Depot
2,80010,Levi's
3,80010,Puma
4,80010,Cuisinart


### User-by-brand matrix

Note that our data above is in condensed format. We could make it into a sparse matrix, which might be easier to work with.  You could do this with `pd.pivot_table`:

    M = pd.pivot_table(data, index='id', columns='brand', aggfunc='size', fill_value=0)

We use a `groupby` statement, which gives us a multi-index series, and then we make an `unstack` call to transform it into a dataframe again.  

Note that these steps are not necessary as you could complete this exercise in several different ways.

In [3]:
M = data.groupby(['id', 'brand']).size().unstack().fillna(0)
n_users, n_brands = M.shape
brands = M.columns
M.head(3)

brand,6pm.com,Abercrombie & Fitch,Adidas,Aeropostale,Aldo,All Saints,Amazon.com,American Apparel,American Eagle,Ann Taylor,...,Walgreens,Walk-Over,Wet Seal,Windsor,YSL,Yves Saint Laurent,ZOO,Zara,Zipcar,vineyard vines
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
80002,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
80010,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
80011,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Jaccard distance

Since we will use a neighborhood method, we need a definition of _distance_.  We'll use the _Jaccard distance_ for this. (Also see an earlier notebook on SVD which covered the Jaccard distance.)

The [_Jaccard index_](https://en.wikipedia.org/wiki/Jaccard_index) is a similarity metric between two sets.  It measures how many elements two sets have in common, as a fraction of the total number of distinct elements in both sets.  

$$\text{Jaccard index} = \frac{ |A \cap B | }{ |A \cup B| }$$

We could make a Jaccard matrix $J$, with pairwise similarities $J_{ij}$ as entries.
- `J[i, j]` = Jaccard similarity between doc _i_ and _j_ (between 0 and 1)
- `J[i, i]` = 1, obviously, and
- `J[i, j]` = `J[i, j]`, i.e., the matrix is symmetric.

We could also define the _Jaccard distance_, which has $D_{ii} = 0$ for identical sets, and bigger values as the sets have less words in common.  We define: $D = 1 - J,$ which has values between 0 and 1.

Common applications of the Jaccard index include text clustering, but we can use it for brand clustering as well, counting the number of followers they have in common.

<hr>
## Exercise


- Create a brand-by-brand matrix, with the similarity distances between two brands in each entry.
   - Obviously, you'd have $N_{ii} = 0$ for each brand $i$, and $N_{ij} = N_{ji}$ for each pair of brands.
   - You can create a 2-dimenional `np.array` for this, or a nested dictionary `N = {i: {j: distance}}`, or anything you like.
      
      
- For a few brands of your choice, show the top most similar brands.  
   - Do your results make sense? Would you agree?
   
   
- For a few users, make a few top recommendations.
   - Per user, display the brands s/he's already following
   - For each brand, compute the distance to all other brands
   - Average all distances to find the few closest brands, with the shortest average distance
   - Make sure you exclude the brands the user is already following from the recommendations

In [4]:
def jaccard_distance(M):
    I = M.T.dot(M)  # number of users in common 
    n_users_per_brand = np.diag(I)
    N = n_users_per_brand.reshape(n_brands, 1) * np.ones(n_brands)
    U = N + N.T - I  # total unique followers = n_users_i + n_users_j - users in common
    J = I / U.astype(float)  # similarity matrix
    D = 1 - J  # distance
    return D

In [5]:
brand_distance = jaccard_distance(M)
brand_distance.head(3)

brand,6pm.com,Abercrombie & Fitch,Adidas,Aeropostale,Aldo,All Saints,Amazon.com,American Apparel,American Eagle,Ann Taylor,...,Walgreens,Walk-Over,Wet Seal,Windsor,YSL,Yves Saint Laurent,ZOO,Zara,Zipcar,vineyard vines
brand,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
6pm.com,0.0,0.996255,0.996154,0.996241,1,1,1,0.996183,0.996441,0.992481,...,1,1,1.0,1,1,1,1,1.0,1,1
Abercrombie & Fitch,0.996255,0.0,1.0,0.928571,1,1,1,1.0,0.888889,1.0,...,1,1,0.9375,1,1,1,1,0.909091,1,1
Adidas,0.996154,1.0,0.0,1.0,1,1,1,1.0,1.0,1.0,...,1,1,1.0,1,1,1,1,1.0,1,1


Note that this is a _distance_ matrix, so the lower, the closer, the more similar.  Hence we have zeros on the diagonal.

Let's show the top most similar brands for some known brands.

In [6]:
top = 5
for brand in ['Home Depot', 'Armani', 'UNIQLO']:
    print "%-20s:" % brand, 
    print ", ".join(brand_distance[brand].sort(inplace=False).index[:top]) 

Home Depot          : Home Depot, Target, Kohl's, Old Navy, Crate & Barrel
Armani              : Armani, Ecco, Hugo Boss, Giorgio Armani, Horchow
UNIQLO              : UNIQLO, American Apparel, 6pm.com, Shoebuy, Columbia


And let's pick some other random brands.

In [7]:
n_show = 10  # show a few brands
print "Top %d similar brands for some random %d brands" % (top, n_show)
for brand in np.random.choice(brands, n_show, replace=False):
    print "%-20s:" % brand, 
    print ", ".join(brand_distance[brand].sort(inplace=False).index[:top])

Top 5 similar brands for some random 10 brands
Sephora             : Sephora, Billabong, Roxy, Rip Curl, O'Neill
John Varvatos       : John Varvatos, Kosta Boda, Nambe, Villeroy & Boch, Horchow
Carter's            : Carter's, The Limited, Justice, Sephora, JC Penney
Oakley              : Oakley, Under Armour, Nike, Columbia, Lacoste
Under Armour        : Under Armour, TOMS Shoes, Oakley, Keds, Nike
Brooks Brothers     : Brooks Brothers, Mikasa, Crocs, Lacoste, New Balance
MINKPINK            : MINKPINK, Bloomingdale's, Giorgio Armani, JC Penney, Aeropostale
Kohl's              : Kohl's, Old Navy, Target, Home Depot, Gap
Pottery Barn        : Pottery Barn, Villeroy & Boch, Lenox, Urban Outfitters, Serena and Lily
Windsor             : Windsor, Ethan Allen, Betsey Johnson, Columbia, Guess


### Recommendations
Given a user, return recommended brands with scores

In [8]:
def recommend_brands_for_user(user, M, top=5):
    user_brands = M.loc[user][M.loc[user] > 0].index  # get brands of user
    brand_distance = jaccard_distance(M)        
    recs = brand_distance[user_brands].mean(axis=1).sort(ascending=True, inplace=False).index
    # remove all top brands that are already on this user's list
    recs = [rec for rec in recs if rec not in user_brands]
    return recs[:top]

In [9]:
n_users = 5
# for user in [90217, 86156, 89116, 89112]:
for user in np.random.choice(M.index, n_users, replace=False):
    print "User %s" % user
    print "Already following:", ", ".join(brands[M.loc[user] > 0])
    print "Recommended:", ", ".join(recommend_brands_for_user(user, M))
    print

User 90377
Already following: Banana Republic
Recommended: Gap, J.Crew, Nordstrom, Express, Crate & Barrel

User 86499
Already following: Kohl's, Target
Recommended: Old Navy, Home Depot, Gap, Crate & Barrel, KitchenAid

User 81036
Already following: Levi's
Recommended: Converse, Calvin Klein, Puma, Guess, KitchenAid

User 85458
Already following: Express, Home Depot, Kohl's, Old Navy, Target
Recommended: Gap, Crate & Barrel, KitchenAid, Nordstrom, Banana Republic

User 82989
Already following: Calvin Klein, Express, Guess, Nine West
Recommended: Steve Madden, DKNY, Banana Republic, BCBGMAXAZRIA, Kenneth Cole

