# Who To Follow: Recommending Brands

In this exercise, we consider a simple dataset: users following brands. We only know of a user follows a brand or not, but not how much he or she likes this brand.  Given the brands the user is following, we would like to recommend similar brands that s/he might be interested in.  

This is an example of _item-based collaborative filtering_ (also called _memory-based collaborative filtering_).  It's the approach known as _"because you liked this, we think you'd also like this."_  This is a neighborhood method, which is easy to understand.

### Import code and data

In [5]:
import numpy as np
import pandas as pd
import seaborn as sb
from matplotlib import pyplot as plt
%matplotlib inline

In [6]:
path_to_repo = '/Users/ruben/repo/personal/ga/DAT-23-NYC/'

In [7]:
data = pd.read_csv(path_to_repo + 'data/brand_followers/user-brands.csv')
print "We have %d pairs of %s users and %s brands." % \
    (len(data), data.id.nunique(), data.brand.nunique())
data.head()

We have 23804 pairs of 3759 users and 198 brands.


Unnamed: 0,id,brand
0,80002,Target
1,80002,Home Depot
2,80010,Levi's
3,80010,Puma
4,80010,Cuisinart


### User-by-brand matrix

Note that our data above is in condensed format. We could make it into a sparse matrix, which might be easier to work with.  You could do this with `pd.pivot_table`:

    M = pd.pivot_table(data, index='id', columns='brand', aggfunc='size', fill_value=0)

We use a `groupby` statement, which gives us a multi-index series, and then we make an `unstack` call to transform it into a dataframe again.  

Note that these steps are not necessary as you could complete this exercise in several different ways.

In [8]:
M = data.groupby(['id', 'brand']).size().unstack().fillna(0)
n_users, n_brands = M.shape
brands = M.columns
M.head(3)

brand,6pm.com,Abercrombie & Fitch,Adidas,Aeropostale,Aldo,All Saints,Amazon.com,American Apparel,American Eagle,Ann Taylor,...,Walgreens,Walk-Over,Wet Seal,Windsor,YSL,Yves Saint Laurent,ZOO,Zara,Zipcar,vineyard vines
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
80002,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
80010,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
80011,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Jaccard distance

Since we will use a neighborhood method, we need a definition of _distance_.  We'll use the _Jaccard distance_ for this. (Also see an earlier notebook on SVD which covered the Jaccard distance.)

The [_Jaccard index_](https://en.wikipedia.org/wiki/Jaccard_index) is a similarity metric between two sets.  It measures how many elements two sets have in common, as a fraction of the total number of distinct elements in both sets.  

$$\text{Jaccard index} = \frac{ |A \cap B | }{ |A \cup B| }$$

We could make a Jaccard matrix $J$, with pairwise similarities $J_{ij}$ as entries.
- `J[i, j]` = Jaccard similarity between doc _i_ and _j_ (between 0 and 1)
- `J[i, i]` = 1, obviously, and
- `J[i, j]` = `J[i, j]`, i.e., the matrix is symmetric.

We could also define the _Jaccard distance_, which has $D_{ii} = 0$ for identical sets, and bigger values as the sets have less words in common.  We define: $D = 1 - J,$ which has values between 0 and 1.

Common applications of the Jaccard index include text clustering, but we can use it for brand clustering as well, counting the number of followers they have in common.

<hr>
## Exercise


- Create a brand-by-brand matrix, with the similarity distances between two brands in each entry.
   - Obviously, you'd have $N_{ii} = 0$ for each brand $i$, and $N_{ij} = N_{ji}$ for each pair of brands.
   - You can create a 2-dimenional `np.array` for this, or a nested dictionary `N = {i: {j: distance}}`, or anything you like.
      
      
- For a few brands of your choice, show the top most similar brands.  
   - Do your results make sense? Would you agree?
   
   
- For a few users, make a few top recommendations.
   - Per user, display the brands s/he's already following
   - For each brand, compute the distance to all other brands
   - Average all distances to find the few closest brands, with the shortest average distance
   - Make sure you exclude the brands the user is already following from the recommendations

In [1]:
# Your code here...

In [12]:
M.head()

brand,6pm.com,Abercrombie & Fitch,Adidas,Aeropostale,Aldo,All Saints,Amazon.com,American Apparel,American Eagle,Ann Taylor,...,Walgreens,Walk-Over,Wet Seal,Windsor,YSL,Yves Saint Laurent,ZOO,Zara,Zipcar,vineyard vines
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
80002,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
80010,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
80011,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
80015,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
80020,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [34]:
%%time
J = {}
for brand1 in M:
    J[brand1] = {}
    for brand2 in M:
        in_common = M[brand1].dot(M[brand2])
        in_total = M[brand1].sum() + M[brand2].sum() - in_common
        J[brand1][brand2] = in_common / float(in_total)

CPU times: user 4.16 s, sys: 8.36 ms, total: 4.17 s
Wall time: 4.17 s


In [51]:
J = pd.DataFrame(J)
# J.head()

In [50]:
%%time
in_common = M.T.dot(M)
N = M.sum()  # number of users per brand
N = N.reshape(-1, 1)
in_total = N + N.T - in_common.values
JJ = in_common / in_total

CPU times: user 11 ms, sys: 2.24 ms, total: 13.2 ms
Wall time: 6.24 ms


In [53]:
JJ.head()

brand,6pm.com,Abercrombie & Fitch,Adidas,Aeropostale,Aldo,All Saints,Amazon.com,American Apparel,American Eagle,Ann Taylor,...,Walgreens,Walk-Over,Wet Seal,Windsor,YSL,Yves Saint Laurent,ZOO,Zara,Zipcar,vineyard vines
brand,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
6pm.com,1.0,0.003745,0.003846,0.003759,0,0,0,0.003817,0.003559,0.007519,...,0,0,0.0,0,0,0,0,0.0,0,0
Abercrombie & Fitch,0.003745,1.0,0.0,0.071429,0,0,0,0.0,0.111111,0.0,...,0,0,0.0625,0,0,0,0,0.090909,0,0
Adidas,0.003846,0.0,1.0,0.0,0,0,0,0.0,0.0,0.0,...,0,0,0.0,0,0,0,0,0.0,0,0
Aeropostale,0.003759,0.071429,0.0,1.0,0,0,0,0.0,0.115385,0.0,...,0,0,0.0,0,0,0,0,0.0,0,0
Aldo,0.0,0.0,0.0,0.0,1,0,0,0.0,0.0,0.0,...,0,0,0.0,0,0,0,0,0.0,0,0


In [59]:
# J.loc['Aldo']

In [61]:
J['Aldo'].sort(ascending=False, inplace=False).head()

Aldo          1.000000
LOFT          0.125000
Nike          0.037037
Mikasa        0.006250
Lands' End    0.003584
Name: Aldo, dtype: float64

In [64]:
top = 5
for brand in ['Home Depot', 'Armani', 'UNIQLO']:
    print "%-20s:" % brand, 
    print ', '.join(J[brand].sort(ascending=False, inplace=False).head().index)


Home Depot          : Home Depot, Target, Kohl's, Old Navy, Crate & Barrel
Armani              : Armani, Ecco, Hugo Boss, Giorgio Armani, Horchow
UNIQLO              : UNIQLO, American Apparel, 6pm.com, Shoebuy, Columbia
