# Last.FM Recommendation System - An Introduction to Collaborative Filtering

* The dataset contains information about users, gender, age, and which artists they have listened to on Last.FM. In this notebook, we use only Germany's data and transform the data into a frequency matrix.

We are going to implement 2 types of collaborative filtering:

1. Item based: Which takes similarities between items' consumption histories
2. User Based that considers siminarities between user consumption histories and item similarities

In [1]:
import pandas as pd
from scipy.spatial.distance import cosine

# Disable jedi autocompleter
%config Completer.use_jedi = False

In [2]:
df = pd.read_csv('../Datasets/lastfm-matrix-germany.csv')
df.sample(5)

Unnamed: 0,user,a perfect circle,abba,ac/dc,adam green,aerosmith,afi,air,alanis morissette,alexisonfire,...,timbaland,tom waits,tool,tori amos,travis,trivium,u2,underoath,volbeat,yann tiersen
59,1022,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
783,12274,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1113,17482,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
826,12978,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
480,7651,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1257 entries, 0 to 1256
Columns: 286 entries, user to yann tiersen
dtypes: int64(286)
memory usage: 2.7 MB


In [4]:
# downcast the datatypes of all column, in order to save some memory
cols = df.columns
df[cols] = df[cols].apply(pd.to_numeric, downcast='unsigned')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1257 entries, 0 to 1256
Columns: 286 entries, user to yann tiersen
dtypes: uint16(1), uint8(285)
memory usage: 352.4 KB


## Item Based Collaborative Filtering

In item based collaborative filtering we don not care about the user column. So let's drop it

In [5]:
df_de = df.drop('user', axis=1)
df_de.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1257 entries, 0 to 1256
Columns: 285 entries, a perfect circle to yann tiersen
dtypes: uint8(285)
memory usage: 350.0 KB


In [6]:
df_de.head()

Unnamed: 0,a perfect circle,abba,ac/dc,adam green,aerosmith,afi,air,alanis morissette,alexisonfire,alicia keys,...,timbaland,tom waits,tool,tori amos,travis,trivium,u2,underoath,volbeat,yann tiersen
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [7]:
# Before we caluculate the similarities we heed to create a place holder as a pandas DF
ibcf = pd.DataFrame(index=df_de.columns, columns=df_de.columns)

Now we can start filling in the similarities. We will use the `cosine` similarities from `scipy`

In [8]:
# Lets fill in our place holder with cosine similarities
# Loop through the columns
for i in range(ibcf.shape[1]):
    # Loop through the columns for each column
    for j in range(ibcf.shape[1]):
      # Fill in placeholder with cosine similarities
      ibcf.iloc[i,j] = 1 - cosine(df_de.iloc[:,i], df_de.iloc[:,j]) 
        
# I don't like using loops in python and particularly not a cascade of loops.
# This code is provisory, until I find a more elegant solution.
# Sorry for that!

In [9]:
ibcf.head()

Unnamed: 0,a perfect circle,abba,ac/dc,adam green,aerosmith,afi,air,alanis morissette,alexisonfire,alicia keys,...,timbaland,tom waits,tool,tori amos,travis,trivium,u2,underoath,volbeat,yann tiersen
a perfect circle,1.0,0.0,0.0179172,0.0515539,0.0627765,0.0,0.0517549,0.0607177,0,0.0,...,0.0473381,0.0811998,0.394709,0.125553,0.0303588,0.111154,0.0243975,0.06506,0.0521641,0.0
abba,0.0,1.0,0.0522788,0.0250706,0.0610563,0.0,0.0167789,0.0295269,0,0.0,...,0.0,0.0,0.0,0.0610563,0.0295269,0.0,0.0949158,0.0,0.0253673,0.0
ac/dc,0.0179172,0.0522788,1.0,0.113154,0.177153,0.0678942,0.0757299,0.0380762,0,0.0883332,...,0.0445288,0.0678942,0.0582408,0.0393673,0.0,0.0871313,0.122398,0.0203997,0.130849,0.0
adam green,0.0515539,0.0250706,0.113154,1.0,0.0566365,0.0,0.0933859,0.0,0,0.0254164,...,0.0,0.146516,0.0837892,0.0566365,0.0821687,0.0250706,0.0220113,0.0,0.023531,0.0880451
aerosmith,0.0627765,0.0610563,0.177153,0.0566365,1.0,0.0,0.113715,0.100056,0,0.0618984,...,0.0520051,0.0297351,0.0255072,0.0689655,0.0333519,0.0,0.214423,0.0,0.0573068,0.0


With our similarity matrix filled out we can sort each columns separately  and save the names of the top 10 songs of each column in a new DF

In [10]:
# Create a placeholder with 10 rows and the same columns as ibcf
top = 10
top10 = pd.DataFrame(index=range(top), columns=ibcf.columns)

In [11]:
for c in ibcf.columns:
    top10[c] = ibcf[c].sort_values(ascending=False).index[1:11]

In [12]:
# Show top 10 similarities of the first 10 songs
top10.iloc[:,:9]

Unnamed: 0,a perfect circle,abba,ac/dc,adam green,aerosmith,afi,air,alanis morissette,alexisonfire
0,tool,madonna,red hot chili peppers,the libertines,u2,funeral for a friend,massive attack,tori amos,atreyu
1,dredg,robbie williams,metallica,the strokes,led zeppelin,rise against,goldfrapp,alicia keys,underoath
2,deftones,elvis presley,iron maiden,babyshambles,metallica,fall out boy,morcheeba,red hot chili peppers,funeral for a friend
3,porcupine tree,michael jackson,the offspring,radiohead,ac/dc,anti-flag,thievery corporation,kelly clarkson,silverstein
4,nine inch nails,queen,black sabbath,franz ferdinand,lenny kravitz,sum 41,jamiroquai,dido,killswitch engage
5,incubus,the beatles,die toten hosen,the kooks,the rolling stones,billy talent,nouvelle vague,coldplay,rise against
6,system of a down,kelly clarkson,rammstein,foo fighters,jack johnson,silverstein,coldplay,pearl jam,caliban
7,opeth,groove coverage,judas priest,the white stripes,red hot chili peppers,lostprophets,portishead,jack johnson,enter shikari
8,the smashing pumpkins,duffy,the beatles,the beatles,robbie williams,millencolin,daft punk,norah jones,three days grace
9,radiohead,mika,hammerfall,arctic monkeys,oasis,system of a down,moby,james blunt,billy talent


## User Based Collaborative Filtering

The steps for creating a user based recommendation system are the following:

1. Generate an item based recommendation system
2. Check what products the user has consumed
3. For each item the user has consumed, get the top X neighbours
4. Get the consumption record of the user for each neighbour
5. Calculate a similarity score
6. Recommend the items with the highest score

We first need a formula to compute a similarity score. We use the sum of the product between purchase history and item similarity. We then divide that figure by the sum of the similarities:

In [13]:
# Helper function to get similarity scores
def getScore(history, similarities):
   return sum(history*similarities)/sum(similarities)

Now we just have to apply this function to the data frames.

In [14]:
# Place holder
df_sim = pd.DataFrame(index=df.index, columns=df.columns)
df_sim.iloc[:, :1] = df.iloc[:, :1]

We now loop through the rows and columns filling in empty spaces with similarity scores.  
Note that we score items that the user has already consumed as 0, because there is no point recommending it again.

In [33]:
#Loop through all rows, skip the user column, and fill with similarity scores
for i in range(0,len(df_sim.index)):
    for j in range(1,len(df_sim.columns)):
        user = df_sim.index[i]
        product = df_sim.columns[j]
 
        if df.iloc[i, j] == 1:
            df_sim.iloc[i, j] = 0
        else:
            product_top_names = top10[product]
            product_top_sims = ibcf.loc[product].sort_values(ascending=False)[1:11]
            user_purchases = df_de.loc[user,product_top_names]
 
            df_sim.iloc[i][j] = getScore(user_purchases,product_top_sims)

In [35]:
df_sim.head()

Unnamed: 0,user,a perfect circle,abba,ac/dc,adam green,aerosmith,afi,air,alanis morissette,alexisonfire,...,timbaland,tom waits,tool,tori amos,travis,trivium,u2,underoath,volbeat,yann tiersen
0,1,0.0,0.0,0.204405,0.0,0.266175,0.0,0.11814,0.186879,0.0,...,0.0,0.0,0,0.0,0.0954959,0.0,0.0,0,0.0,0.0
1,33,0.0823426,0.0,0.0959115,0.0,0.0888852,0.0,0.190638,0.175416,0.0697204,...,0.0821721,0.0865245,0,0.0860362,0.34953,0.0,0.0945433,0,0.0855854,0.090075
2,42,0.0,0.0897666,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0,0.167549,0.0,0.107222,0.0,0,0.0,0.0
3,51,0.0823426,0.0835681,0.0,0.0835018,0.0,0.0929544,0.0,0.0,0.0670885,...,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0
4,62,0.0,0.0,0.114305,0.0931969,0.0882214,0.0929544,0.0,0.102627,0.0670885,...,0.296039,0.0,0,0.0,0.0771396,0.0,0.0,0,0.0918341,0.0


Instead of having the matrix filled with similarity scores, however, it would be nice to see the song names.

In [46]:
# We can now produc a matrix of User Based recommendations as follows:

recommendations = pd.DataFrame(index=df_sim.index, columns=['user','1','2','3','4','5','6'])
recommendations.iloc[0:,0] = df_sim.iloc[:,0]

for i in range(len(df_sim.index)):
    recommendations.iloc[i,1:] = df_sim.iloc[i,:].sort_values(ascending=False).iloc[1:7,].index.T


In [49]:
recommendations.head()

Unnamed: 0,user,1,2,3,4,5,6
0,1,flogging molly,coldplay,aerosmith,the beatles,moby,mando diao
1,33,peter fox,gentleman,red hot chili peppers,kings of leon,flyleaf,oasis
2,42,oomph!,lacuna coil,rammstein,schandmaul,sonata arctica,subway to sally
3,51,the subways,the kooks,the hives,franz ferdinand,jack johnson,bloc party
4,62,mando diao,the fratellis,jack johnson,incubus,peter fox,oasis


## Reference

* S. Marafi, [Collaborative Filtering with Python](http://www.salemmarafi.com/code/collaborative-filtering-with-python/)