# Last.FM Recommendation System - An Introduction to Collaborative Filtering

* The dataset contains information about users, gender, age, and which artists they have listened to on Last.FM. In this notebook, we use only Germany's data and transform the data into a frequency matrix.

We are going to implement 2 types of collaborative filtering:

1. Item based: Which takes similarities between items' consumption histories
2. User Based that considers siminarities between user consumption histories and item similarities

In [1]:
import pandas as pd
from scipy.spatial.distance import cosine

# Disable jedi autocompleter
%config Completer.use_jedi = False

In [2]:
df = pd.read_csv('../Datasets/lastfm-matrix-germany.csv')
df.sample(5)

Unnamed: 0,user,a perfect circle,abba,ac/dc,adam green,aerosmith,afi,air,alanis morissette,alexisonfire,...,timbaland,tom waits,tool,tori amos,travis,trivium,u2,underoath,volbeat,yann tiersen
1217,19049,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
172,2784,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
869,13751,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
621,9764,1,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
988,15656,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1257 entries, 0 to 1256
Columns: 286 entries, user to yann tiersen
dtypes: int64(286)
memory usage: 2.7 MB


In [8]:
# downcast the datatypes of all column, in order to save some memory
cols = df.columns
df[cols] = df[cols].apply(pd.to_numeric, downcast='unsigned')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1257 entries, 0 to 1256
Columns: 286 entries, user to yann tiersen
dtypes: uint16(1), uint8(285)
memory usage: 352.4 KB


## Item Based Collaborative Filtering

In item based collaborative filtering we don not care about the user column. So let's drop it

In [11]:
df_de = df.drop('user', axis=1)
df_de.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1257 entries, 0 to 1256
Columns: 285 entries, a perfect circle to yann tiersen
dtypes: uint8(285)
memory usage: 350.0 KB


In [10]:
df_de.head()

Unnamed: 0,a perfect circle,abba,ac/dc,adam green,aerosmith,afi,air,alanis morissette,alexisonfire,alicia keys,...,timbaland,tom waits,tool,tori amos,travis,trivium,u2,underoath,volbeat,yann tiersen
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
# Before we caluculate the similarities we heed to create a place holder as a pandas DF
ibcf = pd.DataFrame(index=df_de.columns, columns=df_de.columns)

Now we can start filling in the similarities. We will use the `cosine` similarities from `scipy`

In [26]:
# Lets fill in our place holder with cosine similarities
# Loop through the columns
for i in range(ibcf.shape[1]):
    # Loop through the columns for each column
    for j in range(ibcf.shape[1]):
      # Fill in placeholder with cosine similarities
      ibcf.iloc[i,j] = 1 - cosine(df_de.iloc[:,i], df_de.iloc[:,j]) 
        
# I don't like using loops in python and particularly not a cascade of loops.
# This code is provisory, until I find a more elegant solution.
# Sorry for that!

In [25]:
ibcf.head()

Unnamed: 0,a perfect circle,abba,ac/dc,adam green,aerosmith,afi,air,alanis morissette,alexisonfire,alicia keys,...,timbaland,tom waits,tool,tori amos,travis,trivium,u2,underoath,volbeat,yann tiersen
a perfect circle,1.0,0.0,0.0179172,0.0515539,0.0627765,0.0,0.0517549,0.0607177,0,0.0,...,0.0473381,0.0811998,0.394709,0.125553,0.0303588,0.111154,0.0243975,0.06506,0.0521641,0.0
abba,0.0,1.0,0.0522788,0.0250706,0.0610563,0.0,0.0167789,0.0295269,0,0.0,...,0.0,0.0,0.0,0.0610563,0.0295269,0.0,0.0949158,0.0,0.0253673,0.0
ac/dc,0.0179172,0.0522788,1.0,0.113154,0.177153,0.0678942,0.0757299,0.0380762,0,0.0883332,...,0.0445288,0.0678942,0.0582408,0.0393673,0.0,0.0871313,0.122398,0.0203997,0.130849,0.0
adam green,0.0515539,0.0250706,0.113154,1.0,0.0566365,0.0,0.0933859,0.0,0,0.0254164,...,0.0,0.146516,0.0837892,0.0566365,0.0821687,0.0250706,0.0220113,0.0,0.023531,0.0880451
aerosmith,0.0627765,0.0610563,0.177153,0.0566365,1.0,0.0,0.113715,0.100056,0,0.0618984,...,0.0520051,0.0297351,0.0255072,0.0689655,0.0333519,0.0,0.214423,0.0,0.0573068,0.0


With our similarity matrix filled out we can sort each columns separately  and save the names of the top 10 songs of each column in a new DF

In [43]:
# Create a placeholder with 10 rows and the same columns as ibcf
top = 10
top10 = pd.DataFrame(index=range(top), columns=ibcf.columns)

In [44]:
for c in ibcf.columns:
    top10[c] = ibcf[c].sort_values(ascending=False).index[1:11]

In [57]:
# Show top 10 similarities of the first 10 songs
top10.iloc[:,:9]

Unnamed: 0,a perfect circle,abba,ac/dc,adam green,aerosmith,afi,air,alanis morissette,alexisonfire
0,tool,madonna,red hot chili peppers,the libertines,u2,funeral for a friend,massive attack,tori amos,atreyu
1,dredg,robbie williams,metallica,the strokes,led zeppelin,rise against,goldfrapp,alicia keys,underoath
2,deftones,elvis presley,iron maiden,babyshambles,metallica,fall out boy,morcheeba,red hot chili peppers,funeral for a friend
3,porcupine tree,michael jackson,the offspring,radiohead,ac/dc,anti-flag,thievery corporation,kelly clarkson,silverstein
4,nine inch nails,queen,black sabbath,franz ferdinand,lenny kravitz,sum 41,jamiroquai,dido,killswitch engage
5,incubus,the beatles,die toten hosen,the kooks,the rolling stones,billy talent,nouvelle vague,coldplay,rise against
6,system of a down,kelly clarkson,rammstein,foo fighters,jack johnson,silverstein,coldplay,pearl jam,caliban
7,opeth,groove coverage,judas priest,the white stripes,red hot chili peppers,lostprophets,portishead,jack johnson,enter shikari
8,the smashing pumpkins,duffy,the beatles,the beatles,robbie williams,millencolin,daft punk,norah jones,three days grace
9,radiohead,mika,hammerfall,arctic monkeys,oasis,system of a down,moby,james blunt,billy talent
