## CONTENT-BASED FILTERING

Recommendation systems are a collection of algorithms used to recommend items to users based on information taken from the user. In this notebook I am implementing a simple version of one Content-based recommendation systems using Python and the Pandas library.

### Import libraries

In [1]:
import pandas as pd
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

### Import Data
(I created 2 tables; Dijelovi and Ratings)

In [2]:
dijelovi = pd.read_csv('Tokic_ML_dijelovi.csv')
dijelovi

Unnamed: 0,Id,Naziv,Kategorija
0,1,Gume Sava zimske,Vanjski|Gume i felge
1,2,Gume Sava ljetne,Vanjski|Gume i felge
2,3,Felge Alesio,Vanjski|Gume i felge
3,4,Felge Alesio niskoprofilne,Vanjski|Gume i felge
4,5,Brisači,Vanjski|Brisači
5,6,Amortizeri Prednji,Upravljanje|Ovjes
6,7,Amortizeri Stražnji,Upravljanje|Ovjes
7,8,Zračni Filter,Motor|Filteri
8,9,Klinasti Remen,Motor|Remenje
9,10,Bregasta Osovina,Motor|Osovine


In [3]:
rejting = pd.read_csv('Tokic_ML_rejting.csv')
rejting

Unnamed: 0,User_Id,Id,Rating
0,1,1,4.0
1,1,2,4.0
2,1,3,5.0
3,1,4,3.0
4,5,5,2.0
5,2,6,3.5
6,2,7,4.5
7,4,8,4.0
8,4,9,3.0
9,4,10,4.0


I split the values in the __Kategorija__ column into a __list of Kategorija__ to simplify future use. This can be achieved by applying Python's split string function on the correct column.

In [4]:
#Every Kategorija is separated by a | so we simply have to call the split function on |
dijelovi['Kategorija'] = dijelovi.Kategorija.str.split('|')
dijelovi.head()

Unnamed: 0,Id,Naziv,Kategorija
0,1,Gume Sava zimske,"[Vanjski, Gume i felge]"
1,2,Gume Sava ljetne,"[Vanjski, Gume i felge]"
2,3,Felge Alesio,"[Vanjski, Gume i felge]"
3,4,Felge Alesio niskoprofilne,"[Vanjski, Gume i felge]"
4,5,Brisači,"[Vanjski, Brisači]"


Since keeping Kategorije in a list format isn't optimal for the content-based recommendation system technique, we will use the One Hot Encoding technique to convert the list of Kategorija to a vector where each column corresponds to one possible value of the feature. This encoding is needed for feeding categorical data. In this case, we store every different Kategorija in columns that contain either 1 or 0.

In [6]:
#Copying the dijelovi into a new one since we won't need to use the Kategorija information in our first case.
dijelovi_sa_kategorijama = dijelovi.copy()

#For every row in the dataframe, iterate through the list of KAtegorija and place a 1 into the corresponding column
for index, row in dijelovi.iterrows():
    for Kategorija in row['Kategorija']:
        dijelovi_sa_kategorijama.at[index, Kategorija] = 1
#Filling in the NaN values with 0 to show that a car part doesn't have that column's Kategorija
dijelovi_sa_kategorijama = dijelovi_sa_kategorijama.fillna(0)
dijelovi_sa_kategorijama.head()

Unnamed: 0,Id,Naziv,Kategorija,Vanjski,Gume i felge,Brisači,Upravljanje,Ovjes,Motor,Filteri,Remenje,Osovine,Klipovi,Brtve,Pumpe,Elektronika,Akumulatori,Senzori,Rasvjeta
0,1,Gume Sava zimske,"[Vanjski, Gume i felge]",1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Gume Sava ljetne,"[Vanjski, Gume i felge]",1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Felge Alesio,"[Vanjski, Gume i felge]",1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Felge Alesio niskoprofilne,"[Vanjski, Gume i felge]",1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Brisači,"[Vanjski, Brisači]",1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


How to implement Content-Based or Item-Item recommendation systems. This technique attempts to figure out what a user's favourite aspects of an item is, and then recommends items that present those aspects. In our case, we're going to try to figure out the input's favorite Kategorija from the dijelovi and ratings given.

Let's begin by creating an input user to recommend dijelovi to:

In [7]:
userInput = [
            {'Naziv':'Gume Sava zimske', 'rating':5},
            {'Naziv':'Gume Sava ljetne', 'rating':3.5}
            ] 
inputKupovina = pd.DataFrame(userInput)
inputKupovina

Unnamed: 0,Naziv,rating
0,Gume Sava zimske,5.0
1,Gume Sava ljetne,3.5


### Add dijelovi Id to input user

With the input complete, let's extract the input kupovina ID's from the dijelovi dataframe and add them into it.

We can achieve this by first filtering out the rows that contain the input kupovina title and then merging this subset with the input dataframe. We also drop unnecessary columns for the input to save memory space.

In [8]:
#Filtering out the dijelovi by naziv
inputId = dijelovi[dijelovi['Naziv'].isin(inputKupovina['Naziv'].tolist())]
#Then merging it so we can get the dijeloviId. It's implicitly merging it by naziv.
inputKupovina = pd.merge(inputId, inputKupovina)
#Dropping information we won't use from the input dataframe
inputKupovina = inputKupovina.drop('Kategorija', 1)
#Final input dataframe
inputKupovina

Unnamed: 0,Id,Naziv,rating
0,1,Gume Sava zimske,5.0
1,2,Gume Sava ljetne,3.5


In [10]:
#Filtering out the dijelovi from the input
userKupovina = dijelovi_sa_kategorijama[dijelovi_sa_kategorijama['Id'].isin(inputKupovina['Id'].tolist())]
userKupovina

Unnamed: 0,Id,Naziv,Kategorija,Vanjski,Gume i felge,Brisači,Upravljanje,Ovjes,Motor,Filteri,Remenje,Osovine,Klipovi,Brtve,Pumpe,Elektronika,Akumulatori,Senzori,Rasvjeta
0,1,Gume Sava zimske,"[Vanjski, Gume i felge]",1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Gume Sava ljetne,"[Vanjski, Gume i felge]",1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [11]:
#Resetting the index to avoid future issues
userKupovina = userKupovina.reset_index(drop=True)
#Dropping unnecessary issues due to save memory and to avoid issues
userKategorijaTable = userKupovina.drop('Id', 1).drop('Naziv', 1).drop('Kategorija', 1)
userKategorijaTable

Unnamed: 0,Vanjski,Gume i felge,Brisači,Upravljanje,Ovjes,Motor,Filteri,Remenje,Osovine,Klipovi,Brtve,Pumpe,Elektronika,Akumulatori,Senzori,Rasvjeta
0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now we're ready to start learning the input's preferences!

To do this, we're going to turn each kategorija into weights. We can do this by using the input's reviews and multiplying them into the input's genre table and then summing up the resulting table by column. This operation is actually a dot product between a matrix and a vector, so we can simply accomplish by calling Pandas's "dot" function.

In [12]:
inputKupovina['rating']

0    5.0
1    3.5
Name: rating, dtype: float64

In [13]:
#Dot produt to get weights
userProfile = userKategorijaTable.transpose().dot(inputKupovina['rating'])
#The user profile
userProfile

Vanjski         8.5
Gume i felge    8.5
Brisači         0.0
Upravljanje     0.0
Ovjes           0.0
Motor           0.0
Filteri         0.0
Remenje         0.0
Osovine         0.0
Klipovi         0.0
Brtve           0.0
Pumpe           0.0
Elektronika     0.0
Akumulatori     0.0
Senzori         0.0
Rasvjeta        0.0
dtype: float64

In [14]:
#Now let's get the kategorija of every dijelovi in our original dataframe
KategorijaTable = dijelovi_sa_kategorijama.set_index(dijelovi_sa_kategorijama['Id'])
#And drop the unnecessary information
KategorijaTable = KategorijaTable.drop('Id', 1).drop('Naziv', 1).drop('Kategorija', 1)
KategorijaTable.head()

Unnamed: 0_level_0,Vanjski,Gume i felge,Brisači,Upravljanje,Ovjes,Motor,Filteri,Remenje,Osovine,Klipovi,Brtve,Pumpe,Elektronika,Akumulatori,Senzori,Rasvjeta
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
KategorijaTable.shape

(20, 16)

With the input's profile and the complete list of 'dijelovi' and their 'Kategorija' in hand, we're going to take the weighted average of every 'dijelovi' based on the input profile and recommend the top 'dijelovi'.

In [16]:
#Multiply the kategorija by the weights and then take the weighted average
Preporuke = ((KategorijaTable*userProfile).sum(axis=1))/(userProfile.sum())
Preporuke.head()

Id
1    1.0
2    1.0
3    1.0
4    1.0
5    0.5
dtype: float64

## Final Recommendation

In [17]:
dijelovi.loc[dijelovi['Id'].isin(Preporuke.head(5).keys())]

Unnamed: 0,Id,Naziv,Kategorija
0,1,Gume Sava zimske,"[Vanjski, Gume i felge]"
1,2,Gume Sava ljetne,"[Vanjski, Gume i felge]"
2,3,Felge Alesio,"[Vanjski, Gume i felge]"
3,4,Felge Alesio niskoprofilne,"[Vanjski, Gume i felge]"
4,5,Brisači,"[Vanjski, Brisači]"
