## Filtriranje na bazi sadržaja

Sustavi preporuka su zbirka algoritama koji se koriste za preporuku predmeta korisnicima na temelju informacija preuzetih od korisnika. U ovom slučaju implementiram jednostavnu verziju jednog preporučenog sustava temeljenog na sadržaju koji koristi Python i biblioteci Pandas.

### Import libraries

In [1]:
import pandas as pd
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

### Import Data
(I created 2 tables; Dijelovi and Ratings)

In [2]:
dijelovi = pd.read_csv('Tokic_ML_dijelovi.csv')
dijelovi

Unnamed: 0,Id,Naziv,Kategorija
0,1,Gume Sava zimske,Vanjski|Gume i felge
1,2,Gume Sava ljetne,Vanjski|Gume i felge
2,3,Felge Alesio,Vanjski|Gume i felge
3,4,Felge Alesio niskoprofilne,Vanjski|Gume i felge
4,5,Brisači,Vanjski|Brisači
5,6,Amortizeri Prednji,Upravljanje|Ovjes
6,7,Amortizeri Stražnji,Upravljanje|Ovjes
7,8,Zračni Filter,Motor|Filteri
8,9,Klinasti Remen,Motor|Remenje
9,10,Bregasta Osovina,Motor|Osovine


In [3]:
rejting = pd.read_csv('Tokic_ML_rejting.csv')
rejting

Unnamed: 0,User_Id,Id,Rating
0,1,1,4.0
1,1,2,4.0
2,1,3,5.0
3,1,4,3.0
4,5,5,2.0
5,2,6,3.5
6,2,7,4.5
7,4,8,4.0
8,4,9,3.0
9,4,10,4.0


Vrijednosti u stupcu __Kategorija__ podijelim na __listu kategorije__ kako bih pojednostavio buduću upotrebu. To se može postići primjenom Pythonove funkcije podijeljenog niza na ispravan stupac.

In [4]:
#Every Kategorija is separated by a | so we simply have to call the split function on |
dijelovi['Kategorija'] = dijelovi.Kategorija.str.split('|')
dijelovi.head()

Unnamed: 0,Id,Naziv,Kategorija
0,1,Gume Sava zimske,"[Vanjski, Gume i felge]"
1,2,Gume Sava ljetne,"[Vanjski, Gume i felge]"
2,3,Felge Alesio,"[Vanjski, Gume i felge]"
3,4,Felge Alesio niskoprofilne,"[Vanjski, Gume i felge]"
4,5,Brisači,"[Vanjski, Brisači]"


Budući da zadržavanje Kategorije u formatu popisa nije optimalno za tehniku sustav preporuka temeljen na sadržaju, upotrijebit ćemo tehniku One Hot Encoding za pretvaranje popisa kategorije u vektor gdje svaki stupac odgovara jednoj mogućoj vrijednosti značajke. Ovo šifriranje potrebno je za unošenje kategorijskih podataka. U ovom slučaju, svaku različitu kategoriju spremamo u stupce koji sadrže ili 1 ili 0.

In [6]:
#Copying the dijelovi into a new one since we won't need to use the Kategorija information in our first case.
dijelovi_sa_kategorijama = dijelovi.copy()

#For every row in the dataframe, iterate through the list of KAtegorija and place a 1 into the corresponding column
for index, row in dijelovi.iterrows():
    for Kategorija in row['Kategorija']:
        dijelovi_sa_kategorijama.at[index, Kategorija] = 1
#Filling in the NaN values with 0 to show that a car part doesn't have that column's Kategorija
dijelovi_sa_kategorijama = dijelovi_sa_kategorijama.fillna(0)
dijelovi_sa_kategorijama.head()

Unnamed: 0,Id,Naziv,Kategorija,Vanjski,Gume i felge,Brisači,Upravljanje,Ovjes,Motor,Filteri,Remenje,Osovine,Klipovi,Brtve,Pumpe,Elektronika,Akumulatori,Senzori,Rasvjeta
0,1,Gume Sava zimske,"[Vanjski, Gume i felge]",1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Gume Sava ljetne,"[Vanjski, Gume i felge]",1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Felge Alesio,"[Vanjski, Gume i felge]",1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Felge Alesio niskoprofilne,"[Vanjski, Gume i felge]",1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Brisači,"[Vanjski, Brisači]",1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Kako implementirati sustave preporuka na temelju sadržaja ili predmeta Ovom se tehnikom pokušava utvrditi koji su najdraži aspekti stavke korisnika, a zatim preporučuje stavke koje predstavljaju te aspekte. U našem ćemo slučaju pokušati ustanoviti najdražu kategoriju ulaza iz podijeljenih dijelova i ocjena.

Započnimo stvaranjem ulaznog korisnika.

In [7]:
userInput = [
            {'Naziv':'Gume Sava zimske', 'rating':5},
            {'Naziv':'Gume Sava ljetne', 'rating':3.5}
            ] 
inputKupovina = pd.DataFrame(userInput)
inputKupovina

Unnamed: 0,Naziv,rating
0,Gume Sava zimske,5.0
1,Gume Sava ljetne,3.5


### Dodajemo dijelovi Id korisničkom ulazu

Kad je unos završen, izvadimo ID ulazne kupovine iz podataka 'dijelova' i dodajmo ih u njega.

To možemo postići tako što prvo filtriramo redove koji sadrže ulazni naslov kupovine i zatim spajamo ovaj podskup sa ulaznim podatkovnim okvirom. Također ispustamo nepotrebne stupce za ulaz kako bismo uštedjeli prostor u memoriji.

In [8]:
#Filtering out the dijelovi by naziv
inputId = dijelovi[dijelovi['Naziv'].isin(inputKupovina['Naziv'].tolist())]
#Then merging it so we can get the dijeloviId. It's implicitly merging it by naziv.
inputKupovina = pd.merge(inputId, inputKupovina)
#Dropping information we won't use from the input dataframe
inputKupovina = inputKupovina.drop('Kategorija', 1)
#Final input dataframe
inputKupovina

Unnamed: 0,Id,Naziv,rating
0,1,Gume Sava zimske,5.0
1,2,Gume Sava ljetne,3.5


In [10]:
#Filtering out the dijelovi from the input
userKupovina = dijelovi_sa_kategorijama[dijelovi_sa_kategorijama['Id'].isin(inputKupovina['Id'].tolist())]
userKupovina

Unnamed: 0,Id,Naziv,Kategorija,Vanjski,Gume i felge,Brisači,Upravljanje,Ovjes,Motor,Filteri,Remenje,Osovine,Klipovi,Brtve,Pumpe,Elektronika,Akumulatori,Senzori,Rasvjeta
0,1,Gume Sava zimske,"[Vanjski, Gume i felge]",1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Gume Sava ljetne,"[Vanjski, Gume i felge]",1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [11]:
#Resetting the index to avoid future issues
userKupovina = userKupovina.reset_index(drop=True)
#Dropping unnecessary issues due to save memory and to avoid issues
userKategorijaTable = userKupovina.drop('Id', 1).drop('Naziv', 1).drop('Kategorija', 1)
userKategorijaTable

Unnamed: 0,Vanjski,Gume i felge,Brisači,Upravljanje,Ovjes,Motor,Filteri,Remenje,Osovine,Klipovi,Brtve,Pumpe,Elektronika,Akumulatori,Senzori,Rasvjeta
0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Sada smo spremni započeti s učenjem postavki unosa!

Da bismo to postigli, svaku ćemo kategoriju pretvoriti u 'weights'. To možemo učiniti pomoću ulaznih recenzija i množenjem u kategorijsku tablicu ulaza, a zatim zbrajanje rezultirajuće tablice po stupcu. Ova operacija je zapravo 'dot' produkt između matrice i vektora, pa to možemo jednostavno ostvariti pozivanjem Pandasove „dot“ funkcije.

In [12]:
inputKupovina['rating']

0    5.0
1    3.5
Name: rating, dtype: float64

In [13]:
#Dot produt to get weights
userProfile = userKategorijaTable.transpose().dot(inputKupovina['rating'])
#The user profile
userProfile

Vanjski         8.5
Gume i felge    8.5
Brisači         0.0
Upravljanje     0.0
Ovjes           0.0
Motor           0.0
Filteri         0.0
Remenje         0.0
Osovine         0.0
Klipovi         0.0
Brtve           0.0
Pumpe           0.0
Elektronika     0.0
Akumulatori     0.0
Senzori         0.0
Rasvjeta        0.0
dtype: float64

In [14]:
#Now let's get the kategorija of every dijelovi in our original dataframe
KategorijaTable = dijelovi_sa_kategorijama.set_index(dijelovi_sa_kategorijama['Id'])
#And drop the unnecessary information
KategorijaTable = KategorijaTable.drop('Id', 1).drop('Naziv', 1).drop('Kategorija', 1)
KategorijaTable.head()

Unnamed: 0_level_0,Vanjski,Gume i felge,Brisači,Upravljanje,Ovjes,Motor,Filteri,Remenje,Osovine,Klipovi,Brtve,Pumpe,Elektronika,Akumulatori,Senzori,Rasvjeta
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
KategorijaTable.shape

(20, 16)

S ulaznim profilom i potpunim popisom dijelova i njihove 'Kategorije' u ruci, uzet ćemo prosjek svakog 'dijelova' na temelju ulaznog profila i preporučiti najbolje 'dijelove'.

In [16]:
#Multiply the kategorija by the weights and then take the weighted average
Preporuke = ((KategorijaTable*userProfile).sum(axis=1))/(userProfile.sum())
Preporuke.head()

Id
1    1.0
2    1.0
3    1.0
4    1.0
5    0.5
dtype: float64

## Završna preporuka

In [17]:
dijelovi.loc[dijelovi['Id'].isin(Preporuke.head(5).keys())]

Unnamed: 0,Id,Naziv,Kategorija
0,1,Gume Sava zimske,"[Vanjski, Gume i felge]"
1,2,Gume Sava ljetne,"[Vanjski, Gume i felge]"
2,3,Felge Alesio,"[Vanjski, Gume i felge]"
3,4,Felge Alesio niskoprofilne,"[Vanjski, Gume i felge]"
4,5,Brisači,"[Vanjski, Brisači]"
