# Recommendation System

- Top/Popular Based Filtering
- Content Based Filtering
- Collaborative Filtering

<img src="https://miro.medium.com/max/4056/1*yrkvweErbifbPFkBUyZlOw.png" 
     width="500px">

In [1]:
import numpy as np
import pandas as pd

<hr>

### 1. Content-based Filtering

__A. Cosinus similarity__

- Menghitung kesamaan data berdasarkan value & pola value
- Di sklearn, gunakan ```cosine_similarity```

In [2]:
from sklearn.metrics.pairwise import cosine_similarity

In [3]:
x = np.array([1, 2, 3])
y = np.array([1, 2, 3])

In [4]:
similarityScore = cosine_similarity([x, y])
similarityScore

array([[1., 1.],
       [1., 1.]])

In [5]:
simDf = pd.DataFrame(
    similarityScore,
    columns = ['x', 'y'],
    index = ['x', 'y']
)
simDf

Unnamed: 0,x,y
x,1.0,1.0
y,1.0,1.0


In [6]:
# data berbeda 1 elemen

x = np.array([1, 1, 1])
y = np.array([1, 2, 1])
similarityScore = cosine_similarity([x, y])
similarityScore

array([[1.        , 0.94280904],
       [0.94280904, 1.        ]])

In [7]:
# 2 data berbeda elemen, namun tiap data semua elemennya sama

x = np.array([1, 1, 1])
y = np.array([2, 2, 2])
similarityScore = cosine_similarity([x, y])
similarityScore

array([[1., 1.],
       [1., 1.]])

In [11]:
# data 1 = 1,2,3 & data 2 = 5,5,5

x = np.array([1, 2, 3])
y = np.array([5, 5, 5])
similarityScore = cosine_similarity([x, y])
similarityScore

array([[1.        , 0.61639313],
       [0.61639313, 1.        ]])

In [12]:
# data 1 = 1,20,300 & data 2 = 5,5,5

x = np.array([1, 20, 300])
y = np.array([5, 5, 5])
similarityScore = cosine_similarity([x, y])
similarityScore

array([[1.        , 0.61639313],
       [0.61639313, 1.        ]])

<hr>

__B. Count vectorizer__

- Bagaimana menghitung cos similarity untuk categorical data (value berupa string)?
- Data diekstrak untuk menentukan jumlah tiap kata. Jumlah tiap kata dalam data dicari cos similarity-nya.
- Di sklearn, gunakan ```CountVectorizer```

In [17]:
from sklearn.feature_extraction.text import CountVectorizer

In [19]:
x = 'Andi Budi Caca'
y = 'Andi Budi Caca'

In [20]:
coVec = CountVectorizer()

In [23]:
c = coVec.fit_transform([x, y])
c

<2x3 sparse matrix of type '<class 'numpy.int64'>'
	with 6 stored elements in Compressed Sparse Row format>

In [25]:
# kata yang terkandung
coVec.get_feature_names()

['andi', 'budi', 'caca']

In [29]:
# matriks freq: freq tiap kata di setiap data
mf = c.toarray()
mf

# [[1, 1, 1], => data 1 => Andi:1x, Budi:1x, Caca:1x
# [1, 1, 1]]  => data 2 => Andi:1x, Budi:1x, Caca:1x

array([[1, 1, 1],
       [1, 1, 1]], dtype=int64)

In [28]:
similarityScore = cosine_similarity(mf)
similarityScore

#      X   Y
# X [[1., 1.],
# Y  [1., 1.]]

array([[1., 1.],
       [1., 1.]])

In [33]:
# data beda sedikit

x = 'Andi Budi Caca'
y = 'Andi Budi Budi'

coVec = CountVectorizer()
cm = coVec.fit_transform([x, y])
print(coVec.get_feature_names())
print(cm.toarray())

similarityScore = cosine_similarity(cm.toarray())
print(similarityScore)

['andi', 'budi', 'caca']
[[1 1 1]
 [1 2 0]]
[[1.         0.77459667]
 [0.77459667 1.        ]]


<hr>

__C. Contoh Kasus__

- Dataset: [USA Cars Dataset](https://www.kaggle.com/doaaalsenani/usa-cers-dataset)
- Content-based filtering rekomendasi berdasarkan __*brand*__ & __*color*__

In [35]:
dfCars = pd.read_csv('USA_cars_datasets.csv')
dfCars.head(3)

Unnamed: 0.1,Unnamed: 0,price,brand,model,year,title_status,mileage,color,vin,lot,state,country,condition
0,0,6300,toyota,cruiser,2008,clean vehicle,274117.0,black,jtezu11f88k007763,159348797,new jersey,usa,10 days left
1,1,2899,ford,se,2011,clean vehicle,190552.0,silver,2fmdk3gc4bbb02217,166951262,tennessee,usa,6 days left
2,2,5350,dodge,mpv,2018,clean vehicle,39590.0,silver,3c4pdcgg5jt346413,167655728,georgia,usa,2 days left


In [37]:
# 1. create column dengan value = kombinasi value "BRAND" & "COLOR"

dfCars['B&C'] = dfCars.apply(lambda row:str(row['brand']) + ' ' + str(row['color']), axis=1)
dfCars.head(3)

Unnamed: 0.1,Unnamed: 0,price,brand,model,year,title_status,mileage,color,vin,lot,state,country,condition,B&C
0,0,6300,toyota,cruiser,2008,clean vehicle,274117.0,black,jtezu11f88k007763,159348797,new jersey,usa,10 days left,toyota black
1,1,2899,ford,se,2011,clean vehicle,190552.0,silver,2fmdk3gc4bbb02217,166951262,tennessee,usa,6 days left,ford silver
2,2,5350,dodge,mpv,2018,clean vehicle,39590.0,silver,3c4pdcgg5jt346413,167655728,georgia,usa,2 days left,dodge silver


In [42]:
# 2. count vectorizer: hitung kata tiap data di column "B&C"

cv = CountVectorizer()
cm = cv.fit_transform(dfCars['B&C'])

# total kata unik di col B&C
print(cv.get_feature_names())

# matrix freq tiap kata di tiap data
print(cm.toarray())
# print(cm.toarray()[0])

['acura', 'audi', 'beige', 'benz', 'billet', 'black', 'blue', 'bmw', 'bright', 'brown', 'buick', 'burgundy', 'cadillac', 'cayenne', 'charcoal', 'chevrolet', 'chrysler', 'clearcoat', 'coat', 'color', 'competition', 'crimson', 'dark', 'davidson', 'dodge', 'ford', 'glacier', 'gmc', 'gold', 'gray', 'green', 'guard', 'harley', 'heartland', 'honda', 'hyundai', 'infiniti', 'ingot', 'jaguar', 'jazz', 'jeep', 'kia', 'kona', 'land', 'lexus', 'light', 'lightning', 'lincoln', 'magnetic', 'maroon', 'maserati', 'mazda', 'mercedes', 'metallic', 'morningsky', 'nissan', 'no_color', 'off', 'orange', 'oxford', 'pearl', 'pearlcoat', 'peterbilt', 'phantom', 'platinum', 'purple', 'ram', 'red', 'royal', 'ruby', 'shadow', 'silver', 'super', 'tan', 'tinted', 'toreador', 'toyota', 'tri', 'triple', 'turquoise', 'tuxedo', 'white', 'yellow']
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [43]:
# 3. Cosine similarity dari tiap data di matrix freq

cosScore = cosine_similarity(cm.toarray())
cosScore

array([[1. , 0. , 0. , ..., 0. , 0.5, 0. ],
       [0. , 1. , 0.5, ..., 0.5, 0. , 0.5],
       [0. , 0.5, 1. , ..., 0.5, 0. , 0.5],
       ...,
       [0. , 0.5, 0.5, ..., 1. , 0.5, 1. ],
       [0.5, 0. , 0. , ..., 0.5, 1. , 0.5],
       [0. , 0.5, 0.5, ..., 1. , 0.5, 1. ]])

In [56]:
# 4. Gunakan cos score sbg data rekomendasi
# saya suka mobil data pertama (index = 21)
sayaSuka = 21

# daftar seluruh mobil beserta cos score
similarCars = list(enumerate(cosScore[sayaSuka]))
# similarCars  # (index, %similarity)

In [57]:
# sort berdasarkan %similarity
similarCars = sorted(similarCars, key=lambda x: x[1], reverse=True)
# similarCars

In [66]:
# 5 mobil yang mirip

# similarCars[:6]

dfSim = []
for i in similarCars[:6]:
    dfSim.append(dfCars.iloc[i[0]])
dfSim = pd.DataFrame(dfSim)
dfSim

Unnamed: 0.1,Unnamed: 0,price,brand,model,year,title_status,mileage,color,vin,lot,state,country,condition,B&C
21,21,7300,kia,forte,2018,clean vehicle,38823.0,black,3kpfl4a79je272611,167801773,north carolina,usa,2 days left,kia black
527,527,3810,kia,door,2016,clean vehicle,39408.0,black,kndjx3ae3g7012477,167692796,california,usa,20 hours left,kia black
623,623,14000,kia,sorento,2017,clean vehicle,53424.0,black,5xypgda51hg221253,167119104,connecticut,usa,5 hours left,kia black
0,0,6300,toyota,cruiser,2008,clean vehicle,274117.0,black,jtezu11f88k007763,159348797,new jersey,usa,10 days left,toyota black
6,6,7300,chevrolet,pk,2010,clean vehicle,149050.0,black,1gcsksea1az121133,167753872,georgia,usa,22 hours left,chevrolet black
9,9,5250,ford,mpv,2017,clean vehicle,63418.0,black,2fmpk3j92hbc12542,167656121,texas,usa,2 days left,ford black
