# Recommendation System

- Top/Popular Based Filtering
- Content Based Filtering
- Collaborative Filtering

<img src="https://miro.medium.com/max/4056/1*yrkvweErbifbPFkBUyZlOw.png" 
     width="500px">

In [None]:
import numpy as np
import pandas as pd

<hr>

### 1. Content-based Filtering

__A. Cosinus similarity__

- Menghitung kesamaan data berdasarkan value & pola value
- Di sklearn, gunakan ```cosine_similarity```

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
x = np.array([1, 2, 3])
y = np.array([1, 2, 3])

In [None]:
similarityScore = cosine_similarity([x, y])
similarityScore

In [None]:
simDf = pd.DataFrame(
    similarityScore,
    columns = ['x', 'y'],
    index = ['x', 'y']
)
simDf

In [None]:
# data berbeda 1 elemen

x = np.array([1, 1, 1])
y = np.array([1, 2, 1])
similarityScore = cosine_similarity([x, y])
similarityScore

In [None]:
# 2 data berbeda elemen, namun tiap data semua elemennya sama

x = np.array([1, 1, 1])
y = np.array([2, 2, 2])
similarityScore = cosine_similarity([x, y])
similarityScore

In [None]:
# data 1 = 1,2,3 & data 2 = 5,5,5

x = np.array([1, 2, 3])
y = np.array([5, 5, 5])
similarityScore = cosine_similarity([x, y])
similarityScore

In [None]:
# data 1 = 1,20,300 & data 2 = 5,5,5

x = np.array([1, 20, 300])
y = np.array([5, 5, 5])
similarityScore = cosine_similarity([x, y])
similarityScore

<hr>

__B. Count vectorizer__

- Bagaimana menghitung cos similarity untuk categorical data (value berupa string)?
- Data diekstrak untuk menentukan jumlah tiap kata. Jumlah tiap kata dalam data dicari cos similarity-nya.
- Di sklearn, gunakan ```CountVectorizer```

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
x = 'Andi Budi Caca'
y = 'Andi Budi Caca'

In [None]:
coVec = CountVectorizer()

In [None]:
c = coVec.fit_transform([x, y])
c

In [None]:
# kata yang terkandung
coVec.get_feature_names()

In [None]:
# matriks freq: freq tiap kata di setiap data
mf = c.toarray()
mf

# [[1, 1, 1], => data 1 => Andi:1x, Budi:1x, Caca:1x
# [1, 1, 1]]  => data 2 => Andi:1x, Budi:1x, Caca:1x

In [None]:
similarityScore = cosine_similarity(mf)
similarityScore

#      X   Y
# X [[1., 1.],
# Y  [1., 1.]]

In [None]:
# data beda sedikit

x = 'Andi Budi Caca'
y = 'Andi Budi Budi'

coVec = CountVectorizer()
cm = coVec.fit_transform([x, y])
print(coVec.get_feature_names())
print(cm.toarray())

similarityScore = cosine_similarity(cm.toarray())
print(similarityScore)

<hr>

__C. Contoh Kasus__

- Dataset: [USA Cars Dataset](https://www.kaggle.com/doaaalsenani/usa-cers-dataset)
- Content-based filtering rekomendasi berdasarkan __*brand*__ & __*color*__

In [None]:
dfCars = pd.read_csv('USA_cars_datasets.csv')
dfCars.head(3)

In [None]:
# 1. create column dengan value = kombinasi value "BRAND" & "COLOR"

dfCars['B&C'] = dfCars.apply(lambda row:str(row['brand']) + ' ' + str(row['color']), axis=1)
dfCars.head(3)

In [None]:
# 2. count vectorizer: hitung kata tiap data di column "B&C"

cv = CountVectorizer()
cm = cv.fit_transform(dfCars['B&C'])

# total kata unik di col B&C
print(cv.get_feature_names())

# matrix freq tiap kata di tiap data
print(cm.toarray())
# print(cm.toarray()[0])

In [None]:
# 3. Cosine similarity dari tiap data di matrix freq

cosScore = cosine_similarity(cm.toarray())
cosScore

In [None]:
# 4. Gunakan cos score sbg data rekomendasi
# saya suka mobil data pertama (index = 21)
sayaSuka = 21

# daftar seluruh mobil beserta cos score
similarCars = list(enumerate(cosScore[sayaSuka]))
# similarCars  # (index, %similarity)

In [None]:
# sort berdasarkan %similarity
similarCars = sorted(similarCars, key=lambda x: x[1], reverse=True)
# similarCars

In [None]:
# 5 mobil yang mirip

# similarCars[:6]

dfSim = []
for i in similarCars[:6]:
    dfSim.append(dfCars.iloc[i[0]])
dfSim = pd.DataFrame(dfSim)
dfSim