# Sistem Rekomendasi Buku

DBS Coding Camp
- Zuhair Nashif Abdurrohim
- 1301223102
- MC012D5Y1127

# Import

Kode ini mengimpor pustaka untuk analisis data, pemrosesan file, dan pembelajaran mesin, termasuk TF-IDF untuk representasi teks dan cosine similarity untuk mengukur kesamaan antar teks.

In [49]:
import pandas as pd
import numpy as np
import os
import zipfile
from google.colab import files
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import tensorflow as tf

# Data Loading

Mengambil data dari kaggle
- Upload kaggle.json untuk API kaggle
- Ekstract data
- Rubah menjadi dataframe

In [9]:
# Upload file kaggle.json
files.upload()

# Setup untuk API kaggle
os.makedirs("/root/.kaggle", exist_ok=True)
os.rename("kaggle.json", "/root/.kaggle/kaggle.json")
os.chmod("/root/.kaggle/kaggle.json", 600)

# Download dataset dari Kaggle
!kaggle datasets download -d arashnic/book-recommendation-dataset

Saving kaggle.json to kaggle.json
Dataset URL: https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset
License(s): CC0-1.0
book-recommendation-dataset.zip: Skipping, found more recently modified local copy (use --force to force download)


Kode ini mengekstrak file ZIP yang berisi dataset rekomendasi buku, kemudian membaca tiga file CSV—**Books.csv**, **Users.csv**, dan **Ratings.csv**—ke dalam **DataFrame** menggunakan **pandas** untuk analisis data lebih lanjut.

In [10]:
# Ekstrak file ZIP
with zipfile.ZipFile("/content/book-recommendation-dataset.zip", 'r') as zip_ref:
    zip_ref.extractall("book-recommendation-dataset")

# Import dataset ke DataFrame dan tampilkan
books  = pd.read_csv('/content/book-recommendation-dataset/Books.csv')
users  = pd.read_csv('/content/book-recommendation-dataset/Users.csv')
ratings = pd.read_csv('/content/book-recommendation-dataset/Ratings.csv')

  books  = pd.read_csv('/content/book-recommendation-dataset/Books.csv')


# Data Understanding
tahap awal proyek untuk mengetahui atau memahami data yang dimiliki

- **Users:** Berisi data pengguna. ID pengguna (User-ID) telah dianonimkan dan dikonversi menjadi angka. Data demografi seperti lokasi dan usia disertakan jika tersedia, tetapi jika tidak, nilainya akan **NULL**.

- **Books:** Setiap buku diidentifikasi berdasarkan ISBN-nya. ISBN yang tidak valid telah dihapus dari dataset. Informasi berbasis konten seperti **judul buku, nama penulis, tahun terbit, dan penerbit** diperoleh dari Amazon Web Services. Jika ada lebih dari satu penulis, hanya penulis pertama yang dicantumkan. URL yang mengarah ke sampul buku tersedia dalam tiga ukuran berbeda (**kecil, sedang, besar**) dan menunjuk ke situs Amazon.

- **Ratings:** Berisi informasi tentang rating buku. Rating (Book-Rating) bisa berupa **rating eksplisit** dalam skala **1-10** (semakin tinggi menunjukkan apresiasi lebih besar) atau **rating implisit** yang ditunjukkan dengan nilai **0**.

Kode ini mencetak jumlah baris dalam masing-masing **DataFrame** untuk dataset buku, pengguna, dan rating, memberikan gambaran tentang ukuran dataset yang digunakan dalam analisis.

In [11]:
print("Jumlah data pada file Books.csv:", books.shape[0])
print("Jumlah data pada file Users.csv:", users.shape[0])
print("Jumlah data pada file Ratings.csv:", ratings.shape[0])

Jumlah data pada file Books.csv: 271360
Jumlah data pada file Users.csv: 278858
Jumlah data pada file Ratings.csv: 1149780


# Univariate EDA
melakukan analisis dan eksplorasi setiap variabel data, memahami keterkaitan antar variable

Menampilkan informasi data users

In [12]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   User-ID   278858 non-null  int64  
 1   Location  278858 non-null  object 
 2   Age       168096 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB


Menampilkan deskripsi data users

In [13]:
users.describe()

Unnamed: 0,User-ID,Age
count,278858.0,168096.0
mean,139429.5,34.751434
std,80499.51502,14.428097
min,1.0,0.0
25%,69715.25,24.0
50%,139429.5,32.0
75%,209143.75,44.0
max,278858.0,244.0


Menampilkan informasi data book

In [14]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 8 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   ISBN                 271360 non-null  object
 1   Book-Title           271360 non-null  object
 2   Book-Author          271358 non-null  object
 3   Year-Of-Publication  271360 non-null  object
 4   Publisher            271358 non-null  object
 5   Image-URL-S          271360 non-null  object
 6   Image-URL-M          271360 non-null  object
 7   Image-URL-L          271357 non-null  object
dtypes: object(8)
memory usage: 16.6+ MB


Menampilkan deskripsi data buku

In [15]:
books.describe()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
count,271360,271360,271358,271360,271358,271360,271360,271357
unique,271360,242135,102022,202,16807,271044,271044,271041
top,020130998X,Selected Poems,Agatha Christie,2002,Harlequin,http://images.amazon.com/images/P/042509474X.0...,http://images.amazon.com/images/P/042509474X.0...,http://images.amazon.com/images/P/006091985X.0...
freq,1,27,632,13903,7535,2,2,2


Menampilkan informasi data rating

In [16]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   User-ID      1149780 non-null  int64 
 1   ISBN         1149780 non-null  object
 2   Book-Rating  1149780 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 26.3+ MB


Menampilkan deskripsi data rating

In [17]:
ratings.describe()

Unnamed: 0,User-ID,Book-Rating
count,1149780.0,1149780.0
mean,140386.4,2.86695
std,80562.28,3.854184
min,2.0,0.0
25%,70345.0,0.0
50%,141010.0,0.0
75%,211028.0,7.0
max,278854.0,10.0


# Data Preprocessing
mempersiapkan data sebelum digunakan

Menghapus variabel yang tidak diperlukan (Image dari data books)

In [18]:
books.drop(['Image-URL-S', 'Image-URL-M', 'Image-URL-L'], axis=1, inplace=True)

Tampilkan head data

In [19]:
books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


Menghitung missing value pada data books

In [20]:
books.isnull().sum()

Unnamed: 0,0
ISBN,0
Book-Title,0
Book-Author,2
Year-Of-Publication,0
Publisher,2


Menghitung missing value pada data users

In [21]:
users.isnull().sum()

Unnamed: 0,0
User-ID,0
Location,0
Age,110762


Menghitung missing value pada data ratings

In [22]:
ratings.isnull().sum()

Unnamed: 0,0
User-ID,0
ISBN,0
Book-Rating,0


Menggabungkan data users, ratings dan books

In [23]:
# Gabungkan ratings dengan users
merge_df = pd.merge(ratings, users, on='User-ID', how='left')

In [24]:
# Gabungkan merge_df dengan books
merge_df = pd.merge(merge_df, books, on='ISBN', how='left')

In [25]:
merge_df.shape[0]

1149780

In [26]:
merge_df.head()

Unnamed: 0,User-ID,ISBN,Book-Rating,Location,Age,Book-Title,Book-Author,Year-Of-Publication,Publisher
0,276725,034545104X,0,"tyler, texas, usa",,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books
1,276726,0155061224,5,"seattle, washington, usa",,Rites of Passage,Judith Rae,2001,Heinle
2,276727,0446520802,0,"h, new south wales, australia",16.0,The Notebook,Nicholas Sparks,1996,Warner Books
3,276729,052165615X,3,"rijeka, n/a, croatia",16.0,Help!: Level 1,Philip Prowse,1999,Cambridge University Press
4,276729,0521795028,6,"rijeka, n/a, croatia",16.0,The Amsterdam Connection : Level 4 (Cambridge ...,Sue Leather,2001,Cambridge University Press


# Data Preparation
mempersiapkan data, mengatasi missing value

Menghitung missing value

In [27]:
merge_df.isnull().sum()

Unnamed: 0,0
User-ID,0
ISBN,0
Book-Rating,0
Location,0
Age,309492
Book-Title,118644
Book-Author,118646
Year-Of-Publication,118644
Publisher,118646


Dari hasil jumalh missing value, maka dapat disipulkan bersal dari :
- Rating untuk ISBN tanpa metadata buku
- User tanpa data usia

Tangani missing value metadata buku

In [28]:
# Hapus data dengan metadata tidak lengkap
merge_df = merge_df.dropna(
    subset=[
      'Book-Title',
      'Book-Author',
      'Year-Of-Publication',
      'Publisher'
    ],
    how='any'
)

In [29]:
merge_df.isnull().sum()

Unnamed: 0,0
User-ID,0
ISBN,0
Book-Rating,0
Location,0
Age,277835
Book-Title,0
Book-Author,0
Year-Of-Publication,0
Publisher,0


Karena umur dirasa tidak begitu penting, maka akan dilakukan drop kolom Age

In [30]:
merge_df.drop(['Age', 'Location', 'Year-Of-Publication', 'Publisher'], axis=1, inplace=True)

In [31]:
merge_df.isnull().sum()

Unnamed: 0,0
User-ID,0
ISBN,0
Book-Rating,0
Book-Title,0
Book-Author,0


✅ Data bersih dari missing value

# Model Development Content Based Filtering
mengembangkan sistem rekomendasi dengan teknik content based filtering. Teknik content based filtering akan merekomendasikan item yang mirip dengan item yang disukai pengguna di masa lalu. Pada tahap ini, akan menemukan representasi fitur penting dari setiap kategori buku dengan tfidf vectorizer dan menghitung tingkat kesamaan dengan cosine similarity. Setelah itu, akan membuat sejumlah rekomendasi nuku untuk pelanggan berdasarkan kesamaan yang telah dihitung sebelumnya.

Mengambil 10.000 baris pertama dari merge_df dan menyimpannya dalam variabel data

In [32]:
data = merge_df.head(10000)

TF-IDF Vectorizer

Kode ini membuat objek **TfidfVectorizer** untuk mengubah teks menjadi representasi numerik berbasis **TF-IDF**. Kemudian, model dihitung menggunakan **judul buku** (`Book-Title`) sebagai fitur, dan hasilnya digunakan untuk mendapatkan daftar kata yang digunakan dalam proses pemetaan ke indeks numerik.

In [33]:
# Buat TFidfVektorizer
tf = TfidfVectorizer()

# Hitung idf pada title
tf.fit(data['Book-Title'])

# Mapping index integer ke nama
tf.get_feature_names_out()

array(['00', '000', '007', ..., 'â¼ã', 'ãµes', 'ã¼ber'], dtype=object)

Kode ini mengubah judul buku (`Book-Title`) menjadi **matriks TF-IDF** menggunakan `TfidfVectorizer`. Hasilnya disimpan dalam `tfidf_matrix`, yang merepresentasikan setiap judul buku sebagai vektor numerik berdasarkan bobot TF-IDF. **`tfidf_matrix.shape`** digunakan untuk melihat ukuran matriks, menunjukkan jumlah buku dan jumlah fitur unik dalam teks.

In [34]:
tfidf_matrix = tf.fit_transform(data['Book-Title'])

tfidf_matrix.shape

(10000, 10661)

Kode ini mengubah **matriks TF-IDF** menjadi bentuk **matriks densitas penuh**, yaitu merepresentasikan nilai-nilai TF-IDF dalam format matriks tanpa kompresi, sehingga lebih mudah untuk dianalisis atau divisualisasikan.

In [35]:
tfidf_matrix.todense()

matrix([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])

Kode ini membuat **DataFrame** dari **matriks TF-IDF**, dengan kata-kata unik sebagai kolom dan penulis buku sebagai indeks. Kemudian, dilakukan pengambilan sampel acak terhadap **10.661 fitur (kata-kata)** dan **10 penulis**, sehingga hanya sebagian kecil data yang ditampilkan untuk analisis.

In [36]:
# Dataframe tf-idf matrix

pd.DataFrame(
    tfidf_matrix.todense(),
    columns=tf.get_feature_names_out(),
    index=data['Book-Author']
).sample(10661, axis=1).sample(10, axis=0)

Unnamed: 0_level_0,ian,43,arabia,andalucia,blackhawk,smoky,mujeres,rubinstein,vã,barbarism,...,carrot,origin,yellow,longstocking,school,assessment,bike,moving,window,melody
Book-Author,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Kaye Gibbons,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Lilian Jackson Braun,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Jude Deveraux,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
The Onion,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
John Grisham,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Barney Hoskyns,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Karen Amen,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
John Le Carre,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Tracy Chevalier,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
MITCH ALBOM,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Cosine Similarity

Kode ini menghitung kesamaan antar buku menggunakan **cosine similarity** berdasarkan matriks TF-IDF, menghasilkan matriks kesamaan di mana setiap nilai menunjukkan seberapa mirip satu buku dengan lainnya.

In [37]:
cosine_sim = cosine_similarity(tfidf_matrix)
cosine_sim

array([[1.       , 0.       , 0.       , ..., 0.       , 0.       ,
        0.       ],
       [0.       , 1.       , 0.       , ..., 0.       , 0.0295068,
        0.       ],
       [0.       , 0.       , 1.       , ..., 0.       , 0.       ,
        0.039179 ],
       ...,
       [0.       , 0.       , 0.       , ..., 1.       , 0.       ,
        0.       ],
       [0.       , 0.0295068, 0.       , ..., 0.       , 1.       ,
        0.       ],
       [0.       , 0.       , 0.039179 , ..., 0.       , 0.       ,
        1.       ]])

Kode ini membuat **DataFrame** dari matriks **cosine similarity**, dengan **Book-Author** sebagai indeks dan kolom. Ini memungkinkan analisis kesamaan antar buku berdasarkan penulisnya. Kemudian, ukuran DataFrame ditampilkan, dan sampel acak **5 kolom** serta **10 baris** diambil untuk melihat sebagian kecil data.

In [38]:
# Dataframe cosine_sim
cosine_sim_df = pd.DataFrame(
    cosine_sim,
    index=data['Book-Author'],
    columns=data['Book-Author']
)
print('Shape:', cosine_sim_df.shape)

cosine_sim_df.sample(5, axis=1).sample(10, axis=0)

Shape: (10000, 10000)


Book-Author,Elspeth Josceline Huxley,Belva Plain,DEAN KOONTZ,Lavyrle Spencer,Marion Grafin Donhoff
Book-Author,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Jean Shinoda Bolen,0.022729,0.0,0.0,0.0,0.0
Elizabeth Graham,0.0,0.0,0.0,0.0,0.0
Andrei Codrescu,0.0,0.0,0.0,0.0,0.0
Marianne Willman,0.028657,0.0,0.0,0.0,0.0
MICHAEL CRICHTON,0.034203,0.0,0.0,0.0,0.0
Cathy Gillen Thacker,0.0,0.0,0.0,0.0,0.0
Magdalen Nabb,0.0,0.0,0.0,0.0,0.0
Linda Goodman,0.0,0.0,0.0,0.0,0.0
Carole Mortimer,0.0,0.0,0.0,0.0,0.0
Julia Alvarez,0.0,0.0,0.0,0.0,0.0


Mendapatkan Rekomendasi

Fungsi ini membuat sistem rekomendasi buku berbasis **cosine similarity** dengan langkah-langkah berikut:

- **Mengambil indeks kesamaan**: Menggunakan `argpartition` untuk menemukan `k` buku yang paling mirip dengan penulis yang diberikan.
- **Menentukan buku terdekat**: Memilih buku-buku dengan nilai kesamaan tertinggi berdasarkan hasil dari matriks **cosine similarity**.
- **Menghapus buku input dari hasil**: Menghindari rekomendasi buku yang sama dengan yang diberikan pengguna.
- **Menggabungkan hasil dengan informasi buku**: Mengembalikan DataFrame berisi judul dan penulis dari rekomendasi.



In [39]:
def book_recommendations(book_author, similarity_data=cosine_sim_df, items=data[['Book-Title', 'Book-Author']], k=10):

  index = similarity_data.loc[:,book_author].to_numpy().argpartition(
        range(-1, -k, -1))

  closest = similarity_data.columns[index[-1:-(k+2):-1]]

  closest = closest.drop(book_author, errors='ignore')

  return pd.DataFrame(closest).merge(items).head(k)

Kode ini mengambil semua baris dalam `data` yang memiliki nilai **'Tracey West'** di kolom **'Book-Author'**, memungkinkan analisis atau pemfilteran buku berdasarkan penulisnya.

In [40]:
data[data['Book-Author'].eq('Tracey West')]

Unnamed: 0,User-ID,ISBN,Book-Rating,Book-Title,Book-Author
3021,277921,0439104645,9,I Choose You (Pokemon Chapter Book #1),Tracey West
6226,278418,0439137411,0,Pokemon the First Movie: Mewtwo Strikes Back (...,Tracey West
6248,278418,043920092X,0,Thundershock in Pummelo Stadium (Pokemon Chapt...,Tracey West
6249,278418,0439200938,0,"Go West, Young Ash (PokÃ©mon Chapter Book, 17)",Tracey West
6250,278418,0439200946,0,"Ash Ketchum, Pokemon Detective (PokÃ©mon Chapt...",Tracey West
6260,278418,0439220335,0,"Prepare for Trouble (PokÃ©mon Chapter Book, 19)",Tracey West


# Model Development Collaborative Filtering
Model merekomendasikan sejumlah buku berdasarkan rating yang telah diberikan sebelumnya. Dari data rating pengguna, kita akan mengidentifikasi buku-buku yang mirip dan belum pernah dibaca oleh pengguna untuk direkomendasikan.

# Train

Kode ini mendefinisikan `dc` sebagai **DataFrame** untuk sistem rekomendasi berbasis **Collaborative Filtering**, menggunakan dataset rating (`ratings`) yang berisi informasi tentang pengguna dan buku yang mereka nilai.

In [41]:
# dc = datafram collaborative
dc = ratings

Kode ini mengambil semua **User-ID unik** dari dataset `dc`, kemudian melakukan **encoding** dengan mengubah **User-ID** menjadi angka yang lebih mudah digunakan dalam model pembelajaran mesin. Selanjutnya, dibuat **mapping decoding** untuk mengubah kembali angka tersebut menjadi **User-ID asli**, memungkinkan konversi dua arah antara format numerik dan ID pengguna.

In [42]:
# Ambil semua user ID unik dari data
user_ids = dc['User-ID'].unique()

# Encoding: dari user ID asli ke angka
user_encoded = {user_id: idx for idx, user_id in enumerate(user_ids)}

# Decoding: dari angka ke user ID asli
user_decode = {idx: user_id for user_id, idx in user_encoded.items()}

Kode ini mengambil semua **ISBN unik** dari dataset `dc`, kemudian melakukan **encoding** dengan mengubah **ISBN** menjadi angka yang lebih mudah digunakan dalam model pembelajaran mesin. Selanjutnya, dibuat **mapping decoding** untuk mengubah kembali angka tersebut menjadi **ISBN asli**, memungkinkan konversi dua arah antara format numerik dan ISBN buku.

In [43]:
# Ambil semua ISBN unik dari data
book_ids = dc['ISBN'].unique()

# Encoding: dari ISBN asli ke angka
book_encoded = {isbn: idx for idx, isbn in enumerate(book_ids)}

# Decoding: dari angka ke ISBN asli
book_decode = {idx: isbn for isbn, idx in book_encoded.items()}

Kode ini melakukan pemetaan (mapping) **User-ID** ke indeks numerik dalam `dc` menggunakan `user_encoded`, serta memetakan **ISBN** ke indeks numerik menggunakan `book_encoded`. Ini bertujuan untuk menyederhanakan data sehingga dapat digunakan dalam model pembelajaran mesin untuk rekomendasi buku.

In [44]:
# Mapping User-ID ke dataframe user
dc['user'] = dc['User-ID'].map(user_encoded)

# Mapping ISBN ke dataframe buku
dc['book'] = dc['ISBN'].map(book_encoded)

Kode ini menghitung jumlah **pengguna** dan **buku** dalam dataset rekomendasi, kemudian mengonversi rating buku menjadi tipe data **float** untuk memastikan kompatibilitas dalam pemrosesan numerik. Selain itu, kode ini juga menentukan **nilai minimum** dan **maksimum** dari rating buku yang diberikan oleh pengguna, membantu memahami distribusi rating dalam sistem rekomendasi.

In [45]:
# Jumlah user
num_users = len(user_encoded)
print("Jumlah user:", num_users)

# Jumlah buku
num_books = len(book_encoded)
print("Jumlah buku:", num_books)

# Convert rating to float
dc['Book-Rating'] = dc['Book-Rating'].values.astype(np.float32)

# Minimum rating
min_rate = min(dc['Book-Rating'])
print("Minimum rating:", min_rate)

# Maximum rating
max_rate = max(dc['Book-Rating'])
print("Maximum rating:", max_rate)

Jumlah user: 105283
Jumlah buku: 340556
Minimum rating: 0.0
Maximum rating: 10.0


Kode ini mengacak urutan data dalam **DataFrame `dc`** dengan menggunakan `sample(frac=1)`, yang memastikan bahwa semua baris dipilih tetapi dalam urutan acak. Parameter **`random_state=42`** digunakan untuk memastikan hasil yang konsisten setiap kali kode dijalankan.

In [46]:
dc = dc.sample(frac=1, random_state=42)
dc

Unnamed: 0,User-ID,ISBN,Book-Rating,user,book
178554,38781,0373259131,0.0,15560,99291
533905,128835,0811805905,8.0,49582,59185
1091374,261829,037324486X,0.0,99796,121427
1036247,247747,0531303306,0.0,94309,320740
309523,74076,0316812404,0.0,28854,32411
...,...,...,...,...,...
110268,25458,0142000191,0.0,10260,69256
259178,60146,0060964049,8.0,23699,527
131932,30509,1857230655,0.0,12254,79598
671155,163307,0446314145,0.0,62388,243103


Kode ini mempersiapkan data untuk pelatihan model rekomendasi. **`x`** berisi pasangan **user** dan **book** dalam bentuk array numerik, sedangkan **`y`** berisi **Book-Rating** yang telah dinormalisasi ke rentang **0-1** berdasarkan nilai minimum dan maksimum dalam dataset. Selanjutnya, **90%** data digunakan sebagai **training set** (`x_train`, `y_train`), dan **10% sisanya** digunakan sebagai **validation set** (`x_val`, `y_val`), memungkinkan model untuk belajar dan diuji sebelum penerapan lebih lanjut.

In [47]:
x = dc[['user', 'book']].values
y = dc['Book-Rating'].apply(lambda x: (x - min_rate) / (max_rate - min_rate)).values

train_indices = int(0.9 * dc.shape[0])
x_train, x_val, y_train, y_val = (
    x[:train_indices],
    x[train_indices:],
    y[:train_indices],
    y[train_indices:]
)
print(x, y)

[[ 15560  99291]
 [ 49582  59185]
 [ 99796 121427]
 ...
 [ 12254  79598]
 [ 62388 243103]
 [ 11319   1365]] [0.  0.8 0.  ... 0.  0.  0. ]


Training

Kode ini membangun model **Collaborative Filtering** menggunakan **Neural Network** dengan langkah-langkah berikut:

- **Embedding Layer:** Membangun representasi numerik pengguna (`user_embedding`) dan buku (`book_embedding`) dalam dimensi `embedding_dim = 32`.
- **Flattening:** Mengubah embedding pengguna dan buku menjadi vektor satu dimensi.
- **Dot Product:** Menghitung skor kesamaan antara pengguna dan buku menggunakan operasi dot product.
- **Model Kompilasi:** Model dibuat menggunakan **Keras Functional API**, dengan optimizer **Adam** dan loss function **Mean Squared Error (MSE)** untuk prediksi rating.
- **Training Model:** Data training (`x_train, y_train`) dan validasi (`x_val, y_val`) digunakan untuk melatih model dalam **5 epoch**.
- **Fungsi Rekomendasi:** `recommend_books()` mencari buku yang belum dinilai oleh pengguna, memprediksi rating menggunakan model, dan mengembalikan **10 buku terbaik** berdasarkan prediksi rating.

Sistem ini memungkinkan rekomendasi buku berdasarkan pola rating pengguna lain dengan pendekatan **latent factor model** menggunakan embedding.

In [50]:

embedding_dim = 32

user_input = tf.keras.layers.Input(shape=(1,), name='user_input')
user_embedding = tf.keras.layers.Embedding(num_users, embedding_dim, name='user_embedding')(user_input)
user_vec = tf.keras.layers.Flatten(name='FlattenUsers')(user_embedding)

book_input = tf.keras.layers.Input(shape=(1,), name='book_input')
book_embedding = tf.keras.layers.Embedding(num_books, embedding_dim, name='book_embedding')(book_input)
book_vec = tf.keras.layers.Flatten(name='FlattenBooks')(book_embedding)

prod = tf.keras.layers.dot([user_vec, book_vec], axes=1, normalize=False)
model = tf.keras.Model([user_input, book_input], prod)
model.compile('adam', 'mean_squared_error')

# Assuming x_train, y_train, x_val, y_val are defined from the previous code
history = model.fit([x_train[:, 0], x_train[:, 1]], y_train,
                    epochs=5,
                    verbose=1,
                    validation_data=([x_val[:, 0], x_val[:, 1]], y_val))


def recommend_books(user_id, dc_df, books_df, k=10):
    # Encode the user ID
    encoded_user_id = user_encoded.get(user_id)

    if encoded_user_id is None:
        print(f"User ID {user_id} not found in the training data.")
        return pd.DataFrame() # Return empty DataFrame

    # Get books already rated by the user
    books_rated_by_user = dc_df[dc_df['User-ID'] == user_id]['ISBN'].tolist()

    # Get all unique book ISBNs from the original books data
    all_book_isbns = books_df['ISBN'].unique()

    # Filter out books already rated by the user
    books_to_predict = [isbn for isbn in all_book_isbns if isbn not in books_rated_by_user]

    if not books_to_predict:
        print(f"User ID {user_id} has rated all available books or no books found to predict.")
        return pd.DataFrame()

    # Encode the books to predict
    encoded_books_to_predict = np.array([book_encoded.get(isbn) for isbn in books_to_predict if book_encoded.get(isbn) is not None])

    if encoded_books_to_predict.size == 0:
         print(f"Could not encode any books to predict for user ID {user_id}.")
         return pd.DataFrame()

    # Create user input array for prediction
    user_input_predict = np.full(len(encoded_books_to_predict), encoded_user_id)

    # Predict ratings for the books the user hasn't rated
    predictions = model.predict([user_input_predict, encoded_books_to_predict])

    # Get the indices of top k predicted ratings
    top_indices = np.argsort(predictions.flatten())[::-1][:k]

    # Get the encoded book IDs of the top recommendations
    top_encoded_book_ids = encoded_books_to_predict[top_indices]

    # Decode the book IDs to ISBNs
    recommended_book_isbns = [book_decode.get(encoded_id) for encoded_id in top_encoded_book_ids]

    # Get book information for the recommended ISBNs
    recommended_books_info = books_df[books_df['ISBN'].isin(recommended_book_isbns)]

    return recommended_books_info[['ISBN', 'Book-Title', 'Book-Author']]

Epoch 1/5
[1m32338/32338[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m191s[0m 6ms/step - loss: 0.2292 - val_loss: 0.2120
Epoch 2/5
[1m32338/32338[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m183s[0m 6ms/step - loss: 0.1553 - val_loss: 0.2079
Epoch 3/5
[1m32338/32338[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m201s[0m 6ms/step - loss: 0.0768 - val_loss: 0.2175
Epoch 4/5
[1m32338/32338[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m203s[0m 6ms/step - loss: 0.0464 - val_loss: 0.2191
Epoch 5/5
[1m32338/32338[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m200s[0m 6ms/step - loss: 0.0327 - val_loss: 0.2215


Kode ini memilih satu **User-ID** secara acak dari data rating (`dc`), lalu menggunakan fungsi **`recommend_books()`** untuk mendapatkan **10 rekomendasi buku** berdasarkan informasi rating yang diberikan pengguna lain. Hasil rekomendasi ditampilkan dengan format yang mencantumkan **penulis dan judul buku**, atau pesan alternatif jika tidak ditemukan rekomendasi untuk pengguna tersebut.

In [52]:
# Ambil salah satu user ID dari data rating yang sudah dimuat sebelumnya
# Pastikan user_id ini ada di data training (dc)
sample_user_id = dc['User-ID'].sample(1).iloc[0] # Ambil user ID acak dari data rating

# Panggil fungsi rekomendasi
# Gunakan DataFrame 'dc' untuk user's rated books and 'books' for book info
recommended_books_df = recommend_books(sample_user_id, dc, books, k=10)

# Tampilkan hasil rekomendasi
print(f"Rekomendasi Buku untuk User ID {sample_user_id}:")
print("===" * 10)

if not recommended_books_df.empty:
    for index, row in recommended_books_df.iterrows():
        print(f"{row['Book-Author']} : {row['Book-Title']}")
else:
    print("Tidak ada rekomendasi yang ditemukan untuk pengguna ini.")

[1m8339/8339[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 2ms/step
Rekomendasi Buku untuk User ID 76352:
Tanith Lee : The Silver Metal Lover
CHUCK PALAHNIUK : Survivor : A Novel
Jan Karon : Shepherds Abiding
Anthony Bourdain : Kitchen Confidential: Adventures in the Culinary Underbelly
T. A. Barron : Heartlight
JERRY SPINELLI : Stargirl
Anna Quindlen : How Reading Changed My Life (Library of Contemporary Thought)
Tim Lahaye : The Indwelling: The Beast Takes Possession (Left Behind No. 7)
Daphne du Maurier : Rebecca
Joyce Milton : Mummies (All Aboard Reading)


# Evaluasi Model

Pada bagian ini, kita akan mengevaluasi kinerja dari kedua model rekomendasi yang telah dibangun: Content-Based Filtering dan Collaborative Filtering.


In [51]:

# Fungsi untuk mendapatkan indeks buku berdasarkan judul
def get_book_index(title, data):
    # Menggunakan .str.contains() dengan case=False untuk pencarian tidak case-sensitive
    # dan mengembalikan indeks dari baris pertama yang cocok
    return data[data['Book-Title'].str.contains(title, case=False, na=False)].index.min()

# Fungsi untuk mendapatkan rekomendasi berdasarkan judul buku
def recommend_by_title(book_title, data=data, similarity_data=cosine_sim_df, k=10):
    book_index = get_book_index(book_title, data)

    if book_index is None:
        print(f"Buku dengan judul '{book_title}' tidak ditemukan dalam dataset terbatas ini.")
        return pd.DataFrame()

    # Dapatkan nilai kesamaan untuk buku yang dicari
    similarity_scores = cosine_sim[book_index]

    # Dapatkan indeks buku yang diurutkan berdasarkan skor kesamaan (descending)
    # Ambil k+1 buku pertama (termasuk buku itu sendiri)
    top_indices = similarity_scores.argsort()[::-1][1:k+1] # Mulai dari 1 untuk mengecualikan buku input

    # Dapatkan ISBN dari buku-buku yang direkomendasikan
    recommended_isbns = data.iloc[top_indices]['ISBN'].tolist()

    # Dapatkan informasi buku dari DataFrame asli (books)
    recommended_books_info = books[books['ISBN'].isin(recommended_isbns)]

    return recommended_books_info[['ISBN', 'Book-Title', 'Book-Author']]

# Contoh Evaluasi Content-Based Filtering
print("=== Evaluasi Content-Based Filtering ===")
sample_book_title = "The Lovely Bones" # Ganti dengan judul buku yang ada di dataset terbatas

# Dapatkan rekomendasi untuk buku contoh
recommended_books_cb = recommend_by_title(sample_book_title, data)

if not recommended_books_cb.empty:
    print(f"\nRekomendasi Berdasarkan Konten untuk Buku '{sample_book_title}':")
    print("---" * 10)
    for index, row in recommended_books_cb.iterrows():
        print(f"{row['Book-Author']} : {row['Book-Title']}")
else:
    print(f"\nTidak dapat memberikan rekomendasi berdasarkan konten untuk buku '{sample_book_title}'.")

# Menampilkan nilai Cosine Similarity (Contoh)
# Kita bisa menampilkan matriks kesamaan untuk buku contoh dengan buku-buku lain.
# Karena matriksnya besar, kita ambil contoh buku yang ditemukan
book_index_for_sim = get_book_index(sample_book_title, data)

if book_index_for_sim is not None:
    print(f"\nNilai Cosine Similarity untuk '{sample_book_title}' dengan beberapa buku lain (dataset terbatas):")
    print("---" * 10)

    # Ambil baris kesamaan untuk buku yang dicari
    similarity_row = cosine_sim[book_index_for_sim]

    # Buat DataFrame sementara untuk menampilkan nilai kesamaan
    sim_df = pd.DataFrame({'Book-Title': data['Book-Title'], 'Similarity': similarity_row})

    # Urutkan berdasarkan kesamaan (descending) dan tampilkan beberapa teratas (kecuali buku itu sendiri)
    print(sim_df.sort_values(by='Similarity', ascending=False).head(11).tail(10)) # Ambil 11, buang yang pertama
else:
    print(f"\nTidak dapat menampilkan nilai Cosine Similarity karena buku '{sample_book_title}' tidak ditemukan di dataset terbatas.")


### Evaluasi Collaborative Filtering

# Evaluasi model Collaborative Filtering dapat menggunakan metrik seperti Mean Squared Error (MSE) atau Root Mean Squared Error (RMSE) dari prediksi rating.
# Kita juga bisa melihat contoh rekomendasi untuk user dan membandingkannya dengan buku yang sudah dirating user.

print("\n=== Evaluasi Collaborative Filtering ===")

# Tampilkan loss (MSE) dari proses training dan validasi
print(f"Training Loss (MSE): {history.history['loss'][-1]:.4f}")
print(f"Validation Loss (MSE): {history.history['val_loss'][-1]:.4f}")

# RMSE adalah akar kuadrat dari MSE
train_rmse = np.sqrt(history.history['loss'][-1])
val_rmse = np.sqrt(history.history['val_loss'][-1])

print(f"Training RMSE: {train_rmse:.4f}")
print(f"Validation RMSE: {val_rmse:.4f}")

# Tampilkan kembali contoh rekomendasi untuk user (sudah dilakukan di bagian sebelumnya)
# Kita bisa mengambil user_id yang sama atau user_id lain
sample_user_id_eval = dc['User-ID'].sample(1).iloc[0]

print(f"\nRekomendasi Buku Collaborative Filtering untuk User ID {sample_user_id_eval}:")
print("---" * 10)
recommended_books_cf = recommend_books(sample_user_id_eval, dc, books, k=10)

if not recommended_books_cf.empty:
    for index, row in recommended_books_cf.iterrows():
        print(f"{row['Book-Author']} : {row['Book-Title']}")
else:
    print("Tidak ada rekomendasi Collaborative Filtering yang ditemukan untuk pengguna ini.")

# Untuk evaluasi yang lebih mendalam, kita bisa membagi data menjadi training, validasi, dan testing set.
# Kemudian menghitung metrik pada testing set.
# Contoh: Hitung MSE pada x_val, y_val menggunakan model.evaluate()
# mse_val = model.evaluate([x_val[:, 0], x_val[:, 1]], y_val, verbose=0)
# print(f"\nManual Validation MSE: {mse_val:.4f}")

=== Evaluasi Content-Based Filtering ===

Rekomendasi Berdasarkan Konten untuk Buku 'The Lovely Bones':
------------------------------
Bernard Werber : Les Fourmis
Michael Cunningham : The Hours: A Novel
Michael Cunningham : The Hours : A Novel
Greg Iles : 24 Hours
Sandra Steffen : Marriage By Contract  (36 Hours) (Harlequin 36 Hours)
Hugo : Les Orientales ; Les Feuilles d'automne
Virginia Woolf : Les Vagues
Michael Flynn : Father And Child Reunion  (36 Hours) (36 Hours)
Deaver J. : Dix-huit heures pour mourir: Roman

Nilai Cosine Similarity untuk 'The Lovely Bones' dengan beberapa buku lain (dataset terbatas):
------------------------------
                                              Book-Title  Similarity
10090                                 The Hours: A Novel    0.519876
11090                                The Hours : A Novel    0.519876
1690                                            24 Hours    0.397841
10143  Marriage By Contract  (36 Hours) (Harlequin 36...    0.337282
5709 