# Acknowledgements
---

Dataset ini diambil dari [https://www.kaggle.com](https://www.kaggle.com/datasets/ruchi798/bookcrossing-dataset)

# Import Dataset from Kaggle
---

Import dataset terlebih dulu lalu di unzip

In [1]:
# from google.colab import drive

# drive.mount('/content/drive')

In [2]:
# !mkdir ~/.kaggle
# !cp /content/drive/MyDrive/kaggle.json ~/.kaggle
# !chmod 600 ~/.kaggle/kaggle.json

In [3]:
# !kaggle datasets download -d ruchi798/bookcrossing-dataset

In [4]:
# !unzip bookcrossing-dataset.zip

# Import Library for Exploratory Data Analysis

---

Import library yang akan digunakan untuk data analisis, data visualisasi, data preprocessing dan modeling

In [5]:
# library for data loading and data analysis
import pandas as pd
import numpy as np

# library for data visualization
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline

# Data Loading

In [6]:
path = 'Books Data with Category Language and Summary/Preprocessed_data.csv'
df = pd.read_csv(path, index_col=[0])
df.head()

Unnamed: 0,user_id,location,age,isbn,rating,book_title,book_author,year_of_publication,publisher,img_s,img_m,img_l,Summary,Language,Category,city,state,country
0,2,"stockton, california, usa",18.0,195153448,0,Classical Mythology,Mark P. O. Morford,2002.0,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,Provides an introduction to classical myths pl...,en,['Social Science'],stockton,california,usa
1,8,"timmins, ontario, canada",34.7439,2005018,5,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,"In a small town in Canada, Clara Callan reluct...",en,['Actresses'],timmins,ontario,canada
2,11400,"ottawa, ontario, canada",49.0,2005018,0,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,"In a small town in Canada, Clara Callan reluct...",en,['Actresses'],ottawa,ontario,canada
3,11676,"n/a, n/a, n/a",34.7439,2005018,8,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,"In a small town in Canada, Clara Callan reluct...",en,['Actresses'],,,
4,41385,"sudbury, ontario, canada",34.7439,2005018,0,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,"In a small town in Canada, Clara Callan reluct...",en,['Actresses'],sudbury,ontario,canada


In [7]:
# check the shape of dataframe
print(f'The data has {df.shape[0]} records and {df.shape[1]} columns.')

The data has 1031175 records and 18 columns.


# Exploratory Data Analysis
---

Dataset ini memiliki 1.031.175 baris data dan 18 kolom :
* user_id : merupakan id dari user
* location : merupakan lokasi user tinggal
* age : merupakan umur dari user
* isbn : merupakan kode pengidentifikasi buku
* rating : merupakan rating yang user berikan untuk buku
* book_title : merupakan judul dari buku
* book_author : merupakan penulis dari buku
* year_of_publication : merupakan tahun publikasi buku
* publisher : merupakan penerbit buku
* img_s/img_m/img_l : merupakan cover dari buku
* summary : merupakan sinopsis dari buku
* language : merupakan bahasa terjemahan buku
* category : merupakan kategori buku
* city : merupakan kota buku tersebut dibeli
* state : merupakan provinsi buku tersebut dibeli
* country : merupakan negara buku tersebut dibeli

### Mengecek data apakaha memiliki missing value atau tidak

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1031175 entries, 0 to 1031174
Data columns (total 18 columns):
 #   Column               Non-Null Count    Dtype  
---  ------               --------------    -----  
 0   user_id              1031175 non-null  int64  
 1   location             1031175 non-null  object 
 2   age                  1031175 non-null  float64
 3   isbn                 1031175 non-null  object 
 4   rating               1031175 non-null  int64  
 5   book_title           1031175 non-null  object 
 6   book_author          1031175 non-null  object 
 7   year_of_publication  1031175 non-null  float64
 8   publisher            1031175 non-null  object 
 9   img_s                1031175 non-null  object 
 10  img_m                1031175 non-null  object 
 11  img_l                1031175 non-null  object 
 12  Summary              1031175 non-null  object 
 13  Language             1031175 non-null  object 
 14  Category             1031175 non-null  object 
 15

Dapat kita lihat bahwa beberapa kolom di dataset memiliki jumlah yang berbeda. Hal ini mengindikasikan bahwa terdapat missing value pada data

In [9]:
print('Total missing value in dataframe:', df.isnull().sum().sum(), 'records')

Total missing value in dataframe: 72275 records


In [10]:
col_with_missing = [col for col in df.columns if df[col].isnull().any()]
print('Column with missing value:', col_with_missing)

Column with missing value: ['city', 'state', 'country']


Seperti yang kita lihat bahwa kolom city, state, country memiliki missing value. Ada banyak cara dalam menangani missing value, namun pada kasus kali ini kita akan menghapus kolom karena tidak terlalu berpengaruh pada rekomendasi buku

In [11]:
df_no_missing = df.drop(col_with_missing, axis=1)
df_no_missing.head()

Unnamed: 0,user_id,location,age,isbn,rating,book_title,book_author,year_of_publication,publisher,img_s,img_m,img_l,Summary,Language,Category
0,2,"stockton, california, usa",18.0,195153448,0,Classical Mythology,Mark P. O. Morford,2002.0,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,Provides an introduction to classical myths pl...,en,['Social Science']
1,8,"timmins, ontario, canada",34.7439,2005018,5,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,"In a small town in Canada, Clara Callan reluct...",en,['Actresses']
2,11400,"ottawa, ontario, canada",49.0,2005018,0,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,"In a small town in Canada, Clara Callan reluct...",en,['Actresses']
3,11676,"n/a, n/a, n/a",34.7439,2005018,8,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,"In a small town in Canada, Clara Callan reluct...",en,['Actresses']
4,41385,"sudbury, ontario, canada",34.7439,2005018,0,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,"In a small town in Canada, Clara Callan reluct...",en,['Actresses']


In [12]:
df_no_missing.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1031175 entries, 0 to 1031174
Data columns (total 15 columns):
 #   Column               Non-Null Count    Dtype  
---  ------               --------------    -----  
 0   user_id              1031175 non-null  int64  
 1   location             1031175 non-null  object 
 2   age                  1031175 non-null  float64
 3   isbn                 1031175 non-null  object 
 4   rating               1031175 non-null  int64  
 5   book_title           1031175 non-null  object 
 6   book_author          1031175 non-null  object 
 7   year_of_publication  1031175 non-null  float64
 8   publisher            1031175 non-null  object 
 9   img_s                1031175 non-null  object 
 10  img_m                1031175 non-null  object 
 11  img_l                1031175 non-null  object 
 12  Summary              1031175 non-null  object 
 13  Language             1031175 non-null  object 
 14  Category             1031175 non-null  object 
dty

In [13]:
print('Total missing value in dataframe:', df_no_missing.isnull().sum().sum(), 'records')

Total missing value in dataframe: 0 records


### Explore Statistic Information

Secara umum, sebuah data pasti memiliki informasi statistik pada masing-masing kolom, antara lain:


*   Count : Jumlah data pada setiap kolom
*   Mean : Nilai rata-rata pada setiam kolom
*   Std : Standar deviasi pada setiap kolom
*   Min : Nilai minimum pada setiap kolom
*   25% : Kuartil pertama
*   50% : Kuartil kedua atau biasa juga disebut median (nilai tengah)
*   75% : Kuartil ketiga
*   Max : Nilai maksimum pada setiap kolom



In [14]:
df_no_missing.describe()

Unnamed: 0,user_id,age,rating,year_of_publication
count,1031175.0,1031175.0,1031175.0,1031175.0
mean,140594.4,36.42902,2.839022,1995.283
std,80524.44,10.35354,3.854149,7.30934
min,2.0,5.0,0.0,1376.0
25%,70415.0,31.0,0.0,1992.0
50%,141210.0,34.7439,0.0,1997.0
75%,211426.0,41.0,7.0,2001.0
max,278854.0,99.0,10.0,2008.0


# Data Preparation
---

Kita akan menghapus banyak kolom pada data kali ini karena banyak kolom pada data tidak memberikan informasi yang relevan terhadap rekomendasi buku seperti user_id, age, rating, dsb.

In [15]:
col_to_drop = ['user_id', 'location', 'age', 'isbn', 'rating', 'year_of_publication', 'img_s', 'img_m', 'img_l', 'Summary', 'Language']
df_dropped = df_no_missing.drop(col_to_drop, axis=1)
df_dropped.head()

Unnamed: 0,book_title,book_author,publisher,Category
0,Classical Mythology,Mark P. O. Morford,Oxford University Press,['Social Science']
1,Clara Callan,Richard Bruce Wright,HarperFlamingo Canada,['Actresses']
2,Clara Callan,Richard Bruce Wright,HarperFlamingo Canada,['Actresses']
3,Clara Callan,Richard Bruce Wright,HarperFlamingo Canada,['Actresses']
4,Clara Callan,Richard Bruce Wright,HarperFlamingo Canada,['Actresses']


In [16]:
# check the shape of dataframe
print(f'The data has {df_dropped.shape[0]} records and {df_dropped.shape[1]} columns.')

The data has 1031175 records and 4 columns.


In [17]:
# dropping duplicate record
df_no_dup = df_dropped.drop_duplicates(['book_title'])
df_no_dup.head()

Unnamed: 0,book_title,book_author,publisher,Category
0,Classical Mythology,Mark P. O. Morford,Oxford University Press,['Social Science']
1,Clara Callan,Richard Bruce Wright,HarperFlamingo Canada,['Actresses']
15,Decision in Normandy,Carlo D'Este,HarperPerennial,['1940-1949']
18,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,Farrar Straus Giroux,['Medical']
29,The Mummies of Urumchi,E. J. W. Barber,W. W. Norton & Company,['Design']


In [18]:
# check the shape of dataframe
print(f'The data has {df_no_dup.shape[0]} records and {df_no_dup.shape[1]} columns.')

The data has 241090 records and 4 columns.


In [19]:
df_no_dup.rename(columns={'Category': 'category'}, inplace=True)
df_no_dup.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_no_dup.rename(columns={'Category': 'category'}, inplace=True)


Unnamed: 0,book_title,book_author,publisher,category
0,Classical Mythology,Mark P. O. Morford,Oxford University Press,['Social Science']
1,Clara Callan,Richard Bruce Wright,HarperFlamingo Canada,['Actresses']
15,Decision in Normandy,Carlo D'Este,HarperPerennial,['1940-1949']
18,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,Farrar Straus Giroux,['Medical']
29,The Mummies of Urumchi,E. J. W. Barber,W. W. Norton & Company,['Design']


### Data Preprocessing
Karena data pada kolom category terbungkus dalam tanda kurung siku, maka data harus dibersihkan agar dapat diterima baik oleh model

In [20]:
def clean_category(text):
  text = re.sub(r'[\[\]]', '', text)
  text = text.replace("'", '')
  return text

In [21]:
import re

df_no_dup['clean_category'] = df_no_dup['category'].apply(clean_category)
df_no_dup.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_no_dup['clean_category'] = df_no_dup['category'].apply(clean_category)


Unnamed: 0,book_title,book_author,publisher,category,clean_category
0,Classical Mythology,Mark P. O. Morford,Oxford University Press,['Social Science'],Social Science
1,Clara Callan,Richard Bruce Wright,HarperFlamingo Canada,['Actresses'],Actresses
15,Decision in Normandy,Carlo D'Este,HarperPerennial,['1940-1949'],1940-1949
18,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,Farrar Straus Giroux,['Medical'],Medical
29,The Mummies of Urumchi,E. J. W. Barber,W. W. Norton & Company,['Design'],Design


Setelah membersihkan kolom kategori dari kolom kotak, selanjutnya kita harus menyamakan beberapa kategori menjadi sama contoh Voyages and travels. --> Voyages and travels

In [22]:
df_clean = df_no_dup.replace('Voyages and travels.', 'Voyages and travels')
df_clean = df_clean.replace('Documentary photography.', 'Documentary photography')
df_clean = df_clean.drop(['category'], axis=1)
df_clean.head()

Unnamed: 0,book_title,book_author,publisher,clean_category
0,Classical Mythology,Mark P. O. Morford,Oxford University Press,Social Science
1,Clara Callan,Richard Bruce Wright,HarperFlamingo Canada,Actresses
15,Decision in Normandy,Carlo D'Este,HarperPerennial,1940-1949
18,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,Farrar Straus Giroux,Medical
29,The Mummies of Urumchi,E. J. W. Barber,W. W. Norton & Company,Design


In [23]:
# lowercase all data
index = df_clean.columns
for i in index:
  df_clean[i] = df_clean[i].apply(str.lower)
df_clean.head()

Unnamed: 0,book_title,book_author,publisher,clean_category
0,classical mythology,mark p. o. morford,oxford university press,social science
1,clara callan,richard bruce wright,harperflamingo canada,actresses
15,decision in normandy,carlo d'este,harperperennial,1940-1949
18,flu: the story of the great influenza pandemic...,gina bari kolata,farrar straus giroux,medical
29,the mummies of urumchi,e. j. w. barber,w. w. norton & company,design


In [24]:
# check the shape of dataframe
print(f'The data has {df_clean.shape[0]} records and {df_clean.shape[1]} columns.')

The data has 241090 records and 4 columns.


In [25]:
df_no_dup = df_clean.drop_duplicates(['clean_category'])
df_no_dup.head()

Unnamed: 0,book_title,book_author,publisher,clean_category
0,classical mythology,mark p. o. morford,oxford university press,social science
1,clara callan,richard bruce wright,harperflamingo canada,actresses
15,decision in normandy,carlo d'este,harperperennial,1940-1949
18,flu: the story of the great influenza pandemic...,gina bari kolata,farrar straus giroux,medical
29,the mummies of urumchi,e. j. w. barber,w. w. norton & company,design


In [26]:
# check the shape of dataframe
print(f'The data has {df_no_dup.shape[0]} records and {df_no_dup.shape[1]} columns.')

The data has 6059 records and 4 columns.


# Modeling
---

Untuk model kali ini kita akan menggunakan Content Based Filtering dimana tujuan model ini adalah mencari similarity antara buku

In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer()
tfidf_matrix = tf.fit_transform(df_no_dup['clean_category'])

In [28]:
from sklearn.metrics.pairwise import cosine_similarity
 
# Menghitung cosine similarity pada matrix tf-idf
cosine_sim = cosine_similarity(tfidf_matrix)
cosine_sim_df = pd.DataFrame(cosine_sim, index=df_no_dup['book_title'], columns=df_no_dup['book_title'])
cosine_sim_df.sample(5, axis=1).sample(10, axis=0)

book_title,"on the edge (gold medal dreams , no 1)",the self-made snowman,li'l sis and uncle willie: a story based on the life and paintings of william h. johnson,everything you need to know about world history homework: a desk reference for students and parents/4th to 6th grades (scholastic homework reference series),owner manual life
book_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"elizabeth the spy (sweet valley twins, 96)",0.0,0.0,0.0,0.0,0.0
reason and religious belief: an introduction to the philosophy of religion,0.0,0.0,0.0,0.0,0.0
excel,0.0,0.0,0.0,0.0,0.0
a change of skies (picador fiction),0.0,0.0,0.0,0.0,0.0
sloop of war,0.0,0.0,0.0,0.0,0.0
optionen und futures verstehen. grundlagen und neuere entwicklungen.,0.0,0.0,0.0,0.0,0.0
si vivre est tel: poã¨mes (collection contemporains),0.0,0.0,0.0,0.0,0.0
universe within,0.0,0.0,0.0,0.0,0.0
halloween echo,0.0,0.0,0.0,0.0,0.0
sun dog,0.0,0.0,0.0,0.0,0.0
