# Document Similarity

# Rekomendasi Buku dengan Kemiripan Dokumen

Rekomendasi buku adalah aplikasi yang serupa dengan rekomendasi film, tetapi dalam konteks rekomendasi buku, kita ingin merekomendasikan buku kepada pembaca berdasarkan minat mereka. Berikut adalah tiga pendekatan yang serupa dengan rekomendasi buku:

Simple Rule-based Recommenders :
- Rekomendasi berdasarkan popularitas buku: Menampilkan buku-buku populer yang banyak dibaca atau dinilai tinggi oleh pembaca lain.
- Rekomendasi berdasarkan genre: Memberikan buku-buku dalam genre yang sesuai dengan minat pembaca. Misalnya, jika seseorang suka fiksi ilmiah, rekomendasikan buku-buku dalam genre tersebut.

Content-based Recommenders:
- Rekomendasi berdasarkan atribut buku: Menganalisis atribut konten buku seperti deskripsi, genre, penulis, dan kata kunci. Selain itu, dapat merekomendasikan buku dengan atribut yang serupa dengan buku yang telah disukai oleh pembaca.
- Analisis sentimen ulasan buku: Menganalisis sentimen ulasan pembaca untuk buku-buku dan merekomendasikan buku dengan sentimen positif yang mirip dengan buku yang disukai.

Collaborative Filtering Recommenders:
- Menganalisis preferensi pembaca: Menganalisis preferensi pembaca berdasarkan buku yang telah mereka baca atau dinilai. Jika dua pembaca memiliki preferensi yang serupa, kita dapat merekomendasikan buku yang satu telah membaca kepada yang lain.
- Filtering berbasis peringkat: Menggunakan peringkat yang diberikan oleh pembaca untuk buku-buku tertentu. Jika dua pembaca memberikan peringkat yang mirip untuk buku yang sama, sistem dapat merekomendasikan buku tersebut kepada keduanya.


In [39]:
!pip install textsearch
!pip install contractions
import nltk
nltk.download('punkt')
nltk.download('stopwords')



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [40]:
import pathlib
pathlib.Path().resolve()

WindowsPath('C:/Users/User')

# load dan view data

In [41]:
import pandas as pd

df = pd.read_csv("goodreads_data.csv")
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   10000 non-null  int64  
 1   Book         10000 non-null  object 
 2   Author       10000 non-null  object 
 3   Description  9923 non-null   object 
 4   Genres       10000 non-null  object 
 5   Avg_Rating   10000 non-null  float64
 6   Num_Ratings  10000 non-null  object 
 7   URL          10000 non-null  object 
dtypes: float64(1), int64(1), object(6)
memory usage: 625.1+ KB


Unnamed: 0.1,Unnamed: 0,Book,Author,Description,Genres,Avg_Rating,Num_Ratings,URL
0,0,To Kill a Mockingbird,Harper Lee,The unforgettable novel of a childhood in a sl...,"['Classics', 'Fiction', 'Historical Fiction', ...",4.27,5691311,https://www.goodreads.com/book/show/2657.To_Ki...
1,1,Harry Potter and the Philosopher’s Stone (Harr...,J.K. Rowling,Harry Potter thinks he is an ordinary boy - un...,"['Fantasy', 'Fiction', 'Young Adult', 'Magic',...",4.47,9278135,https://www.goodreads.com/book/show/72193.Harr...
2,2,Pride and Prejudice,Jane Austen,"Since its immediate success in 1813, Pride and...","['Classics', 'Fiction', 'Romance', 'Historical...",4.28,3944155,https://www.goodreads.com/book/show/1885.Pride...
3,3,The Diary of a Young Girl,Anne Frank,Discovered in the attic in which she spent the...,"['Classics', 'Nonfiction', 'History', 'Biograp...",4.18,3488438,https://www.goodreads.com/book/show/48855.The_...
4,4,Animal Farm,George Orwell,Librarian's note: There is an Alternate Cover ...,"['Classics', 'Fiction', 'Dystopia', 'Fantasy',...",3.98,3575172,https://www.goodreads.com/book/show/170448.Ani...


In [42]:
df = df[['Book', 'Author', 'Genres', 'Num_Ratings']]
df.Author.fillna('', inplace=True)
df['description'] = df['Author'].map(str) + ' ' + df['Genres']
df.dropna(inplace=True)
df = df.sort_values(by=['Num_Ratings'], ascending=False)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 1468 to 3747
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Book         10000 non-null  object
 1   Author       10000 non-null  object
 2   Genres       10000 non-null  object
 3   Num_Ratings  10000 non-null  object
 4   description  10000 non-null  object
dtypes: object(5)
memory usage: 468.8+ KB


In [43]:
df.head()

Unnamed: 0,Book,Author,Genres,Num_Ratings,description
1468,"Hometown Girl After All (Hometown, #2)",Kirsten Fullmer,"['Contemporary', 'Young Adult', 'New Adult', '...",999,"Kirsten Fullmer ['Contemporary', 'Young Adult'..."
3033,"Hometown Girl After All (Hometown, #2)",Kirsten Fullmer,"['Contemporary', 'Young Adult', 'New Adult', '...",999,"Kirsten Fullmer ['Contemporary', 'Young Adult'..."
4735,"Belonging (Temptation, #2)",Karen Ann Hopkins,"['Young Adult', 'Romance', 'Amish', 'Contempor...",998,"Karen Ann Hopkins ['Young Adult', 'Romance', '..."
7411,تشي,أحمد خالد توفيق,"['Fiction', 'Novels', 'Fantasy']",998,"أحمد خالد توفيق ['Fiction', 'Novels', 'Fantasy']"
2796,Living The Best Day Ever,Hendri Coetzee,"['Nonfiction', 'Adventure']",997,"Hendri Coetzee ['Nonfiction', 'Adventure']"


In [44]:
df.iloc[4799].description

"Georgette Heyer ['Romance', 'Historical Fiction', 'Historical Romance', 'Historical', 'Fiction', 'Regency', 'Classics']"

# Bangun Sistem Rekomendasi Film

Tahapan
- Pre Processing
- Feature Engineering
- Komputasi Doc Similarity
- Proses Retrieve
- proses rekomendasi film


## Kemiripan Dokumen / document similarity

Ada berbagai cara untuk menghitung kesamaan antara dua item dokumen. Salah satu ukuran yang paling banyak digunakan adalah __cosine similarity__ .

### Cosine Similarity

Cosine Similarity digunakan untuk menghitung skor numerik untuk menunjukkan kesamaan antara dua dokumen teks. Secara matematis, ini didefinisikan sebagai berikut:


In [46]:
import nltk
import re
import numpy as np
import contractions

stop_words = nltk.corpus.stopwords.words('english')

def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    doc = contractions.fix(doc)
    # tokenize document
    tokens = nltk.word_tokenize(doc)
    #filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

normalize_corpus = np.vectorize(normalize_document)

norm_corpus = normalize_corpus(list(df['description']))
len(norm_corpus)

10000

## Extrak TF-IDF

In [47]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer(ngram_range=(1, 2), min_df=2)
tfidf_matrix = tf.fit_transform(norm_corpus)
tfidf_matrix.shape

(10000, 10023)

## Compute Pairwise Document Similarity

In [48]:
from sklearn.metrics.pairwise import cosine_similarity

doc_sim = cosine_similarity(tfidf_matrix)
doc_sim_df = pd.DataFrame(doc_sim)
doc_sim_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,9990,9991,9992,9993,9994,9995,9996,9997,9998,9999
0,1.0,1.0,0.113022,0.008637,0.0,0.073988,0.007678,0.028422,0.055005,0.056849,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,1.0,0.113022,0.008637,0.0,0.073988,0.007678,0.028422,0.055005,0.056849,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.113022,0.113022,1.0,0.007605,0.0,0.07917,0.00676,0.025027,0.048434,0.063815,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.008637,0.008637,0.007605,1.0,0.0,0.047804,0.066411,0.042331,0.15702,0.036816,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.043812,0.0,0.0,0.0,0.042988,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## mendapatkan judul buku

In [49]:
book_list = df['Book'].values
book_list, book_list.shape

(array(['Hometown Girl After All (Hometown, #2)',
        'Hometown Girl After All (Hometown, #2)',
        'Belonging (Temptation, #2)', ..., 'witchbird',
        'The Sense of a Deity',
        'Broken: The Failed Promise of Muslim Inclusion'], dtype=object),
 (10000,))

# Temukan Film Serupa Teratas untuk Contoh Film
Mari ambil Hometown Girl After All (Hometown, #2) buku yang memiliki ratung paling populer dari kerangka data di atas dan coba temukan buku paling mirip yang dapat direkomendasikan

## ambil book ID

In [50]:
book_idx = np.where(book_list == 'Hometown Girl After All (Hometown, #2)')[0][0]
book_idx

0

## ambil similarities

In [51]:
book_similarities = doc_sim_df.iloc[book_idx].values
book_similarities

array([1.        , 1.        , 0.11302219, ..., 0.        , 0.        ,
       0.        ])

## Get top 5 similar book IDs

In [22]:
similar_book_idxs = np.argsort(-book_similarities)[1:6]
similar_book_idxs

array([   1,   69, 9931, 9932, 3872], dtype=int64)

## Get top 5 similar books

In [52]:
similar_books = book_list[similar_book_idxs]
similar_books

array(['Hometown Girl After All (Hometown, #2)',
       'Christmas in Smithville (Hometown, #4)',
       'Hometown Girl Forever (Hometown Series, #3)',
       'Hometown Girl Forever (Hometown, #3)',
       'Love on the Line (Women at Work, #1)'], dtype=object)

## Buat fungsi rekomendasi buku untuk merekomendasikan 5 buku serupa teratas untuk buku apa pun

In [53]:
def book_recommender(book_title, books=book_list, doc_sims=doc_sim_df):
    # find movie id
    book_idx = np.where(books == book_title)[0][0]
    # get movie similarities
    book_similarities = doc_sims.iloc[book_idx].values
    # get top 5 similar movie IDs
    similar_book_idxs = np.argsort(-book_similarities)[1:6]
    # get top 5 movies
    similar_books = books[similar_book_idxs]
    # return the top 5 movies
    return similar_books

In [54]:
popular_books = ['To Kill a Mockingbird', 'Pride and Prejudice', 'The Diary of a Young Girl', 'Animal Farm', 
                  'Hometown Girl After All (Hometown, #2)', 'Christmas in Smithville (Hometown, #4)', 
                  'Love on the Line (Women at Work, #1)']

In [55]:
for Book in popular_books:
    print('Books:', Book )
    print('Top 5 recommended Books:', book_recommender(book_title=Book, books=book_list, doc_sims=doc_sim_df))
    print()

Books: To Kill a Mockingbird
Top 5 recommended Books: ['Go Set a Watchman' 'A Separate Peace'
 'The Adventures of Huckleberry Finn' 'The Scarlet Letter'
 'The Fig Orchard']

Books: Pride and Prejudice
Top 5 recommended Books: ['Persuasion' 'Emma' 'Sense and Sensibility' 'The Complete Novels'
 'Pride and Prejudice, Mansfield Park, Persuasion']

Books: The Diary of a Young Girl
Top 5 recommended Books: ['Rabbit-Proof Fence' 'The Auschwitz Chapter' 'The Complete Maus'
 'Twelve Years a Slave' 'The Fields of Home (Little Britches, #5)']

Books: Animal Farm
Top 5 recommended Books: ['Animal Farm / 1984' '1984' 'Homage to Catalonia'
 'The Road to Wigan Pier' 'Down and Out in Paris and London']

Books: Hometown Girl After All (Hometown, #2)
Top 5 recommended Books: ['Hometown Girl After All (Hometown, #2)'
 'Christmas in Smithville (Hometown, #4)'
 'Hometown Girl Forever (Hometown Series, #3)'
 'Hometown Girl Forever (Hometown, #3)'
 'Love on the Line (Women at Work, #1)']

Books: Christmas in