# Preprocessing

## importing dataset

disini kita akan menyiapkan dataset yang ada untuk diimplementasikan di kode nya

In [48]:
import numpy as np
from nltk.tokenize import  word_tokenize 
import pandas as pd
import re
import seaborn as sns
import matplotlib.pyplot as plt
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import string

In [49]:
data = pd.read_csv("reviews.csv")

data.head()

Unnamed: 0,Time_submitted,Review,Rating,Total_thumbsup,Reply
0,2022-07-09 15:00:00,"Great music service, the audio is high quality...",5,2,
1,2022-07-09 14:21:22,Please ignore previous negative rating. This a...,5,1,
2,2022-07-09 13:27:32,"This pop-up ""Get the best Spotify experience o...",4,0,
3,2022-07-09 13:26:45,Really buggy and terrible to use as of recently,1,1,
4,2022-07-09 13:20:49,Dear Spotify why do I get songs that I didn't ...,1,1,


In [50]:
data["Rating"].value_counts()

Rating
5    22095
1    17653
4     7842
2     7118
3     6886
Name: count, dtype: int64

## feature selection



kita akan memilih lagi untuk kolom apa saja yang diperlukan. Untuk kasus ini kolom yang penting adalah review sebagai x nya nanti dan rating sebagai y nya atau labelnya nanti. Selain dari itu akan kita hapus. Selain itu kita akan mengubah rating menjadi label encoding dengan makna negatif = 0 positif = 1. 

In [51]:
data.drop(columns=["Time_submitted", "Total_thumbsup", "Reply"], inplace=True)

data.head()

Unnamed: 0,Review,Rating
0,"Great music service, the audio is high quality...",5
1,Please ignore previous negative rating. This a...,5
2,"This pop-up ""Get the best Spotify experience o...",4
3,Really buggy and terrible to use as of recently,1
4,Dear Spotify why do I get songs that I didn't ...,1


kita akan memilih komentar positif dan negatif berdasarkan rating 1 dan rating 5, karena ini rating 1 dan 5 adalah sesuatu hal yang general. 

In [52]:
data = data[(data["Rating"] == 5) | (data["Rating"] == 1)]

data


Unnamed: 0,Review,Rating
0,"Great music service, the audio is high quality...",5
1,Please ignore previous negative rating. This a...,5
3,Really buggy and terrible to use as of recently,1
4,Dear Spotify why do I get songs that I didn't ...,1
6,I love the selection and the lyrics are provid...,5
...,...,...
61586,One day I was able to switch between songs and...,1
61587,It was my favourite app. I feel sorry for arti...,1
61588,Back to one frkng star. First of all there's t...,1
61589,Even though it was communicated that lyrics fe...,1


0 = negatif

1 = positif

In [53]:
labels = [1 if i==5 else 0 for i in data["Rating"]]

data.loc[:, "labels"] = labels
data.head()

Unnamed: 0,Review,Rating,labels
0,"Great music service, the audio is high quality...",5,1
1,Please ignore previous negative rating. This a...,5,1
3,Really buggy and terrible to use as of recently,1,0
4,Dear Spotify why do I get songs that I didn't ...,1,0
6,I love the selection and the lyrics are provid...,5,1


untuk memperingankan proses prediksi dan proses tf-idf maka kita batasi untuk setiap kelas itu memiliki 3000 data (sebelum di cleaning), total 6000 data untuk seluruh kelas.

In [54]:
data = pd.concat([data[data.labels == 1].iloc[:3000, :], data[data.labels == 0].iloc[:3000, :]])


In [55]:
data.shape

(6000, 3)

## duplikated data


disini kita akan mencari apakah ada data yang terduplikasi atau tidak. Karena duplikasi data akan dapat menyebabkan pengaruh overfitting

In [56]:
data[data.duplicated()]

Unnamed: 0,Review,Rating,labels
2495,Good app for songs,5,1
2996,I love Spotify.,5,1
3261,Amazing music app,5,1
4457,Great app tons of music,5,1
5023,Best music experience ever,5,1
5827,The best music app ever,5,1
6246,The best music app ever,5,1
6297,Best music app by far,5,1
6302,I love Spotify!,5,1
6532,Great Music App.,5,1


berhubung dari data di atas kita memiliki data yang terduplikasi, maka kita perlu menghilangkan salah satunya (jika dua)

In [57]:
data = data.drop_duplicates()

data[data.duplicated()]

Unnamed: 0,Review,Rating,labels


## missing value



dalam tahapan ini, kita akan mencari apakah ada data yang kosong atau tidak. Data yang kosong akan mengganggu proses modeling.

In [58]:
data.isna().sum()

Review    0
Rating    0
labels    0
dtype: int64

In [59]:
data.isnull().sum()

Review    0
Rating    0
labels    0
dtype: int64

dapat disimpulkan bahwa tidak ada data yang kosong di atas

Bagaimana caranya untuk melihat data outlier pada kasus text mining ? 

In [60]:
data.head()

Unnamed: 0,Review,Rating,labels
0,"Great music service, the audio is high quality...",5,1
1,Please ignore previous negative rating. This a...,5,1
6,I love the selection and the lyrics are provid...,5,1
8,It's a great app and the best mp3 music app I ...,5,1
14,i hav any music that i like it is super🙌,5,1


In [61]:
data.shape

(5974, 3)

## case folding



dalam tahapan ini, kita akan melakukan penyeleksi an, untuk karakter tambahan tertentu seperti tag, hastag, huruf (karena tidak memberikan makna yang jelas), link, dan karakter lainnya

In [62]:
def cleaning_text(text):
    text = re.sub(r'@[A-Za-z0-9_]+', '', text)
    text = re.sub(r'#\w+', '', text)
    text = re.sub(r"\d+", "", text)
    text = re.sub(r'RT[\s]+', '', text)
    text = re.sub(r'https?://\S+', '', text)
    text = text.translate(str.maketrans("","", string.punctuation))

    return text

data["Review"] = data["Review"].apply(cleaning_text).str.lower()

data.head()

Unnamed: 0,Review,Rating,labels
0,great music service the audio is high quality ...,5,1
1,please ignore previous negative rating this ap...,5,1
6,i love the selection and the lyrics are provid...,5,1
8,its a great app and the best mp music app i ha...,5,1
14,i hav any music that i like it is super🙌,5,1


kita akan coba lihat untuk hasilnya

In [63]:
data.iloc[39].Review

'i have used this app for the last  years straight always use premium its an essential'

apakah normalisasi perlu ?

## Tokenized



dalam tahapan ini, kita akan jadikan suatu kalimat atau dokumen terpisah pisah menjadi perkata

In [64]:
tokenized = data["Review"].apply(lambda x: word_tokenize(x))

tokenized

0        [great, music, service, the, audio, is, high, ...
1        [please, ignore, previous, negative, rating, t...
6        [i, love, the, selection, and, the, lyrics, ar...
8        [its, a, great, app, and, the, best, mp, music...
14       [i, hav, any, music, that, i, like, it, is, su...
                               ...                        
12014    [changing, my, review, from, stars, to, the, m...
12021    [cant, play, songs, in, order, have, to, liste...
12022    [everytime, i, play, a, song, another, song, g...
12024    [i, hope, spotify, will, fix, the, design, whe...
12025    [terrible, experience, when, using, it, to, st...
Name: Review, Length: 5974, dtype: object

## Stopwords



dalam tahapan ini, kita akan menghilangkan kata kata yang tidak terlalu bermakna dan yang dependen seperti konjungsi, kata sifat, dan lain sebagainya

In [65]:
stops = set(stopwords.words('english'))

def stopword(text):
    text = [word for word in text if word not in stops]
    return text

tokenized = tokenized.apply(lambda x: stopword(x))

tokenized

0        [great, music, service, audio, high, quality, ...
1        [please, ignore, previous, negative, rating, a...
6        [love, selection, lyrics, provided, song, your...
8        [great, app, best, mp, music, app, ever, used,...
14                              [hav, music, like, super🙌]
                               ...                        
12014    [changing, review, stars, music, stops, repeat...
12021    [cant, play, songs, order, listen, playlist, c...
12022    [everytime, play, song, another, song, goes, t...
12024    [hope, spotify, fix, design, upload, instagram...
12025    [terrible, experience, using, stream, sonos, y...
Name: Review, Length: 5974, dtype: object

## Stemming



dalam proses stemming kita akan jadikan setiap kata yang memiliki imbuhan akan kita jadikan kata aslinya misalnya membaca menjadi baca, driving menjadi drive

In [66]:
stemmer = PorterStemmer()

def stemming(text):
    text = [stemmer.stem(token) for token in text]
    
    return text

stemmed_token = tokenized.apply(lambda x: stemming(x))

stemmed_token

0        [great, music, servic, audio, high, qualiti, a...
1        [pleas, ignor, previou, neg, rate, app, super,...
6        [love, select, lyric, provid, song, your, listen]
8        [great, app, best, mp, music, app, ever, use, ...
14                              [hav, music, like, super🙌]
                               ...                        
12014    [chang, review, star, music, stop, repeatedli,...
12021    [cant, play, song, order, listen, playlist, ca...
12022    [everytim, play, song, anoth, song, goe, tri, ...
12024    [hope, spotifi, fix, design, upload, instagram...
12025    [terribl, experi, use, stream, sono, youtub, m...
Name: Review, Length: 5974, dtype: object

# Pembobotan kata

## Perhitungan jumlah kata yang muncul

untuk mempersiapkan TF-IDF kita perlu menghitung dalam setiap dokumen, kata yang muncul pada dokumen tersebut akan dihitung berapa kali muncul dalam dokumen tersebut.

In [67]:
combined_text = [' '.join(text) for text in stemmed_token]
sentences = []
word_set = []
 
for sent in combined_text:
    x = [i for i in word_tokenize(sent) if i.isalpha()]
    sentences.append(x)
    for word in x:
        if word not in word_set:
            word_set.append(word)
 
#Set of vocab 
word_set = set(word_set)
#Total documents in our corpus
total_documents = len(sentences)
 
#Creating an index for each word in our vocab.
index_dict = {} #Dictionary to store index for each word
i = 0
for word in word_set:
    index_dict[word] = i
    i += 1


In [68]:
def count_dict(sentences):
    word_count = {}
    for word in word_set:
        word_count[word] = 0
        for sent in sentences:
            if word in sent:
                word_count[word] += 1
    return word_count
 
word_count = count_dict(sentences)

## Term Frequency

dalam term frequency atau TF kita akan menghitung di setiap dokumen dari banyaknya suatu term t dalam dokumen d banding banyaknya kata n dalam dokumen d

In [69]:
#Term Frequency
def termfreq(document, word):
    N = len(document)
    occurance = len([token for token in document if token == word])
    return occurance/N

## Inverse Document Frequently

Lalu selanjutnya adalah kita akan melakukan perhitungan IDF yakni perhitungan perbandingan antara banyaknya dokumen n dengan banyak dokumen yang mengandung term t

In [70]:
def inverse_doc_freq(word):
    try:
        word_occurance = word_count[word] + 1
    except:
        word_occurance = 1
    return np.log(total_documents/word_occurance)

## TF-IDF

dalam tahapan ini, kita akan menggabungkan TF IDF untuk digunakan sebagai acuan bobot dalam setiap data

In [71]:
def tf_idf(sentence):
    tf_idf_vec = np.zeros((len(word_set),))
    for word in sentence:
        tf = termfreq(sentence,word)
        idf = inverse_doc_freq(word)
         
        value = tf*idf
        tf_idf_vec[index_dict[word]] = value 
    return tf_idf_vec

In [72]:
len(word_set)

5193

menjadikan hasil tf idf menjadi dataframe agar bisa diolah dengan class

In [73]:
#TF-IDF Encoded text corpus
vectors = []
for sent in sentences:
    vec = tf_idf(sent)
    vectors.append(vec)
 
tfidf = pd.DataFrame(vectors, columns=list(word_set))
tfidf

Unnamed: 0,𝐝𝐢𝐬𝐚𝐩𝐩𝐨𝐢𝐧𝐭𝐞𝐝,recommand,suddenlyback,die,frame,prior,stress,stick,share,pan,...,kinda,consistli,dr,roll,pioneer,zuri,worser,loveplu,abound,structur
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5969,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5970,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5971,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5972,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


melihat jika hasilnya numpy

In [74]:
tfidf.to_numpy()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

lalu kita akan cari untuk hasil TF IDF yang mendekati angka 0. Karena jika angkanya mendekati angka 0 artinya term tersebut jarang keluar (uncommon)

In [75]:
uncommon_word = []
for col in tfidf.columns:
    if tfidf[col].sum() < 0.2:
        uncommon_word.append(col)


uncommon_word

['writer',
 'yn',
 'disjoint',
 'hoursnon',
 'detent',
 'likelihood',
 'induct',
 'semidec',
 'commercialfre',
 'gim',
 'weav',
 'prevu',
 'toolbar',
 'stuf',
 'span',
 'ware',
 'bsrebon',
 'starjust',
 'conceal',
 'serbia',
 'didi',
 '𝐭𝐡𝐞𝐫𝐞𝐬',
 'disk',
 'shuffel',
 'distort',
 'strength',
 'onlow',
 '𝐤𝐧𝐨𝐰',
 'undesir',
 'musicssong',
 'unusu',
 'mdiddisnefidhddndmdidnaj',
 'betteranoth',
 'influenc',
 'truth',
 'songlist',
 'fecal',
 'essex',
 'useaft',
 'breadth',
 'nervesif',
 'servicefeel',
 'deliber',
 'pref',
 'tought',
 'fame',
 'autogener',
 'fnk',
 'frustratingli',
 'nownot',
 'disallow',
 'nxt',
 'youtubepro',
 'blog',
 'ge',
 'nono',
 'nitpick',
 'listenit',
 'plete',
 'fixedihav',
 'camp',
 'halen',
 '𝐦𝐲',
 'sadden',
 'buggiest',
 'sayin',
 'krki',
 'platformsi',
 'lay',
 'tosort',
 'abandon',
 'sixth',
 'belong',
 'reba',
 'noooo',
 'grabber',
 '𝐬𝐨𝐮𝐧𝐝',
 'highr',
 'unconstitut',
 'bubbi',
 'crusti',
 'offlinedownload',
 'playlistpleasethat',
 'bubbl',
 '𝐭𝐡𝐢𝐬𝐚𝐩𝐩',
 'phonesp

In [76]:
tfidf.drop(columns=uncommon_word, inplace=True)

Lalu kita akan pisahkan antara data train yang dikhususkan untuk training dengan datatest yang digunakan untuk testing

In [77]:
def Train_Test_Split(x, y, random_seed=None, test_size=0.2):
    n = len(x)
    if random_seed:
        np.random.seed(random_seed)
    
    test_size = int(test_size * n)
    indices = np.random.permutation(n)
    train_indices, test_indices = indices[test_size:], indices[:test_size]
    return x.iloc[train_indices], x.iloc[test_indices], y.iloc[train_indices], y.iloc[test_indices]


X_train, X_test, y_train, y_test = Train_Test_Split(tfidf, data["labels"], random_seed=42)

# Modeling

untuk modeling ini saya akan membuat class dengan menggunakan algopritma KNN ytang berisi train, predict, dan melihat akurasi

In [78]:
class KNN:
    def __init__(self, n_neighbors=3):
        self.n_neighbors = n_neighbors
        self.x = None
        self.y = None
        self.total_iteration = 0
        self.iteration = 0

    def fit(self, x,  y):
        if x.shape[0] != y.shape[0]:
            raise f"error cannot fit with different size x ({x.shape[0]}) and y({y.shape[0]})"
        
        self.x = np.array(x)
        self.y = np.array(y)

    def _predict(self, x_predict):
        self.iteration += 1
        print(self.iteration, "/", self.total_iteration)
        print("counting distance . . .")
        distance = np.array([np.sum((x_train-x_predict)**2) for x_train in self.x])

        max_distance = np.argsort(distance)[:self.n_neighbors]
        print("labeling . . .")
        label = [self.y[i] for i in max_distance]
        return np.bincount(label).argmax()

        
    def predict(self, x_predict):
        self.total_iteration = len(x_predict)
        self.iteration = 0
        x_predict = np.array(x_predict)
        predicted = [self._predict(x) for x in x_predict]
        return np.array(predicted)
    
    def accuracy(self, y_true, y_pred):
        intersection = 0
        for i in range(len(y_true)):
            if y_true[i] == y_pred[i]:
                intersection += 1

        return intersection / len(y_true)
        

## Training

In [79]:
model = KNN(n_neighbors=5)


In [80]:
model.fit(X_train, y_train)

## Testing

In [81]:
y_predict = model.predict(X_test)#.iloc[:50])

y_predict

1 / 1194
counting distance . . .


labeling . . .
2 / 1194
counting distance . . .
labeling . . .
3 / 1194
counting distance . . .
labeling . . .
4 / 1194
counting distance . . .
labeling . . .
5 / 1194
counting distance . . .
labeling . . .
6 / 1194
counting distance . . .
labeling . . .
7 / 1194
counting distance . . .
labeling . . .
8 / 1194
counting distance . . .
labeling . . .
9 / 1194
counting distance . . .
labeling . . .
10 / 1194
counting distance . . .
labeling . . .
11 / 1194
counting distance . . .
labeling . . .
12 / 1194
counting distance . . .
labeling . . .
13 / 1194
counting distance . . .
labeling . . .
14 / 1194
counting distance . . .
labeling . . .
15 / 1194
counting distance . . .
labeling . . .
16 / 1194
counting distance . . .
labeling . . .
17 / 1194
counting distance . . .
labeling . . .
18 / 1194
counting distance . . .
labeling . . .
19 / 1194
counting distance . . .
labeling . . .
20 / 1194
counting distance . . .
labeling . . .
21 / 1194
counting distance . . .
labeling . . .
22 / 1194
cou

array([0, 0, 1, ..., 1, 1, 0], dtype=int64)

In [82]:
model.accuracy(y_test.to_numpy(), y_predict)

0.7839195979899497

dalam percobaan yang telah saya lakukan dalam pergantian nilai k menghasilkan nilai akurasi yang seperti ini

Ketika dataset 20% test 80% train dengan 8000 data (non-cleaning)
- k = 3, akurasi = 80.6026 %
- k = 5, akurasi = 80.7909 %
- k = 7, akurasi = 80.2887 %
- k = 9, akurasi = 79.5354 %

jadi untuk nilai k dengan nilai akurasi paling tinggi adalah k = 5