# Preprocessing

## importing dataset

disini kita akan menyiapkan dataset yang ada untuk diimplementasikan di kode nya

In [284]:
import numpy as np
from nltk.tokenize import  word_tokenize 
import pandas as pd
import re
import seaborn as sns
import matplotlib.pyplot as plt
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import string

In [285]:
data = pd.read_csv("reviews.csv")

data.head()

Unnamed: 0,Time_submitted,Review,Rating,Total_thumbsup,Reply
0,2022-07-09 15:00:00,"Great music service, the audio is high quality...",5,2,
1,2022-07-09 14:21:22,Please ignore previous negative rating. This a...,5,1,
2,2022-07-09 13:27:32,"This pop-up ""Get the best Spotify experience o...",4,0,
3,2022-07-09 13:26:45,Really buggy and terrible to use as of recently,1,1,
4,2022-07-09 13:20:49,Dear Spotify why do I get songs that I didn't ...,1,1,


In [286]:
data["Rating"].value_counts()

Rating
5    22095
1    17653
4     7842
2     7118
3     6886
Name: count, dtype: int64

## feature selection



kita akan memilih lagi untuk kolom apa saja yang diperlukan. Untuk kasus ini kolom yang penting adalah review sebagai x nya nanti dan rating sebagai y nya atau labelnya nanti. Selain dari itu akan kita hapus. Selain itu kita akan mengubah rating menjadi label encoding dengan makna negatif = 0 positif = 1. 

In [287]:
data.drop(columns=["Time_submitted", "Total_thumbsup", "Reply"], inplace=True)

data.head()

Unnamed: 0,Review,Rating
0,"Great music service, the audio is high quality...",5
1,Please ignore previous negative rating. This a...,5
2,"This pop-up ""Get the best Spotify experience o...",4
3,Really buggy and terrible to use as of recently,1
4,Dear Spotify why do I get songs that I didn't ...,1


kita akan memilih komentar positif dan negatif berdasarkan rating 1 dan rating 5, karena ini rating 1 dan 5 adalah sesuatu hal yang general. 

In [288]:
data = data[(data["Rating"] == 5) | (data["Rating"] == 1)]

data


Unnamed: 0,Review,Rating
0,"Great music service, the audio is high quality...",5
1,Please ignore previous negative rating. This a...,5
3,Really buggy and terrible to use as of recently,1
4,Dear Spotify why do I get songs that I didn't ...,1
6,I love the selection and the lyrics are provid...,5
...,...,...
61586,One day I was able to switch between songs and...,1
61587,It was my favourite app. I feel sorry for arti...,1
61588,Back to one frkng star. First of all there's t...,1
61589,Even though it was communicated that lyrics fe...,1


0 = negatif

1 = positif

In [289]:
labels = [1 if i==5 else 0 for i in data["Rating"]]

data.loc[:, "labels"] = labels
data.head()

Unnamed: 0,Review,Rating,labels
0,"Great music service, the audio is high quality...",5,1
1,Please ignore previous negative rating. This a...,5,1
3,Really buggy and terrible to use as of recently,1,0
4,Dear Spotify why do I get songs that I didn't ...,1,0
6,I love the selection and the lyrics are provid...,5,1


untuk memperingankan proses prediksi dan proses tf-idf maka kita batasi untuk setiap kelas itu memiliki 4000 data (sebelum di cleaning), total 8000 data untuk seluruh kelas.

In [290]:
data = pd.concat([data[data.labels == 1].iloc[:4000, :], data[data.labels == 0].iloc[:4000, :]])


In [291]:
data.shape

(8000, 3)

## duplikated data


disini kita akan mencari apakah ada data yang terduplikasi atau tidak. Karena duplikasi data akan dapat menyebabkan pengaruh overfitting

In [292]:
data[data.duplicated()]

Unnamed: 0,Review,Rating,labels
2495,Good app for songs,5,1
2996,I love Spotify.,5,1
3261,Amazing music app,5,1
4457,Great app tons of music,5,1
5023,Best music experience ever,5,1
5827,The best music app ever,5,1
6246,The best music app ever,5,1
6297,Best music app by far,5,1
6302,I love Spotify!,5,1
6532,Great Music App.,5,1


berhubung dari data di atas kita memiliki data yang terduplikasi, maka kita perlu menghilangkan salah satunya (jika dua)

In [293]:
data = data.drop_duplicates()

data[data.duplicated()]

Unnamed: 0,Review,Rating,labels


## missing value



dalam tahapan ini, kita akan mencari apakah ada data yang kosong atau tidak. Data yang kosong akan mengganggu proses modeling.

In [294]:
data.isna().sum()

Review    0
Rating    0
labels    0
dtype: int64

In [295]:
data.isnull().sum()

Review    0
Rating    0
labels    0
dtype: int64

dapat disimpulkan bahwa tidak ada data yang kosong di atas

Bagaimana caranya untuk melihat data outlier pada kasus text mining ? 

In [296]:
data.head()

Unnamed: 0,Review,Rating,labels
0,"Great music service, the audio is high quality...",5,1
1,Please ignore previous negative rating. This a...,5,1
6,I love the selection and the lyrics are provid...,5,1
8,It's a great app and the best mp3 music app I ...,5,1
14,i hav any music that i like it is super🙌,5,1


In [297]:
data.shape

(7967, 3)

## case folding



dalam tahapan ini, kita akan melakukan penyeleksi an, untuk karakter tambahan tertentu seperti tag, hastag, huruf (karena tidak memberikan makna yang jelas), link, dan karakter lainnya

In [298]:
def cleaning_text(text):
    text = re.sub(r'@[A-Za-z0-9_]+', '', text)
    text = re.sub(r'#\w+', '', text)
    text = re.sub(r"\d+", "", text)
    text = re.sub(r'RT[\s]+', '', text)
    text = re.sub(r'https?://\S+', '', text)
    text = text.translate(str.maketrans("","", string.punctuation))

    return text

data["Review"] = data["Review"].apply(cleaning_text).str.lower()

data.head()

Unnamed: 0,Review,Rating,labels
0,great music service the audio is high quality ...,5,1
1,please ignore previous negative rating this ap...,5,1
6,i love the selection and the lyrics are provid...,5,1
8,its a great app and the best mp music app i ha...,5,1
14,i hav any music that i like it is super🙌,5,1


kita akan coba lihat untuk hasilnya

In [299]:
data.iloc[39].Review

0        great music service the audio is high quality ...
1        please ignore previous negative rating this ap...
6        i love the selection and the lyrics are provid...
8        its a great app and the best mp music app i ha...
14                i hav any music that i like it is super🙌
                               ...                        
15743    constantly cuts off on my z fold driving me cr...
15746    if you have a free account dont even bother do...
15748    the app was one of the best maybe the best mus...
15756                    app is not opening  dont know why
15757    the app was running smoothly before the last u...
Name: Review, Length: 7967, dtype: object

apakah normalisasi perlu ?

## Tokenized



dalam tahapan ini, kita akan jadikan suatu kalimat atau dokumen terpisah pisah menjadi perkata

In [300]:
tokenized = data["Review"].apply(lambda x: word_tokenize(x))

tokenized

0        [great, music, service, the, audio, is, high, ...
1        [please, ignore, previous, negative, rating, t...
6        [i, love, the, selection, and, the, lyrics, ar...
8        [its, a, great, app, and, the, best, mp, music...
14       [i, hav, any, music, that, i, like, it, is, su...
                               ...                        
15743    [constantly, cuts, off, on, my, z, fold, drivi...
15746    [if, you, have, a, free, account, dont, even, ...
15748    [the, app, was, one, of, the, best, maybe, the...
15756             [app, is, not, opening, dont, know, why]
15757    [the, app, was, running, smoothly, before, the...
Name: Review, Length: 7967, dtype: object

## Stopwords



dalam tahapan ini, kita akan menghilangkan kata kata yang tidak terlalu bermakna dan yang dependen seperti konjungsi, kata sifat, dan lain sebagainya

In [301]:
stops = set(stopwords.words('english'))

def stopword(text):
    text = [word for word in text if word not in stops]
    return text

tokenized = tokenized.apply(lambda x: stopword(x))

tokenized

0        [great, music, service, audio, high, quality, ...
1        [please, ignore, previous, negative, rating, a...
6        [love, selection, lyrics, provided, song, your...
8        [great, app, best, mp, music, app, ever, used,...
14                              [hav, music, like, super🙌]
                               ...                        
15743    [constantly, cuts, z, fold, driving, crazy, ed...
15746    [free, account, dont, even, bother, downloadin...
15748    [app, one, best, maybe, best, music, app, late...
15756                           [app, opening, dont, know]
15757    [app, running, smoothly, last, update, updatin...
Name: Review, Length: 7967, dtype: object

## Stemming



dalam proses stemming kita akan jadikan setiap kata yang memiliki imbuhan akan kita jadikan kata aslinya misalnya membaca menjadi baca, driving menjadi drive

In [302]:
stemmer = PorterStemmer()

def stemming(text):
    text = [stemmer.stem(token) for token in text]
    
    return text

stemmed_token = tokenized.apply(lambda x: stemming(x))

stemmed_token

0        [great, music, servic, audio, high, qualiti, a...
1        [pleas, ignor, previou, neg, rate, app, super,...
6        [love, select, lyric, provid, song, your, listen]
8        [great, app, best, mp, music, app, ever, use, ...
14                              [hav, music, like, super🙌]
                               ...                        
15743    [constantli, cut, z, fold, drive, crazi, edit,...
15746    [free, account, dont, even, bother, download, ...
15748    [app, one, best, mayb, best, music, app, lates...
15756                              [app, open, dont, know]
15757    [app, run, smoothli, last, updat, updat, newer...
Name: Review, Length: 7967, dtype: object

# Pembobotan kata

## Perhitungan jumlah kata yang muncul

untuk mempersiapkan TF-IDF kita perlu menghitung dalam setiap dokumen, kata yang muncul pada dokumen tersebut akan dihitung berapa kali muncul dalam dokumen tersebut.

In [304]:
combined_text = [' '.join(text) for text in stemmed_token]
sentences = []
word_set = []
 
for sent in combined_text:
    x = [i for i in word_tokenize(sent) if i.isalpha()]
    sentences.append(x)
    for word in x:
        if word not in word_set:
            word_set.append(word)
 
#Set of vocab 
word_set = set(word_set)
#Total documents in our corpus
total_documents = len(sentences)
 
#Creating an index for each word in our vocab.
index_dict = {} #Dictionary to store index for each word
i = 0
for word in word_set:
    index_dict[word] = i
    i += 1


In [305]:
def count_dict(sentences):
    word_count = {}
    for word in word_set:
        word_count[word] = 0
        for sent in sentences:
            if word in sent:
                word_count[word] += 1
    return word_count
 
word_count = count_dict(sentences)

## Term Frequency

dalam term frequency atau TF kita akan menghitung di setiap dokumen dari banyaknya suatu term t dalam dokumen d banding banyaknya kata n dalam dokumen d

In [306]:
#Term Frequency
def termfreq(document, word):
    N = len(document)
    occurance = len([token for token in document if token == word])
    return occurance/N

## Inverse Document Frequently

Lalu selanjutnya adalah kita akan melakukan perhitungan IDF yakni perhitungan perbandingan antara banyaknya dokumen n dengan banyak dokumen yang mengandung term t

In [307]:
def inverse_doc_freq(word):
    try:
        word_occurance = word_count[word] + 1
    except:
        word_occurance = 1
    return np.log(total_documents/word_occurance)

## TF-IDF

dalam tahapan ini, kita akan menggabungkan TF IDF untuk digunakan sebagai acuan bobot dalam setiap data

In [308]:
def tf_idf(sentence):
    tf_idf_vec = np.zeros((len(word_set),))
    for word in sentence:
        tf = termfreq(sentence,word)
        idf = inverse_doc_freq(word)
         
        value = tf*idf
        tf_idf_vec[index_dict[word]] = value 
    return tf_idf_vec

In [309]:
len(word_set)

6046

menjadikan hasil tf idf menjadi dataframe agar bisa diolah dengan class

In [310]:
#TF-IDF Encoded text corpus
vectors = []
for sent in sentences:
    vec = tf_idf(sent)
    vectors.append(vec)
 
tfidf = pd.DataFrame(vectors, columns=list(word_set))
tfidf

Unnamed: 0,button,increas,merg,omg,build,apk,butday,took,port,covid,...,perk,spotfyluv,tyriuggt,adopt,zomato,goodd,fors,unload,spotifly,profil
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7962,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7963,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7964,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7965,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


melihat jika hasilnya numpy

In [312]:
tfidf.to_numpy()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

lalu kita akan cari untuk hasil TF IDF yang mendekati angka 0. Karena jika angkanya mendekati angka 0 artinya term tersebut jarang keluar (uncommon)

In [315]:
uncommon_word = []
for col in tfidf.columns:
    if tfidf[col].sum() < 0.2:
        uncommon_word.append(col)


uncommon_word

['abandon',
 'unfinishedbroken',
 'transmiss',
 'ushsnsjsj',
 'buch',
 'goal',
 'endus',
 'awson',
 'truth',
 'drnosleep',
 'bsrebon',
 'ni',
 'isiwish',
 'expiri',
 'adapt',
 'luxuri',
 'baad',
 'clumsi',
 'popularhighest',
 'cube',
 'prop',
 'walter',
 'taunt',
 'bnda',
 'sustain',
 'fnk',
 'lastli',
 'zuissjddb',
 'peev',
 'evil',
 'wifigam',
 'amd',
 'intel',
 'ach',
 'gum',
 'quell',
 'lament',
 'stikl',
 'cowboy',
 'stupendouschang',
 'capitalist',
 'embarrassingli',
 'greatad',
 'clario',
 '𝐬𝐚𝐲𝐢𝐧𝐠',
 'dwell',
 'timeseem',
 '𝐭𝐡𝐞',
 'likelihood',
 'exchang',
 'platformsi',
 'fame',
 'ismvvisianddcixiasisjcànæufifjf',
 'woundsit',
 'placement',
 'disappointedbefor',
 'afficianado',
 'fazool',
 'persji',
 'lastlywhi',
 'grade',
 'gripe',
 'clickabl',
 'restartedreset',
 'getpin',
 'ifnejd',
 'queuefrom',
 'lotongo',
 'writer',
 'onof',
 'nooo',
 'freeload',
 'unquot',
 'ammount',
 'dusiskf',
 'secondth',
 'saniti',
 'elswher',
 'musicssong',
 'duh',
 'playlistthes',
 'crusti',
 'key

Lalu kita akan pisahkan antara data train yang dikhususkan untuk training dengan datatest yang digunakan untuk testing

In [339]:
def Train_Test_Split(x, y, random_seed=None, test_size=0.2):
    n = len(x)
    if random_seed:
        np.random.seed(random_seed)
    
    test_size = int(test_size * n)
    indices = np.random.permutation(n)
    train_indices, test_indices = indices[test_size:], indices[:test_size]
    return x.iloc[train_indices], x.iloc[test_indices], y.iloc[train_indices], y.iloc[test_indices]


X_train, X_test, y_train, y_test = Train_Test_Split(tfidf, data["labels"], random_seed=42)

Unnamed: 0,button,increas,merg,omg,build,apk,butday,took,port,covid,...,perk,spotfyluv,tyriuggt,adopt,zomato,goodd,fors,unload,spotifly,profil
2716,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.24269,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7838,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7832,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6480,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3334,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5045,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2114,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7126,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6504,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Modeling

untuk modeling ini saya akan membuat class dengan menggunakan algopritma KNN ytang berisi train, predict, dan melihat akurasi

In [346]:
class KNN:
    def __init__(self, n_neighbors=3):
        self.n_neighbors = n_neighbors
        self.x = None
        self.y = None

    def fit(self, x,  y):
        if x.shape[0] != y.shape[0]:
            raise f"error cannot fit with different size x ({x.shape[0]}) and y({y.shape[0]})"
        
        self.x = np.array(x)
        self.y = np.array(y)

    def _predict(self, x_predict):
        distance = np.array([np.sum((x_train-x_predict)**2) for x_train in self.x])
        max_distance = np.argsort(distance)[:self.n_neighbors]
        label = [self.y[i] for i in max_distance]
        return np.bincount(label).argmax()

        
    def predict(self, x_predict):
        x_predict = np.array(x_predict)
        predicted = [self._predict(x) for x in x_predict]
        return np.array(predicted)
    
    def accuracy(self, y_true, y_pred):
        intersection = 0
        for i in range(len(y_true)):
            if y_true[i] == y_pred[i]:
                intersection += 1

        return intersection / len(y_true)
        

## Training

In [364]:
model = KNN(n_neighbors=9)


In [365]:
model.fit(X_train, y_train)

## Testing

In [366]:
y_predict = model.predict(X_test)

y_predict

array([0, 0, 0, ..., 0, 0, 1], dtype=int64)

In [367]:
model.accuracy(y_test.to_numpy(), y_predict)

0.7953546767106089

dalam percobaan yang telah saya lakukan dalam pergantian nilai k menghasilkan nilai akurasi yang seperti ini
- k = 3, akurasi = 80.6026 %
- k = 5, akurasi = 80.7909 %
- k = 7, akurasi = 80.2887 %
- k = 9, akurasi = 79.5354 %

jadi untuk nilai k dengan nilai akurasi paling tinggi adalah k = 5