Ovaj notebook priprema podatke za klasifikaciju. Pravi različite skupove u zavisnosti od različitih metoda predobrade podataka. Skupovi se kreću od najjednostavnijih gde gotovo nikakva predobrada nije izvršena, do komplikovanijih gde je više metoda predobrade primenjeno.

In [5]:
import pandas as pd
import utility.preprocessing as preprocessing

In [6]:
# Za brze ucitavanje prilikom promena preprocessing skripte

import importlib
importlib.reload(preprocessing)

<module 'utility.preprocessing' from '/Users/boris_majic/Documents/ETF/OPJ/opj_avengers/utility/preprocessing.py'>

In [7]:
# Ucitavanje podataka

df_merged = pd.read_csv('./data/annotation_merged.csv')
df_merged.drop_duplicates(subset ="pair_id", keep = 'first', inplace = True)
df_merged.head(3)

Unnamed: 0,pair_id,comment,Komentar,code,pretvaranje int u string,red sa prioritetom,pretvaranje string u datum,sortiranje string liste,čuvanje liste u datoteku,postgresql konekcija,...,slanje binarnih podataka preko seriske veze,otpakovanje podataka iz tekstualne datoteke,pozicije podstingova u stringu,čitanje elemenata iz html-a - <td>,oduzimanje medijana iz svake kolone,uklanjanja zaglavlja prilikom spajanja nekoliko datoteka,parsiranje query stringa u url-u,rangiranje fazi članova na osnovu stepena podudaranja,izlaz u html datoteku,kako efikasno pročitati .csv datoteku
0,BookStackApp_BookStack_ActivityService_740,Get a new activity instance for the current u...,Daj novu instancu aktivnosti za trenutnog kori...,protected function newActivityForUser(stri...,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,BookStackApp_BookStack_CommentRepo_753,Update an existing comment.\n,Osveži postojeći komentar.,"public function update(Comment $comment, s...",0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,BookStackApp_BookStack_CommentRepo_754,Delete a comment from the system.\n,Obriši komentar iz sistema.,public function delete(Comment $comment) ...,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Prebacivanje podataka u oblik pogodan za klasifikaciju
Podaci u do sada korišćenoj tabeli su zgodni iz više razloga, ali nisu u formi traženoj u zadatku. Iz forme gde jedan red predstavlja jedan par Komentar-blok, podaci se transforimišu u format gde jedan red predstavlja jedan par komentar-upit

In [8]:
# Transformisanje podataka u oblik prigodan za klasifikaciju
df = preprocessing.transform_annotated_dataframe(df_merged)
df.head(3)

Unnamed: 0,PairID,QueryID,Comment,Query,Score
0,BookStackApp_BookStack_ActivityService_740,0,Daj novu instancu aktivnosti za trenutnog kori...,red sa prioritetom,0
1,BookStackApp_BookStack_ActivityService_740,1,Daj novu instancu aktivnosti za trenutnog kori...,pretvaranje string u datum,0
2,BookStackApp_BookStack_ActivityService_740,2,Daj novu instancu aktivnosti za trenutnog kori...,sortiranje string liste,0


Spremanje i cuvanje podataka u trazenom obliku

In [17]:
df_to_save = df.copy()
df_to_save.rename(columns={"Comment": "CommentText", "Query": "QueryText", "Score": "SimilarityScore"}, inplace=True)
df_to_save['ProgrammingLanguageName'] = 'PHP'
df_to_save = df_to_save[['ProgrammingLanguageName', 'QueryID', 'PairID', 'QueryText', 'CommentText', 'SimilarityScore']]
df_to_save.to_csv('./data/AnnotatedData.csv', index=False)

### Pravljenje obeležja bez predobrade

In [4]:
# Kopija dataframe-a koju ćemo koristiti za obučavanje modela 
df_nopreprocessing = df.copy()

In [5]:
# Dodajemo osnovna obeležja zajednička za sve metode predobrade (Broj zajedničkih reči, Dužine reči)
preprocessing.make_base_features(df_nopreprocessing, inplace=True)
df_nopreprocessing.head(3)

Unnamed: 0,PairID,QueryID,Comment,Query,Score,WordCountComment,WordCountQuery,MutualUnique,MutualWithRepetition
0,BookStackApp_BookStack_ActivityService_740,0,Daj novu instancu aktivnosti za trenutnog kori...,red sa prioritetom,0,7,3,0,0
1,BookStackApp_BookStack_ActivityService_740,1,Daj novu instancu aktivnosti za trenutnog kori...,pretvaranje string u datum,0,7,4,0,0
2,BookStackApp_BookStack_ActivityService_740,2,Daj novu instancu aktivnosti za trenutnog kori...,sortiranje string liste,0,7,3,0,0


In [6]:
# Racunamo BOW za trenutni pristup
# Prvo pravimo recnik:
nopreprocess_wds = preprocessing.make_word_dict(df_nopreprocessing)
nopreprocess_bow = preprocessing.make_bow_dfs(df = df_nopreprocessing, word_dict=nopreprocess_wds)

In [7]:
# Dodajemo BOW kosinusnu sličnost kao dodatno obeležje
df_nopreprocessing['BOW'] = df_nopreprocessing.apply(
    lambda x: preprocessing.cosine_sim(nopreprocess_bow['Comment'].loc[x['PairID']], nopreprocess_bow['Query'].loc[x['QueryID']]),
    axis=1
)
df_nopreprocessing.head(3)

Unnamed: 0,PairID,QueryID,Comment,Query,Score,WordCountComment,WordCountQuery,MutualUnique,MutualWithRepetition,BOW
0,BookStackApp_BookStack_ActivityService_740,0,Daj novu instancu aktivnosti za trenutnog kori...,red sa prioritetom,0,7,3,0,0,0.0
1,BookStackApp_BookStack_ActivityService_740,1,Daj novu instancu aktivnosti za trenutnog kori...,pretvaranje string u datum,0,7,4,0,0,0.0
2,BookStackApp_BookStack_ActivityService_740,2,Daj novu instancu aktivnosti za trenutnog kori...,sortiranje string liste,0,7,3,0,0,0.0


In [9]:
# Save to file for classification
df_nopreprocessing.to_csv('no_preprocessing.csv', index=False)

### Pravljenje obeležja sa normalizacijom na mala slova

In [10]:
df_lowercase = preprocessing.normalize_to_lowercase(df, inplace=False)
df_lowercase.head(3)

Unnamed: 0,PairID,QueryID,Comment,Query,Score
0,BookStackApp_BookStack_ActivityService_740,0,daj novu instancu aktivnosti za trenutnog kori...,red sa prioritetom,0
1,BookStackApp_BookStack_ActivityService_740,1,daj novu instancu aktivnosti za trenutnog kori...,pretvaranje string u datum,0
2,BookStackApp_BookStack_ActivityService_740,2,daj novu instancu aktivnosti za trenutnog kori...,sortiranje string liste,0


In [11]:
# Ponavljamo istu proceduru kao za prethodni skup obeležja
preprocessing.make_base_features(df_lowercase, inplace=True)

# Racunamo BOW za trenutni pristup
# Prvo pravimo recnik:
lowercase_wds = preprocessing.make_word_dict(df_lowercase)
lowercase_bow = preprocessing.make_bow_dfs(df = df_lowercase, word_dict=lowercase_wds)

# Dodajemo BOW kosinusnu sličnost kao dodatno obeležje
df_lowercase['BOW'] = df_lowercase.apply(
    lambda x: preprocessing.cosine_sim(lowercase_bow['Comment'].loc[x['PairID']], lowercase_bow['Query'].loc[x['QueryID']]),
    axis=1
)
df_lowercase.head(3)

Unnamed: 0,PairID,QueryID,Comment,Query,Score,WordCountComment,WordCountQuery,MutualUnique,MutualWithRepetition,BOW
0,BookStackApp_BookStack_ActivityService_740,0,daj novu instancu aktivnosti za trenutnog kori...,red sa prioritetom,0,7,3,0,0,0.0
1,BookStackApp_BookStack_ActivityService_740,1,daj novu instancu aktivnosti za trenutnog kori...,pretvaranje string u datum,0,7,4,0,0,0.0
2,BookStackApp_BookStack_ActivityService_740,2,daj novu instancu aktivnosti za trenutnog kori...,sortiranje string liste,0,7,3,0,0,0.0


In [16]:
df_lowercaseercase.to_csv('lowercase.csv', index=False)

### Pravljenje skupa obeležja sa stemovanjem i izbacivanjem stop reči
Tokom pravljenja ovog skupa obeležja, primenjivana je i normalizacija na mala slova jer korišćeni stemer ne podržava rad sa velikim slovima. Takođe, izbačeni su i intepunkcijski znakovi.

In [14]:
df_stem = preprocessing.normalize_to_lowercase(df, inplace=False)
preprocessing.apply_stem(df_stem, inplace=True)
df_stem.head(3)

Unnamed: 0,PairID,QueryID,Comment,Query,Score
0,BookStackApp_BookStack_ActivityService_740,0,daj nov instanc aktivnost za trenutn korisnik,red sa prioritet,0
1,BookStackApp_BookStack_ActivityService_740,1,daj nov instanc aktivnost za trenutn korisnik,pretvaranj string u datum,0
2,BookStackApp_BookStack_ActivityService_740,2,daj nov instanc aktivnost za trenutn korisnik,sortiranj string list,0


In [15]:
# Ponavljamo istu proceduru, samo koristeći korene reči:
preprocessing.make_base_features(df_stem, inplace=True)

# Racunamo BOW za trenutni pristup
# Prvo pravimo recnik:
stem_wds = preprocessing.make_word_dict(df_stem)
stem_bow = preprocessing.make_bow_dfs(df = df_stem, word_dict=stem_wds)

# Dodajemo BOW kosinusnu sličnost kao dodatno obeležje
df_stem['BOW'] = df_stem.apply(
    lambda x: preprocessing.cosine_sim(stem_bow['Comment'].loc[x['PairID']], stem_bow['Query'].loc[x['QueryID']]),
    axis=1
)
df_stem.head(3)

Unnamed: 0,PairID,QueryID,Comment,Query,Score,WordCountComment,WordCountQuery,MutualUnique,MutualWithRepetition,BOW
0,BookStackApp_BookStack_ActivityService_740,0,daj nov instanc aktivnost za trenutn korisnik,red sa prioritet,0,7,3,0,0,0.0
1,BookStackApp_BookStack_ActivityService_740,1,daj nov instanc aktivnost za trenutn korisnik,pretvaranj string u datum,0,7,4,0,0,0.0
2,BookStackApp_BookStack_ActivityService_740,2,daj nov instanc aktivnost za trenutn korisnik,sortiranj string list,0,7,3,0,0,0.0


In [17]:
df_stem.to_csv('stem.csv', index=False)

### Pravljenje skupa obeležja primenom frekvencijskog filtriranja reči
Frekvencijsko filtriranje vršeno je na ne-normalizovanim i ne-korenovanim tekstovima. Međutim, izbačeni su znakovi interpunkcije kako ne bi uticali na broj pojavljivanja reči koje su odmah do znakova intepunkcije. Ovakav pristup posledica je same implementacije pravljenja rečnika i frekvencijskog filtriranja.

In [31]:
df_freq = df.copy()

# Izbacivanje interpunkcijskih znakova
df_freq['Comment'] = df_freq['Comment'].apply(lambda x: preprocessing.remove_interpunction(x))
df_freq['Query'] = df_freq['Query'].apply(lambda x: preprocessing.remove_interpunction(x))

# Pravljanje liste čestih i retkih reči
freq_words = preprocessing.make_word_dict(df_freq)
common_words = preprocessing.get_common_words(freq_words)
rare_words = preprocessing.get_rare_words(freq_words)

# Frekvencijsko filtriranje tekstova i upita
df_freq['Comment'] = df_freq['Comment'].apply(lambda x: preprocessing.remove_common_words(x, common_words))
df_freq['Query'] = df_freq['Query'].apply(lambda x: preprocessing.remove_common_words(x, common_words))
df_freq['Comment'] = df_freq['Comment'].apply(lambda x: preprocessing.remove_rare_words(x, rare_words))
df_freq['Query'] = df_freq['Query'].apply(lambda x: preprocessing.remove_rare_words(x, rare_words))

# Dodavanje osnovnih obelezja
preprocessing.make_base_features(df_freq, inplace=True)

# Racunanje BOW vektora
# Pravimo novi rečnik koji neće sadržati filtrirane reči
freq_words = preprocessing.make_word_dict(df_freq)
freq_bow = preprocessing.make_bow_dfs(df = df_freq, word_dict=freq_words)

# Dodajemo BOW kosinusnu sličnost kao dodatno obeležje
df_freq['BOW'] = df_freq.apply(
    lambda x: preprocessing.cosine_sim(freq_bow['Comment'].loc[x['PairID']], freq_bow['Query'].loc[x['QueryID']]),
    axis=1
)
df_freq.head(3)

Unnamed: 0,PairID,QueryID,Comment,Query,Score,WordCountComment,WordCountQuery,MutualUnique,MutualWithRepetition,BOW
0,BookStackApp_BookStack_ActivityService_740,0,aktivnosti za trenutnog,red sa prioritetom,0,3,3,0,0,0.0
1,BookStackApp_BookStack_ActivityService_740,1,aktivnosti za trenutnog,pretvaranje string u datum,0,3,4,0,0,0.0
2,BookStackApp_BookStack_ActivityService_740,2,aktivnosti za trenutnog,sortiranje string liste,0,3,3,0,0,0.0


In [32]:
df_freq.to_csv('freq_filt.csv', index=False)

### Koriscenje bigrama i trigrama

In [35]:
df_ngrams = df.copy()

# Dodavanje osnovnih obelezja
preprocessing.make_base_features(df_ngrams, inplace=True)

# Dodavanje broja zajednickih bigrama i trigrama
df_ngrams['NCommonBigrams'] = df_ngrams.apply(
    lambda x: preprocessing.count_mutual_bigrams(x['Comment'], x['Query']),
    axis=1
)

df_ngrams['NCommonTrigrams'] = df_ngrams.apply(
    lambda x: preprocessing.count_mutual_trigrams(x['Comment'], x['Query']),
    axis=1
)

# Racunanje BOW vektora
ngram_words = preprocessing.make_word_dict(df_ngrams)
ngram_bow = preprocessing.make_bow_dfs(df = df_ngrams, word_dict=ngram_words)

# Dodajemo BOW kosinusnu sličnost kao dodatno obeležje
df_ngrams['BOW'] = df_ngrams.apply(
    lambda x: preprocessing.cosine_sim(ngram_bow['Comment'].loc[x['PairID']], ngram_bow['Query'].loc[x['QueryID']]),
    axis=1
)

df_ngrams.head(3)

Unnamed: 0,PairID,QueryID,Comment,Query,Score,WordCountComment,WordCountQuery,MutualUnique,MutualWithRepetition,NCommonBigrams,NCommonTrigrams,BOW
0,BookStackApp_BookStack_ActivityService_740,0,Daj novu instancu aktivnosti za trenutnog kori...,red sa prioritetom,0,7,3,0,0,0,0,0.0
1,BookStackApp_BookStack_ActivityService_740,1,Daj novu instancu aktivnosti za trenutnog kori...,pretvaranje string u datum,0,7,4,0,0,0,0,0.0
2,BookStackApp_BookStack_ActivityService_740,2,Daj novu instancu aktivnosti za trenutnog kori...,sortiranje string liste,0,7,3,0,0,0,0,0.0


In [46]:
df_ngrams.to_csv('ngrams.csv', index=False)

### TF ponderisanje

In [45]:
df_tf = df.copy()

# Dodavanje osnovnih obelezja
preprocessing.make_base_features(df_tf, inplace=True)

# Racunanje TF vektora
tf_words = preprocessing.make_word_dict(df_tf)
tf_bow = preprocessing.make_bow_dfs(df = df_tf, word_dict=tf_words)
tf_tf = preprocessing.make_tf_dfs(tf_bow)

# Dodajemo TF kosinusnu sličnost kao dodatno obeležje
df_tf['TF'] = df_tf.apply(
    lambda x: preprocessing.cosine_sim(tf_tf['Comment'].loc[x['PairID']], tf_tf['Query'].loc[x['QueryID']]),
    axis=1
)

df_tf.head(3)

Unnamed: 0,PairID,QueryID,Comment,Query,Score,WordCountComment,WordCountQuery,MutualUnique,MutualWithRepetition,TF
0,BookStackApp_BookStack_ActivityService_740,0,Daj novu instancu aktivnosti za trenutnog kori...,red sa prioritetom,0,7,3,0,0,0.0
1,BookStackApp_BookStack_ActivityService_740,1,Daj novu instancu aktivnosti za trenutnog kori...,pretvaranje string u datum,0,7,4,0,0,0.0
2,BookStackApp_BookStack_ActivityService_740,2,Daj novu instancu aktivnosti za trenutnog kori...,sortiranje string liste,0,7,3,0,0,0.0


In [47]:
df_tf.to_csv('td_ponderisanje.csv', index=False)

### TF-IDF ponderisanje

In [51]:
df_tfidf = df.copy()

# Dodavanje osnovnih obelezja
preprocessing.make_base_features(df_tfidf, inplace=True)

# Racunanje TF vektora
tfidf_words = preprocessing.make_word_dict(df_tfidf)
tfidf_bow = preprocessing.make_bow_dfs(df = df_tfidf, word_dict=tfidf_words)
tfidf_tf = preprocessing.make_tf_dfs(tfidf_bow)

idf = preprocessing.calc_idf(df_tfidf, tfidf_bow)
tfidf_tfidf = preprocessing.make_tfidf_dfs(tfidf_tf, idf)


# Dodajemo TF kosinusnu sličnost kao dodatno obeležje
df_tfidf['TFIDF'] = df_tfidf.apply(
    lambda x: preprocessing.cosine_sim(tfidf_tfidf['Comment'].loc[x['PairID']], tfidf_tfidf['Query'].loc[x['QueryID']]),
    axis=1
)

df_tfidf.head(3)

Unnamed: 0,PairID,QueryID,Comment,Query,Score,WordCountComment,WordCountQuery,MutualUnique,MutualWithRepetition,TFIDF
0,BookStackApp_BookStack_ActivityService_740,0,Daj novu instancu aktivnosti za trenutnog kori...,red sa prioritetom,0,7,3,0,0,0.0
1,BookStackApp_BookStack_ActivityService_740,1,Daj novu instancu aktivnosti za trenutnog kori...,pretvaranje string u datum,0,7,4,0,0,0.0
2,BookStackApp_BookStack_ActivityService_740,2,Daj novu instancu aktivnosti za trenutnog kori...,sortiranje string liste,0,7,3,0,0,0.0


In [54]:
df_tfidf.to_csv('tfidf_ponderisanje.csv', index=False)

0.4607457848837003

### Kombinacija više metoda predobrade

In [None]:
df_multi = preprocessing.normalize_to_lowercase(df, inplace=False)
preprocessing.apply_stem(df_multi, inplace=True)

# Izbacivanje interpunkcijskih znakova
df_multi['Comment'] = df_multi['Comment'].apply(lambda x: preprocessing.remove_interpunction(x))
df_multi['Query'] = df_multi['Query'].apply(lambda x: preprocessing.remove_interpunction(x))

# Pravljanje liste čestih i retkih reči
multi_words = preprocessing.make_word_dict(df_multi)
common_words = preprocessing.get_common_words(multi_words, threshold=0.9)
rare_words = preprocessing.get_rare_words(multi_words, threshold=0.1)

# Frekvencijsko filtriranje tekstova i upita
df_multi['Comment'] = df_multi['Comment'].apply(lambda x: preprocessing.remove_common_words(x, common_words))
df_multi['Query'] = df_multi['Query'].apply(lambda x: preprocessing.remove_common_words(x, common_words))
df_multi['Comment'] = df_multi['Comment'].apply(lambda x: preprocessing.remove_rare_words(x, rare_words))
df_multi['Query'] = df_multi['Query'].apply(lambda x: preprocessing.remove_rare_words(x, rare_words))

# Dodavanje osnovnih obelezja
preprocessing.make_base_features(df_multi, inplace=True)


