## Defining Bug Types for commitpackft dataset

*Approach 1*

Description: Δημιουργία ενός λεξιλογίου με λέξεις και φράσεις κλειδιά, βάση του οποίου θα γίνει η κατηγοριοποιήση των δειγμάτων. Προγραμματιστικά, θα ελέγχεται αν υπάρχουν φράσεις κλειδιά από κάθε τύπο σφάλματος στο commit message του κάθε δείγματος (μπορεί να εμπεριέχονται φράσεις για παραπάνω απο έναν τύπο σφάλματος). Η κατηγοριοποιήση των commit μυνημάτων σε τύπους σφαλμάτων γίνεται ανέμασα σε πέντε κλάσεις (general, functionality, performance/compatibility, network/security, ui-ux) οι οποίες αντιστοιχίζονται με λέξεις κλειδιά, και κάθε δείγμα παίρνει μία η παραπάνω κλάσεις όταν εμπερειέχεται κάποια λέξη κλειδί από ένα συγκεκριμένο τύπο στπ commit μύνημα.

Cons: 
- Η λιστα με τα keywords για καθε τυπο μπορει να επεκταθει

In [1]:
import json
import sqlite3
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from collections import Counter
from df2sql import sqlite2postgres


In [35]:
con = sqlite3.connect('commitpack-datasets.db')
df = pd.read_sql_query('SELECT * FROM commitpackft', con).set_index('index')

def preprocess_text(text: str, get_tokens: bool = True):
    # Convert to lowercase
    text = text.lower()
    
    # remove punctuation marks
    punctMarks = [".", ",", "?", "!", ":", ";" ,"\'", "\"", "`", "-", "(", ")", "[", "]", "{", "}", "...", "/", "\\", "•", "*", "^", "_", "<", ">",]
    for pM in punctMarks:
        if pM in text:
            text = text.replace(pM, '')
            
    # Tokenization
    tokens = word_tokenize(text)
    # Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    # Stemming
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(word) for word in tokens]
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in stemmed_tokens]
    
    if get_tokens:
        return lemmatized_tokens
    return " ".join(lemmatized_tokens)
    

def create_word_frequency(commit_messages):
    all_words = []
    for message in commit_messages:
        processed_words = preprocess_text(message)
        all_words.extend(processed_words)
    word_counts = Counter(all_words)
    return word_counts


def classify_message(msg: str, bugTypes: pd.DataFrame):
    bTypes = []
    for i,bType in bugTypes.iterrows():
        bTD = bType.to_dict()
        if any(word in msg for word in bTD['keywords']):
            bTypes.append(bTD['type'])
    return ",".join(bTypes)


print("# Create Word Count Dictionary")
wordsFreq = create_word_frequency(df['message'].tolist())
wordsFreqDict = dict(wordsFreq)
wordsCount = []
for key in wordsFreqDict:
    wordsCount.append({
        'word': key,
        'count': wordsFreqDict[key]
    })
wordCount = pd.DataFrame(wordsCount)
con.cursor().execute("drop table commitpackft_word_count")
wordCount.to_sql('commitpackft_word_count', con)
sqlite2postgres(wordCount, 'commitpackft_word_count')

# Process Commits and store 
print("# Processing Commits..")
df["processed_messages"] = df["message"].apply(lambda msg: preprocess_text(text=msg, get_tokens=False))
con.cursor().execute('drop table commitpackft_processed_commits')
df.to_sql(name='commitpackft_processed_commits', con=con)
sqlite2postgres(df, 'commitpackft_processed_commits')

print("# Classify samples")
bugTypes = pd.read_sql("select * from bug_types", con)
bugTypes['keywords'] = bugTypes['keywords'].str.split(',')
df['bug_type'] = df["processed_messages"].apply(lambda msg: classify_message(msg=msg, bugTypes=bugTypes))
con.cursor().execute("drop table commitpackft_classified")
df.to_sql(name='commitpackft_classified' ,con=con)
sqlite2postgres(df, 'commitpackft_classified')

# Create Word Count Dictionary
# Processing Commits..
# Classify samples


In [1]:
from modules.filters import identify

identify(query="select * from commitpackft", sqlite_path='commitpack-datasets.db')

# Create Word Count Dictionary
# Processing Commits
# Classifying samples


In [9]:
import sqlite3
con = sqlite3.connect('commitpack-datasets.db')
print(con.cursor().execute('SELECT name FROM sqlite_master WHERE type = "table"').fetchall())

import pandas as pd
train_df = pd.read_sql_query('select * from commitpackft_classified_train', con)
test_df = pd.read_sql_query('select * from commitpackft_classified_test', con)

print(len(train_df))
print(len(test_df))
train_df.head()

[('humanevalpack',), ('commitpackft',), ('commitpackft_eslintparsed',), ('CodeT5JS_3_ROUGE',), ('commitpackft-classified',), ('commitpackft-classified_v2',), ('bug_types',), ('commitpackft_word_count',), ('commitpackft_processed_commits',), ('commitpackft_classified',), ('commitpackft_classified_train',), ('commitpackft_classified_test',)]
19452
9582


Unnamed: 0,index,commit,old_file,new_file,old_contents,new_contents,subject,message,lang,license,repos,processed_messages,bug_type
0,11805,0c4b57fcd7f85581e7bb06e0e418cdb6015b5ffa,karma.conf.js,karma.conf.js,const webpackConfig = require('./webpack.confi...,const webpackConfig = require('./webpack.confi...,Include all of the files in test coverage report,Include all of the files in test coverage repo...,JavaScript,mit,"kbeloborodko/webpack-deep-dive,kbeloborodko/we...",includ file test coverag report,compatibility/performance
1,15343,077e8042f60676ca4644da8793a6c09e1abbc93c,public/mainCtrl.js,public/mainCtrl.js,"angular.module('myApp', ['uiGmapgoogle-maps'])...","angular.module('myApp', ['uiGmapgoogle-maps'])...",Change input class name for GMap input template.,Change input class name for GMap input templat...,JavaScript,apache-2.0,"kvasir/studenthem,kvasir/studenthem",chang input class name gmap input templat,funcionality
2,29332,c62998b16908b95a1904227e8f1d008fc80ce714,IPython/html/static/widgets/js/widget_float.js,IPython/html/static/widgets/js/widget_float.js,// Copyright (c) IPython Development Team.\n//...,// Copyright (c) IPython Development Team.\n//...,Add support to the float slider,Add support to the float slider\n,JavaScript,bsd-3-clause,"SylvainCorlay/ipywidgets,jupyter-widgets/ipywi...",add support float slider,compatibility/performance
3,19531,2ee8fba863a9302262a9d4adc53142200efe6bf6,src/views/game_over_view.js,src/views/game_over_view.js,import React from 'react'\nimport classnames f...,import ButtonView from './button_view'\nimport...,Refactor game over view to use button view,Refactor game over view to use button view\n,JavaScript,mit,"nullobject/hexgrid,nullobject/hexgrid",refactor game view use button view,ui-ux
4,31663,e6b3a08ec185b347b7d9c6b2a7fb2117fc1b951a,lib/modules/docker/__tests__/index.js,lib/modules/docker/__tests__/index.js,"import {describe, it} from 'mocha';\n\ndescrib...","import {describe, it} from 'mocha';\nimport as...",Add tests for 'mup docker setup',Add tests for 'mup docker setup'\n,JavaScript,mit,"arunoda/meteor-up,zodern/meteor-up,zodern/mete...",add test mup docker setup,compatibility/performance


Unnamed: 0,index,commit,old_file,new_file,old_contents,new_contents,subject,message,lang,license,repos,processed_messages,bug_type
