## Preprocessing with code-tokenize & Pygments

The code_tokenize package was selected to tokenize the source code during the training phase. It is based on the tree-sitter package, which has tokenizers for a large number of programming languages and more.

- code_tokenize: https://github.com/cedricrupb/code_tokenize
- tree-sitter: https://tree-sitter.github.io/tree-sitter/

We would like to point out that it will only be used in the TRAINING part because of the need to know the marker.

The idea is simple: to clear dataset not from words, but from stop-types. This is a more suitable approach, since programming languages have their own structure. For example, we can remove comments or some literals.

In [None]:
!pip install code_tokenize pygments

In [1]:
from codcat.tok import preprocess

### Let's check preprocessing for some languages

In [2]:
print(preprocess('''
# This is a Bash comment.
echo "This is Code" # This is an inline Bash comment.
a=5
b=7
''', lang='bash'))

echo "This is Code" 
a=5
b=7


Vb.net is an exception, because it is not supported by the tree-sitter package. The text is returned in its original form.

In [3]:
print(preprocess('''
' This is a comment beginning at the left edge of the screen.
text1.Text = "Hi!"   ' This is an inline comment.
''', lang='vb.net'))


' This is a comment beginning at the left edge of the screen.
text1.Text = "Hi!"   ' This is an inline comment.



In [4]:
print(preprocess('''
# this is a print statement
# it prints Hello World

print("Hello World") 
''', lang='r'))

print("Hello World")


In [5]:
print(preprocess('''
// some c++ comment
/* Multi
Line 
Comment */

int main() {
    return 0;
}
''', lang='cpp'))

int main() {
    return 0;
}


In [6]:
print(preprocess('''
// some c comment
/* Multi
Line 
Comment */

int main() {
    return 0;
}
''', lang='c'))

int main() {
    return 0;
}


In [7]:
print(preprocess('''
#!/usr/bin/ruby -w
# This is a single line comment.

puts "Hello, Ruby!"
''', lang='ruby'))

puts "Hello, Ruby!"


In [8]:
print(preprocess('''
// This is a comment
System.out.println("Hello World");
''', lang='java'))

System.out.println("Hello World");


In [9]:
print(preprocess('''
<?php
    echo 'This is a test'; // This is a one-line c++ style comment
    /* This is a multi line comment
       yet another line of comment */
    echo 'This is yet another test';
    echo 'One Final Test'; # This is a one-line shell-style comment
?>
''', lang='php'))

<?php
    echo 'This is a test'; 
    
    echo 'This is yet another test';
    echo 'One Final Test'; 
?>


By default, only comments are deleted. To evaluate, you can try deleting different types in the source code.

In practice, this method does not always improve the quality of classification, but it can improve the stability of classification. Comments in present languages exist everywhere, but they do not play a large role in determining the final language.

## Pygments

Using pygments is a good approach in getting rid of little-playing tokens in the text. It can also be used to remove comments or certain literals. 
The approach suggested below allows you not to use knowledge of the test data to clean up.

In general, the structure of the approach is as follows:

`Data -> Initial Clf -> Pygments -> Preprocessed Data -> Clf`

That is, we first train a simple enough classifier or use a pre-trained one to predict the class label with good accuracy. Then we use these predictions to run the lexical analyzer, do preprocessing. And then we use the approaches to classify the text. 

This approach does not always show good results, for example on the Code25 dataset (SCC) it degrades quality, and on the original dataset it neither degrades nor improves.

In [10]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
import lime
import lime.lime_text
from sklearn.metrics import classification_report
from codcat.pyg import get_lexer, Prelude
from nltk import TweetTokenizer

In [11]:
train = pd.read_json('train-code25-smol.json')
test = pd.read_json('test-code25-smol.json')

In [12]:
model = Pipeline(
    [
        ('cv', TfidfVectorizer()),
        ('rf', MultinomialNB()),
    ]
)

In [13]:
model.fit(train['code'], train['language'])

In [14]:
print(classification_report(test['language'], model.predict(test['code'])))

              precision    recall  f1-score   support

        bash       0.65      0.74      0.70       129
           c       0.66      0.81      0.73       132
     c_sharp       0.76      0.48      0.59       143
         cpp       0.68      0.69      0.68       153
         css       0.67      0.86      0.75       145
     haskell       0.87      0.75      0.81       146
        java       0.82      0.61      0.70       169
  javascript       0.58      0.82      0.68       131
         lua       0.97      0.62      0.76       104
        objc       0.72      0.74      0.73       133
        perl       0.67      0.72      0.69       142
         php       0.70      0.53      0.60       150
      python       0.86      0.54      0.67       151
           r       0.59      0.79      0.68       136
        ruby       0.81      0.67      0.74       143
       scala       0.69      0.84      0.76       128
      sqlite       0.68      0.83      0.75       153
       swift       0.84    

In [15]:
model_prelude = Pipeline(
    [
        ('prelude', Prelude(model)),
        ('cv', TfidfVectorizer()),
        ('rf', MultinomialNB()),
    ]
)

In [16]:
model_prelude.fit(train['code'], train['language'])

In [17]:
print(classification_report(test['language'], model_prelude.predict(test['code'])))

              precision    recall  f1-score   support

        bash       0.64      0.73      0.68       129
           c       0.67      0.80      0.73       132
     c_sharp       0.78      0.46      0.58       143
         cpp       0.68      0.65      0.67       153
         css       0.66      0.85      0.74       145
     haskell       0.88      0.75      0.81       146
        java       0.83      0.62      0.71       169
  javascript       0.60      0.78      0.68       131
         lua       0.98      0.62      0.76       104
        objc       0.72      0.74      0.73       133
        perl       0.68      0.69      0.68       142
         php       0.65      0.53      0.59       150
      python       0.84      0.54      0.66       151
           r       0.60      0.79      0.68       136
        ruby       0.81      0.66      0.73       143
       scala       0.68      0.83      0.74       128
      sqlite       0.66      0.84      0.74       153
       swift       0.84    