## Preprocessing

Source code preprocessing has several features:

1) Each language has its own grammar.
When working with source code, lexemes are usually individual words or code elements, such as variable names, function names, and operator names.
2) Source code preprocessing also requires certain steps that are unique to code, such as removing code comments, handling variable declarations and imports, and dealing with the peculiarities of coding languages.

For now we will choose `TweetTokenizer`, in the future we will describe the preprocessing with `code_tokenize` (tree-sitter) packets.

In [2]:
from nltk import NLTKWordTokenizer, TweetTokenizer

Removing some tokens (e.g., a string literal or a number) must be done at the tokenization stage, since only at this point do we know the types of tokens.

So in the following example, using `tokenize` removes `'bar'`

In [3]:
text = '''
def foo(bar: int) -> dict:
    return {'bar': bar}
'''

In [4]:
' '.join(NLTKWordTokenizer().tokenize(text))

"def foo ( bar : int ) - > dict : return { 'bar ' : bar }"

In [5]:
' '.join(TweetTokenizer().tokenize(text))

"def foo ( bar : int ) -> dict : return { ' bar ' : bar }"

In [6]:
text = '''
#include <iostream>
using namespace std;

int main() { double n1, n2, n3; cout << "Hello, world!"; }
'''

In [7]:
' '.join(NLTKWordTokenizer().tokenize(text))

"# include < iostream > using namespace std ; int main ( ) { double n1 , n2 , n3 ; cout < < `` Hello , world ! '' ; }"

In [8]:
' '.join(TweetTokenizer().tokenize(text))

'#include <iostream> using namespace std ; int main ( ) { double n1 , n2 , n3 ; cout < < " Hello , world ! " ; }'

## Save preprocessed data

In [9]:
import pandas as pd

In [10]:
train = pd.read_csv('train.csv')

In [11]:
test = pd.read_csv('test.csv')

In [12]:
tokenizer = TweetTokenizer()

In [13]:
train['code'] = train.apply(lambda x: ' '.join(tokenizer.tokenize(x['code'])), axis=1)

In [14]:
test['code'] = test.apply(lambda x: ' '.join(tokenizer.tokenize(x['code'])), axis=1)

In [15]:
train['code'] = train['code'].apply(lambda x: x.lower().strip())
test['code'] = test['code'].apply(lambda x: x.lower().strip())

In [16]:
train = train[(train['code'].map(len) > 0) & (train['code'] != 'null')]
test = test[(test['code'].map(len) > 0) & (train['code'] != 'null')]

  test = test[(test['code'].map(len) > 0) & (train['code'] != 'null')]


In [17]:
train.to_csv('train-preprocessed.csv', index=None)
test.to_csv('test-preprocessed.csv', index=None)