Language detector for Programatic Languages

Import all used packages and modules:

In [44]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import utils

import os, PIL
from glob import glob

# Regular Expression Parsing
import re

# Natural Language Toolkit
import nltk; nltk.download("stopwords"); nltk.download("wordnet")

# Language Token Processing and Frequency Distribution Calculator
from textblob import Word
from collections import Counter

# Generalized Machine/Deep Learning Codependencies
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer

import tensorflow as tf

print(tf.__version__)

2.7.0


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Sammy\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Sammy\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Define Subsets of imported libraries for easier use:

In [82]:
# Sequential Model Architecture
Sequential = tf.keras.models.Sequential

# Connective Layers with Dropout
Dense = tf.keras.layers.Dense
Dropout = tf.keras.layers.Dropout

# Early Stopping Optimization
EarlyStopping = tf.keras.callbacks.EarlyStopping

# Natural Text-Based Language Processing Layers with RNN
Embedding = tf.keras.layers.Embedding
LSTM = tf.keras.layers.LSTM
SpatialDropout1D = tf.keras.layers.SpatialDropout1D
Conv2D = tf.keras.layers.Conv2D
MaxPool2D = tf.keras.layers.MaxPool2D

# Language Tokenization Filter
Tokenizer = tf.keras.preprocessing.text.Tokenizer

# Padding Function for Dataset Ingestion Preprocessing
pad_sequences = tf.keras.preprocessing.sequence.pad_sequences

Load in the dataset from file and create a dataframe with categories:

In [46]:
#title, corpus, type
column_names = ["title", "corpus", "language"]
c_list = os.listdir('dataset/c')
cpp_list = os.listdir('dataset/c++')
cs_list = os.listdir('dataset/c#')

df = pd.DataFrame(columns = column_names)

for (c,cpp,cs) in zip(c_list, cpp_list, cs_list):
    if c:
        with open(f'dataset/c/{c}', encoding="utf8") as file:
            dict = {'title': c.split('.')[0], 'corpus': file.read(), 'language': 'c'}
            df = df.append(dict, ignore_index = True)
            file.close()
    if cpp:
        with open(f'dataset/c++/{cpp}', encoding="utf8") as file:
            dict = {'title': cpp.split('.')[0], 'corpus': file.read(), 'language': 'c++'}
            df = df.append(dict, ignore_index = True)
            file.close()
    if cs:
        with open(f'dataset/c#/{cs}', encoding="utf8") as file:
            dict = {'title': cs.split('.')[0], 'corpus': file.read(), 'language': 'c#'}
            df = df.append(dict, ignore_index = True)
            file.close()
df.head()

Unnamed: 0,title,corpus,language
0,braswell,// SPDX-License-Identifier: GPL-2.0+\n/*\n * C...,c
1,eigensolver_complex,"// This file is part of Eigen, a lightweight C...",c++
2,Bootstrap,#define USE_UPDATE_CHECKS\nusing System;\nusin...,c#
3,common,// SPDX-License-Identifier: GPL-2.0+\n/*\n * c...,c
4,homo_sm4_decrypt,#include <iostream>\n\n#include <helib/helib.h...,c++


Check for null counts:

In [47]:
features_to_check = ["corpus", "language"]
processed = df[features_to_check]
processed.isnull().sum()

corpus      0
language    0
dtype: int64

Helper function that easily allows us to remove stopwards from our bodies of text aswell as generally cleaning it up

In [48]:
def clean_corpus(data, stopwords, dtype="frame", feature="corpus"):
  """ Function to remove special characters, digits, stop words, 
  unimportant symbols, and other unnecessary noise from our dataset. """
  if dtype == "frame":
    data[feature] = data[feature].apply(
        lambda corpus: " ".join(corpus.lower() for corpus in corpus.split())
    )
    data[feature] = data[feature].str.replace(
        "\d+", ""
    )
    data[feature] = data[feature].apply(
        lambda token: " ".join(token for token in token.split() if token not in stopwords)
    )
    data[feature] = data[feature].apply(
        lambda token: " ".join([Word(token).lemmatize() for token in token.split()])
    )
  elif dtype == "list":
    data = [" ".join(corpus.lower() for corpus in corpus.split()) for corpus in data]
    data = [re.sub("\d+", "", corpus) for corpus in data]
    data = [" ".join(token for token in token.split() if token not in stopwords) for token in data]
    data = [" ".join([Word(token).lemmatize() for token in token.split()]) for token in data]
  return data

Create list of stopwords specific to our dataset and purpose aswell as cleaning our data by removing the provided stopwards and more

In [49]:
stopwords = ["for", "while", "do", "goto", "if", "else", "{", "}", "\\n", "i", "n", "//","/*","*"] #coding specific stopwords

processed = clean_corpus(data=processed, 
                         stopwords=stopwords,
                         dtype="frame",
                         feature="corpus")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[feature] = data[feature].apply(
  data[feature] = data[feature].str.replace(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[feature] = data[feature].str.replace(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[feature] = data[feature].apply(
A value is trying to be set on a copy of a 

Now check the head of the data to see a difference in bodies of text

In [50]:
processed.head(3)

Unnamed: 0,corpus,language
0,spdx-license-identifier: gpl-.+ copyright (c) ...,c
1,"this file is part of eigen, a lightweight c++ ...",c++
2,#define use_update_checks using system; using ...,c#


Tokenizer to help us change our words into numerical values that the computer can interpret

In [54]:
def tokenize_dataset(tokenizer, data):
  """ Function to tokenize input data for model training/testing. """
  return pad_sequences(tokenizer.texts_to_sequences(data))

Initialize tokenizer (setting max words per document to 500) and create an X variable using our preprocessed code text

In [56]:
tokenizer = Tokenizer(num_words=500, split=" ")
tokenizer.fit_on_texts(processed["corpus"].values)

X = tokenize_dataset(tokenizer, processed["corpus"].values)

Create a Y variable to use in the testing against our model and split X and Y into training and testing

In [58]:
y = pd.get_dummies(processed['language'])
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.7, 
                                                    test_size=0.3)
X.shape

(30, 3271)

Define layers for use in our RNN, this will make it look a lot nicer when adding to the model

In [84]:
# Embedding Layer for Token-Specific Vectorization
input_embedding_layer = Embedding(500, 120, input_length=X.shape[1])

# Dropout Regularizer for Text Embedding
embedding_dropout_layer = SpatialDropout1D(0.4)

# First Recurrent LSTM Cellular Architecture
first_recurrent_layer = LSTM(176, 
                             dropout=0.2, 
                             recurrent_dropout=0.2, 
                             return_sequences=True)

# Second Recurrent LSTM Cellular Architecture
second_recurrent_layer = LSTM(176, 
                              dropout=0.2, 
                              recurrent_dropout=0.2)

# Final Dense Layer for Output Extraction
output_connective_layer = Dense(3, activation="softmax")

Initialize model and add our layers into it and get a summary of our model

In [65]:
# Sequential Model Architecture Design
model = Sequential()

# Add All Initialized Layers in Effective Sequence
model.add(input_embedding_layer)
model.add(embedding_dropout_layer)
model.add(first_recurrent_layer)
model.add(second_recurrent_layer)
model.add(output_connective_layer)

# Get Model Summary for Confirmation
model.summary()

Model: "sequential_12"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_7 (Embedding)     (None, 3271, 120)         60000     
                                                                 
 spatial_dropout1d_7 (Spatia  (None, 3271, 120)        0         
 lDropout1D)                                                     
                                                                 
 lstm_12 (LSTM)              (None, 3271, 176)         209088    
                                                                 
 lstm_13 (LSTM)              (None, 176)               248512    
                                                                 
 dense_12 (Dense)            (None, 3)                 531       
                                                                 
Total params: 518,131
Trainable params: 518,131
Non-trainable params: 0
_______________________________________________

Compile our model using categorical crossentropy as the loss function

In [66]:
# Compile Model with Specified Loss and Optimization Functions
model.compile(loss="categorical_crossentropy",
              optimizer="nadam",
              metrics=["accuracy"])

Set up early stopping and define the batch size and epochs we will use

In [67]:
# Define Early Stopping Callback Optimizer
callback = EarlyStopping(monitor="loss", patience=3)

# Define Batch Size and Epochs as Hyperparameters
batch_size, epochs = 32, 10

Fit our training values to our model and allow for verbose output so we can see its progress over time

In [68]:
# Fit Learning Model Using Training Data and Configured Hyperparameters
history = model.fit(X_train, y_train,
                    epochs=epochs,
                    batch_size=batch_size,
                    callbacks=[callback],
                    verbose=True)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Accuracy of 1.000 - this is overtrained to the dataset or undersampled
With an easy dataset with only a few features an accuracy close to that of 100% is to be expected but not in a dataset with hundreds of features. 
It is likely that because this dataset is so restricted by its size it is overtrained. 
While 10-11 epochs is generally the training models the selection of data we are utilizing is too small to gain 100% of the information regarding data that lies outside of what is trained.

In [69]:
# Evaluate Learned Model Using Testing Data
model.evaluate(X_test, y_test)



[1.6683546304702759, 0.5555555820465088]

Evidently, the accuracy of the model is not 100%. It would make sense that despite training accuracy being 100% at the end of epoch 10, the testing accuracy is closer to 50%. While the data may have been over trained it is still successful enough to classify around 55% of our code as its correct language. This is better than a blind model which would have an average accuracy of about 33.3%.

In [79]:
# Get Our Predicted Labels
y_pred = pd.DataFrame(data=model.predict(X_test))
y_pred = y_pred.apply(round, axis=1).astype(int)

Create a confusion matrix for further testing of our model

In [81]:
cmat = confusion_matrix(y_true=y_test.values.argmax(axis=1), 
                        y_pred=y_pred.values.argmax(axis=1))
cmat

array([[0, 0, 2],
       [0, 2, 2],
       [0, 0, 3]], dtype=int64)

From our confusion matrix we can see the results of our testing:
- The two C programs we passed in were evaluated as c# programs
    - This may be due to some C# programs being modular in nature as they are for Unity and do not have many of the object oriented things you would expect to see in a regular C# script
- Two C++ programs were evaluated correctly while two were evaluated as C#
    - Both C++ and C# are similar in nature and this is an easy mistake for the model to make. 
- All three C# programs tested were correctly identified

Our model seems to lean heavily towards evaluating models as being written in the C# language. 
This is likely due to a combination of overtraining and the small sample size we have in our dataset. 
All that being said the model is more successful than a blind pick for the languages being only 33% accurate while ours is 55.5%