#Programming Language Classifier using Machine Learning

In [2]:
import pandas as pd
from scraper import scrape_clean_cut
from feature_vectors import *

##Using Two Dataframes: 
###(1) Dataframe with various languages (>200 examples from Rosetta Code) 

In [3]:
df_700x200 = pd.read_pickle('scraper_700x200.pkl')

In [4]:
df_700x200.head(15)

Unnamed: 0,0,1
0,ada,with Ada.Text_IO; procedure Integers_In_Englis...
1,algol68,PROC number words = (INT n)STRING:( # returns...
2,algol68,"MODE EXCEPTION = STRUCT(STRING name, PROC VOID..."
4,autohotkey,Loop { ; TEST ...
5,awk,# syntax: GAWK -f NUMBER_NAMES.AWKBEGIN { ...
6,qbasic,DECLARE FUNCTION int2Text$ (number AS LONG) 's...
8,c,#include <stdio.h>#include <string.h> const ch...
9,cpp,#include <string>#include <iostream>using std:...
10,csharp,using System; class NumberNamer { static re...
11,clojure,"(clojure.pprint/cl-format nil ""~R"" 1234)=> ""on..."


In [5]:
y = df_700x200.loc[:, 0]
X = df_700x200.loc[:, 1]

###(2) Dataframe with only languages from test file (11 total languages)

The train data was scraped from Rosetta Code. 
In total 700 example pages of code and only the languages in the test file.

In [6]:
filtered_df = pd.read_pickle('scraper_filtered_700x1.pkl')

In [7]:
filtered_y_train = filtered_df.loc[:, 0]
filtered_X_train = filtered_df.loc[:, 1]

#Testing only with dataset from Rosetta Code

The purpose of testing first with all of the rosetta code data and laguages, and not just the test languages, is to see if my estimator is overfitting the test file data. The test data, when I split my data using train_test_split includes much more sample code compared to the 32 lines of code and 11 langagues in the test file. 

In [8]:
rc_X_train, rc_X_test, rc_y_train, rc_y_test = train_test_split(X, y)

In [9]:
rc_pipe_bayes = make_pipe(MultinomialNB())
rc_pipe_bayes.fit(rc_X_train, rc_y_train)
rc_pipe_bayes.score(rc_X_test, rc_y_test)

0.6182965299684543

In [10]:
rc_pipe_tree = make_pipe(DecisionTreeClassifier())
rc_pipe_tree.fit(rc_X_train, rc_y_train)
rc_pipe_tree.score(rc_X_test, rc_y_test)

0.76148676450418329

In [11]:
rc_pipe_forest = make_pipe(RandomForestClassifier())
rc_pipe_forest.fit(rc_X_train, rc_y_train)
rc_pipe_forest.score(rc_X_test, rc_y_test)

0.7679330681662323

In [12]:
print((classification_report(rc_pipe_forest.predict(rc_X_test), rc_y_test)))

             precision    recall  f1-score   support

        ada       0.98      0.76      0.85       298
    algol68       0.83      0.82      0.82       104
 autohotkey       0.73      0.79      0.76       121
        awk       0.83      0.72      0.77        98
       bash       0.65      0.60      0.63       155
          c       0.81      0.74      0.77       285
    clojure       0.59      0.54      0.56       115
      cobol       0.89      0.79      0.84        71
coffeescript       0.75      0.67      0.71        63
        cpp       0.82      0.85      0.83       168
     csharp       0.79      0.65      0.71       165
          d       0.92      0.90      0.91       222
     delphi       0.71      0.77      0.74        88
          e       0.54      0.59      0.56        87
     erlang       0.83      0.79      0.81        98
   euphoria       0.67      0.72      0.69        46
    fortran       0.83      0.86      0.85       117
     fsharp       0.56      0.57      0.57  

The estimator poorly classifies coffeescript, java5, and e even with the additaionl built in feature vectorizers.

#Testing with test samples given in the test folder.

In [13]:
y_test = pd.read_pickle('test_y_values.pkl')
X_test = pd.read_pickle('test_X_values.pkl')

In [14]:
y_test = y_test.loc[:, 1]
X_test = X_test.loc[:, 0]

###Estimating with Multinomial Bayes, Decision Tree, and Random Forest

I used my entire data frame (all of my code examples scraped from Rosetta Code) as my training data. 

In [15]:
pipe_mnb = make_pipe(MultinomialNB())
pipe_mnb.fit(X, y)
pipe_mnb.score(X_test, y_test)

0.5625

In [16]:
pipe_tree = make_pipe(DecisionTreeClassifier())
pipe_tree.fit(X, y)
pipe_tree.score(X_test, y_test)

0.59375

In [17]:
pipe_forest = make_pipe(RandomForestClassifier())
pipe_forest.fit(X, y)
pipe_forest.score(X_test, y_test)

0.78125

###Classification report to see which language is not well represented. 

In [18]:
print((classification_report(pipe_forest.predict(X_test), y_test)))

             precision    recall  f1-score   support

        ada       0.00      0.00      0.00         1
    algol68       0.00      0.00      0.00         1
 autohotkey       0.00      0.00      0.00         1
        awk       0.00      0.00      0.00         1
    clojure       1.00      0.80      0.89         5
     fsharp       0.00      0.00      0.00         1
    haskell       0.33      1.00      0.50         1
       java       0.00      0.00      0.00         0
 javascript       0.75      1.00      0.86         3
       objc       0.00      0.00      0.00         1
      ocaml       1.00      1.00      1.00         2
        php       0.67      1.00      0.80         2
     python       1.00      1.00      1.00         4
       ruby       1.00      1.00      1.00         3
      scala       0.50      1.00      0.67         1
     scheme       1.00      1.00      1.00         3
        tcl       1.00      1.00      1.00         2

avg / total       0.73      0.78      0.74  

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


The precision and recall are worse than the Rosetta Code test sample because this test sample is much smaller, only 32 examples. When the esitmator gets an example wrong the relative weight of the wrong answer is much greater. 

##Using Dataframe 2: only includes languages in the test sample. 

In [19]:
pipe_filtered_bayes = make_pipe(MultinomialNB())
pipe_filtered_bayes.fit(filtered_X_train, filtered_y_train)
pipe_filtered_bayes.score(X_test, y_test)

0.875

Significantly imporves when only using the languages in the test sample. 

In [20]:
pipe_filtered_tree = make_pipe(DecisionTreeClassifier())
pipe_filtered_tree.fit(filtered_X_train, filtered_y_train)
pipe_filtered_tree.score(X_test, y_test)

0.84375

In [21]:
pipe_filtered_forest = make_pipe(RandomForestClassifier())
pipe_filtered_forest.fit(filtered_X_train, filtered_y_train)
pipe_filtered_forest.score(X_test, y_test)

0.84375

###Classification report to see which language is not well represented

In [22]:
print((classification_report(pipe_filtered_forest.predict(X_test), y_test)))

             precision    recall  f1-score   support

    clojure       0.75      0.60      0.67         5
    haskell       0.67      0.67      0.67         3
       java       0.00      0.00      0.00         0
 javascript       1.00      0.80      0.89         5
      ocaml       1.00      1.00      1.00         2
        php       0.67      1.00      0.80         2
     python       1.00      0.80      0.89         5
       ruby       1.00      1.00      1.00         3
      scala       1.00      1.00      1.00         2
     scheme       1.00      1.00      1.00         3
        tcl       1.00      1.00      1.00         2

avg / total       0.91      0.84      0.87        32



  'recall', 'true', average, warn_for)


A huge improvement compared to the testing results using many languages. However, these results are less representative of a "real world" scenario. Mainly, because you may want to allow more than 11 languages. Because I knew what languages to expect, I constrained my fit data to better predict the test languages. 