# Linear Regression applied to Haskell Exercises from "Failed_Submissions" Collection


In this notebook we present the results of our first approach on training a machine learning model, based on set of submissions from Haskell exercises on the Mumuki Collection ** failed_submissions** in order to automatically classify programming exercises into non executable (dark red), executable with errors (light red), correct but quality could be improved (yellow), good solution (green) . 

In section 1, we describe Haskell Exercise's Submissions Datasets and some filters we had make on the submissions. In section 2 we present the model we trained and the results we obtained from different excersises. Finally, in section 3 we present differents Tokenizer options for the Vectorizer, and show how they tokenize the submissions content.



##  Techinical Setup

On the cells of this unnumbered section we define the code that it's necessary to execute to satisfy all the requirments for training the models for the differents exercises.

On the code below, we install the spacy library, becuase we test different kind of Tokenizers for the Vectorizer.



In [1]:
!pip install spacy
!python -m spacy download en_core_web_sm

[33mYou are using pip version 10.0.0, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)
[K    100% |████████████████████████████████| 37.4MB 699kB/s ta 0:00:011
[33mYou are using pip version 10.0.0, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m

[93m    Linking successful[0m
    /home/mrc/anaconda3/envs/mumuki/lib/python3.6/site-packages/en_core_web_sm
    -->
    /home/mrc/anaconda3/envs/mumuki/lib/python3.6/site-packages/spacy/data/en_core_web_sm

    You can now load the model via spacy.load('en_core_web_sm')



In [1]:
import pandas as pd
import numpy as np
import os
import json
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import LabelEncoder

## 1 The Failed Submissions Collection Dataset



###  1.1The Failed Submissions Collection

We decided to work with submissions that are part of the collection *failed_submissions.* Failed Sumbissions is an specific submissions collection from mumuki. The submissions that form part of this collection were made by people that are not enrolled at formal education courses. So people use directly the mumuki.io guides to learn by them self. The mumuki.io platform has guides with different programming languages. As failed_submissions collection has many submissions, we decide first to filter submission by programming language. We decide to use Haskell as programming language. Also, we decide to divide the Haskell submissions by Exercise. So we are going to have an specific dataset per exercise.

### 1.2 Analize the Dataset

Every submission has a field called ** status **. Depend of the code that student had submitted, the value of this field. 

1.  **errored** : the code not compile. 
2.  **failed**: the code not pass all the tests defined by the teacher.
3.  **passed_with_warnings**: the code pass all the tests, but on the solution the studennt is not using an specific concept that teacher ask for (expectatives)
4.  ** passed**: the code pass all tests and uses the concepts the teacher ask for.
4. ** aborted ** : server errors or code that last to much to execute.
5. ** pending **: we have the figure up what really do
6. ** manual_evaluation_pending **:  we have the figure up what really do

In the code below we load the submissions per Haskell exercise to do a quantitative analysis in order to obtain status distribution by exercise.

In [2]:
dataset_home = 'datasets/haskell/'

submissions_df = pd.DataFrame()

for course in os.listdir('datasets/haskell/'):
    json_file = open(os.path.join(dataset_home, course), encoding='utf8')
    json_data = json.load(json_file)
    submissions = pd.io.json.json_normalize(json_data)
    submissions_df = submissions_df.append(submissions)
   

We decide to analyze the distribution of the exercises per status.  On the next table we can see the distribution of submission status by exercise. We use the *extract_metrics* function to calculate the distribution of the submissions by status.

In [3]:
def calculate_distribution(df, column_name, exercise):
    """
    df: Dataframe with submissions
    column_name: column with status

    Function to obtain distribution of submissions status
    return metrics and amount of submissions
    """
    total_amount_submissions = df.shape[0]
    submissions_grouped = df.groupby([column_name]).size()
    metrics = {}
    metrics = submissions_grouped
    metrics['exercise'] = str(exercise)
    metrics['submission_amount'] = total_amount_submissions
    return metrics


def weighted_mean(df, columns_to_mean, amount):
    """
    df: Dataframe with submissions
    columns_to_mean: columns to consider to be divided
    amount: pounded

    return add the weighted mean of status submission
    in the dataframe
    """
    df_mean = (df[columns_to_mean].astype(float).multiply(df[amount], axis="index")).sum()/(df[amount]).sum()
    df_mean['exercise'] = 'Weighted Mean'
    df_mean['submission_amount'] = df['submission_amount'].sum()
    df.loc[len(df)+1] = df_mean
    return df

In [4]:
distribution_by_exercise = pd.DataFrame(
    columns=[
        'exercise', 'submission_amount', 'aborted','errored', 'failed',
        'passed', 'passed_with_warnings', 'pending', 'running'])

exercises = submissions_df['exercise.name'].unique()

for exercise in exercises:
    df_exercise = submissions_df.where(submissions_df['exercise.name']==exercise).dropna(axis=0, how='all')
    distribution_by_exercise = (
        distribution_by_exercise.append(
            calculate_distribution(df_exercise, 'status', exercise), ignore_index=True).fillna(value=0))

#save a copy
status_by_exercises = distribution_by_exercise.drop(['aborted', 'pending','running'], axis=1, errors='ignore')

In [5]:
distribution_by_exercise.loc[:, 'aborted':] = distribution_by_exercise.loc[:, 'aborted':].div(distribution_by_exercise.iloc[:]['submission_amount'], axis=0)
distribution_by_exercise = weighted_mean(distribution_by_exercise, distribution_by_exercise.columns[2:], 'submission_amount')
distribution_by_exercise.sort_values('submission_amount')

Unnamed: 0,exercise,submission_amount,aborted,errored,failed,passed,passed_with_warnings,pending,running
197,Persona con más de cuatro letras,1,0.000000,0.000000,0.000000,1.000000,0.000000,0.000000,0.000000
115,Persona que tomó cantidad par de bebidas,1,0.000000,0.000000,0.000000,1.000000,0.000000,0.000000,0.000000
17,Datos de una serie en base al nombre,1,0.000000,1.000000,0.000000,0.000000,0.000000,0.000000,0.000000
184,Bebidas alcohólicas,3,0.000000,0.333333,0.333333,0.333333,0.000000,0.000000,0.000000
121,ejemploDeBusquedaOrdenada,3,0.000000,0.000000,0.000000,1.000000,0.000000,0.000000,0.000000
100,Personas borrachas,3,0.000000,0.000000,0.333333,0.666667,0.000000,0.000000,0.000000
119,The numbers,4,0.000000,0.000000,0.000000,1.000000,0.000000,0.000000,0.000000
56,Lo básico,5,0.000000,1.000000,0.000000,0.000000,0.000000,0.000000,0.000000
48,Mantenimiento,5,0.000000,0.800000,0.000000,0.200000,0.000000,0.000000,0.000000
94,Sobre los posts,5,0.000000,1.000000,0.000000,0.000000,0.000000,0.000000,0.000000


### 1.3 Cleaning the Datasets

We're interested spcifiically in four status values: **errored **, ** failed**, ** passed_with_warnings** and ** passed **. We droped all the submissions with 'manual_pending' 'pending', 'aborted', 'running' status, because them don't provide significative data about the code and the feedback is related with server or technical problems of the Mumuki platform and they were'nt statiscally significant. We use the function * clean_submissions * to clean the dataset. 

On the table below we could see the results of the new status distribution of the exercises.

In [6]:
def clean_submissions(submissions_df, to_train=False):
    """
    submissions_df: Dataframe with submissions
    to_train: indicate if dataframe will be use for training

    Function to cleaning dataset
    """
    submissions_df = submissions_df[~(submissions_df['status'] == 'aborted')]
    submissions_df = submissions_df[~(submissions_df['status'] == 'pending')]
    submissions_df = submissions_df[~(submissions_df['status'] == 'running')]
    submissions_df = submissions_df[~(submissions_df['status'] == 'manual_evaluation_pending')]
    
    if to_train:
            submissions_df = submissions_df[submissions_df['content'] != ""]
            submissions_df = submissions_df[~submissions_df['content'].isnull()]
    return submissions_df

In [7]:
status_by_exercises['submission_amount'] = status_by_exercises[['errored', 'failed', 'passed', 'passed_with_warnings']].sum(axis=1)
status_by_exercises.loc[:, 'errored':'passed_with_warnings'] = status_by_exercises.loc[:, 'errored':'passed_with_warnings'].div(status_by_exercises.iloc[:]['submission_amount'], axis=0)
status_by_exercises = weighted_mean(status_by_exercises, status_by_exercises.columns[2:], 'submission_amount')
status_by_exercises.sort_values('submission_amount')

Unnamed: 0,exercise,submission_amount,errored,failed,passed,passed_with_warnings
17,Datos de una serie en base al nombre,1.0,1.000000,0.000000,0.000000,0.000000
115,Persona que tomó cantidad par de bebidas,1.0,0.000000,0.000000,1.000000,0.000000
197,Persona con más de cuatro letras,1.0,0.000000,0.000000,1.000000,0.000000
184,Bebidas alcohólicas,3.0,0.333333,0.333333,0.333333,0.000000
121,ejemploDeBusquedaOrdenada,3.0,0.000000,0.000000,1.000000,0.000000
100,Personas borrachas,3.0,0.000000,0.333333,0.666667,0.000000
119,The numbers,4.0,0.000000,0.000000,1.000000,0.000000
94,Sobre los posts,5.0,1.000000,0.000000,0.000000,0.000000
56,Lo básico,5.0,1.000000,0.000000,0.000000,0.000000
48,Mantenimiento,5.0,0.800000,0.000000,0.200000,0.000000


As we're going to use as model for training Linear Regression with a vectorizer, we have to consider some specific cases. We droped submissions with empty content because them would cause issues with vectorizer. If we consider submissions with empty conent, we have two different cases:

1. Mumuki offers some informative exercises where students  don't need to code and the content field is null. In this case when we filter the null content submissions, we will eliminate all the submissions from this particular exercise.
2.  In any exercises student could submit an empty solution. It's a border case that we prefer not to consider in this first approach.


In [8]:
#limpio submissions
submissions_df_cleaned = clean_submissions(submissions_df, to_train=True)

### 1.4 Train, Dev and Test Set

First of all we take a technical decistion. We decide not to train all the exercises that have less than 100 submissions. We think that we're not statiscally signifivatives.


In [9]:
exercises_to_train = status_by_exercises[status_by_exercises['submission_amount'] > 100].sort_values('submission_amount')[:-1]
exercises_to_train

Unnamed: 0,exercise,submission_amount,errored,failed,passed,passed_with_warnings
225,Descontrolarse,107.0,0.205607,0.598131,0.196262,0.000000
290,"validarIguales, sobre validados",120.0,0.200000,0.683333,0.083333,0.033333
249,quienesPueden,129.0,0.131783,0.488372,0.317829,0.062016
127,Alcohol en Sangre con fold,131.0,0.358779,0.366412,0.274809,0.000000
162,Otro alto en el camino: Data,138.0,0.000000,0.000000,1.000000,0.000000
64,positivosYNegativos,149.0,0.201342,0.375839,0.395973,0.026846
112,Pedir bebida,149.0,0.161074,0.496644,0.342282,0.000000
211,Composción,152.0,0.105263,0.355263,0.526316,0.013158
148,Funciones anónimas,152.0,0.000000,0.000000,1.000000,0.000000
244,validarIguales,165.0,0.363636,0.557576,0.072727,0.006061



In this particular case we decided to split the exercises dataset only in train and test. We don't use dev set because the number of submissions per exercises is small.

In the code below, we can see the generation of train and test datasets for one exercise.


In [10]:
exercise_df = submissions_df_cleaned.where(submissions_df_cleaned['exercise.name']=='intersectar').dropna(axis=0, how='all')
X = exercise_df['content']
Y = exercise_df['status']

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.40, random_state=32)



We try to maintain the same status distribution in train and test set as a good practice to train a machine learning model. Test and Train sets have at least one example of each status to make sure when we will train/test our model with train set it contains examples of all types of status.

In the code below we show a particular exercise distribution of train and test as an example.

In [11]:
print("Train Distribution")
print(Y_train.value_counts()/Y_train.shape[0])

print("\n Test Distribution")
print(Y_test.value_counts()/Y_test.shape[0])

Train Distribution
failed                  0.381282
passed                  0.362218
errored                 0.225303
Name: status, dtype: float64

 Test Distribution
failed                  0.400000
passed                  0.314286
errored                 0.270130
Name: status, dtype: float64


## 2 Choose Model and Training

We choosed Linear Regression as first approach model because is the easiest model to train. We select it as our baseline, and will try to obtain the best performance classifing submissions setting parameters and selecting tokenizer. In case that not obtain a good performance we are going to change ML model. 

As we describe in ** Cleaning the Dataset** section, we use *CountVectorizer* for convert submissions content into features and *LabelEncoder* to convert an status into a number. ** Status**. On this first approach the **status** will be target for our machine learning model.


### 2.1 One vs Rest Algorithm
If we train a multiclass classifier, we need to use [one-vs-rest algorithm](https://en.wikipedia.org/wiki/Multiclass_classification#One-vs.-rest). In a few words, create a binary classifier (Linear Regresssion) for each status type, classify the example by all the classifiers and get the max of those predictions as predicted label.

In [12]:
#Featurize submissions content
cv = CountVectorizer(lowercase=False)
X_transformed = cv.fit_transform(X_train)

#Featurize status submissions
le = LabelEncoder()
Y_transformed = le.fit_transform(Y_train)

#Train linear regression for multi label classification
lr = OneVsRestClassifier(LinearRegression())
lr.fit(X_transformed, Y_transformed)

#classify Test set
predicted = lr.predict(cv.transform(X_test))

#obtain metrics avg/total
#shape of metrics
#precision    recall  f1-score   amount_tested
metrics_by_exercise = classification_report(le.transform(Y_test), predicted, target_names=le.classes_, digits=2)

print(metrics_by_exercise)

                      precision    recall  f1-score   support

             errored       0.88      0.14      0.25       104
              failed       0.62      0.69      0.65       154
              passed       0.55      0.88      0.67       121

         avg / total       0.66      0.59      0.54       385



### One vs Rest applied to all exercises

In [13]:
linear_performance = pd.DataFrame(columns=['exercise', 'precision', 'recall', 'f1-score', 'amount_tested', 'submission_amount'])
i = 0
exercises = exercises_to_train['exercise'].unique()
not_trained = []
for exercise in exercises:
    #Split dataset in train and test
    try:
        exercise_df = submissions_df_cleaned.where(submissions_df_cleaned['exercise.name']==exercise).dropna(axis=0, how='all')
        X = exercise_df['content']
        Y = exercise_df['status']

        X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.40, random_state=32)

        #Featurize submissions content
        cv = CountVectorizer(lowercase=False)
        X_transformed = cv.fit_transform(X_train)

        #Featurize status submissions
        le = LabelEncoder()
        Y_transformed = le.fit_transform(Y_train)

        #Train linear regression for multi label classification
        lr = OneVsRestClassifier(LinearRegression())
        lr.fit(X_transformed, Y_transformed)

        #classify Test set
        predicted = lr.predict(cv.transform(X_test))

        #obtain metrics avg/total
        #shape of metrics
        #precision    recall  f1-score   amount_tested
        metrics_by_exercise = classification_report(
            le.transform(Y_test), predicted, target_names=le.classes_, digits=2).split()[-4:]
        metrics_by_exercise = [str(exercise)] + metrics_by_exercise + [int(X_train.shape[0] + X_test.shape[0])]
        linear_performance.loc[len(linear_performance)+1]= metrics_by_exercise
        i += 1
    except:
        not_trained.append(exercise)
        pass
    
print("Trained {} classifier of {}".format(i, len(exercises)))

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  .format(len(labels), len(target_names))
  'precision', 'predicted', average, warn_for)
  str(classes[c]))
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  .format(len(labels), len(target_names))
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  .format(len(labels), len(target_names))
  'precision', 'p

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision'

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Trained 220 classifier of 246


### 2.2 Metrics Results
Below we show precision recall and f1-score of classifier.

In [14]:
linear_performance

Unnamed: 0,exercise,precision,recall,f1-score,amount_tested,submission_amount
1,Descontrolarse,0.50,0.49,0.49,43,107
2,"validarIguales, sobre validados",0.58,0.65,0.60,48,119
3,quienesPueden,0.63,0.58,0.59,52,129
4,Alcohol en Sangre con fold,0.72,0.71,0.71,52,130
5,positivosYNegativos,0.61,0.52,0.52,60,148
6,Pedir bebida,0.53,0.57,0.55,60,149
7,Composción,0.63,0.52,0.52,61,151
8,validarIguales,0.71,0.70,0.70,66,163
9,Haciendo functores,0.54,0.58,0.56,66,165
10,estadisticas,0.51,0.54,0.52,72,178


In [22]:
weighted_mean(linear_performance, ['precision', 'recall', 'f1-score'],'submission_amount')

Unnamed: 0,exercise,precision,recall,f1-score,amount_tested,submission_amount
1,Descontrolarse,0.50,0.49,0.49,43,107
2,"validarIguales, sobre validados",0.58,0.65,0.60,48,119
3,quienesPueden,0.63,0.58,0.59,52,129
4,Alcohol en Sangre con fold,0.72,0.71,0.71,52,130
5,positivosYNegativos,0.61,0.52,0.52,60,148
6,Pedir bebida,0.53,0.57,0.55,60,149
7,Composción,0.63,0.52,0.52,61,151
8,validarIguales,0.71,0.70,0.70,66,163
9,Haciendo functores,0.54,0.58,0.56,66,165
10,estadisticas,0.51,0.54,0.52,72,178


In [15]:
distribution_by_exercise.sort_values('submission_amount').to_csv('distribution_by_exercise.csv')

In [16]:
linear_performance.to_csv('linear_regression_performance.csv')