# Haskell Failed Submissions Analysis

In this notebook we analyse the haskell submissions done by no registered users in Mumuki.io. The main objective is train a ML model that will be able to predict feedback for submissions.

# Set Up
First of all we load the libraries that we need for the code below. **Json** is the format in which the database is stored. **Pandas** is a python library for handling databases in the memory of the computer. **Pandas** store the data in structures called dataframes. **Os** is a library used to gain portability to other operating systems, in this way the notebook will work in Ubuntu, Windows, etc. **Scikit Learn** python main library for machine learning. **Sklearn_pandas** provides a bridge between Scikit-Learn's machine learning methods and pandas-style Data Frames.

In [1]:
import pandas as pd
import json
import os
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn_pandas import DataFrameMapper

# Load Database
Load only haskell submission divided by excercise. Loading the dataset in this way allows us to take a subset of exercises to ensure that in the training/test set there are samples of all the exercises. (To be implemented) By the moment we are using train_test_split function provided by scikit learn.


In [2]:
dataset_home = "datasets/haskell"
files = os.listdir(dataset_home)
submissions = pd.DataFrame()
for file in files:
    json_file = open(os.path.join(dataset_home, file), encoding='utf8')
    json_data = json.load(json_file)
    submissions = submissions.append(pd.io.json.json_normalize(json_data), ignore_index=True)

# Clean Dataset
Remove submissions with blank content which is the solution written by the user.
Remove submissions with status aborted or pending because those submissions are not evaluated due to net problems.

In [7]:
submissions = submissions[submissions['content'] != ""]
submissions = submissions[~submissions['content'].isnull()]
submissions = submissions[~(submissions['status'] == 'aborted')]
submissions = submissions[~(submissions['status'] == 'pending')]
#Improve it
submissions['result'] =  submissions.apply(
    lambda x : str(x.test_results) if x.status == 'failed' else (
        str(x.expectation_results) if x.status == 'passed_with_warnings' else('passed' if x.status == 'passed' else x.result)), axis=1)

# Narrow dataset
Only use columns that provides important information.

In [18]:
X = submissions[['guide.slug', 'guide.language.name', 'content']]
Y = submissions['result']

In [19]:
X

Unnamed: 0,guide.slug,guide.language.name,content
0,pdep-utn/mumuki-guia-funcional-practica-expres...,haskell,intersectar l1 l2 = filter (\x -> elem x l2) l1
1,pdep-utn/mumuki-guia-funcional-practica-listas...,haskell,intersect xs ys = map id xs
2,pdep-utn/mumuki-guia-funcional-practica-listas...,haskell,intersectar xs ys = map id xs
3,pdep-utn/mumuki-guia-funcional-practica-listas...,haskell,"intersectar xs ys = [ x | x<-xs, elem x ys]"
4,pdep-utn/mumuki-guia-funcional-practica-expres...,haskell,intersectar lista1 lista2 = filter (\x -> elem...
5,pdep-utn/mumuki-guia-funcional-practica-expres...,haskell,"intersectar (x:xs) (y:ys) = filter (\x,y -> x ..."
6,pdep-utn/mumuki-guia-funcional-practica-expres...,haskell,intersectar l1 l2 = filter (\x y -> x == y)
7,pdep-utn/mumuki-guia-funcional-practica-expres...,haskell,intersectar l1 l2 = \r\n filter (\x y -> he...
8,pdep-utn/mumuki-guia-funcional-practica-expres...,haskell,intersectar l1 l2 = \r\n filter (\x y -> x ...
9,pdep-utn/mumuki-guia-funcional-practica-expres...,haskell,intersectar l1 l2 = \r\n filter (\x -> x ==...


In [10]:
Y

0                                                    passed
1         \n/tmp/mumuki.compile20160726-25582-6b7u07.hs:...
2         [{'title': 'intersectar intersectar [1, 2, 3, ...
3         [{'binding': 'intersect', 'inspection': 'HasCo...
4                                                    passed
5         \n/tmp/mumuki.compile20160728-3796-1atis45.hs:...
6         [{'title': 'intersectar intersectar [1, 2, 3, ...
7         [{'title': 'intersectar intersectar [1, 2, 3, ...
8         [{'title': 'intersectar intersectar [1, 2, 3, ...
9         [{'title': 'intersectar intersectar [1, 2, 3, ...
10        [{'title': 'intersectar intersectar [1, 2, 3, ...
11                                                   passed
12        [{'title': 'intersectar intersectar [1, 2, 3, ...
13        [{'title': 'intersectar intersectar [1, 2, 3, ...
14        [{'title': 'intersectar intersectar [1, 2, 3, ...
15        [{'title': 'intersectar intersectar [1, 2, 3, ...
16        [{'title': 'intersectar inters

# Split Dataset
Split dataset between train and test in 66,6 / 33,33 proportionally.

In [21]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.33, random_state=42)

# Featurization
DataFrameMapper is used for mapping pandas data frame columns to different sklearn transformations.

In [80]:
mapper = DataFrameMapper([('guide.slug', LabelEncoder()),
                         ('guide.language.name', LabelEncoder()),
                         ('content', CountVectorizer()),
                         ])

* X_train shape = (313147, 3)
* X_test shape = (154238, 3)
* Y_train shape = (313147,)
* Y_test shape = (154238,)

CountVectorizer content doesnt fit in memory :(, we need to think a better option to featurize content submissions.