In [1]:
try:
    import numpy as np
    import pandas as pd
except:
    import pip
    pip.main(['install', "--upgrade", "pip"])
    pip.main(['install', "numpy"])
    pip.main(['install', "matplotlib"])
    pip.main(['install', "ipython"])
    pip.main(['install', "jupyter"])
    pip.main(['install', "pandas"])
    import numpy as np
    import pandas as pd
    
%load_ext autoreload
%autoreload 2

## Importing the data
We use the pandas library for this. The data is split into 3 files:
- sample.csv :: Some stuff we can look at to know how the bigger and slower-to-load datafiles look like
- test.csv :: Validation data, which we use to 'grade' our model by
- train.csv :: Data we use to train our model with (to find the 'optimal' parameters)

In [2]:
data_sample = pd.read_csv('sample.csv')
data_test = pd.read_csv('test.csv')
data_train = pd.read_csv('train.csv')

cases = 1
if False:
    print("Sample")
    print(data_sample.head(cases))
    print("Test")
    print(data_test.head(cases))
    print("Train")
    print(data_train.head(cases))

The data contains 'headers' (Thing like 'Id', 'y', 'x1', 'x2', etc.). For pure data processing we need to get rid of this (because these are not numebrs), and just retrieve the numerical values within the matrices. CV stands for cross-validation. Printing the shapes is just to make sure the import was successful.

In [3]:
X = data_train.values
y_sample = data_sample.values
X_finaltest = data_test.values[:,1:]

if True:
    print("X_train")
    print(X.shape)
    print("y_sample")
    print(y_sample.shape)
    print("X_finaltest")
    print(X_finaltest.shape)

    print("And has types")

X_train
(10000, 12)
y_sample
(2000, 2)
X_finaltest
(2000, 10)
And has types


## Function Definitions and Applications
Here we define the functions we will use to predict the weights (normalEq, or maybe even stochastic gradient descent), and also the given error function (rms). Trying out different algorithms and testing them through cross-validation. Look at the external files for the algorithms.

### Normal Equations

In [7]:
from helper import *

k = 5 #number of folds for cross validation
fold_size = int(X.shape[0] / k)
total_error = 0.0

#Apply cross validation
for i in range(k):
    X_train, X_cv, y_train, y_cv = train_test_split(X[:,2:], X[:,1], fold_size)
    
    #Add bias to X_train and X_cv
    X_train = np.column_stack((X_train, np.ones(X_train.shape[0])))
    X_cv = np.column_stack((X_cv, np.ones(X_cv.shape[0])))
    
    #Pass-Forward
    weights = normalEq(X_train, y_train)
    
    #Get Loss
    predictions = np.dot(X_cv, weights)
    loss = rms(predictions, y_cv)
    total_error += loss

total_error /= k
print("Absolute Error is:")
print(total_error)
    
    

Absolute Error is:
3.67773838308e-13


### Gradient Descent

Before we can apply Gradient Descent, we must check if our derivative of the function is correct

In [8]:
#from helper import *
#
##keep parameters as above
#
#for i in range(k):
#    X_train, X_cv, y_train, y_cv = train_test_split(X[:,2:], X[:,1], fold_size)
#    
#    #Pass-Forward
#    weights = normalEq(X_train, y_train)
#    
#    #Get Loss
#    predictions = np.dot(X_cv, weights)
#    loss = rms(predictions, y_cv)
#    total_error += loss
#    
#    
#

## Bring the function into the submission format: Post-Processing
We have trained the weights using the training data (X_train). Remember that the sample submission data looks like the following. That means we need to predict for the test data.

In [9]:
if False:
    print("Sample")
    print(data_sample.head(cases))
    print(data_sample.tail(cases))
    print("Test")
    print(data_test.head(cases))
    print(data_test.tail(cases))
    print(X_finaltest.shape)
    print(weights.shape)
    print(y_pred_test.shape)
    print(y_pred_test)

First of all, calculate the predictions. Don't forget to stack a bias column. The submission format includes the ID's taken from the X-training data. Each invidual record has a predicted 'y' record aswell.

In [10]:
y_pred_test = np.dot(np.column_stack((X_finaltest, np.ones(X_finaltest.shape[0]))), weights)
sub_data = np.column_stack((data_test.values[:,0], y_pred_test))
print(sub_data.shape)
print(sub_data)

ValueError: shapes (2000,11) and (10,) not aligned: 11 (dim 1) != 10 (dim 0)

This look alright... Let's wrap it in a pandas-dataframe (that's what the datastructures including the headers with 'ID' and 'y' are called)

In [78]:
pd.set_option('max_info_rows', 11)
pd.set_option('precision',20)
submission = pd.DataFrame(sub_data, columns = ["Id", "y"])
submission.Id = submission.Id.astype(int)
print(submission)

         Id                         y
0     10000  -66.00242349023130827845
1     10001  451.40650440115518904349
2     10002 -461.67641706029962733737
3     10003   40.50120875372320483621
4     10004 -126.74472245403632086891
5     10005 -342.53455181925158967715
6     10006 -396.55554211359054761488
7     10007  335.54127907908764427702
8     10008  -99.51242087062264829456
9     10009  304.81253980627650435054
10    10010   68.89453048978556637394
11    10011  412.36919545590848201755
12    10012   54.94237102476856193789
13    10013  -17.00555075478039768200
14    10014 -597.84995757226056412037
15    10015  443.50228878641490837254
16    10016  144.77448697804288713087
17    10017  -57.41116367253620467181
18    10018  134.38782757746042761937
19    10019 -108.98377598910846586477
20    10020  153.82482842506630049684
21    10021  611.73598360220182712510
22    10022  588.38965033842225693661
23    10023 -131.01679995802052758336
24    10024  -61.13562114123003965460
25    10025 

*** I'M NOT SURE IF IT HAS TO BE 1-point PRECISION (last record to be 417.3, instead of 417.269155). ***

Not let's export the submission file as a csv

In [14]:
csv_file = submission.to_csv('final_submission.csv')