In [1]:
import numpy as np
import pandas as pd

## Importing the data
We use the pandas library for this. The data is split into 3 files:
- sample.csv :: Some stuff we can look at to know how the bigger and slower-to-load datafiles look like
- test.csv :: Validation data, which we use to 'grade' our model by
- train.csv :: Data we use to train our model with (to find the 'optimal' parameters)

In [2]:
data_sample = pd.read_csv('sample.csv')
data_test = pd.read_csv('test.csv')
data_train = pd.read_csv('train.csv')

Just to know how the data looks like, let us print the first 10 training-samples for each data-set:

In [3]:
cases = 3
print "Sample"
print data_sample.head(cases)
print "Test"
print data_test.head(cases)
print "Train"
print data_train.head(cases)

Sample
      Id       y
0  10000  2000.0
1  10001  2001.0
2  10002  2002.0
Test
      Id           x1           x2          x3          x4          x5  \
0  10000  -483.797492  1288.057065 -129.878712 -198.078388 -334.487592   
1  10001  -316.407305    30.830556 -313.356726 -173.259184 -327.368719   
2  10002 -2448.558997  -561.988408  355.098820  634.378170 -392.450091   

           x6           x7          x8           x9          x10  
0 -391.443186  -612.406176 -676.523964  1327.229655  -448.695446  
1  944.368248  1122.017380  112.338731  1372.340221  2062.561842  
2 -813.156399  -232.873263  246.801210  -562.413197  -841.602015  
Train
   Id           y           x1           x2          x3           x4  \
0   0  738.023171  1764.052346   400.157208  978.737984  2240.893199   
1   1  400.646015   144.043571  1454.273507  761.037725   121.675016   
2   2  189.900156 -2552.989816   653.618595  864.436199  -742.165020   

            x5           x6           x7          x8        

The data contains 'headers' (Thing like 'Id', 'y', 'x1', 'x2', etc.). For pure data processing we need to get rid of this (because these are not numebrs), and just retrieve the numerical values within the matrices. CV stands for cross-validation.

In [4]:
train_size = (10000 / 5) * 4 #first few examples
cv_size = (10000 / 5) * 1 #last few examples #lazy evaluation of size of the training size

y_sample = data_sample.values
X_test = data_test.values[:,1:]
X_train = data_train.values[:train_size,2:]
y_train = data_train.values[:train_size,1]
X_cv = data_train.values[train_size:,2:]
y_cv = data_train.values[train_size:,1]

Just to make sure the import was successful

In [5]:
print "X_train"
print X_train
print "y_train"
print y_train

X_train
[[ 1764.05234597   400.15720837   978.73798411 ...,  -151.3572083
   -103.21885179   410.59850194]
 [  144.04357116  1454.27350696   761.03772515 ...,  -205.15826377
    313.06770165  -854.0957393 ]
 [-2552.98981583   653.61859544   864.43619886 ...,  -187.18385003
   1532.77921436  1469.3587699 ]
 ..., 
 [   27.45669686   -19.03954586   297.35599943 ...,   818.21746148
   1297.81973773   616.74435846]
 [  279.8876014  -1109.80653015  -428.07813862 ...,   855.35708635
   -535.80122196  -142.07524446]
 [  338.39140718 -1360.41534307  1059.45392275 ..., -1048.73452712
    374.75842004  -987.31143098]]
y_train
[ 738.02317073  400.64601516  189.9001559  ...,  315.45915294 -104.07126633
   97.97483287]


## Function Definitions and Applications
Here we define the functions we will use to predict the weights (normalEq, or maybe even stochastic gradient descent), and also the given error function (rms) 

In [6]:
def rms(pred, y):
    out = (1./pred.shape[0]) * np.sum(np.square(pred - y), axis=0)
    return out**0.5

In [7]:
def normalEq(X, y):
    print "X:"
    print X.shape
    print "y:"
    print y.shape
    lhs = np.dot(X.transpose(), X)
    lhsinv = np.linalg.inv(lhs)
    rhs = np.dot(X.transpose(), y)
    return np.dot( lhsinv, rhs) 

We apply the normalEq, which gives us the weights to predict the y-data GIVEN the X-data

In [8]:
weight = normalEq(X_train, y_train)
print
print "Weights:"
print weight

X:
(8000, 10)
y:
(8000,)

Weights:
[ 0.1  0.1  0.1  0.1  0.1  0.1  0.1  0.1  0.1  0.1]


We now test what are error residual is (how close our prediction is to the real prediction). This is expressed as a relative error (not absolute error)

In [9]:
u = -np.dot(X_cv, weight)
u += y_cv
sumu = np.abs(np.sum(u))
v = np.abs(np.sum(np.dot(X_cv, weight)))

print "Relative error, expressed as |y_actual - (X_cv * weights)|/|(X_cv * weights)|:"
print sumu/v

Relative error, expressed as |y_actual - (X_cv * weights)|/|(X_cv * weights)|:
1.69316590935e-16


## Bring the function into the submission format: Post-Processing
We have trained the weights using the training data (X_train). Remember that the sample submission data looks like the following. That means we need to predict for the test data.

In [10]:
print "Sample"
print data_sample.head(cases)
print data_sample.tail(cases)
print "Test"
print data_test.head(cases)
print data_test.tail(cases)

Sample
      Id       y
0  10000  2000.0
1  10001  2001.0
2  10002  2002.0
         Id       y
1997  11997  3997.0
1998  11998  3998.0
1999  11999  3999.0
Test
      Id           x1           x2          x3          x4          x5  \
0  10000  -483.797492  1288.057065 -129.878712 -198.078388 -334.487592   
1  10001  -316.407305    30.830556 -313.356726 -173.259184 -327.368719   
2  10002 -2448.558997  -561.988408  355.098820  634.378170 -392.450091   

           x6           x7          x8           x9          x10  
0 -391.443186  -612.406176 -676.523964  1327.229655  -448.695446  
1  944.368248  1122.017380  112.338731  1372.340221  2062.561842  
2 -813.156399  -232.873263  246.801210  -562.413197  -841.602015  
         Id          x1           x2           x3          x4           x5  \
1997  11997  199.864648   261.345779  -127.986805 -298.503216  -364.240174   
1998  11998 -151.673157 -1425.199620  1070.922114  938.800763  1373.176372   
1999  11999  -97.089983   780.444250   22

First of all, calculate the predictions:

In [11]:
print X_test.shape
print weight.shape
y_pred_test = np.dot(X_test, weight)
print y_pred_test.shape
print y_pred_test

(2000, 10)
(10,)
(2000,)
[ -66.00242349  451.4065044  -461.67641706 ...,  -35.13540942 -131.67918453
  417.26915462]


The submission format includes the ID's taken from the X-training data. Each invidual record has a predicted 'y' record aswell.

In [12]:
sub_data = np.column_stack((data_test.values[:,0], y_pred_test))
print sub_data.shape
print sub_data

(2000, 2)
[[ 10000.            -66.00242349]
 [ 10001.            451.4065044 ]
 [ 10002.           -461.67641706]
 ..., 
 [ 11997.            -35.13540942]
 [ 11998.           -131.67918453]
 [ 11999.            417.26915462]]


This look alright... Let's wrap it in a pandas-dataframe (that's what the datastructures including the headers with 'ID' and 'y' are called)

In [13]:
submission = pd.DataFrame(sub_data, columns = ["Id", "y"])
submission.Id = submission.Id.astype(int)
print submission

         Id           y
0     10000  -66.002423
1     10001  451.406504
2     10002 -461.676417
3     10003   40.501209
4     10004 -126.744722
5     10005 -342.534552
6     10006 -396.555542
7     10007  335.541279
8     10008  -99.512421
9     10009  304.812540
10    10010   68.894530
11    10011  412.369195
12    10012   54.942371
13    10013  -17.005551
14    10014 -597.849958
15    10015  443.502289
16    10016  144.774487
17    10017  -57.411164
18    10018  134.387828
19    10019 -108.983776
20    10020  153.824828
21    10021  611.735984
22    10022  588.389650
23    10023 -131.016800
24    10024  -61.135621
25    10025 -374.248495
26    10026   74.157993
27    10027  137.438379
28    10028  420.086612
29    10029  196.765626
...     ...         ...
1970  11970 -228.376122
1971  11971  198.061607
1972  11972 -713.006477
1973  11973  -89.681865
1974  11974 -480.004570
1975  11975 -271.733396
1976  11976  565.323640
1977  11977   96.111483
1978  11978  178.613359
1979  11979  -61

*** I'M NOT SURE IF IT HAS TO BE 1-point PRECISION (last record to be 417.3, instead of 417.269155). ALSO, THE ID COLUMN STILL CONTAINS DOTS ALTHOUGH THE SAMPLE SUBMISSION FORM DOES NOT EXACTLY MATCH THIS ***

Not let's export the submission file as a csv

In [14]:
csv_file = submission.to_csv('final_submission.csv')
