# Numer.ai

We look into the [numer.ai](https://numer.ai/) data challenge using sklean. Thanks to [Zygmunt Zając](https://github.com/zygmuntz/numer.ai) for posting examples.

In [None]:
import pandas as pd
from time import clock

We first need to load the data. We drop the validation flag for the prediction.

In [None]:
train_file = 'dataset/numerai_training_data.csv'
test_file = 'dataset/numerai_tournament_data.csv'
predict_file = 'predict.csv'

start = clock()
train_data = pd.read_csv(train_file)
test_data = pd.read_csv(test_file)
print('Loaded {:d} train and {:d} test entries in {:.0f} seconds.'.format( 
    len(train_data), len(test_data), clock() - start))

# No need for validation flag for final training and extrapolation
train_data.drop('validation', axis = 1 , inplace = True)

# Separate data and target label
train_target = train_data['target']
train_data.drop('target', axis = 1, inplace = True)

There is one categorial variable. We use one-hot encoding to deal with it.

In [None]:
# Check train and test have the same categories
assert(set(train_data['c1'].unique()) == set(test_data['c1'].unique()))

# Encode column in train, then drop original column
train_dummies = pd.get_dummies(train_data['c1'])
train_data = pd.concat((train_data.drop('c1', axis = 1), train_dummies.astype(int)), axis = 1)

# Encode column in test, then drop original column
test_dummies = pd.get_dummies(test_data['c1'])
test_data = pd.concat((test_data.drop('c1', axis = 1), test_dummies.astype(int)), axis = 1)

We can use ggplot to visualize the data, and obverse the structure.

In [None]:
from ggplot import ggplot, aes, geom_point
%pylab inline

f1f2 = ggplot(train_data, aes(x = 'f1', y = 'f2')) \
       + geom_point()

print(f1f2)

<img src="dataset_f1f2.png">

# Prediction
We simply now have to select a classifier. Random forest does fine, and so does preprocessed logistic regression.

In [None]:
# Random forest

from sklearn.ensemble import RandomForestClassifier as RF

clf = RF(n_estimators = 1000, verbose = True)

In [None]:
# Logistic regression with preprocessor

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression as LR

clf = Pipeline([('MinMaxScaler', MinMaxScaler()), ('LR', LR())])

We simply now have to train, extrapolate, and save.

In [None]:
# Fit training data

start = clock()
clf.fit(train_data, train_target)
print("Fitted in {:.0f} seconds.".format(clock() - start))

# Extrapolate

start = clock()
# Don't forget to ignore the t_id column!
predict = clf.predict_proba(test_data.drop('t_id', axis = 1, inplace = False))
print("Extrapolated in {:.0f} seconds.".format(clock() - start))

# Save results

test_data['probability'] = predict[:,1]
test_data.to_csv(predict_file, cols = ('t_id', 'probability'), index = None)

The scores are measured using the ROC AUC. As of February 2016,
- RF does 0.5218;
- MMS+LR does 0.5278.