# T81-558: Applications of Deep Neural Networks
**Class 7: Kaggle Data Sets.**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), School of Engineering and Applied Science, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Helpful Functions

These are exactly the same feature vector encoding functions from [Class 3](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class3_training.ipynb).  They must be defined for this class as well.  For more information, refer to class 3.

In [1]:
from sklearn import preprocessing
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import shutil
import os


# Encode text values to dummy variables(i.e. [1,0,0],[0,1,0],[0,0,1] for red,green,blue)
def encode_text_dummy(df, name):
    dummies = pd.get_dummies(df[name])
    for x in dummies.columns:
        dummy_name = "{}-{}".format(name, x)
        df[dummy_name] = dummies[x]
    df.drop(name, axis=1, inplace=True)


# Encode text values to a single dummy variable.  The new columns (which do not replace the old) will have a 1
# at every location where the original column (name) matches each of the target_values.  One column is added for
# each target value.
def encode_text_single_dummy(df, name, target_values):
    for tv in target_values:
        l = list(df[name].astype(str))
        l = [1 if str(x) == str(tv) else 0 for x in l]
        name2 = "{}-{}".format(name, tv)
        df[name2] = l


# Encode text values to indexes(i.e. [1],[2],[3] for red,green,blue).
def encode_text_index(df, name):
    le = preprocessing.LabelEncoder()
    df[name] = le.fit_transform(df[name])
    return le.classes_


# Encode a numeric column as zscores
def encode_numeric_zscore(df, name, mean=None, sd=None):
    if mean is None:
        mean = df[name].mean()

    if sd is None:
        sd = df[name].std()

    df[name] = (df[name] - mean) / sd


# Convert all missing values in the specified column to the median
def missing_median(df, name):
    med = df[name].median()
    df[name] = df[name].fillna(med)


# Convert all missing values in the specified column to the default
def missing_default(df, name, default_value):
    df[name] = df[name].fillna(default_value)


# Convert a Pandas dataframe to the x,y inputs that TensorFlow needs
def to_xy(df, target):
    result = []
    for x in df.columns:
        if x != target:
            result.append(x)
    # find out the type of the target column.  Is it really this hard? :(
    target_type = df[target].dtypes
    target_type = target_type[0] if hasattr(target_type, '__iter__') else target_type
    # Encode to int for classification, float otherwise. TensorFlow likes 32 bits.
    if target_type in (np.int64, np.int32):
        # Classification
        dummies = pd.get_dummies(df[target])
        return df.as_matrix(result).astype(np.float32), dummies.as_matrix().astype(np.float32)
    else:
        # Regression
        return df.as_matrix(result).astype(np.float32), df.as_matrix([target]).astype(np.float32)

# Nicely formatted time string
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return "{}:{:>02}:{:>05.2f}".format(h, m, s)


# Regression chart.
def chart_regression(pred,y,sort=True):
    t = pd.DataFrame({'pred' : pred, 'y' : y.flatten()})
    if sort:
        t.sort_values(by=['y'],inplace=True)
    a = plt.plot(t['y'].tolist(),label='expected')
    b = plt.plot(t['pred'].tolist(),label='prediction')
    plt.ylabel('output')
    plt.legend()
    plt.show()

# Remove all rows where the specified column is +/- sd standard deviations
def remove_outliers(df, name, sd):
    drop_rows = df.index[(np.abs(df[name] - df[name].mean()) >= (sd * df[name].std()))]
    df.drop(drop_rows, axis=0, inplace=True)


# Encode a column to a range between normalized_low and normalized_high.
def encode_numeric_range(df, name, normalized_low=-1, normalized_high=1,
                         data_low=None, data_high=None):
    if data_low is None:
        data_low = min(df[name])
        data_high = max(df[name])

    df[name] = ((df[name] - data_low) / (data_high - data_low)) \
               * (normalized_high - normalized_low) + normalized_low

# What is Kaggle?

[Kaggle](http://www.kaggle.com) runs competitions in which data scientists compete in order to provide the best model to fit the data. The capstone project of this chapter features Kaggle’s [Titanic data set](https://www.kaggle.com/c/titanic-gettingStarted). Before we get started with the Titanic example, it’s important to be aware of some Kaggle guidelines. First, most competitions end on a specific date. Website organizers have currently scheduled the Titanic competition to end on December 31, 2016. However, they have already extended the deadline several times, and an extension beyond 2014 is also possible. Second, the Titanic data set is considered a tutorial data set. In other words, there is no prize, and your score in the competition does not count towards becoming a Kaggle Master. 

# Kaggle Ranks

Kaggle ranks are achieved by earning gold, silver and bronze medals.

* [Kaggle Top Users](https://www.kaggle.com/rankings)
* [Current Top Kaggle User's Profile Page](https://www.kaggle.com/stasg7)
* [Jeff Heaton's (your instructor) Kaggle Profile](https://www.kaggle.com/jeffheaton)
* [Current Kaggle Ranking System](https://www.kaggle.com/progression)

# Typical Kaggle Competition

A typical Kaggle competition will have several components.  Consider the Titanic tutorial:

* [Competition Summary Page](https://www.kaggle.com/c/titanic)
* [Data Page](https://www.kaggle.com/c/titanic/data)
* [Evaluation Description Page](https://www.kaggle.com/c/titanic/details/evaluation)
* [Leaderboard](https://www.kaggle.com/c/titanic/leaderboard)

## How Kaggle Competitions are Scored

Kaggle is provided with a data set by the competition sponsor.  This data set is divided up as follows:

* **Complete Data Set** - This is the complete data set.
    * **Training Data Set** - You are provided both the inputs and the outcomes for the training portion of the data set.
    * **Test Data Set** - You are provided the complete test data set; however, you are not given the outcomes.  Your submission is  your predicted outcomes for this data set.
        * **Public Leaderboard** - You are not told what part of the test data set contributes to the public leaderboard.  Your public score is calculated based on this part of the data set.
        * **Private Leaderboard** - You are not told what part of the test data set contributes to the public leaderboard.  Your final score/rank is calculated based on this part.  You do not see your private leaderboard score until the end.

![How Kaggle Competitions are Scored](https://raw.githubusercontent.com/jeffheaton/t81_558_deep_learning/master/images/class_3_kaggle.png "How Kaggle Competitions are Scored")

## Preparing a Kaggle Submission

Code need not be submitted to Kaggle.  For competitions, you are scored entirely on the accuracy of your sbmission file.  A Kaggle submission file is always a CSV file that contains the **Id** of the row you are predicting and the answer.  For the titanic competition, a submission file looks something like this:

```
PassengerId,Survived
892,0
893,1
894,1
895,0
896,0
897,1
...
```

The above file states the prediction for each of various passengers.  You should only predict on ID's that are in the test file.  Likewise, you should render a prediction for every row in the test file.  Some competitions will have different formats for their answers.  For example, a multi-classification will usually have a column for each class and your predictions for each class.

# Select Kaggle Competitions

There have been many interesting competitions on Kaggle, these are some of my favorites.

## Predictive Modeling

* [Otto Group Product Classification Challenge](https://www.kaggle.com/c/otto-group-product-classification-challenge)
* [Galaxy Zoo - The Galaxy Challenge](https://www.kaggle.com/c/galaxy-zoo-the-galaxy-challenge)
* [Practice Fusion Diabetes Classification](https://www.kaggle.com/c/pf2012-diabetes)
* [Predicting a Biological Response](https://www.kaggle.com/c/bioresponse)

## Computer Vision

* [Diabetic Retinopathy Detection](https://www.kaggle.com/c/diabetic-retinopathy-detection)
* [Cats vs Dogs](https://www.kaggle.com/c/dogs-vs-cats)
* [State Farm Distracted Driver Detection](https://www.kaggle.com/c/state-farm-distracted-driver-detection)

## Time Series

* [The Marinexplore and Cornell University Whale Detection Challenge](https://www.kaggle.com/c/whale-detection-challenge)

## Other

* [Helping Santa's Helpers](https://www.kaggle.com/c/helping-santas-helpers)


# Iris as a Kaggle Competition

If the Iris data were used as a Kaggle, you would be given the following three files:

* [kaggle_iris_test.csv](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/data/kaggle_iris_test.csv) - The data that Kaggle will evaluate you on.  Contains only input, you must provide answers.  (contains x)
* [kaggle_iris_train.csv](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/data/kaggle_iris_train.csv) - The data that you will use to train. (contains x and y)
* [kaggle_iris_sample.csv](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/data/kaggle_iris_sample.csv) - A sample submission for Kaggle. (contains x and y)

Important features of the Kaggle iris files (that differ from how we've previously seen files):

* The iris species is already index encoded.
* Your training data is in a separate file.
* You will load the test data to generate a submission file.

The following program generates a submission file for "Iris Kaggle".  You can use it as a starting point for assignment 3.

In [2]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
import numpy as np
from tensorflow.contrib.learn.python.learn.metric_spec import MetricSpec
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.callbacks import EarlyStopping

path = "./data/"
    
filename_train = os.path.join(path,"kaggle_iris_train.csv")
filename_test = os.path.join(path,"kaggle_iris_test.csv")
filename_submit = os.path.join(path,"kaggle_iris_submit.csv")

df_train = pd.read_csv(filename_train,na_values=['NA','?'])

# Encode feature vector
encode_numeric_zscore(df_train,'petal_w')
encode_numeric_zscore(df_train,'petal_l')
encode_numeric_zscore(df_train,'sepal_w')
encode_numeric_zscore(df_train,'sepal_l')
df_train.drop('id', axis=1, inplace=True)

num_classes = len(df_train.groupby('species').species.nunique())

print("Number of classes: {}".format(num_classes))

# Create x & y for training

# Create the x-side (feature vectors) of the training
x, y = to_xy(df_train,'species')
    
# Split into train/test
x_train, x_test, y_train, y_test = train_test_split(    
    x, y, test_size=0.25, random_state=45)

model = Sequential()
model.add(Dense(10, input_dim=x.shape[1], activation='relu'))
model.add(Dense(1))
model.add(Dense(y.shape[1],activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3, patience=5, verbose=1, mode='auto')

model.fit(x,y,validation_data=(x_test,y_test),callbacks=[monitor],verbose=0,epochs=1000)

Using TensorFlow backend.


Number of classes: 3
Epoch 00240: early stopping


<keras.callbacks.History at 0x2e678d1ed68>

In [3]:
from sklearn import metrics

# Calculate multi log loss error
pred = model.predict(x_test)
score = metrics.log_loss(y_test, pred)
print("Log loss score: {}".format(score))


Log loss score: 0.47191062205703926


In [4]:
# Generate Kaggle submit file

# Encode feature vector
df_test = pd.read_csv(filename_test,na_values=['NA','?'])

encode_numeric_zscore(df_test,'petal_w')
encode_numeric_zscore(df_test,'petal_l')
encode_numeric_zscore(df_test,'sepal_w')
encode_numeric_zscore(df_test,'sepal_l')
ids = df_test['id']
df_test.drop('id', axis=1, inplace=True)

x = df_test.as_matrix().astype(np.float32)

# Generate predictions
pred = model.predict(x)
#pred

# Create submission data set

df_submit = pd.DataFrame(pred)
df_submit.insert(0,'id',ids)
df_submit.columns = ['id','species-0','species-1','species-2']

df_submit.to_csv(filename_submit, index=False)

print(df_submit)


     id  species-0  species-1  species-2
0   100   0.012182   0.531441   0.456378
1   101   0.016421   0.515493   0.468087
2   102   0.003597   0.591064   0.405339
3   103   0.985299   0.002708   0.011992
4   104   0.998779   0.000148   0.001074
5   105   0.990435   0.001642   0.007923
6   106   0.999897   0.000008   0.000095
7   107   0.009183   0.545952   0.444865
8   108   0.003574   0.591351   0.405075
9   109   0.025504   0.490506   0.483990
10  110   0.009660   0.543391   0.446950
11  111   0.999404   0.000063   0.000532
12  112   0.011159   0.536000   0.452841
13  113   0.073819   0.418346   0.507835
14  114   0.995753   0.000637   0.003611
15  115   0.016924   0.513839   0.469237
16  116   0.014479   0.522299   0.463222
17  117   0.987608   0.002220   0.010172
18  118   0.998098   0.000249   0.001654
19  119   0.999456   0.000057   0.000487
20  120   0.006542   0.562751   0.430707
21  121   0.005865   0.568043   0.426092
22  122   0.021952   0.499244   0.478804
23  123   0.0582

# Programming Assignment 3

Kaggke competition site for current semester (Fall 2017):
* [Fall 2017 Kaggle Assignment](https://www.kaggle.com/c/wustl-t81-558-washu-deep-learning-fall-2017)
* [Assignment File](https://raw.githubusercontent.com/jeffheaton/t81_558_deep_learning/master/pdf/t81_559_program_3.pdf)

Previous Kaggle competition sites for this class (for your reference, do not use):
* [Spring 2017 Kaggle Assignment](https://inclass.kaggle.com/c/applications-of-deep-learning-wustl-spring-2017)
* [Fall 2016 Kaggle Assignment](https://inclass.kaggle.com/c/wustl-t81-558-washu-deep-learning-fall-2016)



