In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-whitegrid')

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.callbacks import EarlyStopping




# A Quick Exercise

This is a 'quickie' in building a Neural Network with Keras, coming off the Kaggle course "Intro to Deep Learning"

In [None]:
data = pd.read_csv("../input/tabular-playground-series-nov-2021/train.csv")
data.head()

We'll build a naive model first, on all features, then we'll look at feature selection and posisble engineering

Okay, cool. It "works".

# Wide and Deep Modelling

or is it 'modeling'? We can develop our model two ways. Widen  (add more units), or deepen (add more layers). 

We'll try both, separately, and then together

Looking at the past three runs, we can see that 'deepening' the model has improved it. But only 'just'. We need more!

# Batch Normilization

We did nothing with the data. Jack-squat. One aspect that often proves helpful to note is "a large difference is more important than a small difference, (but it depends on the difference)" - me. When data is not 'normalized', a large difference between two values of a feature may have more 'importance' than say a relatively small (but very important) difference for a model. A way to take this into account is to normalize the data. (Scale each feature such that every feature has the same 'spread' or 'min/max' etc.)

Keras has a normlization layer that we will use. We'll continue with our 'deep' model

# Dropout

Another tool we can use is 'dropout'. Meaning, when evaluating weights, the model will choose some nodes to 'drop'. Essentially, it prevents singular nodes having too much importance. We'll use the deep model, again.

minor, if neglibible improvements.

# Early Stopping

We have been running for 20 epochs (training sets) when we can see that we often do not improve after just a few epochs. We can set out some 'early stopping' parameters to stop the training early if it looks like it is going nowhere. 

This also helps with preventing 'overfitting'.

We'll add 'capacity' to our model by adding further layers.

# Features

We *should* inspect the data first when it comes to this. But as mentioned my focus was exploring simple keras neural net models. Anyhow, if we want to really improve the model, we will ahve to start thinking about both feature engineering and feature selection. As this data is no 'real', it was generated, this may prove tricky, but we'll get creative.

Usually, this is how we gain the best improvements. 

## Correlations

Welp! There goes out hopes (but not our dreams). This heatmap confirms that the features do not correlate strongly with eachother. (Thus, 'feature engineering' may be of no help) BUT we can see that some do correlate "more strongly" with the `id` (leftmost column) and `target` (bottom row).

*Note, the strengths of any correlations here are exceptionally weak if present at all. We will more than likely just skip this step altogether, but it's extra practice for myself*

Fortunately, we remove the `id` feature as it is just a label and the correlations present are 'false', they will only create misleading predictions. However, the correlation with `target` is precisely what we need for 'feature selection'. Let's collect these features

## Summary features

Summary statistics of the features, say 'mean', 'median', 'range' etc. Can be very telling of the data. Luckily, these are not complicated to include.

We want these statistics to inform our decision about what the `target` should be. Thus, we need to include extra columns with these summaries as values.

## Polynomial features

Polynomials of 'x' are like: ax^2 + bx + c, etc. to any power and any combination. It would be ridiculous to explore all possible combinations, though looking at just a possible range of powers, say x^2, or x^3, may help create something special. 

Usually, when it comes to choices, you would choose either the most significant features or the least significant features. Depends if you are investigating the most obvious patterns or least.

(*I suppose, but please correct me on this*)

This test, we'll use 'x^2' as an extra feature, for every feature. Then normalize.

## Non-linear features

Much alike the polynomial features, we can also use non-linear functions.
This most common transformation to use is 'log', whether base2, 10, or the natural log is up to choice. But really, it's whatever you need. 

We can go crazy really, but experience and domain knowledge will best inform these decisions. In essence, you are attempting to transform the data to better reveal any underlying pattern.

In [None]:
d2 = data.drop(['id', 'target'], axis = 1)
d1 = d2
d4 = np.log(abs(d2)+1)

d4 = d4.add_prefix("log_")


# our summary stats of original data
d2['mean'] = d1.mean(axis=1)
d2['sd'] = d1.std(axis=1)
d2['var'] = d1.var(axis=1)
d2['max'] = d1.max(axis=1)
d2['min'] = d1.min(axis=1)
d2['median'] = d1.median(axis=1)
d2['kurt'] = d1.kurt(axis=1)

dataf = pd.concat([d2, d4], axis=1)


X_dataf = dataf
y_dataf = data['target']

X_train1, X_test1, y_train1, y_test1 = train_test_split(X_dataf, y_dataf, test_size = 0.2, random_state=37)
X_train1, X_val1, y_train1, y_val1 = train_test_split(X_train1, y_train1, test_size=0.25, random_state=37)

X_train2, X_test2, y_train2, y_test2 = train_test_split(X_dataf, y_dataf, test_size = 0.2, random_state=73)
X_train2, X_val2, y_train2, y_val2 = train_test_split(X_train2, y_train2, test_size=0.25, random_state=73)

X_train3, X_test3, y_train3, y_test3 = train_test_split(X_dataf, y_dataf, test_size = 0.2, random_state=173)
X_train3, X_val3, y_train3, y_val3 = train_test_split(X_train3, y_train3, test_size=0.25, random_state=173)


In [None]:
testdata = pd.read_csv("../input/tabular-playground-series-nov-2021/test.csv")
testid = testdata['id']

td2 = testdata.drop(['id'], axis = 1)
td1 = td2
td4 = np.log(abs(td2)+1)

td4 = td4.add_prefix("log_")


# our summary stats of original data
td2['mean'] = td1.mean(axis=1)
td2['sd'] = td1.std(axis=1)
td2['var'] = td1.var(axis=1)
td2['max'] = td1.max(axis=1)
td2['min'] = td1.min(axis=1)
td2['median'] = d1.median(axis=1)
td2['kurt'] = d1.kurt(axis=1)

tdata = pd.concat([td2, td4], axis=1)

In [None]:
# Early stopping parameters
early_stopping = EarlyStopping(
    min_delta=0.001, # minimium amount of change to count as an improvement
    patience=30, # how many epochs to wait before stopping
    restore_best_weights=True,
)

# model structure
model = keras.Sequential([
    layers.BatchNormalization(input_dim = 207),
    layers.Dense(units = 256, activation = "swish"),
    layers.BatchNormalization(),
    
    layers.Dense(units = 256, activation = "relu"),
    layers.Dropout(0.25),
    layers.BatchNormalization(),
    
    layers.Dense(units = 64, activation = "relu"),
    layers.Dropout(0.125),
    layers.BatchNormalization(),
    
    layers.Dense(units = 16, activation = "relu"),
    layers.Dropout(0.125),
    layers.BatchNormalization(),

    layers.Dense(units = 1, activation = "sigmoid") # binary output
])

# evaluation methods
model.compile(
    optimizer = 'adam',
    loss = 'binary_crossentropy',
    metrics = ['binary_accuracy']
)

# training
history = model.fit(X_train1, y_train1,
                    validation_data = (X_val1, y_val1),
                    batch_size = 100,
                    callbacks = [early_stopping],
                    epochs = 200)

In [None]:
# plotting evaluation steps
history_df = pd.DataFrame(history.history)
history_df[['loss', 'val_loss']].plot()
history_df[['binary_accuracy', 'val_binary_accuracy']].plot()

print(("Best Validation Loss: {:0.4f}" +\
      "\nBest Validation Accuracy: {:0.4f}")\
      .format(history_df['val_loss'].min(), 
              history_df['val_binary_accuracy'].max()))

In [None]:
predictions = model.predict(
   tdata, 
   batch_size = None, 
   verbose = 0, 
   steps = None, 
   callbacks = early_stopping, 
   max_queue_size = 10, 
   workers = 1, 
   use_multiprocessing = True
).reshape(1,-1)[0]


result1 = pd.DataFrame({'id':testid, 'target': predictions})
result1.to_csv('./submission1.csv', index=False)

In [None]:
# Early stopping parameters
early_stopping = EarlyStopping(
    min_delta=0.001, # minimium amount of change to count as an improvement
    patience=30, # how many epochs to wait before stopping
    restore_best_weights=True,
)

# model structure
model = keras.Sequential([
    layers.BatchNormalization(input_dim = 207),
    layers.Dense(units = 256, activation = "swish"),
    layers.BatchNormalization(),
    
    layers.Dense(units = 256, activation = "relu"),
    layers.Dropout(0.25),
    layers.BatchNormalization(),
    
    layers.Dense(units = 64, activation = "relu"),
    layers.Dropout(0.125),
    layers.BatchNormalization(),
    
    layers.Dense(units = 16, activation = "relu"),
    layers.Dropout(0.125),
    layers.BatchNormalization(),

    layers.Dense(units = 1, activation = "sigmoid") # binary output
])

# evaluation methods
model.compile(
    optimizer = 'adam',
    loss = 'binary_crossentropy',
    metrics = ['binary_accuracy']
)

# training
history = model.fit(X_train2, y_train2,
                    validation_data = (X_val2, y_val2),
                    batch_size = 100,
                    callbacks = [early_stopping],
                    epochs = 200)

In [None]:
# plotting evaluation steps
history_df = pd.DataFrame(history.history)
history_df[['loss', 'val_loss']].plot()
history_df[['binary_accuracy', 'val_binary_accuracy']].plot()

print(("Best Validation Loss: {:0.4f}" +\
      "\nBest Validation Accuracy: {:0.4f}")\
      .format(history_df['val_loss'].min(), 
              history_df['val_binary_accuracy'].max()))

In [None]:
predictions = model.predict(
   tdata, 
   batch_size = None, 
   verbose = 0, 
   steps = None, 
   callbacks = early_stopping, 
   max_queue_size = 10, 
   workers = 1, 
   use_multiprocessing = True
).reshape(1,-1)[0]


result1 = pd.DataFrame({'id':testid, 'target': predictions})
result1.to_csv('./submission2.csv', index=False)

In [None]:
# Early stopping parameters
early_stopping = EarlyStopping(
    min_delta=0.001, # minimium amount of change to count as an improvement
    patience=30, # how many epochs to wait before stopping
    restore_best_weights=True,
)

# model structure
model = keras.Sequential([
    layers.BatchNormalization(input_dim = 207),
    layers.Dense(units = 256, activation = "swish"),
    layers.BatchNormalization(),
    
    layers.Dense(units = 256, activation = "relu"),
    layers.Dropout(0.25),
    layers.BatchNormalization(),
    
    layers.Dense(units = 64, activation = "relu"),
    layers.Dropout(0.125),
    layers.BatchNormalization(),
    
    layers.Dense(units = 16, activation = "relu"),
    layers.Dropout(0.125),
    layers.BatchNormalization(),

    layers.Dense(units = 1, activation = "sigmoid") # binary output
])

# evaluation methods
model.compile(
    optimizer = 'adam',
    loss = 'binary_crossentropy',
    metrics = ['binary_accuracy']
)

# training
history = model.fit(X_train3, y_train3,
                    validation_data = (X_val3, y_val3),
                    batch_size = 100,
                    callbacks = [early_stopping],
                    epochs = 200)

In [None]:
# plotting evaluation steps
history_df = pd.DataFrame(history.history)
history_df[['loss', 'val_loss']].plot()
history_df[['binary_accuracy', 'val_binary_accuracy']].plot()

print(("Best Validation Loss: {:0.4f}" +\
      "\nBest Validation Accuracy: {:0.4f}")\
      .format(history_df['val_loss'].min(), 
              history_df['val_binary_accuracy'].max()))

In [None]:
predictions = model.predict(
   tdata, 
   batch_size = None, 
   verbose = 0, 
   steps = None, 
   callbacks = early_stopping, 
   max_queue_size = 10, 
   workers = 1, 
   use_multiprocessing = True
).reshape(1,-1)[0]


result1 = pd.DataFrame({'id':testid, 'target': predictions})
result1.to_csv('./submission3.csv', index=False)