## Welcome to DeepTables

DeepTables: Deep-learning Toolkit for Tabular data
DeepTables(DT) is a easy-to-use toolkit that enables deep learning to unleash great power on tabular data.

Overview
MLP (also known as Fully-connected neural networks) have been shown inefficient in learning distribution representation. The “add” operations of the perceptron layer have been proven poor performance to exploring multiplicative feature interactions. In most cases, manual feature engineering is necessary and this work requires extensive domain knowledge and very cumbersome. How learning feature interactions efficiently in neural networks becomes the most important problem.

A lot of models have been proposed to CTR prediction and continue to outperform existing state-of-the-art approaches to the late years. Well-known examples include FM, DeepFM, Wide&Deep, DCN, PNN, etc. These models can also provide good performance on tabular data under reasonable utilization.

DT aims to utilize the latest research findings to provide users with an end-to-end toolkit on tabular data.

DT has been designed with these key goals in mind:

Easy to use, non-experts can also use.
Provide good performance out of the box.
Flexible architecture and easy expansion by user.
DT follow these steps to build a neural network:

Category features -> Embedding Layer.
Continuous feature -> Dense Layer or to Embedding Layer after discretization/categorization.
Embedding/Dense layers -> Feature Interactions/Extractions nets.
Stacking(add/concat) outputs of nets as the output of the model.

In [None]:
import numpy as np
import pandas as pd
import tensorflow
import tensorflow.keras.layers as layers
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold

In [None]:
train = pd.read_csv('../input/tabular-playground-series-jan-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-jan-2021/test.csv')

In [None]:
import math
def convert(train):
    train['cont1_int'] = train['cont1'].astype(int)
    train['cont2_int'] = train['cont2'].astype(int)
    train['cont3_int'] = train['cont3'].astype(int)
    train['cont4_int'] = train['cont4'].astype(int)
    train['cont5_int'] = train['cont5'].astype(int)
    train['cont6_int'] = train['cont6'].astype(int)
    train['cont7_int'] = train['cont7'].astype(int)
    train['cont8_int'] = train['cont8'].astype(int)
    train['cont9_int'] = train['cont9'].astype(int)
    train['cont10_int'] = train['cont10'].astype(int)
    train['cont11_int'] = train['cont11'].astype(int)
    train['cont12_int'] = train['cont12'].astype(int)
    train['cont13_int'] = train['cont13'].astype(int)
    train['cont14_int'] = train['cont14'].astype(int)

    train['cont1_in'] = train['cont1'].apply(lambda x: math.modf(x)[0])
    train['cont2_in'] = train['cont2'].apply(lambda x: math.modf(x)[0])
    train['cont3_in'] = train['cont3'].apply(lambda x: math.modf(x)[0])
    train['cont4_in'] = train['cont4'].apply(lambda x: math.modf(x)[0])
    train['cont5_in'] = train['cont5'].apply(lambda x: math.modf(x)[0])
    train['cont6_in'] = train['cont6'].apply(lambda x: math.modf(x)[0])
    train['cont7_in'] = train['cont7'].apply(lambda x: math.modf(x)[0])
    train['cont8_in'] = train['cont8'].apply(lambda x: math.modf(x)[0])
    train['cont9_in'] = train['cont9'].apply(lambda x: math.modf(x)[0])
    train['cont10_in'] = train['cont10'].apply(lambda x: math.modf(x)[0])
    train['cont11_in'] = train['cont11'].apply(lambda x: math.modf(x)[0])
    train['cont12_in'] = train['cont12'].apply(lambda x: math.modf(x)[0])
    train['cont13_in'] = train['cont13'].apply(lambda x: math.modf(x)[0])
    train['cont14_in'] = train['cont14'].apply(lambda x: math.modf(x)[0])
    return train

In [None]:
train = convert(train)
test = convert(test)

In [None]:
numerical_cols = [f'cont{i}' for i in range(1, 15)]
target_col = 'target'

for c in numerical_cols:
    prep = StandardScaler()
    train[c] = prep.fit_transform(train[[c]])
    test[c] = prep.transform(test[[c]])

X_train = train.drop(['id', 'target'], axis=1)
y_train = train['target']
X_test = test.drop('id', axis=1)

In [None]:
X_train

In [None]:
# import seaborn as sns
# import matplotlib.pyplot as plt
# plt.figure(figsize = (20,6))
# sns.countplot(x = 'cont1', hue = 'target', data = train)

In [None]:
corr = train.corr(method = 'pearson')
corr = corr.abs()
corr.style.background_gradient(cmap='inferno')

In [None]:
from fastai import *
from fastai.tabular import *

In [None]:
train = train.sort_values(by='target', ascending=False)
train = train.reset_index(drop=True)

In [None]:
train

In [None]:
# train.target = np.log(train.target)

In [None]:
def RMSE_fn(y_true, y_pred):
    return np.sqrt(np.mean(np.power(np.array(y_true, float).reshape(-1, 1) - np.array(y_pred, float).reshape(-1, 1), 2)))

In [None]:
# cv = KFold(n_splits=5, shuffle=True, random_state=7)

# y_preds = []
# models = []
# oof_train = np.zeros((len(X_train),))

# for fold_id, (train_index, valid_index) in enumerate(cv.split(X_train, y_train)):
#     X_tr = X_train.loc[train_index, :]
#     X_val = X_train.loc[valid_index, :]
#     y_tr = y_train.loc[train_index]
#     y_val = y_train.loc[valid_index]

#     model = tensorflow.keras.Sequential([
#         layers.Dense(64, activation='relu'),
#         layers.Dense(16, activation='relu'),
#         layers.Dense(1, activation='linear'),
#     ])

#     model.compile(
#         optimizer='adam',
#         loss='mse',
#         metrics=[tensorflow.keras.metrics.RootMeanSquaredError()]
#     )

#     early_stopping = tensorflow.keras.callbacks.EarlyStopping(
#         patience=10,
#         min_delta=0.001,
#         restore_best_weights=True,
#     )

#     model.fit(
#         X_tr, y_tr,
#         validation_data=(X_val, y_val),
#         batch_size=30000,
#         epochs=1000,
#         callbacks=[early_stopping],
#     )

#     oof_train[valid_index] = model.predict(X_val).reshape(1, -1)[0]
#     y_pred = model.predict(X_test).reshape(1, -1)[0]

#     y_preds.append(y_pred)
#     models.append(model)


In [None]:
# print(f'CV: {mean_squared_error(y_train, oof_train, squared=False)}')

In [None]:
# sub = pd.read_csv('../input/tabular-playground-series-jan-2021/sample_submission.csv')
# y_sub = sum(y_preds) / len(y_preds)
# sub['target'] = y_sub
# sub.to_csv('submission.csv', index=False)
# sub.head()

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import datetime
# from kaggle.competitions import nflrush
import tqdm
import re
from string import punctuation
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
import keras
from keras.callbacks import ReduceLROnPlateau, ModelCheckpoint, EarlyStopping
from keras.utils import plot_model
import keras.backend as K
import tensorflow as tf

sns.set_style('darkgrid')
mpl.rcParams['figure.figsize'] = [15,10]


## Why use DeepTables?
Free preprocessing and processing.
Easy to expert data scientist or a business analyst without modeling ability.
Simpler than the traditional machine learning algorithm which highly depends on manual feature engineering.
Excellent performance out of the box.
Builtin a group of neural network components (NETs) from the most excellent research results in recent years.
Extremely easy to use.
Only 5 lines of code can complete the modeling of any data set.
Very open architecture design.
supports plug-in extension.

In [None]:
!pip install deeptables

In [None]:
import numpy as np
from deeptables.models import deeptable, deepnets
from deeptables.datasets import dsutils
from sklearn.model_selection import train_test_split

In [None]:
# df = dsutils.load_bank()
# df_train, df_test = train_test_split(train, test_size=0.2, random_state=42)

In [None]:
#training
# config = deeptable.ModelConfig(nets=deepnets.xDeepFM, earlystopping_patience=15, metrics=["RootMeanSquaredError"])
config = deeptable.ModelConfig(nets =['linear','cin_nets','dnn_nets'],
    stacking_op = 'add', earlystopping_patience=15, metrics=["RootMeanSquaredError"])

dt = deeptable.DeepTable(config=config)
model, history = dt.fit(X_train, y_train, epochs=25)
# nets =['linear','cin_nets','dnn_nets'],
#     stacking_op = 'add',

In [None]:
X_test

In [None]:
X_test = test.drop('id', axis=1)


In [None]:
test

In [None]:
y_pred = dt.best_model.predict(X_test)

In [None]:
preds=dt.predict(test.iloc[:,1:])

In [None]:
sub=pd.read_csv("../input/tabular-playground-series-jan-2021/sample_submission.csv")
sub.target = preds
sub.to_csv("submission.csv", index=False)

In [None]:
sub