# We are going to compare two Deep Learning AutoML libraries on a very difficult classification dataset: German Credit Data. 
## The first AutoML library we will try is Autokeras. You can see their web site here:
https://autokeras.com/
## Autokeras was developed by DATA Lab at Texas A&M University. It has 1000's of stars on Github and maintained by an army of programmers.

## The next AutoML library we will try is: Deep AutoViML.
<img src="https://github.com/AutoViML/deep_autoviml/raw/master/logo.jpg" alt="banner"/>

## Deep AV is a brand new library and is built from the ground-up using the latest in Tensorflow and Keras technology. It uses keras preprocessing layers which just came out and is based on Tensorflow 2.5.

We will use the same test-train split in both using the same random_states and everything. Only thing is we will test on the final heldout test.

# If you want to see more on German Credit, you can see another great notebook by Marilia here:

https://www.kaggle.com/mpwolke/creditability-deep-autoviml

In [None]:
!pip install deep_autoviml

In [None]:
from deep_autoviml import deep_autoviml as deepauto

In [None]:
!pip install git+https://github.com/keras-team/keras-tuner.git

In [None]:
!pip install autokeras

In [None]:
import torch
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import autokeras as ak

# Please make sure you install deep_autoviml first and then auto-keras.
Otherwise the next few steps will give errors

In [None]:
import tensorflow as tf
tf.__version__

In [None]:
TRAIN_DATA_URL = '/kaggle/input/cusersmarildownloadsgermancsv/german.csv'

In [None]:
df=pd.read_csv(TRAIN_DATA_URL,encoding ='ISO-8859-1',sep=";")
print(df.shape)
df.head()

In [None]:
# Initialize the structured data classifier.
clf = ak.StructuredDataClassifier(
    overwrite=True, max_trials=5)

In [None]:
y=df[['Creditability']].to_numpy()
y[:5]

In [None]:
x=df.loc[:,'Duration_of_Credit_monthly':].to_numpy()
x[:2]

In [None]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size= 0.20, random_state= True,stratify=y) 
x_train.shape,x_test.shape,y_train.shape,y_test.shape

In [None]:
y_train=y_train.reshape((-1))
y_train.shape

In [None]:
y_test=y_test.reshape((-1))
y_test.shape

In [None]:
clf

In [None]:
#clf.fit(x_train, y_train, epochs=5)

# In autokeras, you can set number of epochs to run and max-trials. We set them quite low but got pretty good results on validation. The validation accuracy is 70%
![image.png](attachment:7315612c-e9ac-46b6-91f7-805298b3406b.png)

# Predict with the best model.
predicted_y = clf.predict(x_test).ravel()
predicted_y[:4]

In [None]:
y_test[:4]

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
print('Accuracy = %0.0f%%' %(100*accuracy_score(y_test, predicted_y)))
print('Confusion Matrix: \n%s' %confusion_matrix(y_test, predicted_y))
print('Classification Report: \n%s' %classification_report(y_test, predicted_y))

## The results on test set are pretty good. with 68% accuracy
![image.png](attachment:250524f5-83f9-4e8f-9b7d-8768e44a95f8.png)

## We can see that autokeras provides good results. However it has some major limitations:
1. You need to convert all your data to numeric values. If you use string or text or NLP columns, it will fail. This is a laborious step that you must do yourself.
2. You must clean your data before feeding it to autokeras.

# Let us now compare the results to Deep_AutoViML which we will install now
1. The important advantage of using deep_autoviml is that you don't have to clean your data. You can feed it as it is.
2. You don't have to preprocess text or categorical or string / NLP columns. It will handle preprocessing automatically.
3. The best part is, your model comes with the preprocessing steps as keras layers. So you can immediately deploy the model and predict on raw test data without any preprocessing steps since the model will do that automatically

In [None]:
deepauto

In [None]:
################################################################################
keras_model_type =  "fast" ## always try "fast", then "fast1", "fast2" and "auto"
### always set early_stopping to True first and then change it to False
#### You always need 15 max_trials to get something decent #####
keras_options = {"early_stopping": True, 'lr_scheduler': ''}  
#### always set tuner to "storm" and then "optuna". 
# NLP char limit kicks off NLP processing. Feature Cross later.
model_options = {'tuner':"storm", "max_trials": 5, 'nlp_char_limit':10,
                 'cat_feat_cross_flag':False, }
project_name = 'German_Credit' ### this is the folder where the model will be saved
################################################################################

In [None]:
preds = df.columns[2:].tolist()
targets = df.columns[:1].tolist()
target = targets[0]
target

In [None]:
train = pd.DataFrame(np.c_[x_train,y_train], index=range(len(x_train)), columns = preds+targets)
test = pd.DataFrame(np.c_[x_test,y_test], index=range(len(x_test)), columns = preds+targets)
print(train.shape, test.shape)
train.head(2)

In [None]:
model, cat_vocab_dict = deepauto.fit(train, target, keras_model_type=keras_model_type,
		project_name=project_name, keras_options=keras_options,  
		model_options=model_options, save_model_flag=True, use_my_model='',
		model_use_case='', verbose=0)

# You can see that Precision is 76% while Recall is 97% on the Validation dataset and overall accuracy is 75%. This beats autokeras accuracy of 70%. 
![image.png](attachment:999013f1-aa2c-47c8-96ee-2378e6db8670.png)

# Here is the Deep Learning model that deep_autoviml built:
![image.png](attachment:b5c3b47a-e900-4570-b419-05ec068b7755.png)

In [None]:
predictions = deepauto.predict(model, project_name, test_dataset=test,
                                 keras_model_type=keras_model_type, 
                                 cat_vocab_dict=cat_vocab_dict)

In [None]:
y_test = test[target].values
y_test[:4]

In [None]:
y_preds = predictions[-1]
y_preds[:4]

## We will now test the model on the heldout test dataset. 
We can see that accuracy drops a bit since the dataset is too small and model probably overfit on such a small dataset. However, we can try other keras_model_type="fast1", "fast2" etc and see whether we can get better results

In [None]:
from deep_autoviml import print_classification_model_stats
from sklearn.metrics import accuracy_score
print('Accuracy = %0.0f%%' %(100*accuracy_score(y_test, y_preds)))
print_classification_model_stats(y_test, y_preds)

## The results on test set are similarly good  - 70% accuracy
![image.png](attachment:1c3e1713-208a-492c-801e-28451600d3bf.png)

# So deep_autoviml has the same or better performance in less than 1 minute for German Credit Data. But the biggest advantages of deep_autoviml are the following:
1. The important advantage of using deep_autoviml is that you don't have to clean your data. You can feed it as it is.
2. You don't have to preprocess text or categorical or string / NLP columns. It will handle preprocessing automatically.
3. The best part is, your model comes with the preprocessing steps as keras layers. So you can immediately deploy the model and predict on raw test data without any preprocessing steps since the model will do that automatically

# Hope this notebook was helpful. If you liked it, pelase upvote it

Please see more notebooks by my friend Marilia Prata for German Credit Data

XBNet Classifier on German Credit Data:
https://www.kaggle.com/mpwolke/xbnet-creditability

Deep_AutoViML on German Credit Data:
https://www.kaggle.com/mpwolke/creditability-deep-autoviml

