## Auto ML (Automating finding the best model)
### In the classification problem of predicting the credit risk use AutoKeras to arrive at best Neural Network architecture.


In [None]:
!pip install autokeras

In [None]:
from sklearn.datasets import fetch_openml
import autokeras as ak

d = fetch_openml("credit-g")
X = d["data"]
Y_raw = d["target"]

"""
 AutoKeras accepts string labels, hence we dont encode the target labels
"""


#TBD Split into train and test set
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import  Normalizer

X_train, X_test, Y_train, Y_test = train_test_split(X, Y_raw, test_size=0.25, random_state=101)

# Normalize the input features
scaler = Normalizer()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

classifier = ak.StructuredDataClassifier(overwrite=True, max_trials=3)

#TBD Fit on train set
classifier.fit(X_train, Y_train, epochs=100)

In [None]:
#TBD Evaluate on test set
classifier.evaluate(X_test, Y_test)

In [None]:
## TBD: Show the best architecture found by autokeras
model = classifier.export_model()
model.summary()

In [None]:
## TBD: Could you beat the evaluation score of above architecture by any other manually selected model (including non neural net classifiers)

# Actually it should be easy to beat the classifier arrived at by autokeras, because one we have searched a small solution space
# two its only limited to Neural networks, (ensemble algorithms like XGBoost perform better on Structured data tasks)
# 75% is not a great accuracy score so it should be beatable and with further feature engineering and ML pipeline
# We could do much better.

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

Y = LabelEncoder().fit_transform(Y_raw)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=101)

# Normalize the input features
scaler = Normalizer()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

model = GradientBoostingClassifier()
model.fit(X_train, Y_train)

Y_hat = model.predict(X_test)
accuracy_score(Y_test, Y_hat)



## Auto Data Understanding
## [Tensorflow Data Validation](https://www.tensorflow.org/tfx/guide/tfdv)

## Auto Data Exploration
[pandas-profiling](https://github.com/pandas-profiling/pandas-profiling)