### Introduction

The dataset is used for this competition is synthetic, but based on a real dataset and generated using a CTGAN. The original dataset deals with predicting the biological response of molecules given various chemical properties. Although the features are anonymized, they have properties relating to real-world features.

In this notebook, we'll try performance on different algorithms and choose the best one among them.

#### Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn import metrics
from sklearn.metrics import confusion_matrix, roc_auc_score, auc, matthews_corrcoef, f1_score 
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
import pickle
import tensorflow as tf
from sklearn.model_selection import RandomizedSearchCV

%matplotlib inline

#### Reading Data

In [None]:
train=pd.read_csv('../input/tabular-playground-series-oct-2021/train.csv')

In [None]:
train.head()

In [None]:
sns.countplot(x="target", data=train);

In [None]:
train.info()

In [None]:
train.dtypes.to_dict()

The data doesn't contain any categorical variable so we can start ahead. Now lets check for any null value

In [None]:
[i for i in train.columns if train[i].isna().any()]

### Insights from data

- The complete data contains 1000000 rows and 287 columns out of which we have 285 feature columns.
- The target column is either 0 or 1 and both contain approximately same amount of data
- There are no null values
- There are no categorical variables

#### Reduce Memory usage

There are a million rows with 285 feature columns which is taking around 2.1 gb of RAM. So, it is better to reduce some memory by changing its datatype to store fewer floating points precision.

Here's a quick comparison of integer data type. which shows the values we can store in it.
- int8 can store integers from -128 to 127.
- int16 can store integers from -32768 to 32767.
- int64 can store integers from -9223372036854775808 to 9223372036854775807.

In [None]:
X = train.iloc[:,1:-1]
y = train.target

In [None]:
def ReduceMem(data):
    feature_cols = data.columns.tolist()
    
    print("Memory usage before: ", data.memory_usage(deep=True).sum()/(1024**3)," GB")
    for col in feature_cols:
        if data[col].dtype=='float64':
            data[col] = data[col].astype('float32')
        else:
            data[col] = data[col].astype('uint8')
    
    print("Memory usage after: ", data.memory_usage(deep=True).sum()/(1024**3), "GB")
    
    return data

In [None]:
X = ReduceMem(X)

### DataSplit 

The provided test set doesn't contain a target variable, So we'll create a test set from training data to evaluate its performance.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1, stratify=y)

In [None]:
X_train.shape

In [None]:
X_test.shape

### Feature Selection

As there are a lot of features, lets try by selecting only the features which contribute more to the final output

In [None]:
model = ExtraTreesClassifier()
model.fit(X_train,y_train)

In [None]:
filename = 'feature_model.sav'
pickle.dump(model, open(filename, 'wb'))

In [None]:
#model = pickle.load(open('feature_model.sav', 'rb'))

In [None]:
important_features = pd.Series(model.feature_importances_, index=X_train.columns)
important_features_sorted = important_features.sort_values(ascending=False)
important_features_sorted

ploting 10 most important features

In [None]:
plt.figure(figsize=(10,8))
features = pd.Series(important_features_sorted[:10], index=important_features_sorted.index[:10])
features.plot(kind='barh')
plt.show()

selectFromModel object from sklearn to automatically select the features.

In [None]:
model1 = SelectFromModel(model, prefit=True)

In [None]:
#Training
X_train_imp = model1.transform(X_train)
X_train_imp.shape

In [None]:
#Testing
X_test_imp = model1.transform(X_test)
X_test_imp.shape

Here are the list of features selected by SelectFromModel

In [None]:
X_train.columns[model1.get_support()]

In [None]:
X_test.columns[model1.get_support()]

## Classification

In [None]:
result = []

We are going to use different models and choose the best performing model among those.

In [None]:
def print_roc_auc_score(model, test_data, label):
    y_pred = model.predict_proba(test_data)[::,1]
    auc = metrics.roc_auc_score(label, y_pred)    
    fpr_logit, tpr_logit, _ = metrics.roc_curve(label, y_pred)    
    plt.plot(fpr_logit,tpr_logit,label="AUC Curve, auc={:.3f})".format(auc))
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlabel('False positive rate')
    plt.ylabel('True positive rate')
    plt.title('ROC curve')
    plt.legend(loc=4)
    plt.show()    
    print("AUC Score :", auc)  
    
    return auc

### Logistic Regression

#### selected features

In [None]:
#train the model
logistic_model = LogisticRegression()
logistic_model.fit(X_train_imp, y_train);
lr = print_roc_auc_score(logistic_model, X_test_imp, y_test)
result.append(["Logistic regression",lr])

#### All features

In [None]:
#train the model
logistic_model_all = LogisticRegression(max_iter=500)
logistic_model_all.fit(X_train, y_train);
lr_all = print_roc_auc_score(logistic_model_all, X_test, y_test)
result.append(["Logistic regression all features",lr_all])

### XGBoost

#### Selected features

In [None]:
xgb = XGBClassifier(random_state=10, use_label_encoder=False)
xgb.fit(X_train_imp, y_train, eval_metric='aucpr');
xgboost = print_roc_auc_score(xgb, X_test_imp, y_test)
result.append(["XGBoost",xgboost])

#### All features

In [None]:
xgb_all = XGBClassifier(random_state=10, use_label_encoder=False)
xgb_all.fit(X_train, y_train, eval_metric='aucpr');
xgboost_all = print_roc_auc_score(xgb_all, X_test, y_test)
result.append(["XGBoost all features",xgboost_all])

## AdaBoost

#### Selected features

In [None]:
adaBoost = AdaBoostClassifier()
adaBoost.fit(X_train_imp, y_train);
adaBoostAuc = print_roc_auc_score(adaBoost, X_test_imp, y_test)
result.append(["AdaBoost",adaBoostAuc])

#### All features

In [None]:
adaBoost_all = AdaBoostClassifier()
adaBoost_all.fit(X_train, y_train);
adaBoost_all_auc = print_roc_auc_score(adaBoost_all, X_test, y_test)
result.append(["AdaBoost all features",adaBoost_all_auc])

## CatBoost

#### Selected features

In [None]:
catBoost = CatBoostClassifier()
catBoost.fit(X_train_imp, y_train);
catBoostAuc = print_roc_auc_score(catBoost, X_test_imp, y_test)
result.append(["catBoost",catBoostAuc])

#### All features

In [None]:
catBoost_all = CatBoostClassifier()
catBoost_all.fit(X_train, y_train);
catBoostAucAll = print_roc_auc_score(catBoost_all, X_test, y_test)
result.append(["catBoost all features",catBoostAucAll])

## LightGBM

#### Selected features

In [None]:
lgbm = LGBMClassifier()
lgbm.fit(X_train_imp, y_train);
lgbm_auc = print_roc_auc_score(lgbm, X_test_imp, y_test)
result.append(["lightgbm",lgbm_auc])

#### All features

In [None]:
lgbm_all = LGBMClassifier()
lgbm_all.fit(X_train, y_train);
lgbm_all_auc = print_roc_auc_score(lgbm_all, X_test, y_test)
result.append(["lightgbm all features",lgbm_all_auc])

### Summary 

In [None]:
df_results = pd.DataFrame(result)
df_results = df_results.rename(columns={0:"ModelName",1:"Auc Score"})
df_results

In [None]:
df_results.iloc[df_results["Auc Score"].argmax()]

As we can see that models trained on selected features using ExtraTreesClassifier have performed really well even after removing 276 of its features, which really shows the power of choosing the right features. Training only on 9 features take small amount of time for training but for the sake of submission we are going to use the model with highest score i.e. catBoost model with all of the features of training data.

## Parameter tuning

In [None]:
lgb = LGBMClassifier(random_state = 42)
rs_params = dict(learning_rate = [0.05,0.03,0.01,0.1,0.3],
                 reg_lambda = [0, 20, 40],
                 n_estimators = [1000,2000,2500,3000],
                 max_depth = [1, 3, 4, 7, 10, 15],
                 subsample = [0.7, 0.8, 0.9, 1],
                 colsample_bytree = [0.7, 0.8, 0.9, 1],
                 reg_alpha = [0, 20, 40, 50, 60])

rs_lgb = RandomizedSearchCV(estimator = lgb,
                            param_distributions = rs_params,
                            scoring = 'roc_auc',
                            cv = 2,
                            random_state = 42)

rs_lgb.fit(X_train, y_train)
print(rs_lgb.best_params_)

## Submission

In [None]:
catBoost_final = CatBoostClassifier()
catBoost_final.fit(X, y)

In [None]:
lgb = LGBMClassifier(**rs_lgb.best_params_)
lgb.fit(X,y)

In [None]:
X_test = pd.read_csv("../input/tabular-playground-series-oct-2021/test.csv", index_col='id')
predict = lgb.predict_proba(X_test)[::,1]
predictions = pd.DataFrame({"id":X_test.index, "target":predict})

In [None]:
predictions.to_csv('submission.csv', index=False)

## AutoKeras

Automated machine learning (AutoML) automates the selection, composition and parameterization of machine learning models.


work in progress...

In [None]:
#pip install git+git://github.com/keras-team/autokeras@master#egg=autokeras

In [None]:
#import autokeras as ak

#### Selected features

In [None]:
# Initialize the structured data classifier.
#clf = ak.StructuredDataClassifier(overwrite=True, max_trials=5)  # It tries 3 different models.

#clf.fit(X_train_imp, y_train, epochs=10)

#predicted_y = clf.predict(X_test_imp)
# Evaluate the best model with testing data.
#print(clf.evaluate(X_test_imp, y_test))

In [None]:
#model = clf.export_model()
#model.summary()

In [None]:
#model.save('model_keras.tf')

In [None]:
#y_pred = model.predict(X_test_imp)

#auc = metrics.roc_auc_score(y_test, y_pred)    
#fpr_logit, tpr_logit, _ = metrics.roc_curve(y_test, y_pred)    
#plt.plot(fpr_logit,tpr_logit,label="AUC Curve, auc={:.3f})".format(auc))
#plt.plot([0, 1], [0, 1], 'k--')
#plt.xlabel('False positive rate')
#plt.ylabel('True positive rate')
#plt.title('ROC curve')
#plt.legend(loc=4)
#plt.show()    
#print("AUC Score :", auc)

#### All features

work in progress...