## The accuracy of the model can be improved by incorporating synthetic data. 

Data augmentation in data analysis are techniques used to increase the amount of data by adding slightly modified copies of already existing data or newly created synthetic data from existing data. It acts as a regularizer and helps reduce overfitting when training a machine learning model. It is closely related to oversampling in data analysis.

**GANs for tabular data**
We well know GANs for success in the realistic image generation. However, they can be applied in tabular data generation.

we use [<u>**tabgan**</u>](https://github.com/Diyago/GAN-for-tabular-data) to generate extra synthetic tabular data. 


In [None]:
!pip install tabgan
!pip install pycaret

In [15]:
from tabgan.sampler import OriginalGenerator #, GANGenerator
import pandas as pd
from pycaret import classification
from IPython.display import display
import warnings
warnings.filterwarnings('ignore')

In [17]:
train = pd.read_csv('./data/train.preprocessed.csv',usecols=range(0,8))
test = pd.read_csv('./data/test.preprocessed.csv',usecols=range(1,8))
print(train.info())
display(train.head())
print(test.info())
display(test.head())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   Survived  891 non-null    int64
 1   Pclass    891 non-null    int64
 2   Sex       891 non-null    int64
 3   Age       891 non-null    int64
 4   Fare      891 non-null    int64
 5   Embarked  891 non-null    int64
 6   Title     891 non-null    int64
 7   IsAlone   891 non-null    int64
dtypes: int64(8)
memory usage: 55.8 KB
None


Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked,Title,IsAlone
0,0,3,0,1,0,0,1,0
1,1,1,1,2,3,1,3,0
2,1,3,1,1,1,0,2,1
3,1,1,1,2,3,0,3,0
4,0,3,0,2,1,0,1,1


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   Pclass    418 non-null    int64
 1   Sex       418 non-null    int64
 2   Age       418 non-null    int64
 3   Fare      418 non-null    int64
 4   Embarked  418 non-null    int64
 5   Title     418 non-null    int64
 6   IsAlone   418 non-null    int64
dtypes: int64(7)
memory usage: 23.0 KB
None


Unnamed: 0,Pclass,Sex,Age,Fare,Embarked,Title,IsAlone
0,3,0,2,0,2,1,1
1,3,1,2,0,0,3,0
2,2,0,3,1,2,1,1
3,3,0,1,1,0,1,1
4,3,1,1,1,0,3,0


In [18]:
train.loc[:,'Pclass':'IsAlone']

Unnamed: 0,Pclass,Sex,Age,Fare,Embarked,Title,IsAlone
0,3,0,1,0,0,1,0
1,1,1,2,3,1,3,0
2,3,1,1,1,0,2,1
3,1,1,2,3,0,3,0
4,3,0,2,1,0,1,1
...,...,...,...,...,...,...,...
886,2,0,1,1,0,5,1
887,1,1,1,2,0,2,1
888,3,1,1,2,0,2,0
889,1,0,1,2,1,1,1


In [19]:
train[['Survived']]

Unnamed: 0,Survived
0,0
1,1
2,1
3,1
4,0
...,...
886,0
887,1
888,0
889,1


In [21]:
# generate data
new_train, new_target = OriginalGenerator().generate_data_pipe(train.loc[:,'Pclass':'IsAlone'], train[['Survived']], test, )

In [22]:
new_train

Unnamed: 0,Pclass,Sex,Age,Fare,Embarked,Title,IsAlone
0,3,0,2,0,0,1,1
1,3,0,2,0,0,1,1
2,3,0,2,0,0,1,1
3,3,0,2,0,0,1,1
4,3,0,2,0,0,1,1
...,...,...,...,...,...,...,...
2661,2,1,1,1,2,2,1
2662,2,1,1,1,1,3,1
2663,2,1,1,1,2,2,1
2664,2,1,1,1,2,2,1


In [23]:
new_target

0       0
1       0
2       0
3       0
4       0
       ..
2661    1
2662    1
2663    1
2664    1
2665    1
Name: Survived, Length: 2666, dtype: int64

In [24]:
new_train=new_train.join(new_target)

In [28]:
classification_setup = classification.setup(data=new_train,target='Survived', silent = True,train_size=0.8)

Unnamed: 0,Description,Value
0,session_id,5568
1,Target,Survived
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(2666, 8)"
5,Missing Values,False
6,Numeric Features,0
7,Categorical Features,7
8,Ordinal Features,False
9,High Cardinality Features,False


INFO:logs:create_model_container: 0
INFO:logs:master_model_container: 0
INFO:logs:display_container: 1
INFO:logs:Pipeline(memory=None,
         steps=[('dtypes',
                 DataTypes_Auto_infer(categorical_features=[],
                                      display_types=False, features_todrop=[],
                                      id_columns=[],
                                      ml_usecase='classification',
                                      numerical_features=[], target='Survived',
                                      time_features=[])),
                ('imputer',
                 Simple_Imputer(categorical_strategy='not_available',
                                fill_value_categorical=None,
                                fill_value_numerical=None,
                                numeric_st...
                ('scaling', 'passthrough'), ('P_transform', 'passthrough'),
                ('binn', 'passthrough'), ('rem_outliers', 'passthrough'),
                ('cluste

In [29]:
classification.compare_models(classification.models().index.tolist())

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
rf,Random Forest Classifier,0.8565,0.9176,0.7705,0.8509,0.807,0.6932,0.697,0.521
lightgbm,Light Gradient Boosting Machine,0.855,0.917,0.762,0.8548,0.8039,0.6895,0.6942,0.116
dt,Decision Tree Classifier,0.8527,0.9158,0.7535,0.8543,0.7989,0.6835,0.6885,0.02
et,Extra Trees Classifier,0.8527,0.9164,0.7535,0.8544,0.799,0.6835,0.6885,0.503
mlp,MLP Classifier,0.849,0.917,0.7547,0.8447,0.7954,0.6763,0.6807,2.637
knn,K Neighbors Classifier,0.841,0.8898,0.744,0.8338,0.7849,0.6594,0.6635,0.132
gpc,Gaussian Process Classifier,0.8368,0.897,0.71,0.8499,0.7713,0.6461,0.6544,4.396
gbc,Gradient Boosting Classifier,0.8278,0.8882,0.6787,0.8523,0.7538,0.6241,0.6351,0.176
rbfsvm,SVM - Radial Kernel,0.8152,0.8409,0.6968,0.8076,0.7463,0.6021,0.6078,0.527
lr,Logistic Regression,0.8077,0.8494,0.7246,0.7705,0.7451,0.5911,0.5934,0.045


INFO:logs:create_model_container: 16
INFO:logs:master_model_container: 16
INFO:logs:display_container: 2
INFO:logs:RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=False, random_state=5568, verbose=0,
                       warm_start=False)
INFO:logs:compare_models() succesfully completed......................................


RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=False, random_state=5568, verbose=0,
                       warm_start=False)

In [30]:
model = classification.create_model('rf')

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.8832,0.9117,0.759,0.9265,0.8344,0.7456,0.7544
1,0.8598,0.9201,0.7349,0.8841,0.8026,0.6954,0.7025
2,0.8498,0.9251,0.8193,0.8,0.8095,0.6855,0.6857
3,0.8545,0.924,0.8072,0.8171,0.8121,0.6934,0.6934
4,0.8216,0.9107,0.759,0.7778,0.7683,0.6233,0.6234
5,0.8216,0.8926,0.759,0.7778,0.7683,0.6233,0.6234
6,0.8732,0.9282,0.7952,0.8684,0.8302,0.7294,0.7312
7,0.8592,0.9206,0.7952,0.8354,0.8148,0.7013,0.7018
8,0.8779,0.9231,0.7317,0.9375,0.8219,0.7312,0.7442
9,0.8638,0.9196,0.7439,0.8841,0.8079,0.7037,0.71


INFO:logs:create_model_container: 17
INFO:logs:master_model_container: 17
INFO:logs:display_container: 3
INFO:logs:RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=False, random_state=5568, verbose=0,
                       warm_start=False)
INFO:logs:create_model() succesfully completed......................................


In [31]:
tuned_model = classification.tune_model(model)

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.8598,0.9178,0.7831,0.8442,0.8125,0.7008,0.7021
1,0.8505,0.9216,0.8313,0.7931,0.8118,0.6878,0.6884
2,0.8169,0.9242,0.8434,0.7292,0.7821,0.6257,0.6306
3,0.8263,0.9217,0.8072,0.7614,0.7836,0.6387,0.6395
4,0.8169,0.8975,0.7952,0.75,0.7719,0.6192,0.6199
5,0.8216,0.8882,0.7952,0.7586,0.7765,0.6282,0.6287
6,0.8451,0.9241,0.8434,0.7778,0.8092,0.6792,0.6808
7,0.8357,0.9077,0.8313,0.7667,0.7977,0.6597,0.6613
8,0.8732,0.9144,0.7439,0.9104,0.8188,0.7228,0.7316
9,0.8451,0.9211,0.7927,0.8025,0.7975,0.6721,0.6721


INFO:logs:create_model_container: 18
INFO:logs:master_model_container: 18
INFO:logs:display_container: 4
INFO:logs:RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight='balanced',
                       criterion='gini', max_depth=8, max_features=1.0,
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0001, min_impurity_split=None,
                       min_samples_leaf=2, min_samples_split=9,
                       min_weight_fraction_leaf=0.0, n_estimators=260,
                       n_jobs=-1, oob_score=False, random_state=5568, verbose=0,
                       warm_start=False)
INFO:logs:tune_model() succesfully completed......................................
