
# Data Preprocessing with ADS - Diabetes Dataset

## Overview of this Notebook

`Data is the most important element in any data science project`.

Data Preparations for training any ML model involves Data Transformations and Manipulations. This is the most important phase and takes 80% of the time in the life cycle of any ML Model. The real world data is mostly incomplete, and has several missing values. Missing values can be because of the unavailability of data or inconsistency present in the data. There might be several errors and outliers present in the data. 
Preprocessing of data involves various steps - 

1. Combining attributes or columns  
2. Data Imputation
3. Data Cleaning
4. Dummy Variables Encoding
5. Outlier Detection
6. Feature Scaling
7. Feature Engineering
8. Feature Selection
9. Feature Extraction.

This notebook will demonstrate the core functionality of the Dataset Factory class. In this notebook, you will learn some of the many ways to clean and transform data in an `ADSDatasetFactory` Object.
When datasets are loaded with DatasetFactory, they can be transformed and manipulated easily with the built-in functions. Underlying, an ADSDataset object is a Pandas dataframe. Any operation that can be performed to a Pandas dataframe can also be applied to an ADS Dataset.

In [22]:
import ads
import logging
import numpy as np
import os
import pandas as pd
import shutil
import tempfile
import warnings

from ads.dataset.dataset_browser import DatasetBrowser
from ads.dataset.factory import DatasetFactory
from ads.common.data import ADSData
from ads.common.model import ADSModel
from os import path
from sqlalchemy import create_engine
import seaborn as sns
from sklearn.base import TransformerMixin
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
#from sklearn.model_selection import GridSearchCV
#from sklearn.metrics import get_scorer
from collections import defaultdict
from ads.evaluations.evaluator import ADSEvaluator
#from ads.explanations.explainer import ADSExplainer

from ads.dataset.dataset_browser import DatasetBrowser
#from ads.feature_engineering import feature_type_manager, FeatureType
import matplotlib.pyplot as plt
from os import path

warnings.filterwarnings('ignore')
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR)

In [2]:
print(ads.__version__)

2.8.8


In [3]:
diab = pd.read_csv('diabetes_prediction_dataset.csv')

In [4]:
diab_ds = DatasetFactory.from_dataframe(pd.DataFrame(diab), target = 'diabetes')

loop1:   0%|          | 0/4 [00:00<?, ?it/s]

In [5]:
type(diab_ds)

ads.dataset.classification_dataset.BinaryClassificationDataset

The ADSDataset object cannot be used for classification or regression problems `until a target has been set using set_target`.
This datasets is of type Binary Classification Dataset.

In [6]:
diab_ds.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
0,Female,80.0,0,1,never,25.19,6.6,140,0
1,Female,54.0,0,0,No Info,27.32,6.6,80,0
2,Male,28.0,0,0,never,27.32,5.7,158,0
3,Female,36.0,0,0,current,23.45,5.0,155,0
4,Male,76.0,1,1,current,20.14,4.8,155,0


In [7]:
diab_ds.shape

(100000, 9)

In [8]:
diab_ds.summary()

Unnamed: 0,Feature,Datatype
0,diabetes,categorical/int64
1,gender,categorical/object
2,age,continuous/float64
3,hypertension,categorical/int64
4,heart_disease,categorical/int64
5,smoking_history,categorical/object
6,bmi,continuous/float64
7,HbA1c_level,continuous/float64
8,blood_glucose_level,ordinal/int64


In [9]:
diab_ds.feature_types

{'diabetes': {
   "type": "categorical",
   "low_level_type": "int64",
   "missing_percentage": 0.0,
   "stats": {
     "unique percentage": 0.02,
     "mode": 0,
     "count": 10000,
     "unique": 2,
     "top": 0,
     "freq": 9137
   },
   "feature_name": "diabetes"
 },
 'gender': {
   "type": "categorical",
   "low_level_type": "object",
   "missing_percentage": 0.0,
   "stats": {
     "unique percentage": 0.02,
     "mode": "Female",
     "count": 10000,
     "unique": 2,
     "top": "Female",
     "freq": 5919
   },
   "feature_name": "gender"
 },
 'age': {
   "type": "continuous",
   "missing_percentage": 0.0,
   "low_level_type": "float64",
   "stats": {
     "mode": 80.0,
     "median": 42.0,
     "kurtosis": -1.0117309856225845,
     "variance": 509.96217323972166,
     "skewness": -0.03795010636587615,
     "outlier_percentage": 0.0,
     "count": 10000.0,
     "mean": 41.49323999999999,
     "std": 22.582342067193157,
     "min": 0.08,
     "25%": 23.0,
     "50%": 42.0,
 

### Remove Duplicate Rows

Having duplicate rows is a pain, as it will slow down your model training, without any actual gain. And therefore, we need to remove duplicates. We can call the `drop_duplicates` function to return a dataset with all of the duplicates removed.

In [10]:
diab_ds.isna().sum()

gender                 0
age                    0
hypertension           0
heart_disease          0
smoking_history        0
bmi                    0
HbA1c_level            0
blood_glucose_level    0
diabetes               0
dtype: int64

In [11]:
diab_without_dup = diab_ds.drop_duplicates()
diab_without_dup.shape

(96146, 9)


## show_in_notebook()
ADS offers a smart visualization tool that automatically detects the type of your data columns and offers the best way to plot your data. You can also create custom visualizations with ADS by using your preferred plotting libraries and packages. The correlation plot gives an idea on totally correlated pairwise attributes. `Highly correlated attributes contribute less to decision making`. So it is better to avoid those attributes. So the transformed dataset ensures that such attributes are avoided. The drop column feature drops those columns which are correlated. **`show_in_notebook()`** shows these correlations in the form of heatmaps. 
The ADS show_in_notebook() method creates a comprehensive preview of all the basic information about a dataset including:

a. The predictive data type (for example, regression, binary classification, or multinomial classification).

b. The number of columns and rows.

c. Feature type information.

d. Summary visualization of each feature.

e. The correlation map.

f. Any warnings about data conditions that you should be aware of.

To improve plotting performance, the ADS show_in_notebook() method uses an optimized subset of the data. This smart sample is selected so that it is statistically representative of the full dataset. The correlation map is only displayed when the data only has numerical (continuous or oridinal) columns.
The warnings in this case is the presence of null values for pregancies, insulin, skinthickness and outcome. This can be ignored because for this dataset this is not indicative of missing values.
The show_in_notebook() option shows 4 warnings. Pregnancies, SkinThicknes, Insulin, Outcome has zeros. Out of this, Outcome is the target variable. The other 3 attributes are just categorical which displays the presence or absence of Pregnancies, Insulin usage and SkinThickness. So this warning can be ignorned

In [12]:
diab_ds.show_in_notebook()

Accordion(children=(HTML(value='<h1>Name: User Provided DataFrame</h1><h3>Type: BinaryClassificationDataset</h…

## get_recommendations()
ADS provides built-in automatic transformation tools for datasets. These tools help detect issues with the data and recommend changes to improve the dataset. The recommended changes can be accepted by clicking a button in the drop-down menu. Once the changes are applied, the transformed dataset can be retrieved using the get_transformed_dataset() method.

To access the recommendations, you can use the get_recommendations() method on the ADSDataset object:

In [13]:
diab_ds.get_recommendations()

Output()

## get_transformed_dataset()

Return the transformed dataset with the recommendations applied.

This method should be called after applying the recommendations using the recommendation `show_in_notebook()` API.

In [14]:
diab_ds.get_transformed_dataset()



## Automated Transformations

Automatically chooses the most effective dataset transformation.

Alternatively, you can use auto_transform() to apply all the recommended transformations at once. auto_transform() returns a transformed dataset with several optimizations applied automatically. The optimizations include:

- Dropping constant and primary key columns, which has no predictive quality.

- Imputation to fill in missing values in noisy data.

- Dropping strongly co-correlated columns that tend to produce less generalizable models.

- Balancing a dataset using up or down sampling.

One optional argument to auto_transform() is `fix_imbalance`, which is set to `True` by default. When True, auto_transform() corrects any imbalance between the classes. ADS downsamples the dominant class first unless there are too few data points. In that case, ADS upsamples the minority class.

In [15]:
diab_ds.auto_transform()

loop1:   0%|          | 0/7 [00:00<?, ?it/s]

diabetes,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level
0,Female,52.0,0,0,never,20.99,5.7,90
0,Female,25.0,0,0,No Info,27.32,6.2,159
0,Male,80.0,0,0,No Info,27.32,6.1,160
0,Female,18.0,0,0,No Info,24.75,6.1,90
0,Female,49.0,0,0,No Info,27.32,6.6,85


BinaryClassificationDataset(target: diabetes) 17,000 rows, 9 columns

## select_best_features()

Automatically chooses the best features and removes the rest.

In [16]:
diab_ds.select_best_features()

diabetes,blood_glucose_level,HbA1c_level,age,bmi,hypertension,heart_disease,smoking_history,gender
0,126,5.8,13.0,20.82,0,0,No Info,Female
0,145,5.0,3.0,21.0,0,0,No Info,Female
0,200,3.5,63.0,25.32,0,0,former,Male
0,126,6.1,2.0,17.43,0,0,never,Female
1,200,6.2,33.0,40.08,0,0,not current,Female


BinaryClassificationDataset(target: diabetes) 100,000 rows, 9 columns

## Building and Visualizing Models

In [25]:
# X = diab_ds[['blood_glucose_level', 'HbA1c_level', 'age', 'bmi', 'hypertension', 'heart_disease',
#                         'smoking_history', 'gender']]

# y = diab_ds[['diabetes']]

# X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.90, test_size = 0.10, random_state = 109)

# print("X_Train Shape:", X_train.shape)
# print("Y_Train Shape:", y_train.shape)
# print("X_Test Shape:", X_test.shape)
# print("Y_Test Shape:", y_test.shape)

In [26]:
train, test = diab_ds.train_test_split()

In [31]:
train.X.smoking_history.unique()

array(['never', 'No Info', 'current', 'not current', 'former', 'ever'],
      dtype=object)

In [24]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

ERROR - Exception
Traceback (most recent call last):
  File "/Users/SM023112/opt/anaconda3/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3553, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-24-c2ac09642c45>", line 3, in <cell line: 3>
    X_train = sc.fit_transform(X_train)
NameError: name 'X_train' is not defined
NameError: name 'X_train' is not defined