
# Data Preprocessing with ADS - Diabetes Dataset

## Overview of this Notebook

`Data is the most important element in any data science project`.

Data Preparations for training any ML model involves Data Transformations and Manipulations. This is the most important phase and takes 80% of the time in the life cycle of any ML Model. The real world data is mostly incomplete, and has several missing values. Missing values can be because of the unavailability of data or inconsistency present in the data. There might be several errors and outliers present in the data. 
Preprocessing of data involves various steps - 

1. Combining attributes or columns  
2. Data Imputation
3. Data Cleaning
4. Dummy Variables Encoding
5. Outlier Detection
6. Feature Scaling
7. Feature Engineering
8. Feature Selection
9. Feature Extraction.

This notebook will demonstrate the core functionality of the Dataset Factory class. In this notebook, you will learn some of the many ways to clean and transform data in an `ADSDatasetFactory` Object.
When datasets are loaded with DatasetFactory, they can be transformed and manipulated easily with the built-in functions. Underlying, an ADSDataset object is a Pandas dataframe. Any operation that can be performed to a Pandas dataframe can also be applied to an ADS Dataset.

In [1]:
import ads
import logging
import numpy as np
import os
import pandas as pd
import shutil
import tempfile
import warnings

from ads.dataset.dataset_browser import DatasetBrowser
from ads.dataset.factory import DatasetFactory
from ads.common.data import ADSData
from ads.common.model import ADSModel
from os import path
from sqlalchemy import create_engine
import seaborn as sns
from sklearn.base import TransformerMixin
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import LabelEncoder
#from sklearn.model_selection import GridSearchCV
#from sklearn.metrics import get_scorer
from collections import defaultdict
from ads.evaluations.evaluator import ADSEvaluator
#from ads.explanations.explainer import ADSExplainer

from ads.dataset.dataset_browser import DatasetBrowser
#from ads.feature_engineering import feature_type_manager, FeatureType
import matplotlib.pyplot as plt
from os import path

warnings.filterwarnings('ignore')
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR)

In [2]:
print(ads.__version__)

2.8.8


In [3]:
diab = pd.read_csv('diabetes_prediction_dataset.csv')

In [4]:
diab_ds = DatasetFactory.from_dataframe(pd.DataFrame(diab), target = 'diabetes')

loop1:   0%|          | 0/4 [00:00<?, ?it/s]

In [17]:
type(diab_ds)

ads.dataset.classification_dataset.BinaryClassificationDataset

The ADSDataset object cannot be used for classification or regression problems `until a target has been set using set_target`.
This datasets is of type Binary Classification Dataset. m

In [6]:
diab_ds.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
0,Female,80.0,0,1,never,25.19,6.6,140,0
1,Female,54.0,0,0,No Info,27.32,6.6,80,0
2,Male,28.0,0,0,never,27.32,5.7,158,0
3,Female,36.0,0,0,current,23.45,5.0,155,0
4,Male,76.0,1,1,current,20.14,4.8,155,0


In [7]:
diab_ds.shape

(100000, 9)

In [8]:
diab_ds.summary()

Unnamed: 0,Feature,Datatype
0,diabetes,categorical/int64
1,gender,categorical/object
2,age,continuous/float64
3,hypertension,categorical/int64
4,heart_disease,categorical/int64
5,smoking_history,categorical/object
6,bmi,continuous/float64
7,HbA1c_level,continuous/float64
8,blood_glucose_level,ordinal/int64


In [9]:
diab_ds.isna().sum()

gender                 0
age                    0
hypertension           0
heart_disease          0
smoking_history        0
bmi                    0
HbA1c_level            0
blood_glucose_level    0
diabetes               0
dtype: int64

In [10]:
diab_without_dup = diab_ds.drop_duplicates()
diab_without_dup.shape

(96146, 9)

In [11]:
diab_ds.show_in_notebook()

Accordion(children=(HTML(value='<h1>Name: User Provided DataFrame</h1><h3>Type: BinaryClassificationDataset</h…

In [12]:
diab_ds.get_recommendations()

Output()

In [13]:
diab_ds

diabetes,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level
0,Female,13.0,0,0,No Info,20.82,5.8,126
0,Female,3.0,0,0,No Info,21.0,5.0,145
0,Male,63.0,0,0,former,25.32,3.5,200
0,Female,2.0,0,0,never,17.43,6.1,126
1,Female,33.0,0,0,not current,40.08,6.2,200


BinaryClassificationDataset(target: diabetes) 100,000 rows, 9 columns

In [14]:
diab_ds.select_best_features()

diabetes,blood_glucose_level,HbA1c_level,age,bmi,hypertension,heart_disease,smoking_history,gender
0,126,5.8,13.0,20.82,0,0,No Info,Female
0,145,5.0,3.0,21.0,0,0,No Info,Female
0,200,3.5,63.0,25.32,0,0,former,Male
0,126,6.1,2.0,17.43,0,0,never,Female
1,200,6.2,33.0,40.08,0,0,not current,Female


BinaryClassificationDataset(target: diabetes) 100,000 rows, 9 columns

In [15]:
diab_ds.auto_transform()

loop1:   0%|          | 0/7 [00:00<?, ?it/s]

diabetes,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level
0,Male,26.0,0,0,No Info,35.43,6.5,200
0,Male,65.0,0,1,current,36.46,5.7,85
0,Male,80.0,0,0,not current,22.94,6.6,145
0,Female,72.0,1,0,former,32.65,6.1,90
0,Female,47.0,0,0,No Info,18.86,5.0,159


BinaryClassificationDataset(target: diabetes) 17,000 rows, 9 columns