# Lab 3: Data Preprocessing - Data Discretion, Binning and Feature Subset Selection

## Data Discretization

**Definition**: The process of converting continuous data into discrete intervals or categories.  
**Importance**: Simplifies data, makes patterns easier to identify, and helps some machine learning models perform better.

## Data Binning
  
**Definition**: A specific form of discretization that groups continuous values into "bins" or intervals.  
**Importance**: Reduces noise in the data and helps detect trends by organizing values into broader categories.

## Feature Subset Selection

**Definition**: The process of selecting a subset of relevant features from the dataset while eliminating irrelevant or redundant ones.  
**Importance**: Improves model performance, reduces overfitting, and speeds up computation by focusing on the most important data features.

####  Data Bining can be achieved by method `cut()`, which will group data and apply user defined labels  

Syntax: `pd.cut(x, bins, labels = None)`

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

In [2]:
check_glucose_level_data = pd.read_csv('dataset/imputed_data_diabetes.csv')
check_glucose_level_data.head()

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,Diabetic
0,1,85.0,66.0,29.0,125.0,26.6,0.351,31,0
1,8,183.0,64.0,29.142593,125.0,23.3,0.672,32,1
2,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
3,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1
4,5,116.0,74.0,29.142593,125.0,25.6,0.201,30,0


In [3]:
print("Maximum Value:", check_glucose_level_data['glucose'].max())
print("Minimum Value:", check_glucose_level_data['glucose'].min())

Maximum Value: 199.0
Minimum Value: 44.0


#### Create two buckets for glucose values of 0-140 and 140-199

In [4]:
check_glucose_level_data['bin'] = pd.cut(check_glucose_level_data['glucose'], bins = [0,140,199])
check_glucose_level_data.head()

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,Diabetic,bin
0,1,85.0,66.0,29.0,125.0,26.6,0.351,31,0,"(0, 140]"
1,8,183.0,64.0,29.142593,125.0,23.3,0.672,32,1,"(140, 199]"
2,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0,"(0, 140]"
3,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1,"(0, 140]"
4,5,116.0,74.0,29.142593,125.0,25.6,0.201,30,0,"(0, 140]"


Above code and table shows conversion of glucose attribute value(continuous) in the 2nd column to its corresponding specific bin size at the right most part.

#### After creating two buckets for different insulin ranges, create one labels "Normal" for glucose range of 0-140 and "Prediabetic or Risky" for range of 140-199 as in below code and tables

In [5]:
check_glucose_level_data['bin'] = pd.cut(
    check_glucose_level_data['glucose'],
    bins=[0, 140, 199],
    labels=['Normal', 'Prediabetic or Risky']
)

check_glucose_level_data.head()

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,Diabetic,bin
0,1,85.0,66.0,29.0,125.0,26.6,0.351,31,0,Normal
1,8,183.0,64.0,29.142593,125.0,23.3,0.672,32,1,Prediabetic or Risky
2,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0,Normal
3,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1,Normal
4,5,116.0,74.0,29.142593,125.0,25.6,0.201,30,0,Normal


####  Show column values of glucose label and its corresponsing categorical value as created in bin

In [6]:
dta_frm = check_glucose_level_data[['glucose', 'bin']]
dta_frm.head()

Unnamed: 0,glucose,bin
0,85.0,Normal
1,183.0,Prediabetic or Risky
2,89.0,Normal
3,137.0,Normal
4,116.0,Normal


#### Replace each glucose value with its correspoding categorical bin value

In [7]:
check_glucose_level_data['glucose'] = check_glucose_level_data['bin'].values
check_glucose_level_data.head()

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,Diabetic,bin
0,1,Normal,66.0,29.0,125.0,26.6,0.351,31,0,Normal
1,8,Prediabetic or Risky,64.0,29.142593,125.0,23.3,0.672,32,1,Prediabetic or Risky
2,1,Normal,66.0,23.0,94.0,28.1,0.167,21,0,Normal
3,0,Normal,40.0,35.0,168.0,43.1,2.288,33,1,Normal
4,5,Normal,74.0,29.142593,125.0,25.6,0.201,30,0,Normal


#### Drop bin column  from the dataframe `check_glucose_level_data` using (axis = 1,inplace =True) for column representation and store the modified content in the same dataframe respectively

In [8]:
check_glucose_level_data.drop(['bin'], axis=1, inplace = True)
check_glucose_level_data.head()

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,Diabetic
0,1,Normal,66.0,29.0,125.0,26.6,0.351,31,0
1,8,Prediabetic or Risky,64.0,29.142593,125.0,23.3,0.672,32,1
2,1,Normal,66.0,23.0,94.0,28.1,0.167,21,0
3,0,Normal,40.0,35.0,168.0,43.1,2.288,33,1
4,5,Normal,74.0,29.142593,125.0,25.6,0.201,30,0


## Data Binarization

#### One of the binarization method is: `One Hot Encoding`

- Convert each category value into a new column and assign a 1 or 0 (True/False) value to the column
- Use pandas
  - pd.get_dummies(obj_df, columns=["   "]) method to realize one hot encoding


In [9]:
check_glucose_level_data.glucose.value_counts()

glucose
Normal                  576
Prediabetic or Risky    191
Name: count, dtype: int64

 code counts and returns normal and prediabetic patients in the glucose column based on categorical label. Below code then binarizes "Normal" as "01" and "Prediabetic or Risky " as "10".

In [10]:
pd.get_dummies(check_glucose_level_data.glucose)

Unnamed: 0,Normal,Prediabetic or Risky
0,True,False
1,False,True
2,True,False
3,True,False
4,True,False
...,...,...
762,True,False
763,True,False
764,True,False
765,True,False


## Feature Subset Selection

It applies to those cases where most of the attributes/features are redundant or irrelevant in data sets and we don’t need all of them.

>Filter Approach – Features to be included in subset are selected before the subset is fed into algo is run and is independent of the algorithm. In it a certain mathematical basis is used to evaluate the most promising sub features. Or if we want variability in the reduced feature set we select those features that are related in least way i.e. select attributes whose pairwise co-relation is as low as possible– eg Age and Dob have very high dependency on each other – don’t select both of them but DOB and Medical History may have low co-relation and you can choose this set of attributes having low co-relation out of these three ones. </b>

## Filter Approach

#### Below code shows an example of "Filter Approach" to attribute selection using CHI square test.

In CHI Sqaure test we see the corerelation of each atrribute with the output attribute and attributes having high correaltion with target variable are selected.Here we select the first "m" attributes that are highy corelated with the output variable.


In [11]:
from sklearn.feature_selection import SelectKBest, chi2
dbts_new = pd.read_csv('dataset/imputed_data_diabetes.csv')

new_dtaset = dbts_new.values
#  split the dataset into input and output variables.Since we are creating a subset of only the input or independent variables
X = new_dtaset[:, 0:8]                  # select 8 input variables.
Y = new_dtaset[:, 8]                  # select last  output variable
# function to get first k = 5 highest chisqaured input feature scores
test = SelectKBest(score_func=chi2, k=5)
# Run score function on (X, Y) and get the appropriate features
fit = test.fit(X, Y)

# show all chisquared value/score for each input attribute
for i, j in enumerate(fit.scores_):
    print('Input Feature: %0d, Score: %.5f' % (i, j))


# Reduce X input fetaures = (9 input fetaures) to highest chisquared input fetaures K = 5 in this dataset
dbts_ftr_sbset = fit.transform(X)

# summarize selected input features  value from the cleaned table after mean in this case
print(dbts_ftr_sbset[0:5, :])

Input Feature: 0, Score: 110.72718
Input Feature: 1, Score: 1413.68503
Input Feature: 2, Score: 42.71539
Input Feature: 3, Score: 93.44658
Input Feature: 4, Score: 1698.96289
Input Feature: 5, Score: 108.82932
Input Feature: 6, Score: 5.35636
Input Feature: 7, Score: 178.01076
[[  1.   85.  125.   26.6  31. ]
 [  8.  183.  125.   23.3  32. ]
 [  1.   89.   94.   28.1  21. ]
 [  0.  137.  168.   43.1  33. ]
 [  5.  116.  125.   25.6  30. ]]


## Result Interpretation of the CHI Square score value

From the chisquare test we found the following attributes are less corelated to the output variable "diabetes".

- Feature: 2, Score: 42.71539 is "bp"
- Feature: 3, Score: 93.44658 is "skin"
- Feature: 6, Score: 5.35636  is "Pedigree"

Hence we can reduce the 8 feature set {pregnant,glucose,bp,skin,insulin,bmi,pedigree,age} into 5 features as {pregnant,glucose,insulin,bmi,age} that's stuitable for the  learning algorithm.
