In [1]:
# loding data
import pandas as pd
df = pd.read_csv("data/Diab.csv")
df

Unnamed: 0,Pat_Id,Gender,OGTT,DBP,BMI,Age,Diabetic
0,101,Male,176,90,33.7,58,Yes
1,102,Male,150,66,34.7,42,No
2,103,Male,73,50,23.0,21,No
3,104,Female,187,68,37.7,41,Yes
4,105,Female,100,88,46.8,31,No
...,...,...,...,...,...,...,...
495,596,Male,130,96,22.6,21,No
496,597,Female,111,58,29.5,22,No
497,598,Female,98,60,34.7,22,No
498,599,Female,143,86,30.1,23,No


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Pat_Id    500 non-null    int64  
 1   Gender    500 non-null    object 
 2   OGTT      500 non-null    int64  
 3   DBP       500 non-null    int64  
 4   BMI       500 non-null    float64
 5   Age       500 non-null    int64  
 6   Diabetic  500 non-null    object 
dtypes: float64(1), int64(4), object(2)
memory usage: 27.5+ KB


Step 2: Separate categorical and continuous columns

In [6]:
df_cat = df[['Gender','Diabetic']]
df_cont = df.iloc[:,2:6]

df_cat contains categorical variables: Gender and Diabetic.

df_cont contains continuous/numerical variables: OGTT, DBP, BMI, Age.

df_cat → Gender, Diabetic

df_cont → 150, 80, 25, 45

Step 3: Standardization

In [8]:
from sklearn.preprocessing import StandardScaler
SS = StandardScaler()
SS_X= SS.fit_transform(df_cont)
SS_X = pd.DataFrame(SS_X)
SS_X.columns = ['OGTT','DBP','BMI','Age']

Standardization centers the data at mean 0 and scales it to unit variance.

Formula:

z = x - mean/std deviation


Example: If OGTT values are [120, 150, 180], mean=150, std=30 → standardized: [-1, 0, 1].

Useful for algorithms like KNN, SVM, Logistic Regression.

Step 4: Min-Max Scaling (Normalization)

In [10]:
from sklearn.preprocessing import MinMaxScaler
MM = MinMaxScaler()
MM_X = MM.fit_transform(df_cont)
MM_X = pd.DataFrame(MM_X)
MM_X.columns = ['OGTT','DBP','BMI','Age']

Scales values to range 0–1 using:

x = x-x(min)/x(max)-x(min)

Example: OGTT values [120, 150, 180] → min=120, max=180 → scaled [0, 0.5, 1].

Useful for neural networks and distance-based algorithms.

Step 5: Label Encoding

In [12]:
from sklearn.preprocessing import LabelEncoder
LE = LabelEncoder()
df_cat["Gender_LE"] = LE.fit_transform(df_cat["Gender"])
df_cat["Diabetic_LE"] = LE.fit_transform(df_cat["Diabetic"])
df_cat.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cat["Gender_LE"] = LE.fit_transform(df_cat["Gender"])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cat["Diabetic_LE"] = LE.fit_transform(df_cat["Diabetic"])


Unnamed: 0,Gender,Diabetic,Gender_LE,Diabetic_LE
0,Male,Yes,1,1
1,Male,No,1,0
2,Male,No,1,0
3,Female,Yes,0,1
4,Female,No,0,0


Converts categorical strings to numeric labels.

Gender → Male=1, Female=0

Diabetic → Yes=1, No=0


Useful for algorithms that require numeric input.

Note: Label encoding can impose an order. For unordered categories, One-Hot Encoding is preferred.

step 6: OneHot Encoding

In [16]:
from sklearn.preprocessing import OneHotEncoder
OHE = OneHotEncoder()

df_g = pd.DataFrame(OHE.fit_transform(df_cat[['Gender']]).toarray())
df_g.columns = ['Female','Male']

df_d = pd.DataFrame(OHE.fit_transform(df_cat[['Diabetic']]).toarray())
df_d.columns = ['No','Yes']


In [17]:
df_g

Unnamed: 0,Female,Male
0,0.0,1.0
1,0.0,1.0
2,0.0,1.0
3,1.0,0.0
4,1.0,0.0
...,...,...
495,0.0,1.0
496,1.0,0.0
497,1.0,0.0
498,1.0,0.0


In [18]:
df_d

Unnamed: 0,No,Yes
0,0.0,1.0
1,1.0,0.0
2,1.0,0.0
3,0.0,1.0
4,1.0,0.0
...,...,...
495,1.0,0.0
496,1.0,0.0
497,1.0,0.0
498,1.0,0.0


One-hot encoding creates binary columns for each category.

Example (Gender):

Male → [0,1]

Female → [1,0]
Columns → Female, Male

Example (Diabetic):

Yes → [0,1]

No → [1,0]
Columns → No, Yes

This is better than label encoding when categories are nominal (no natural order).