**TASK 2**

1. Find the number of missing samples for each feature and if there is any missing value
found, then fill the values based on the type of data(continuous/discrete)

In [None]:
### 1 Handling Missing Values ###
print("🔹 Missing values per column:\n", df.isnull().sum())

# Fill missing values
for col in df.columns:
    if df[col].isnull().sum() > 0:  # If missing values exist
        if df[col].dtype == 'float64' or df[col].dtype == 'int64':  # Continuous data
            df[col].fillna(df[col].mean(), inplace=True)
        else:  # Categorical/discrete data
            df[col].fillna(df[col].mode()[0], inplace=True)

print("\n✅ Missing values handled!")


🔹 Missing values per column:
 gender                 0
age                    0
hypertension           0
heart_disease          0
smoking_history        0
bmi                    0
HbA1c_level            0
blood_glucose_level    0
diabetes               0
dtype: int64

✅ Missing values handled!


2. Remove the duplicated sample from the dataset

In [None]:
before = df.shape[0]
df.drop_duplicates(inplace=True)
after = df.shape[0]
print(f"\n🔹 Removed {before - after} duplicate rows.")



🔹 Removed 3854 duplicate rows.


3. Normalize `blood_glucose_level` (Min-Max Normalization)

In [None]:
scaler = MinMaxScaler()
df['blood_glucose_level'] = scaler.fit_transform(df[['blood_glucose_level']])
print("\n✅ blood_glucose_level normalized to range 0-1.")



✅ blood_glucose_level normalized to range 0-1.


4. Convert Categorical Data to Ordinal

In [None]:
category_mappings = {
    'gender': {'Male': 0, 'Female': 1, 'Other': 2},
    'smoking_history': {'never': 0, 'former': 1, 'current': 2, 'not known': 3}
}
df.replace(category_mappings, inplace=True)
print("\n✅ Categorical data mapped to ordinal values.")


✅ Categorical data mapped to ordinal values.


  df.replace(category_mappings, inplace=True)


5. Detect Outliers using IQR

In [None]:
outliers = {}
for col in df.select_dtypes(include=[np.number]).columns:  # Only for numeric columns
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    outliers[col] = df[(df[col] < lower_bound) | (df[col] > upper_bound)].shape[0]
    print(f"\n🔹 Outlier range for {col}: Lower Bound = {lower_bound}, Upper Bound = {upper_bound}")

print("\n✅ Outlier counts for each column:", outliers)