# Machine Learning Lab - BCSE209P
# Assessment – 1
**Name: Siddhartha Pathak**

**Reg No. 21BCE3930**

# Q.No.a) Demonstrate the possible approaches to handle the missing value in any real world data and justify the same.

In [2]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer

In [3]:
data = {
    'age': [25, 30, np.nan, 40, 35, np.nan, 50],
    'salary': [50000, 60000, 70000, np.nan, 90000, 100000, np.nan],
    'gender': ['Male', 'Female', np.nan, 'Female', 'Male', 'Female', 'Male'],
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Display the dataset
print("Original Dataset:")
print(df)

Original Dataset:
    age    salary  gender
0  25.0   50000.0    Male
1  30.0   60000.0  Female
2   NaN   70000.0     NaN
3  40.0       NaN  Female
4  35.0   90000.0    Male
5   NaN  100000.0  Female
6  50.0       NaN    Male


# Listwise Deletion
# Remove rows with missing values using dropna().

In [4]:
# Listwise deletion (dropping rows with any missing values)
df_listwise = df.dropna()

print("\nAfter Listwise Deletion:")
print(df_listwise)



After Listwise Deletion:
    age   salary  gender
0  25.0  50000.0    Male
1  30.0  60000.0  Female
4  35.0  90000.0    Male


# 1. Mean/Median/Mode Imputation

In [5]:
# Create a copy of the dataset to perform imputation
df_imputed = df.copy()

# Impute numerical columns using mean for 'age' and 'salary'
imputer_mean = SimpleImputer(strategy='mean')
df_imputed['age'] = imputer_mean.fit_transform(df[['age']]).ravel()
df_imputed['salary'] = imputer_mean.fit_transform(df[['salary']]).ravel()

print("\nAfter Mean Imputation for 'age' and 'salary':")
print(df_imputed)

# Impute categorical 'gender' column using mode
imputer_mode = SimpleImputer(strategy='most_frequent')
df_imputed['gender'] = imputer_mode.fit_transform(df[['gender']]).ravel()

print("\nAfter Mode Imputation for 'gender':")
print(df_imputed)



After Mean Imputation for 'age' and 'salary':
    age    salary  gender
0  25.0   50000.0    Male
1  30.0   60000.0  Female
2  36.0   70000.0     NaN
3  40.0   74000.0  Female
4  35.0   90000.0    Male
5  36.0  100000.0  Female
6  50.0   74000.0    Male

After Mode Imputation for 'gender':
    age    salary  gender
0  25.0   50000.0    Male
1  30.0   60000.0  Female
2  36.0   70000.0  Female
3  40.0   74000.0  Female
4  35.0   90000.0    Male
5  36.0  100000.0  Female
6  50.0   74000.0    Male


# 2. K-Nearest Neighbors (KNN) Imputation

In [6]:
# Create a new sample dataset with missing values
df_knn = pd.DataFrame({
    'age': [25, 30, np.nan, 40, 35, np.nan, 50],
    'salary': [50000, 60000, 70000, np.nan, 90000, 100000, np.nan],
})

# KNN imputation
knn_imputer = KNNImputer(n_neighbors=2)
df_knn_imputed = knn_imputer.fit_transform(df_knn)

# Convert back to DataFrame
df_knn_imputed = pd.DataFrame(df_knn_imputed, columns=['age', 'salary'])

print("\nAfter KNN Imputation:")
print(df_knn_imputed)



After KNN Imputation:
    age    salary
0  25.0   50000.0
1  30.0   60000.0
2  27.5   70000.0
3  40.0   75000.0
4  35.0   90000.0
5  32.5  100000.0
6  50.0   75000.0


# 3. Handling Categorical Data: One-Hot Encoding and Imputation

In [8]:
# Create a DataFrame with categorical data
df_cat = pd.DataFrame({
    'gender': ['Male', 'Female', np.nan, 'Female', 'Male', 'Female', 'Male'],
})

# Impute missing values using the most frequent (mode)
imputer_mode = SimpleImputer(strategy='most_frequent')

# Since SimpleImputer returns a 2D array, we need to flatten it to 1D
df_cat['gender'] = imputer_mode.fit_transform(df_cat[['gender']]).ravel()

print("\nAfter Mode Imputation for Categorical Data:")
print(df_cat)

# One-Hot Encoding of categorical data
df_encoded = pd.get_dummies(df_cat, drop_first=True)

print("\nAfter One-Hot Encoding:")
print(df_encoded)



After Mode Imputation for Categorical Data:
   gender
0    Male
1  Female
2  Female
3  Female
4    Male
5  Female
6    Male

After One-Hot Encoding:
   gender_Male
0         True
1        False
2        False
3        False
4         True
5        False
6         True


# 4. Multiple Imputation using Iterative Imputer (Advanced)

In [9]:
from sklearn.experimental import enable_iterative_imputer  # noqa
from sklearn.impute import IterativeImputer

# Create a new sample dataset with missing values
df_multi = pd.DataFrame({
    'age': [25, 30, np.nan, 40, 35, np.nan, 50],
    'salary': [50000, 60000, 70000, np.nan, 90000, 100000, np.nan],
})

# Multiple imputation
iter_imputer = IterativeImputer(random_state=0)
df_multi_imputed = iter_imputer.fit_transform(df_multi)

# Convert back to DataFrame
df_multi_imputed = pd.DataFrame(df_multi_imputed, columns=['age', 'salary'])

print("\nAfter Multiple Imputation:")
print(df_multi_imputed)



After Multiple Imputation:
         age         salary
0  25.000000   50000.000000
1  30.000000   60000.000000
2  32.576063   70000.000000
3  40.000000   92019.586436
4  35.000000   90000.000000
5  44.112334  100000.000000
6  50.000000  119066.598468




# Conclusion

Different methods to handle missing values in a dataset:

1. **Listwise Deletion**: Useful when the missing data is minimal and random.
2. **Mean/Median/Mode Imputation**: Simple and effective for numerical data that is missing completely at random.
3. **KNN Imputation**: Suitable for datasets where relationships between features can predict missing values.
4. **Multiple Imputation**: Provides a robust way to handle missing data by capturing the uncertainty of the missingness.

Choosing the right method depends on the nature of the dataset, the proportion of missing data, and the specific requirements of the machine learning task.
