**Handling missing values** is a crucial part of data preprocessing in data analysis and machine learning tasks. There are several approaches to dealing with missing data, and the choice of method depends on factors such as the nature of the data, the amount of missingness, and the specific problem at hand. Here’s how you can handle missing values using Python and popular libraries like pandas

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings('ignore')
%matplotlib inline

In [4]:
df = sns.load_dataset('titanic')

In [5]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


### Identifying Missing Values

In [7]:
# Check for missing values
print("Missing values in DataFrame:")
print(df.isnull().sum())

Missing values in DataFrame:
survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64


###  Handling Missing Values

#### Dropping Rows or Columns

In [10]:
# Drop rows with any missing values
print("Shape before deleting: ",df.shape)
df_drop_rows = df.dropna()
print("\nShape after deleting rows with any missing values: ",df_drop_rows.shape)

# Drop columns with any missing values
df_drop_cols = df.dropna(axis=1)
print("\nShape after dropping columns with any missing values: ",df_drop_cols.shape)

Shape before deleting:  (891, 15)

Shape after deleting rows with any missing values:  (182, 15)

Shape after dropping columns with any missing values:  (891, 11)


- **Dropping Rows**: Simple and ensures no imputation bias but may result in significant data loss and potential bias.
- **Dropping Columns**: Retains all observations but loses potentially valuable features and information.

### Imputation Technique

In [17]:
# Separate numeric and categorical columns
numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns
categorical_cols = df.select_dtypes(include=['object']).columns

print("\nNumeric columns:", numeric_cols)
print("Categorical columns:", categorical_cols)



Numeric columns: Index(['survived', 'pclass', 'age', 'sibsp', 'parch', 'fare'], dtype='object')
Categorical columns: Index(['sex', 'embarked', 'who', 'embark_town', 'alive'], dtype='object')


Imputation techniques are used to handle missing data by filling in the gaps with plausible values. Here are some of the most commonly used imputation techniques:

### 1. Mean Imputation

**Description**: Replace missing values with the mean of the non-missing values in the same column.

**Advantages**:
- Simple to implement.
- Preserves the mean of the data.

**Disadvantages**:
- Reduces the variability in the data (underestimates variance).
- Can introduce bias if the data is not missing completely at random (MCAR).

In [23]:
# Mean Imputation
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
df_num_col_mean = pd.DataFrame(imputer.fit_transform(df[numeric_cols]),columns= numeric_cols)
df_num_col_mean

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
0,0.0,3.0,22.000000,1.0,0.0,7.2500
1,1.0,1.0,38.000000,1.0,0.0,71.2833
2,1.0,3.0,26.000000,0.0,0.0,7.9250
3,1.0,1.0,35.000000,1.0,0.0,53.1000
4,0.0,3.0,35.000000,0.0,0.0,8.0500
...,...,...,...,...,...,...
886,0.0,2.0,27.000000,0.0,0.0,13.0000
887,1.0,1.0,19.000000,0.0,0.0,30.0000
888,0.0,3.0,29.699118,1.0,2.0,23.4500
889,1.0,1.0,26.000000,0.0,0.0,30.0000


### 2. Median Imputation

**Description**: Replace missing values with the median of the non-missing values in the same column.

**Advantages**:
- Robust to outliers.
- Preserves the median of the data.

**Disadvantages**:
- Similar to mean imputation, it can underestimate the variance.

In [25]:
# Mean Imputation
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
df_num_col_median = pd.DataFrame(imputer.fit_transform(df[numeric_cols]),columns= numeric_cols)
df_num_col_median

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
0,0.0,3.0,22.000000,1.0,0.0,7.2500
1,1.0,1.0,38.000000,1.0,0.0,71.2833
2,1.0,3.0,26.000000,0.0,0.0,7.9250
3,1.0,1.0,35.000000,1.0,0.0,53.1000
4,0.0,3.0,35.000000,0.0,0.0,8.0500
...,...,...,...,...,...,...
886,0.0,2.0,27.000000,0.0,0.0,13.0000
887,1.0,1.0,19.000000,0.0,0.0,30.0000
888,0.0,3.0,29.699118,1.0,2.0,23.4500
889,1.0,1.0,26.000000,0.0,0.0,30.0000


### 3. Mode Imputation (Most Frequent)

**Description**: Replace missing values with the most frequent value (mode) in the same column.

**Advantages**:
- Simple to implement.
- Useful for categorical data.

**Disadvantages**:
- Can introduce bias if the most frequent value is not representative.

In [27]:
# Mean Imputation
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='most_frequent')
df_cat_col_median = pd.DataFrame(imputer.fit_transform(df[categorical_cols]),columns= categorical_cols)
df_cat_col_median

Unnamed: 0,sex,embarked,who,embark_town,alive
0,male,S,man,Southampton,no
1,female,C,woman,Cherbourg,yes
2,female,S,woman,Southampton,yes
3,female,S,woman,Southampton,yes
4,male,S,man,Southampton,no
...,...,...,...,...,...
886,male,S,man,Southampton,no
887,female,S,woman,Southampton,yes
888,female,S,woman,Southampton,no
889,male,C,man,Cherbourg,yes


### 4. Constant Imputation

**Description**: Replace missing values with a constant value specified by the user.

**Advantages**:
- Useful for creating a specific category for missing values in categorical data.
- Simple to implement.

**Disadvantages**:
- May introduce bias if the constant value is not representative.

In [29]:
# Constant Imputation
imputer = SimpleImputer(strategy='constant', fill_value=0)
df_num_col_median = pd.DataFrame(imputer.fit_transform(df[numeric_cols]),columns= numeric_cols)
df_num_col_median

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
0,0.0,3.0,22.0,1.0,0.0,7.2500
1,1.0,1.0,38.0,1.0,0.0,71.2833
2,1.0,3.0,26.0,0.0,0.0,7.9250
3,1.0,1.0,35.0,1.0,0.0,53.1000
4,0.0,3.0,35.0,0.0,0.0,8.0500
...,...,...,...,...,...,...
886,0.0,2.0,27.0,0.0,0.0,13.0000
887,1.0,1.0,19.0,0.0,0.0,30.0000
888,0.0,3.0,0.0,1.0,2.0,23.4500
889,1.0,1.0,26.0,0.0,0.0,30.0000


### 5. K-Nearest Neighbors Imputation (KNN)

**Description**: Use the k-nearest neighbors algorithm to impute missing values based on the values of the k-nearest neighbors.

**Advantages**:
- Can capture the local structure of the data.
- More sophisticated and potentially more accurate.

**Disadvantages**:
- Computationally intensive.
- Requires careful selection of the number of neighbors (k).

In [32]:
from sklearn.impute import KNNImputer
# KKN Imputation
imputer = KNNImputer(n_neighbors=3)
df_num_col_median = pd.DataFrame(imputer.fit_transform(df[numeric_cols]),columns= numeric_cols)
df_num_col_median

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
0,0.0,3.0,22.000000,1.0,0.0,7.2500
1,1.0,1.0,38.000000,1.0,0.0,71.2833
2,1.0,3.0,26.000000,0.0,0.0,7.9250
3,1.0,1.0,35.000000,1.0,0.0,53.1000
4,0.0,3.0,35.000000,0.0,0.0,8.0500
...,...,...,...,...,...,...
886,0.0,2.0,27.000000,0.0,0.0,13.0000
887,1.0,1.0,19.000000,0.0,0.0,30.0000
888,0.0,3.0,25.333333,1.0,2.0,23.4500
889,1.0,1.0,26.000000,0.0,0.0,30.0000


### 6. Iterative Imputation (Multivariate Imputation by Chained Equations)

**Description**: Impute missing values by modeling each feature as a function of other features iteratively.

**Advantages**:
- Can capture complex relationships between features.
- More sophisticated and potentially more accurate.

**Disadvantages**:
- Computationally intensive.
- Requires careful convergence checking.

In [34]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Iterative Imputation
imputer = IterativeImputer(max_iter=10, random_state=0)
df_num_col_median = pd.DataFrame(imputer.fit_transform(df[numeric_cols]),columns= numeric_cols)
df_num_col_median

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
0,0.0,3.0,22.000000,1.0,0.0,7.2500
1,1.0,1.0,38.000000,1.0,0.0,71.2833
2,1.0,3.0,26.000000,0.0,0.0,7.9250
3,1.0,1.0,35.000000,1.0,0.0,53.1000
4,0.0,3.0,35.000000,0.0,0.0,8.0500
...,...,...,...,...,...,...
886,0.0,2.0,27.000000,0.0,0.0,13.0000
887,1.0,1.0,19.000000,0.0,0.0,30.0000
888,0.0,3.0,23.259681,1.0,2.0,23.4500
889,1.0,1.0,26.000000,0.0,0.0,30.0000


### 7. Forward and Backward Fill (for Time Series Data)

**Description**: Use the previous (forward fill) or next (backward fill) value to fill missing values.

**Advantages**:
- Simple to implement.
- Preserves temporal continuity in time series data.

**Disadvantages**:
- Not suitable for data that is not time-dependent.
- Can introduce bias if the missing values are not randomly distributed.

**Example Code**:
```python
# Forward Fill
df_ffill = df.fillna(method='ffill')
print(df_ffill)

# Backward Fill
df_bfill = df.fillna(method='bfill')
print(df_bfill)
```

### Choosing the Right Imputation Technique
The choice of imputation technique depends on the nature of the data and the specific requirements of the analysis:

- **Mean/Median Imputation**: Suitable for numerical data with a relatively small number of missing values.
- **Mode Imputation**: Useful for categorical data.
- **Constant Imputation**: Helpful for creating a specific category for missing values or when a domain-specific constant is appropriate.
- **KNN and Iterative Imputation**: Best for capturing complex relationships in the data, especially when the missing values are not random.
- **Forward/Backward Fill**: Ideal for time series data where temporal continuity is important.

Each technique has its own strengths and weaknesses, and the best choice often requires understanding the data and the context of the analysis.

In [36]:
#For normally distributed numerical column
df["age"].fillna(df["age"].mean())

0      22.000000
1      38.000000
2      26.000000
3      35.000000
4      35.000000
         ...    
886    27.000000
887    19.000000
888    29.699118
889    26.000000
890    32.000000
Name: age, Length: 891, dtype: float64

In [39]:
#For not normally distributed numerical column
df["age"].fillna(df["age"].median())

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888    28.0
889    26.0
890    32.0
Name: age, Length: 891, dtype: float64

In [46]:
# Mode of cat column
mode = df[df["embarked"].notna()]['embarked'].mode()

In [47]:
df['embarked'].fillna(mode)

0      S
1      C
2      S
3      S
4      S
      ..
886    S
887    S
888    S
889    C
890    Q
Name: embarked, Length: 891, dtype: object