## Handling Missing Values in Python During the Data Analysis Process
Missing values are a common occurrence in data analysis and must be handled effectively to ensure the accuracy of insights derived from the data. This document provides a comprehensive guide to handling missing values in Python, covering various techniques and their implementations.

### 1. Understanding Missing Values
### Causes of Missing Values:
- Data entry errors
- Non-responses in surveys
- Data corruption
- Merging datasets with unmatched keys

### Types of Missing Data:
- Missing Completely at Random (MCAR): No systematic pattern.
- Missing at Random (MAR): Systematic relationship with other variables.
- Missing Not at Random (MNAR): Related to the missing value itself.

#### Identifying Missing Values:

Before handling missing values, identify them in the dataset.

In [1]:
import pandas as pd

# Example dataset
data = {
    'Name': ['Sajjad', 'Noor', 'Sameer', None],
    'Age': [25, None, 30, 22],
    'Salary': [50000, 60000, None, 45000]
}
df = pd.DataFrame(data)

# Check for missing values
print(df.isnull())  # Boolean mask for missing values
print(df.isnull().sum())  # Count of missing values per column

    Name    Age  Salary
0  False  False   False
1  False   True   False
2  False  False    True
3   True  False   False
Name      1
Age       1
Salary    1
dtype: int64


### 2. Techniques to Handle Missing Values

#### 2.1 Dropping Missing Values
- When to use: If missing values are sparse and not critical.

In [2]:
# Drop rows with any missing values
df_dropped_rows = df.dropna()

# Drop columns with any missing values
df_dropped_columns = df.dropna(axis=1)

# Drop rows where specific columns have missing values
df_dropped_specific = df.dropna(subset=['Age', 'Salary'])

### 2.2 Imputation (Filling Missing Values)
- When to use: If missing values need to be estimated.

2.2.1 Fill with a Constant Value

In [3]:
# Fill with a specific value
df_filled_constant = df.fillna(0)

2.2.2 Fill with Statistical Measures
- Mean, Median, or Mode: Suitable for numerical and categorical data.

In [5]:
# Fill with mean
df['Age'] = df['Age'].fillna(df['Age'].mean())

# Fill with median
df['Age'] = df['Age'].fillna(df['Age'].median())

# Fill with mode
df['Name'] = df['Name'].fillna(df['Name'].mode()[0])

2.2.3 Forward Fill or Backward Fill
- When to use: For time-series or sequential data.

In [7]:
# Forward fill
df_filled_ffill = df.ffill()

# Backward fill
df_filled_bfill = df.bfill()

# Check for missing values after fill
print("Missing values after forward fill:")
print(df_filled_ffill.isna().sum())

print("\nMissing values after backward fill:")
print(df_filled_bfill.isna().sum())


Missing values after forward fill:
Name      0
Age       0
Salary    0
dtype: int64

Missing values after backward fill:
Name      0
Age       0
Salary    0
dtype: int64


2.2.4 Interpolation
- When to use: For numerical data with patterns.

In [9]:
# Convert 'Age' and 'Salary' columns to numeric, forcing errors to NaN
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
df['Salary'] = pd.to_numeric(df['Salary'], errors='coerce')

# Interpolate only numeric columns ('Age' and 'Salary') using linear interpolation
df_interpolated = df.copy()
df_interpolated[['Age', 'Salary']] = df[['Age', 'Salary']].interpolate(method='linear')

# Display the interpolated DataFrame
print(df_interpolated)

     Name        Age   Salary
0  Sajjad  25.000000  50000.0
1    Noor  25.666667  60000.0
2  Sameer  30.000000  52500.0
3    Noor  22.000000  45000.0


### 2.3 Using Predictive Models
- When to use: For complex datasets where patterns exist.

In [10]:
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression

# Example using SimpleImputer
imputer = SimpleImputer(strategy='mean')
df['Age'] = imputer.fit_transform(df[['Age']])

# Using regression for imputation
X = df.dropna(subset=['Salary'])[['Age']]  # Features for prediction
y = df.dropna(subset=['Salary'])['Salary']  # Target
model = LinearRegression().fit(X, y)
missing_indices = df['Salary'].isnull()
df.loc[missing_indices, 'Salary'] = model.predict(df[missing_indices][['Age']])

### 3. Advanced Techniques

### 3.1 K-Nearest Neighbors (KNN) Imputation
- When to use: For small to medium datasets with significant relationships.

In [12]:
from sklearn.impute import KNNImputer

# 1. Select only numeric columns (Age and Salary)
df_numeric = df[['Age', 'Salary']]

# 2. Initialize KNNImputer and apply it on numeric columns
imputer = KNNImputer(n_neighbors=5)
df_imputed_numeric = pd.DataFrame(imputer.fit_transform(df_numeric), columns=df_numeric.columns)

# 3. Reattach the non-numeric columns (e.g., 'Name')
df_imputed = df.copy()
df_imputed[['Age', 'Salary']] = df_imputed_numeric[['Age', 'Salary']]

# Display the imputed DataFrame
print(df_imputed)

     Name        Age        Salary
0  Sajjad  25.000000  50000.000000
1    Noor  25.666667  60000.000000
2  Sameer  30.000000  71019.417476
3    Noor  22.000000  45000.000000


### 3.2 Multiple Imputation
- When to use: For datasets requiring robust statistical treatment.

In [16]:
from sklearn.experimental import enable_iterative_imputer  # noqa
from sklearn.impute import IterativeImputer

# 1. Select only numeric columns (Age and Salary)
df_numeric = df[['Age', 'Salary']]

# 2. Initialize IterativeImputer and apply it on numeric columns
imputer = IterativeImputer()
df_imputed_numeric = pd.DataFrame(imputer.fit_transform(df_numeric), columns=df_numeric.columns)

# 3. Reattach the non-numeric columns (e.g., 'Name')
df_imputed = df.copy()
df_imputed[['Age', 'Salary']] = df_imputed_numeric[['Age', 'Salary']]

# Display the imputed DataFrame
print(df_imputed)

     Name        Age        Salary
0  Sajjad  25.000000  50000.000000
1    Noor  25.666667  60000.000000
2  Sameer  30.000000  71019.417476
3    Noor  22.000000  45000.000000


### 4. Handling Categorical Missing Values
- Replace with the mode.
- Replace with a placeholder (e.g., "Unknown").
- Use one-hot encoding with an extra category.

In [18]:
# Replace NaN values with 'Unknown'
df['Name'] = df['Name'].fillna('Unknown')

# One-hot encoding
encoded_df = pd.get_dummies(df, columns=['Name'], dummy_na=True)

### 5. Best Practices
- Always analyze the extent and pattern of missing data.
- Avoid arbitrary imputation without understanding data context.
- Test models before and after handling missing data.
- Document all transformations for reproducibility.