# Identifying and Handling Missing Data

## 1.1 Detecting Missing Data

Detecting missing data is the first step in handling it. Pandas provides several functions to detect missing values in a dataset.

### Using `isnull()`
The `isnull()` function returns a DataFrame of the same shape as the original, with `True` in positions where values are missing and `False` where they are not.

In [23]:
import pandas as pd

# Example DataFrame
data = {'Name': ['Alice', 'Bob', None, 'David'],
        'Age': [24, None, 22, 29],
        'Score': [85.5, 90.0, None, 88.0]}

df = pd.DataFrame(data)

# Detecting missing values
missing_data = df.isnull()
print(missing_data)

    Name    Age  Score
0  False  False  False
1  False   True  False
2   True  False   True
3  False  False  False


### Using `notnull()`

The `notnull()` function returns a DataFrame of the same shape as the original, with `True` in positions where values are not missing and `False` where they are.

In [24]:
# Detecting non-missing values
non_missing_data = df.notnull()
print(non_missing_data)

    Name    Age  Score
0   True   True   True
1   True  False   True
2  False   True  False
3   True   True   True


### Summary Statistic for Missing Data

You can also use the `sum()` function to get the total count of missing values per column.

In [25]:
# Count of missing values per column
missing_count = df.isnull().sum()
print(missing_count)

Name     1
Age      1
Score    1
dtype: int64


## 1.2 Techniques for Handling Missing Data

Once missing data is detected, there are several ways to handle it. Common techniques include deletion and imputation.

**Deletion**
* **Dropping Rows**: Removing rows with missing values can be useful when the dataset is large and the amount of missing data is relatively small.

In [26]:
# Dropping rows with missing values
df_dropped_rows = df.dropna()
print(df_dropped_rows)

    Name   Age  Score
0  Alice  24.0   85.5
3  David  29.0   88.0


* **Dropping Columns**: Removing columns with missing values is an option when a column has a significant amount of missing data.

In [27]:
# Dropping columns with missing values
df_dropped_columns = df.dropna(axis=1)
print(df_dropped_columns)

Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]


**Imputation**
* **Fill with a Specific Value**: Replacing missing values with a specified value, such as zero, the mean, or the median of the column.

In [28]:
# Filling missing values with a specific value
df_filled_zero = df.fillna(0)
print(df_filled_zero)

    Name   Age  Score
0  Alice  24.0   85.5
1    Bob   0.0   90.0
2      0  22.0    0.0
3  David  29.0   88.0


* **Filling with the Mean/Median of the Column**: This is a common technique for numerical data.

In [29]:
# Filling missing values with the mean of the column
df_filled_mean = df.fillna(df.mean(numeric_only=True))
print(df_filled_mean)

    Name   Age      Score
0  Alice  24.0  85.500000
1    Bob  25.0  90.000000
2   None  22.0  87.833333
3  David  29.0  88.000000


In [30]:
# Filling missing values with the median of the column
df_filled_median = df.fillna(df.median(numeric_only=True))
print(df_filled_median)

    Name   Age  Score
0  Alice  24.0   85.5
1    Bob  24.0   90.0
2   None  22.0   88.0
3  David  29.0   88.0


## 1.3 Advanced Imputation Methods

Advanced imputation methods involve using models to predict and fill in missing values. Two common methods are K-Nearest Neighbors (KNN) and Multiple Imputation by Chained Equations (MICE).

### K-Nearest Neighbors (KNN)
KNN imputation replaces a missing value with the mean (or median) value of the **`k`** nearest neighbors found in the dataset.

In [31]:
from sklearn.impute import KNNImputer

# Example DataFrame with numerical data
data_numeric = {'Age': [24, None, 22, 29],
                'Score': [85.5, 90.0, None, 88.0]}
df_numeric = pd.DataFrame(data_numeric)

In [32]:
# Applying KNN imputation
knn_imputer = KNNImputer(n_neighbors=2)
df_knn_imputed = pd.DataFrame(knn_imputer.fit_transform(df_numeric), columns=df_numeric.columns)
print(df_knn_imputed)

    Age  Score
0  24.0  85.50
1  26.5  90.00
2  22.0  86.75
3  29.0  88.00


### Multiple Imputation by Chained Equations (MICE)

MICE is a more sophisticated imputation method that fills in missing data multiple times to create several complete datasets. Each dataset is then analyzed, and the results are pooled to account for the uncertainty in the imputations.

In [33]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Applying MICE imputation
mice_imputer = IterativeImputer()
df_mice_imputed = pd.DataFrame(mice_imputer.fit_transform(df_numeric), columns=df_numeric.columns)
print(df_mice_imputed)

         Age     Score
0  24.000000  85.50000
1  25.002583  90.00000
2  22.000000  87.83034
3  29.000000  88.00000
