<a href="https://colab.research.google.com/github/sumanthkrishna/genAIPath/blob/main/GAI_Mod1_4_3_DataProcessing_MissingValues.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Data Cleaning: Missing Values in Exploratory Data Analysis (EDA)**

Missing values are a common challenge in real-world datasets, and addressing them is a crucial step in the exploratory data analysis (EDA) phase of data science. Missing values can arise due to various reasons, including data collection errors, system failures, or simply because certain information is not available. In this detailed exploration, we will delve into identifying missing values across different variable types, approaches to handle them, and demonstrate Python code snippets using the well-known Iris dataset.

### **1. Identifying Missing Values:**

#### **1.1 Across Different Variable Types:**

**1.1.1 Numerical Variables:**
For numerical variables, we can use the Pandas library to easily identify missing values. The `isnull()` function can be applied to the dataframe to check for missing values, and `sum()` can then be used to get a count for each column.

```python
import pandas as pd

# Assuming df is your dataframe
missing_numerical = df.select_dtypes(include=['float64', 'int64']).isnull().sum()
```

**1.1.2 Categorical Variables:**
For categorical variables, the same approach can be used. You need to include object types in the `select_dtypes` parameter.

```python
missing_categorical = df.select_dtypes(include=['object']).isnull().sum()
```

#### **1.2 Approaches to Handle Missing Values:**

### **2. Approaches to Handle Missing Values:**

#### **2.1 Removal:**
If the missing values are relatively small in number, removing rows or columns might be an option.

```python
# Removing rows with any missing values
df_cleaned_rows = df.dropna()

# Removing columns with any missing values
df_cleaned_columns = df.dropna(axis=1)
```

#### **2.2 Imputation:**
For numerical variables, imputation is often done by replacing missing values with the mean, median, or a custom value.

```python
# Impute missing values with mean
df['numerical_column'].fillna(df['numerical_column'].mean(), inplace=True)

# Impute missing values with median
df['numerical_column'].fillna(df['numerical_column'].median(), inplace=True)
```

For categorical variables, common imputation strategies include replacing missing values with the mode (most frequent category) or a custom value.

```python
# Impute missing values with mode
df['categorical_column'].fillna(df['categorical_column'].mode()[0], inplace=True)

# Impute missing values with a custom value
df['categorical_column'].fillna('Not Available', inplace=True)
```

#### **2.3 Advanced Imputation:**
Machine learning models can be used for more advanced imputation. The `SimpleImputer` from scikit-learn is a useful tool for this purpose.

```python
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
df['numerical_column'] = imputer.fit_transform(df[['numerical_column']])
```

### **3. Applying Concepts to Iris Dataset:**

Now, let's apply these concepts to the Iris dataset:

```python
# Load the Iris dataset
from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris()
df_iris = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df_iris['target'] = iris.target

# Introduce missing values
import numpy as np

# Randomly set some values to NaN
df_iris.iloc[3:5, 1] = np.nan
df_iris.iloc[2:4, 2] = np.nan
```

With this introduction of missing values, you can then follow the steps outlined earlier to identify and handle missing values based on the variable types in the Iris dataset.

In conclusion, handling missing values is a critical aspect of data cleaning in the EDA phase of data science. The approach chosen depends on the nature of the data and the specific requirements of the analysis or modeling task. The examples provided offer a practical guide on how to tackle missing values using Python and Pandas, with a real-world application to the Iris dataset.