<a href="https://colab.research.google.com/github/sumanthkrishna/genAIPath/blob/main/GAI_Mod1_4_2_DataProcessing_Variables.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Data Cleaning: Variable Types in Exploratory Data Analysis (EDA)**

In the realm of data science, one of the critical stages of Exploratory Data Analysis (EDA) is data cleaning. This process involves identifying and rectifying issues within the dataset, ensuring its quality and reliability for subsequent analysis. A fundamental aspect of data cleaning is understanding and handling different variable types appropriately. Variables can be broadly categorized into numerical and categorical types, each requiring distinct methods of cleaning and preprocessing.

### **1. Numerical Variables:**

Numerical variables represent measurable quantities and can further be classified into two subtypes: discrete and continuous.

**1.1. Discrete Numerical Variables:**
Discrete variables take distinct, separate values and are often counted in whole numbers. Examples include the count of items, number of people, etc.

**Cleaning Steps:**
- Check for missing values.
- Handle outliers if present.
- Validate that values are within the expected range.

```python
# Example: Checking for missing values in a discrete numerical column
print(df['count_of_items'].isnull().sum())
```

**1.2. Continuous Numerical Variables:**
Continuous variables can take any real value within a given range. Examples include height, weight, temperature, etc.

**Cleaning Steps:**
- Address missing values.
- Detect and handle outliers.
- Normalize or standardize if needed.

```python
# Example: Handling outliers in a continuous numerical column
from scipy.stats import zscore

df['height_zscore'] = zscore(df['height'])
df_no_outliers = df[(df['height_zscore'] > -3) & (df['height_zscore'] < 3)]
```

### **2. Categorical Variables:**

Categorical variables represent categories and can be further divided into nominal and ordinal types.

**2.1. Nominal Categorical Variables:**
Nominal variables represent categories with no inherent order. Examples include colors, species, etc.

**Cleaning Steps:**
- Check for missing values.
- Convert to numerical format using one-hot encoding.

```python
# Example: One-hot encoding a nominal categorical column
df_encoded = pd.get_dummies(df, columns=['color'])
```

**2.2. Ordinal Categorical Variables:**
Ordinal variables have a meaningful order among categories, but the intervals are not consistent. Examples include educational levels, customer satisfaction ratings, etc.

**Cleaning Steps:**
- Handle missing values.
- Map ordinal values to numerical representations.

```python
# Example: Mapping ordinal values to numerical representations
education_mapping = {'High School': 1, 'Bachelor': 2, 'Master': 3, 'PhD': 4}
df['education_numeric'] = df['education'].map(education_mapping)
```

### **Using Iris Dataset for Illustration:**

Let's apply these concepts to the Iris dataset, a well-known dataset in machine learning.

```python
# Load the Iris dataset
from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris()
df_iris = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df_iris['target'] = iris.target
```

Now, we can explore and clean the numerical and categorical variables in the Iris dataset based on the outlined steps. This practice ensures that the dataset is well-prepared for subsequent analysis and modeling tasks.

In conclusion, understanding variable types and applying appropriate cleaning procedures is fundamental in the EDA phase of data science. The meticulous handling of numerical and categorical variables lays the groundwork for reliable and insightful analyses, contributing to the overall success of a data science project.