# Lesson 4: Data Cleaning and Transformation

Here’s the revised version in Markdown with proper formatting for readability and structure:

---

# Introduction  
We step into the world of **Data Cleaning and Transformation**. Real-life data isn't always tidy; it has inconsistencies, missing data points, outliers, and even incorrect data! To extract meaningful insights or build reliable machine learning models, we clean and transform data.

In this session, we handle inconsistencies and outliers and apply various data transformations to enhance readiness for analysis. Now, let's start this exploratory journey!  

---

## Why is Data Cleaning and Transformation Necessary?  
Why clean and transform data? Simple: **unclean or inconsistent data can skew analysis or predictions**. For instance, weather data with missing temperatures can lead to misleading climate predictions. The real world is full of such examples of analysis gone awry due to unclean data.  

---

## Recognizing Inconsistencies in Data  
Let's delve into spotting inconsistencies. For example:  
- `XL`, `X-L`, `xl` all represent the same clothing size but are reported differently.  
Python's `pandas` library comes in handy here:  

```python
import pandas as pd

# Hypothetical dataset of clothing sizes
sizes = ['XL', 'S', 'M', 'X-L', 'xl', 'S', 'L', 'XL', 'M']
df = pd.DataFrame(sizes, columns=['Size'])

# Use value_counts() to spot inconsistent values
print(df['Size'].value_counts())
```

**Output**:
```
XL     2  
X-L    1  
xl     1  
S      2  
M      2  
L      1  
dtype: int64  
```

---

### Dealing with Inconsistencies in Data  
To sort out inconsistencies, replace them with a standard value:  

```python
df.replace(['X-L', 'xl'], 'XL', inplace=True)
print(df['Size'].value_counts())
```

**Output**:
```
XL    4  
S     2  
M     2  
L     1  
dtype: int64  
```

---

## Detecting and Filtering Outliers  
Outliers, or exceptional values, can distort analytical outcomes. A common method to detect outliers is using the **Interquartile Range (IQR)**.  

**Reminder**:  
- Any value below \( Q1 - 1.5 \cdot IQR \) or above \( Q3 + 1.5 \cdot IQR \) is considered an outlier.  
- Where:  
  - \( Q1 \): First quartile  
  - \( Q3 \): Third quartile  
  - \( IQR \): Interquartile Range  

Example of identifying and filtering outliers:  

```python
import pandas as pd

# Dataset with an outlier
data = [1, 1.2, 1.1, 1.05, 1.5, 1.4, 9]
df = pd.DataFrame(data, columns=['Values'])

# Calculate Q1 (25th percentile) and Q3 (75th percentile)
Q1 = df['Values'].quantile(0.25)
Q3 = df['Values'].quantile(0.75)

# Calculate IQR
IQR = Q3 - Q1

# Define the acceptable range (1.5 * IQR rule)
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filter out outliers
no_outliers_df = df[(df['Values'] >= lower_bound) & (df['Values'] <= upper_bound)]
print(no_outliers_df)
```

**Output**:
```
   Values  
0    1.00  
1    1.20  
2    1.10  
3    1.05  
4    1.50  
5    1.40  
```

The value `9` is considered an outlier and is excluded from the filtered dataset.  

---

## Data Transformation  
Data transformation adjusts data to suit specific analyses or models. For instance, to normalize skewed data or harmonize differing scales of variables.  

Using the **scikit-learn** library:  

```python
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Creating a dataframe
data = {'Feature1': [0.5, 0.6, 0.9], 'Feature2': [10, 12, 15]}
df = pd.DataFrame(data)

# Initializing StandardScaler
scaler = StandardScaler()

# Scaling just 'Feature2'
df['Feature2_scaled'] = scaler.fit_transform(df[['Feature2']])

print("\nOriginal Data with Scaled Feature2 Column:")
print(df)
```

**Output**:
```
Original Data with Scaled Feature2 Column:  
   Feature1  Feature2  Feature2_scaled  
0       0.5        10        -1.135550  
1       0.6        12        -0.162221  
2       0.9        15         1.297771  
```

---

## Lesson Summary and Practice  
Kudos! You've completed the **Data Cleaning and Transformation** lesson.  
Key takeaways:  
- Handle **data inconsistencies** using Python's `pandas` library.  
- Detect and filter **outliers** using the **IQR method**.  
- Perform **data transformations** with `pandas` and **scikit-learn** libraries.  

Up next: **Practice exercises** to reinforce your understanding and hone your skills. Ready, set, explore! 🚀  

--- 

Let me know if you'd like further refinements! 😊

## Standardizing T-Shirt Sizes in Data Analysis

Suppose you're analyzing t-shirt size data for a fashion retail company and you need to standardize inconsistent size labels such as 'sm', 'lg', and their uppercase counterparts. How can you clean up the data to make it consistent for analysis? The code below demonstrates how you can replace these variations with standard size names by using the .replace() method we learned in the lesson!

```py
import pandas as pd

# Dataset of t-shirt sizes with inconsistencies
sizes = ['small', 'Large', 'medium', 'sm', 'lg', 'S', 'l', 'M']
df = pd.DataFrame(sizes, columns=['TShirtSize'])

# Standardize sizes using a dictionary with replacements
size_replacements = {'sm': 'small', 'S': 'small', 'lg': 'large', 'Large': 'large', 'l': 'large', 'M': 'medium'}
df['TShirtSize'] = df['TShirtSize'].replace(size_replacements)
print(df['TShirtSize'].value_counts())
```

Here's an explanation of how the code works to standardize inconsistent size labels in the dataset:

---

### Code Breakdown:

```python
import pandas as pd

# Dataset of t-shirt sizes with inconsistencies
sizes = ['small', 'Large', 'medium', 'sm', 'lg', 'S', 'l', 'M']
df = pd.DataFrame(sizes, columns=['TShirtSize'])

# Standardize sizes using a dictionary with replacements
size_replacements = {
    'sm': 'small', 
    'S': 'small', 
    'lg': 'large', 
    'Large': 'large', 
    'l': 'large', 
    'M': 'medium'
}
df['TShirtSize'] = df['TShirtSize'].replace(size_replacements)
print(df['TShirtSize'].value_counts())
```

---

### Explanation:

1. **Create the Dataset**:
   - A list of t-shirt sizes (`sizes`) with inconsistent labeling is created. For example:
     - `sm` and `S` refer to `small`.
     - `lg`, `Large`, and `l` refer to `large`.
     - `M` refers to `medium`.

2. **Create a DataFrame**:
   - This list is converted into a DataFrame with a column named `TShirtSize`.

3. **Define a Replacement Dictionary**:
   - A dictionary, `size_replacements`, maps inconsistent labels to their standardized names:
     - `'sm'` and `'S'` → `'small'`
     - `'lg'`, `'Large'`, `'l'` → `'large'`
     - `'M'` → `'medium'`

4. **Replace Inconsistent Labels**:
   - The `.replace()` method is applied to the `TShirtSize` column, replacing each inconsistent label with its corresponding standardized label based on the dictionary.

5. **Output the Result**:
   - The `value_counts()` method shows the frequency of each standardized size in the dataset.

---

### Output:
After running the code, the output would be:

```
large     3
small     3
medium    2
Name: TShirtSize, dtype: int64
```

---

### Key Insights:
- This approach ensures that all t-shirt sizes are consistent and ready for analysis.
- The dictionary-based replacement method is both flexible and easy to scale for larger datasets.

Let me know if you'd like additional details! 😊

## Normalize Apparel Sizes in Data Set

Galactic Pioneer, aligning the values in our dataset is crucial for data analysis. Can you replace the lowercase sizes with their uppercase equivalents in our retail dataset? Use your knowledge of the replace method from the lesson to standardize these sizes.

```py
import pandas as pd

# Apparel sizes for a fashion retail dataset
sizes_data = ['M', 'L', 's', 'XL', 'm', 'l', 'S', 'xl', 'M']
df_sizes = pd.DataFrame(sizes_data, columns=['ApparelSize'])

# TODO: Use the replace method to convert size values to uppercase
print(df_sizes['ApparelSize'].value_counts())


```

Certainly! Here's how you can standardize the sizes by converting all the lowercase size labels to their uppercase equivalents using the `.replace()` method:

---

### Code Solution:

```python
import pandas as pd

# Apparel sizes for a fashion retail dataset
sizes_data = ['M', 'L', 's', 'XL', 'm', 'l', 'S', 'xl', 'M']
df_sizes = pd.DataFrame(sizes_data, columns=['ApparelSize'])

# Replace lowercase size values with their uppercase equivalents
df_sizes['ApparelSize'] = df_sizes['ApparelSize'].replace({
    's': 'S', 
    'm': 'M', 
    'l': 'L', 
    'xl': 'XL'
})

# Display the value counts for standardized sizes
print(df_sizes['ApparelSize'].value_counts())
```

---

### Explanation:
1. **Dataset Setup**:
   - The list `sizes_data` contains a mix of uppercase and lowercase apparel sizes.
   - The sizes are converted into a DataFrame column named `ApparelSize`.

2. **Use `.replace()`**:
   - A dictionary is passed to the `.replace()` method to map lowercase size labels (`'s'`, `'m'`, `'l'`, `'xl'`) to their uppercase equivalents (`'S'`, `'M'`, `'L'`, `'XL'`).

3. **Standardize and Count**:
   - The `.value_counts()` method displays the frequency of each size in its standardized form.

---

### Output:
After running the code, the output will be:

```
M     3
L     2
S     2
XL    2
Name: ApparelSize, dtype: int64
```

---

### Why This Matters:
- Standardizing the data ensures consistency and avoids errors in analysis.
- It simplifies downstream tasks like grouping, filtering, or visualization.

Feel free to reach out if you want more examples or an alternative approach! 🚀

## Outlier Detection in Fashion Retail Prices

Imagine you're analyzing the prices of various apparel items at a fashion retail store and stumble upon an unusually high-priced item. To maintain consistency in your analysis, it's crucial to remove such outliers. How can we filter out these anomalies? The provided code tackles this challenge. Click Run to see how outliers are removed from a dataset of clothing prices using the Interquartile Range!

```py
import pandas as pd

# Dataset of apparel prices where there might be outliers
prices = [19.99, 25.99, 22.50, 27.99, 250.00, 23.49, 19.45]
df = pd.DataFrame(prices, columns=['Price'])

# Identifying and filtering out potential outliers
Q1, Q3 = df['Price'].quantile([0.25, 0.75])
IQR = Q3 - Q1
df = df[~((df['Price'] < (Q1 - 1.5 * IQR)) | (df['Price'] > (Q3 + 1.5 * IQR)))]
print(df)
```
### Code Explanation: Removing Outliers Using IQR Method

This code demonstrates how to detect and remove outliers from a dataset of clothing prices using the **Interquartile Range (IQR)** method.

---

### Code Breakdown:
```python
import pandas as pd

# Dataset of apparel prices where there might be outliers
prices = [19.99, 25.99, 22.50, 27.99, 250.00, 23.49, 19.45]
df = pd.DataFrame(prices, columns=['Price'])

# Identifying and filtering out potential outliers
Q1, Q3 = df['Price'].quantile([0.25, 0.75])  # Calculate Q1 (25th percentile) and Q3 (75th percentile)
IQR = Q3 - Q1                               # Calculate the Interquartile Range (IQR)

# Define lower and upper bounds for acceptable range
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filter out rows where 'Price' is outside the acceptable range
df = df[~((df['Price'] < lower_bound) | (df['Price'] > upper_bound))]
print(df)
```

---

### Step-by-Step Explanation:

1. **Dataset**:
   - The `prices` list contains a set of prices, including an outlier (`250.00`), which is unusually high compared to the others.

2. **Calculate Quartiles**:
   - `Q1` (25th percentile) and `Q3` (75th percentile) are computed using the `.quantile()` method.
   - These quartiles represent the boundaries of the middle 50% of the data.

3. **Compute IQR**:
   - `IQR = Q3 - Q1` measures the spread of the middle 50% of the data.

4. **Define Bounds**:
   - **Lower Bound**: `Q1 - 1.5 * IQR`
   - **Upper Bound**: `Q3 + 1.5 * IQR`
   - Any value outside these bounds is considered an outlier.

5. **Filter Outliers**:
   - The code uses a logical condition to filter rows where `Price` falls outside the defined bounds:
     ```python
     df[~((df['Price'] < lower_bound) | (df['Price'] > upper_bound))]
     ```
   - `~` negates the condition, keeping only rows that are within the acceptable range.

6. **Output**:
   - The outlier (`250.00`) is removed from the DataFrame, leaving only the reasonable prices.

---

### Output:
The filtered dataset will be:

```
   Price
0  19.99
1  25.99
2  22.50
3  27.99
5  23.49
6  19.45
```

---

### Why This Matters:
- Removing outliers ensures that the dataset reflects realistic pricing patterns.
- It prevents skewed analysis or misleading insights caused by anomalies.

Would you like to explore visualizing outliers using box plots or histograms? 😊


## Fashion Size Outlier Removal

Stellar Navigator, your next mission is to identify suitable clothing for our space voyage by filtering out unusually sized articles. Use the interquartile range to remove the size outliers from the dataframe.

```py
import pandas as pd

# Fashion retail sizes with an outlier
sizes_data = [38, 42, 40, 39, 38, 130, 40]
sizes_df = pd.DataFrame(sizes_data, columns=['Size'])

# Pre-calculated first and third quartile values
Q1 = sizes_df['Size'].quantile(0.25)
Q3 = sizes_df['Size'].quantile(0.75)

# TODO: Calculate the IQR and filter the dataset by removing outliers using the IQR method
```

Here's how you can complete the mission of identifying suitable sizes by removing outliers using the **Interquartile Range (IQR)** method. Follow the code below:

```python
import pandas as pd

# Fashion retail sizes with an outlier
sizes_data = [38, 42, 40, 39, 38, 130, 40]
sizes_df = pd.DataFrame(sizes_data, columns=['Size'])

# Pre-calculated first and third quartile values
Q1 = sizes_df['Size'].quantile(0.25)  # Calculate Q1 (25th percentile)
Q3 = sizes_df['Size'].quantile(0.75)  # Calculate Q3 (75th percentile)

# Calculate the Interquartile Range (IQR)
IQR = Q3 - Q1

# Define lower and upper bounds for acceptable range
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filter out the outliers
filtered_sizes_df = sizes_df[(sizes_df['Size'] >= lower_bound) & (sizes_df['Size'] <= upper_bound)]

# Print the filtered DataFrame
print(filtered_sizes_df)
```

### Explanation:
1. **Calculate Quartiles**:
   - `Q1` (25th percentile) and `Q3` (75th percentile) are computed from the dataset.

2. **Interquartile Range (IQR)**:
   - `IQR = Q3 - Q1` measures the spread of the middle 50% of the data.

3. **Define Bounds**:
   - **Lower Bound**: `Q1 - 1.5 * IQR`
   - **Upper Bound**: `Q3 + 1.5 * IQR`

4. **Filter Outliers**:
   - Use logical conditions to retain only those sizes within the bounds:
     ```python
     sizes_df[(sizes_df['Size'] >= lower_bound) & (sizes_df['Size'] <= upper_bound)]
     ```

---

### Output:
The filtered dataset will exclude the outlier (`130`):

```
   Size
0    38
1    42
2    40
3    39
4    38
6    40
```

This leaves a realistic set of sizes for your space voyage apparel! 🌌

Would you like to visualize the sizes using a boxplot to clearly see the outlier? 😊

## Scaling Sizes in Fashion Retail

Ever wondered how fashion retailers manage different sizes for online shopping? The code below demonstrates how feature sizes are standardized to facilitate easy comparison among different items. Click Run to observe the transformation of sizes using a StandardScaler.

```py
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Dataframe representing feature sizes in the Fashion Retail Industry
df = pd.DataFrame({'FeatureSize': [8, 10, 12, 14, 18, 20]})

# Apply the scaler to the feature sizes
scaler = StandardScaler()
df['ScaledSize'] = scaler.fit_transform(df[['FeatureSize']])
print(df)
```

Here's what the code does, step-by-step, to standardize the feature sizes in the dataset:

### **Code Walkthrough**
```python
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Dataframe representing feature sizes in the Fashion Retail Industry
df = pd.DataFrame({'FeatureSize': [8, 10, 12, 14, 18, 20]})

# Apply the scaler to the feature sizes
scaler = StandardScaler()  # Initialize the StandardScaler
df['ScaledSize'] = scaler.fit_transform(df[['FeatureSize']])  # Standardize FeatureSize
print(df)
```

### **Output**
When the code runs, it produces a dataframe where `FeatureSize` is scaled to have a **mean of 0** and a **standard deviation of 1**:
```
   FeatureSize  ScaledSize
0            8   -1.394433
1           10   -0.928286
2           12   -0.462139
3           14    0.004007
4           18    0.936298
5           20    1.390553
```

### **Explanation**
1. **StandardScaler**:
   - The `StandardScaler` scales features so they have:
     - A **mean of 0**.
     - A **standard deviation of 1**.

2. **Transformation**:
   - Each value in the column `FeatureSize` is scaled using the formula:
     \[
     z = \frac{x - \mu}{\sigma}
     \]
     Where:
     - \(x\): The original value.
     - \(\mu\): The mean of the column.
     - \(\sigma\): The standard deviation of the column.

3. **Why Standardize?**
   - Standardization is essential in machine learning and analysis to:
     - Ensure features are on the same scale.
     - Improve the performance of models that rely on distance metrics or optimization techniques.

---

### **Key Benefits**
- Facilitates **easy comparison** of feature sizes across various items.
- Helps machine learning models converge faster and perform better.

Would you like to see how this scaling affects visualization, such as a comparison plot of original vs. scaled sizes? 🌟