# Week 3: Data Cleaning and Transformation


## Objectives:
In this week, you will:
1. Learn how to identify and clean dirty data.
2. Practice transforming data to prepare it for analysis.
3. Understand how to handle missing data, duplicates, and outliers.




## 1. Data Cleaning Basics
Data cleaning is an essential step in the data preparation process. Cleaning involves fixing or removing incorrect, corrupted, or incomplete data from your datasets. Let's explore the techniques that help in cleaning data efficiently.

### Common Data Issues:
- **Missing Data**: Missing values in a dataset can skew analysis.
- **Duplicate Data**: Duplicates can create misleading statistics.
- **Outliers**: Extreme values that deviate from other data points.

### Techniques:
- `.isnull()`, `.dropna()`, and `.fillna()`
- `.duplicated()`, `.drop_duplicates()`
- Detecting and handling outliers with IQR or Z-score.

Let's start by loading a dataset and identifying these common issues.


In [1]:

# Importing pandas and loading a sample dataset
import pandas as pd

# Sample dataset with some missing, duplicate, and outlier values
data = {
    'Visitor ID': [1, 2, 2, 3, 4],
    'Visit Date': ['2024-01-01', '2024-01-02', None, '2024-01-03', '2024-01-04'],
    'Location': ['Park A', 'Museum B', 'Museum B', 'Park A', 'Beach C'],
    'Visitors': [200, 150, 150, 300, 5000],  # Notice the outlier in Visitors
    'Revenue': [1000, 750, 750, 1500, None]  # Missing value in Revenue
}

df = pd.DataFrame(data)

# Display the dataset
df


Unnamed: 0,Visitor ID,Visit Date,Location,Visitors,Revenue
0,1,2024-01-01,Park A,200,1000.0
1,2,2024-01-02,Museum B,150,750.0
2,2,,Museum B,150,750.0
3,3,2024-01-03,Park A,300,1500.0
4,4,2024-01-04,Beach C,5000,



## 2. Handling Missing Data
Missing data is a common issue in datasets. It is important to decide whether to remove rows with missing values or to fill them with estimates. 

### Methods:
- **Drop missing values**: Use `.dropna()`
- **Fill missing values**: Use `.fillna()` with mean, median, or other statistics.

Let's handle the missing values in our dataset.


In [2]:

# Drop rows with missing values
df_cleaned = df.dropna()

# Alternatively, fill missing values with the mean
df_filled = df.fillna(df.mean())

# Show the cleaned dataset
df_filled


TypeError: can only concatenate str (not "int") to str


## 3. Handling Duplicate Data
Duplicate records can skew your results by giving extra weight to certain data points. Let's identify and remove duplicates from the dataset.

### Methods:
- **Detect duplicates**: Use `.duplicated()`
- **Remove duplicates**: Use `.drop_duplicates()`


In [None]:

# Check for duplicates
df.duplicated()

# Remove duplicates
df_no_duplicates = df.drop_duplicates()

# Show dataset after removing duplicates
df_no_duplicates



## 4. Handling Outliers
Outliers are extreme values that differ significantly from other observations. They can impact the results of your analysis, so it is important to detect and handle them properly.

### Methods:
- **Interquartile Range (IQR)**: Identify outliers based on the range between the 1st and 3rd quartiles.
- **Z-Score**: Standardize data to find values that are far from the mean.

Let's detect and handle outliers in our dataset.


In [None]:

# Detect outliers using the Interquartile Range (IQR)
Q1 = df['Visitors'].quantile(0.25)
Q3 = df['Visitors'].quantile(0.75)
IQR = Q3 - Q1

# Define outlier limits
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filter out outliers
df_no_outliers = df[(df['Visitors'] >= lower_bound) & (df['Visitors'] <= upper_bound)]

# Show dataset after removing outliers
df_no_outliers



## 5. Summary
This week, you learned how to:
1. Identify common data issues such as missing values, duplicates, and outliers.
2. Apply techniques to clean and transform your data for analysis.

### Homework:
- Apply these techniques to a real-world dataset and practice cleaning it.
- Explore more advanced techniques such as outlier detection using Z-score or robust statistics.

