# <center>Data Cleaning Part 1</center> 
___

## 1. Removing Duplicate or Irrelevant Observations :

When you have datasets from multiple sources or have received data from clients, it's common to encounter duplicate entries. Removing these duplicates ensures each record is unique, leading to accurate analysis.

### Techniques

#### De-duplication
- **Identify and remove duplicate entries** based on one or more columns.
- Use methods like `drop_duplicates()` in pandas.

#### Filtering
- **Filter out irrelevant data** based on specific criteria.
- Use conditional filtering in pandas.

### Detailed Example with Code

A company has received customer data from two different departments, resulting in duplicate entries for some customers. The goal is to remove these duplicates so each customer is represented 

#### Step 1: Import Libraries
```python
import pandas as pd

# Sample data representing customer entries from two departments
data = {
    'CustomerID': [1, 2, 3, 4, 2, 5, 1],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Bob', 'Eva', 'Alice'],
    'Department': ['Sales', 'HR', 'Sales', 'IT', 'HR', 'Marketing', 'Sales']
}

# Create DataFrame
df = pd.DataFrame(data)
# Remove duplicates based on 'CustomerID' and keep the first occurrence
df_cleaned = df.drop_duplicates(subset='CustomerID', keep='first')
```

___

## 2. Fixing Structural Errors

Structural errors occur when data entries have inconsistent formats, incorrect naming conventions, or typographical errors. Fixing these errors ensures data consistency and accuracy, leading to more reliable analysis.

### 1. Data Validation
- **Purpose**: Ensure data meets specified formats and constraints.
- **Method**: Use validation libraries and techniques to check data integrity.
- **Example**: Using Python's `pandas` library to validate date formats.

### 2. Correcting Typos
- **Purpose**: Fix spelling errors and incorrect entries.
- **Method**: Use spell check libraries or manual corrections.
- **Example**: Using `FuzzyWuzzy` library to correct misspellings in text data.

### 3. Standardizing Formats
- **Purpose**: Ensure uniformity in data presentation.
- **Method**: Convert data entries to a standard format.
- **Example**: Standardizing date formats from "MM/DD/YYYY" to "YYYY-MM-DD".

### Detailed Example with Code

### Scenario
A dataset contains inconsistent date formats (e.g., "MM/DD/YYYY" vs. "DD/MM/YYYY"). The goal is to standardize the date format to ensure consistency.

### Step-by-Step Process

```python
# Step 1: Import Libraries
import pandas as pd

# Step 2: Create Sample Data
# Sample data with inconsistent date formats
data = {
    'CustomerID': [1, 2, 3, 4],
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'JoinDate': ['12/25/2020', '25/12/2020', '01/15/2021', '15/01/2021']
}

# Create DataFrame
df = pd.DataFrame(data)

# Step 3: View Initial Data
print("Initial Data:")
print(df)

# Step 4: Standardize Date Formats
# Define a function to standardize date formats
def standardize_date(date_str):
    for fmt in ("%m/%d/%Y", "%d/%m/%Y"):
        try:
            return pd.to_datetime(date_str, format=fmt)
        except ValueError:
            pass
    return pd.NaT  # Return NaT if format is not recognized

# Apply the function to the 'JoinDate' column
df['JoinDate'] = df['JoinDate'].apply(standardize_date)

# Step 5: View Cleaned Data
print("Cleaned Data:")
print(df)
```

### Scenario :
A dataset contains customer names with potential misspellings.

#### Correcting Typos

#### Step 1: Import Libraries
```python
import pandas as pd
from fuzzywuzzy import process

## Step 2: Create Sample Data
# Sample data with misspellings
data = {
    'CustomerID': [1, 2, 3, 4],
    'Name': ['Alicee', 'Bobb', 'Charli', 'Davidd']
}

# Create DataFrame
df = pd.DataFrame(data)

## Step 3: View Initial Data
print("Initial Data:")
print(df)

## Step 4: Define Correct Names List
correct_names = ['Alice', 'Bob', 'Charlie', 'David']

## Step 5: Correct Typos Using Fuzzy Matching
def correct_typo(name, correct_names):
    match = process.extractOne(name, correct_names)
    return match[0] if match[1] > 80 else name

df['CorrectedName'] = df['Name'].apply(lambda x: correct_typo(x, correct_names))
print("Data with Corrected Typos:")
print(df)

```
___


# 3. Handling Missing Data

Missing data can pose significant challenges in data analysis. Data scientists use various techniques to handle missing data, ensuring the dataset remains robust and reliable.

### 1. Imputation
**Purpose**: Replace missing values with substituted values.

**Types**:
- **Mean/Median/Mode Imputation**: Replace missing values with the mean, median, or mode of the column.
- **K-Nearest Neighbors (KNN) Imputation**: Use the values of k-nearest neighbors to impute missing values.
- **Regression Imputation**: Predict missing values using a regression model.
- **Multiple Imputation**: Use multiple predictions to account for the uncertainty of the missing values.

### 2. Deletion
**Purpose**: Remove missing data entries.

**Types**:
- **Listwise Deletion**: Remove rows with any missing values.
- **Pairwise Deletion**: Remove rows only if the variables needed for a particular analysis are missing.

### 3. Interpolation
**Purpose**: Estimate missing values within the range of available data.

**Types**:
- **Linear Interpolation**: Use linear relationships to estimate missing values.
- **Polynomial Interpolation**: Use polynomial relationships for estimation.
- **Spline Interpolation**: Use spline functions for smooth estimation.

### Code Examples :

```python
import pandas as pd
import numpy as np

#### Sample data
data = {
    'EmployeeID': [1, 2, 3, 4, 5],
    'Gender': ['Male', 'Female', np.nan, 'Female', 'Male'],
    'Age': [25, 30, 45, 35, np.nan]
}

df = pd.DataFrame(data)

# Mean Imputation for Age
df['Age'].fillna(df['Age'].mean(), inplace=True)

# Mode Imputation for Gender
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)

print(df)
#------------------------------------------------------------------------------------------------------------#
## K-Nearest Neighbors (KNN) Imputation

### Step 1: Import Libraries

import pandas as pd
from sklearn.impute import KNNImputer

### Step 2: Create Sample Data
```python
# Sample data with missing 'Lot Area' values
data = {
    'HouseID': [1, 2, 3, 4, 5],
    'LotArea': [5000, 6000, None, 7000, 5500],
    'Price': [250000, 300000, 350000, 400000, 320000]
}

# Create DataFrame
df = pd.DataFrame(data)

### Step 3: View Initial Data
print("Initial Data:")
print(df)

### Step 4: Impute Missing Values with KNN
# Initialize KNN Imputer
imputer = KNNImputer(n_neighbors=2)

# Impute missing values
df[['LotArea']] = imputer.fit_transform(df[['LotArea']])

### Step 5: View Cleaned Data
print("Cleaned Data:")
print(df)

#---------------------------------------------------------------------------------------------------------#
## Listwise Deletion

### Step 1: Import Libraries
import pandas as pd

### Step 2: Create Sample Data
# Sample data with missing values
data = {
    'CustomerID': [1, 2, 3, 4, 5],
    'Feedback': ['Good', 'Bad', None, 'Average', 'Excellent'],
    'Rating': [5, 2, 3, None, 4]
}

# Create DataFrame
df = pd.DataFrame(data)

### Step 3: View Initial Data
print("Initial Data:")
print(df)

### Step 4: Remove Rows with Missing Values
# Drop rows with any missing values
df_cleaned = df.dropna()

### Step 5: View Cleaned Data
print("Cleaned Data:")
print(df_cleaned)

#-------------------------------------------------------------------------------------------------------------#
## Linear Interpolation

### Step 1: Import Libraries
import pandas as pd

### Step 2: Create Sample Data
# Sample time series data with missing values
data = {
    'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May'],
    'Sales': [200, None, 250, None, 300]
}

# Create DataFrame
df = pd.DataFrame(data)

### Step 3: View Initial Data
print("Initial Data:")
print(df)

### Step 4: Interpolate Missing Value
# Interpolate missing values linearly
df['Sales'] = df['Sales'].interpolate(method='linear')

### Step 5: View Cleaned Data
print("Cleaned Data:")
print(df)

```
___
