# Data Cleaning with Pandas

This notebook focuses on **data cleaning** in Pandas to handle messy, incomplete, or incorrect data. You'll learn how to identify and address missing values, duplicates, inconsistent data types, and string issues to prepare your dataset for analysis.

## Core Concepts

- **Data Cleaning**: The process of identifying and correcting errors, inconsistencies, or missing values in a dataset to ensure data quality.
- **Common Issues**:
  - Missing values (e.g., `NaN`, `None`).
  - Duplicate rows or entries.
  - Inconsistent data types (e.g., numbers stored as strings).
  - Inconsistent string formats (e.g., mixed case, extra spaces).
- **Goal**: Transform raw data into a clean, consistent format suitable for analysis or modeling.

## Key Methods & Functions

Below are the essential methods for data cleaning in Pandas:

- **`.isnull()` / `.isna()`**: Detect missing values (returns `True` for `NaN` or `None`).
- **`.notnull()` / `.notna()`**: Detect non-missing values.
- **`.dropna()`**: Remove rows or columns with missing values.
- **`.fillna()`**: Fill missing values with a specific value, mean, median, or forward/backward fill.
- **`.drop_duplicates()`**: Remove duplicate rows based on specified columns.
- **`.duplicated()`**: Identify duplicate rows.
- **`.str` methods**: Manipulate strings (e.g., `.str.lower()`, `.str.strip()`, `.str.replace()`).
- **`.astype()`**: Convert column data types (e.g., string to integer).
- **`.replace()`**: Replace specific values in a DataFrame or Series.

## Learning Objectives

- Identify and analyze patterns of missing data.
- Apply different strategies for handling missing values (e.g., drop, fill).
- Detect and remove duplicate rows.
- Clean and standardize string data (e.g., remove spaces, standardize case).
- Convert data types to ensure consistency and enable calculations.

### 1. Setting Up a Sample Dataset

In [8]:
import pandas as pd
import numpy as np

# Creating a sample DataFrame with messy data
data = {
    'name': ['Alice ', 'BOB', '  Charlie', 'David', 'Eve', 'Alice ', None],
    'age': [25, '30', 35, None, 22, 25, '28'],
    'salary': [50000, 60000, np.nan, 52000, 48000, 50000, 55000],
    'department': ['HR', 'IT', 'IT', 'Marketing', 'hr', 'HR', 'IT'],
    'years_experience': [2, 5, 10, 3, 1, 2, np.nan]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

Original DataFrame:
        name   age   salary department  years_experience
0     Alice     25  50000.0         HR               2.0
1        BOB    30  60000.0         IT               5.0
2    Charlie    35      NaN         IT              10.0
3      David  None  52000.0  Marketing               3.0
4        Eve    22  48000.0         hr               1.0
5     Alice     25  50000.0         HR               2.0
6       None    28  55000.0         IT               NaN


### 2. Detecting Missing Values

**Issues in the Dataset**:
- Missing values in `name`, `age`, `salary`, and `years_experience`.
- Duplicate row (index 0 and 5).
- Inconsistent string formatting in `name` (extra spaces, mixed case).
- Inconsistent `department` values (e.g., 'HR' vs 'hr').
- Incorrect data type in `age` (string instead of numeric).

In [9]:
# Checking for missing values
print("Missing values in each column:")
print(df.isnull().sum())

# Checking for non-missing values
print("\nNon-missing values in each column:")
print(df.notnull().sum())

Missing values in each column:
name                1
age                 1
salary              1
department          0
years_experience    1
dtype: int64

Non-missing values in each column:
name                6
age                 6
salary              6
department          7
years_experience    6
dtype: int64


### 3. Handling Missing Values


In [10]:
# Dropping rows with any missing values
df_dropped = df.dropna()
print("DataFrame after dropping rows with missing values:")
print(df_dropped)

# Filling missing values in 'salary' with the mean
df['salary'] = df['salary'].fillna(df['salary'].mean())
print("\nDataFrame after filling missing salary with mean:")
print(df[['name', 'salary']])

# Filling missing values in 'name' with a default value
df['name'] = df['name'].fillna('Unknown')
print("\nDataFrame after filling missing names:")
print(df[['name', 'salary']])

DataFrame after dropping rows with missing values:
     name age   salary department  years_experience
0  Alice   25  50000.0         HR               2.0
1     BOB  30  60000.0         IT               5.0
4     Eve  22  48000.0         hr               1.0
5  Alice   25  50000.0         HR               2.0

DataFrame after filling missing salary with mean:
        name   salary
0     Alice   50000.0
1        BOB  60000.0
2    Charlie  52500.0
3      David  52000.0
4        Eve  48000.0
5     Alice   50000.0
6       None  55000.0

DataFrame after filling missing names:
        name   salary
0     Alice   50000.0
1        BOB  60000.0
2    Charlie  52500.0
3      David  52000.0
4        Eve  48000.0
5     Alice   50000.0
6    Unknown  55000.0


### 4. Detecting and Removing Duplicates


In [11]:
# Identifying duplicate rows
print("Duplicate rows:")
print(df[df.duplicated()])

# Removing duplicate rows (keeping the first occurrence)
df_no_duplicates = df.drop_duplicates(keep='first')
print("\nDataFrame after removing duplicates:")
print(df_no_duplicates)

Duplicate rows:
     name age   salary department  years_experience
5  Alice   25  50000.0         HR               2.0

DataFrame after removing duplicates:
        name   age   salary department  years_experience
0     Alice     25  50000.0         HR               2.0
1        BOB    30  60000.0         IT               5.0
2    Charlie    35  52500.0         IT              10.0
3      David  None  52000.0  Marketing               3.0
4        Eve    22  48000.0         hr               1.0
6    Unknown    28  55000.0         IT               NaN


### 5. String Cleaning and Standardization


In [12]:
# Removing leading/trailing spaces and standardizing case in 'name'
df['name'] = df['name'].str.strip().str.title()
print("DataFrame after cleaning 'name' column:")
print(df[['name', 'salary']])

# Standardizing 'department' values (e.g., 'hr' to 'HR')
df['department'] = df['department'].str.title().replace({'Hr': 'HR'})
print("\nDataFrame after standardizing 'department':")
print(df[['name', 'department']])

DataFrame after cleaning 'name' column:
      name   salary
0    Alice  50000.0
1      Bob  60000.0
2  Charlie  52500.0
3    David  52000.0
4      Eve  48000.0
5    Alice  50000.0
6  Unknown  55000.0

DataFrame after standardizing 'department':
      name department
0    Alice         HR
1      Bob         It
2  Charlie         It
3    David  Marketing
4      Eve         HR
5    Alice         HR
6  Unknown         It


### 6. Converting Data Types


In [13]:
# Converting 'age' to numeric (float or int)
df['age'] = pd.to_numeric(df['age'], errors='coerce')
print("Data types after converting 'age':")
print(df.dtypes)

# Converting 'salary' to integer
df['salary'] = df['salary'].astype(int)
print("\nDataFrame after converting 'salary' to integer:")
print(df[['name', 'salary']])

Data types after converting 'age':
name                 object
age                 float64
salary              float64
department           object
years_experience    float64
dtype: object

DataFrame after converting 'salary' to integer:
      name  salary
0    Alice   50000
1      Bob   60000
2  Charlie   52500
3    David   52000
4      Eve   48000
5    Alice   50000
6  Unknown   55000


### 7. Replacing Specific Values


In [14]:
# Replacing specific values in 'department'
df['department'] = df['department'].replace({'Marketing': 'MKT'})
print("DataFrame after replacing 'Marketing' with 'MKT':")
print(df[['name', 'department']])

DataFrame after replacing 'Marketing' with 'MKT':
      name department
0    Alice         HR
1      Bob         It
2  Charlie         It
3    David        MKT
4      Eve         HR
5    Alice         HR
6  Unknown         It


## Key Takeaways

- **Missing Values**: Use `.isnull()` to detect missing data and `.dropna()` or `.fillna()` to handle them based on your analysis needs.
- **Duplicates**: Identify duplicates with `.duplicated()` and remove them with `.drop_duplicates()` to avoid skewed results.
- **String Cleaning**: Use `.str` methods (e.g., `.str.strip()`, `.str.title()`) to standardize text data.
- **Data Types**: Convert columns to appropriate types using `.astype()` or `pd.to_numeric()` for accurate calculations.
- **Value Replacement**: Use `.replace()` for targeted value substitutions to ensure consistency.