# Data Wrangling with Python and Pandas

Data wrangling, also known as data munging, is the process of transforming and preparing raw data into a format that is more appropriate for analysis. This process involves several key steps such as cleaning, structuring, and enriching raw data into a desired format for better decision-making in less time.

## Why Data Wrangling?

1. **Data Quality Improvement**: Helps in dealing with missing values, removing duplicates, and handling inconsistencies.
2. **Data Consistency**: Ensures that data across different sources is consistent and harmonized.
3. **Data Formatting**: Converts data into a more usable format that aligns with analysis needs.
4. **Error Reduction**: Reduces the chances of errors in downstream data analysis and machine learning tasks.
5. **Enhanced Analysis**: Facilitates complex analysis by structuring data into a suitable form.

## How to Perform Data Wrangling?

Data wrangling typically involves the following steps:

1. **Data Cleaning**: Handling missing values, removing duplicates, fixing errors.
2. **Data Transformation**: Normalization, standardization, binning, and creating new features.
3. **Data Integration**: Combining data from different sources.
4. **Data Reduction**: Aggregation, filtering, and dimensionality reduction.
5. **Data Enrichment**: Adding additional information to the dataset.

## When to Use Data Wrangling?

Data wrangling should be performed whenever you receive raw data for analysis. It is especially crucial when:

- The data contains missing or null values.
- There are inconsistencies or duplicates in the data.
- Data is not in a format suitable for analysis.
- You need to combine multiple data sources.
- Preparing data for machine learning models.

In this guide, we'll use Python and the Pandas library to demonstrate data wrangling techniques with practical examples.



In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Create a sample DataFrame
data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda', 'James', 'Emily', np.nan, 'Michael', 'Sara', 'David'],
    'Age': [28, 22, 35, 32, np.nan, 27, 45, np.nan, 30, 52],
    'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Male', np.nan, 'Male'],
    'Salary': [50000, 54000, np.nan, 62000, 58000, 52000, 60000, 62000, 58000, 60000],
    'Department': ['HR', 'Finance', 'IT', 'Marketing', 'Finance', 'IT', np.nan, 'IT', 'HR', 'Marketing']
}

df = pd.DataFrame(data)

# Display the first 5 rows of the DataFrame
print("Sample DataFrame:")
print(df.head())

# Describe the DataFrame to see basic statistics
print("\nDataFrame Description:")
print(df.describe(include='all'))


Sample DataFrame:
    Name   Age  Gender   Salary Department
0   John  28.0    Male  50000.0         HR
1   Anna  22.0  Female  54000.0    Finance
2  Peter  35.0    Male      NaN         IT
3  Linda  32.0  Female  62000.0  Marketing
4  James   NaN    Male  58000.0    Finance

DataFrame Description:
        Name        Age Gender        Salary Department
count      9   8.000000      9      9.000000          9
unique     9        NaN      2           NaN          4
top     John        NaN   Male           NaN         IT
freq       1        NaN      6           NaN          3
mean     NaN  33.875000    NaN  57333.333333        NaN
std      NaN   9.963326    NaN   4358.898944        NaN
min      NaN  22.000000    NaN  50000.000000        NaN
25%      NaN  27.750000    NaN  54000.000000        NaN
50%      NaN  31.000000    NaN  58000.000000        NaN
75%      NaN  37.500000    NaN  60000.000000        NaN
max      NaN  52.000000    NaN  62000.000000        NaN


## Step 1: Data Cleaning

Data cleaning involves handling missing values, removing duplicates, and correcting errors or inconsistencies in the dataset.

### Handling Missing Values

Missing values can distort the analysis and should be handled appropriately. Common techniques include:

- **Removing rows or columns** with missing values.
- **Imputing missing values** using the mean, median, mode, or a fixed value.

Let's clean our sample dataset.


# Check for missing values
print("Missing values in each column:")
print(df.isnull().sum())

# Handling missing values
# Option 1: Drop rows with missing values
df_cleaned = df.dropna()

# Option 2: Impute missing values
df_filled = df.copy()
df_filled['Age'].fillna(df['Age'].mean(), inplace=True)
df_filled['Salary'].fillna(df['Salary'].median(), inplace=True)
df_filled['Gender'].fillna('Unknown', inplace=True)
df_filled['Department'].fillna('General', inplace=True)

print("\nDataFrame after dropping missing values:")
print(df_cleaned.head())

print("\nDataFrame after imputing missing values:")
print(df_filled.head())


## Step 2: Data Transformation and Visualization

Data transformation involves converting data into a more suitable format for analysis. This includes normalization, standardization, binning, and creating new features.

### Data Visualization

Visualizations help in understanding the data better and identifying patterns, correlations, and outliers. We will create a few random charts to visualize the data using Matplotlib and Seaborn.


# Example of Data Transformation: Binning Ages into Categories
df_filled['Age_Group'] = pd.cut(df_filled['Age'], bins=[0, 25, 35, 60], labels=['Young', 'Adult', 'Senior'])

# Visualization: Distribution of Salary
plt.figure(figsize=(10, 6))
sns.histplot(df_filled['Salary'], kde=True)
plt.title('Salary Distribution')
plt.xlabel('Salary')
plt.ylabel('Frequency')
plt.show()

# Visualization: Age vs. Salary
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Age', y='Salary', hue='Gender', data=df_filled)
plt.title('Age vs. Salary by Gender')
plt.xlabel('Age')
plt.ylabel('Salary')
plt.show()

# Visualization: Count plot of Age Group
plt.figure(figsize=(10, 6))
sns.countplot(x='Age_Group', data=df_filled, palette='Set2')
plt.title('Count of Age Groups')
plt.xlabel('Age Group')
plt.ylabel('Count')
plt.show()


## Conclusion

Data wrangling is a critical step in the data analysis pipeline. By cleaning, transforming, and visualizing data, we can gain better insights and prepare the data for more advanced analysis or machine learning models.

In this guide, we've explored various data wrangling techniques using Python and Pandas, including handling missing values, transforming data, and visualizing data. With these skills, you can now effectively prepare raw data for analysis and make more informed decisions.

Remember, the quality of your analysis is only as good as the quality of your data. Happy wrangling!
