<a href="https://github.com/zia207/python-colab/blob/main/NoteBook/Python_for_Beginners/01-03-00-data-wrangling-introduction-python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![alt text](http://drive.google.com/uc?export=view&id=1IFEWet-Aw4DhkkVe1xv_2YYqlvRe9m5_)

# 3. Introduction to Data Wrangling in Python

**Data Wrangling** (also known as *data munging* or *data preprocessing*) is a crucial process in data science that involves transforming and cleaning raw data into a usable format. It includes converting data structures, handling missing or duplicated values, managing outliers, standardizing formats, and ensuring data integrity. This process is foundational — it ensures that subsequent analysis, modeling, or visualization is built on accurate, reliable data.

In environmental science, finance, healthcare, and many other domains, clean, well-wrangled data enables the creation of precise maps, predictive models, and actionable insights. Consistent and accurate data wrangling is not optional — it’s essential for trustworthy decision-making.

This section provides an overview of data wrangling, its importance, key steps, and the most powerful **Python packages** designed to make wrangling efficient and reproducible.

## Why is Data Wrangling Important?

Here are key reasons why data wrangling is indispensable:

1.  **Data Quality Improvement**: Identifies and corrects errors, inconsistencies, missing values, and outliers to enhance reliability.
2.  **Compatibility**: Harmonizes data from disparate sources (CSV, JSON, APIs, databases) into a unified structure.
3.  **Handling Missing Values**: Applies imputation (mean, median, forward-fill), interpolation, or strategic removal to prevent bias.
4.  **Data Transformation**: Converts data types, reshapes structures (melt/pivot), normalizes scales, and derives new features.
5.  **Feature Engineering**: Creates meaningful predictors (features) to boost machine learning model performance.
6.  **Outlier Detection & Handling**: Flags or adjusts extreme values using statistical or ML-based methods.
7.  **Data Reduction**: Simplifies large datasets through sampling, aggregation, or dimensionality reduction (e.g., PCA).
8.  **Improved Efficiency**: Clean data reduces debugging time and accelerates modeling and visualization workflows.
9.  **Data Exploration**: Wrangling and exploration are iterative — revealing patterns, anomalies, and hypotheses.
10. **Reproducibility**: Scripted, documented wrangling (e.g., in Jupyter Notebooks or `.py` files) ensures others can replicate your steps.
11. **Regulatory Compliance**: Critical in healthcare, finance, and government to meet data privacy standards (e.g., GDPR, HIPAA).
12. **Better Decision-Making**: High-quality data → accurate insights → confident, data-driven decisions.

> Data wrangling is not a preliminary chore — it’s the backbone of robust analytics and AI. Data scientists often spend 60–80% of their time on this stage.

## Steps of Data Wrangling

The process typically follows six iterative stages [(Source: FavTutor)](https://favtutor.com/blogs/data-wrangling):

1.  **Discovering**: Explore and understand the raw data — use `.head()`, `.info()`, `.describe()` to identify patterns and anomalies.
2.  **Structuring**: Reorganize data to fit analytical needs — reshape with `melt()`/`pivot()`, set proper indices, or engineer features.
3.  **Cleaning**:
    - **Outliers**: Detect using boxplots, Z-scores, or IQR; handle by capping, transformation, or removal.
    - **Missing Values**: Use `.fillna()`, interpolation, or model-based imputation (e.g., `sklearn.impute`).
4.  **Enriching**: Enhance data through:
    - **Feature Engineering**: Create new columns (e.g., `df['BMI'] = df['weight'] / (df['height']/100)**2`).
    - **Merging**: Combine datasets using `pd.merge()` or `pd.concat()`.
    - **Encoding**: Convert categories to numbers (One-Hot, Label Encoding).
5.  **Validating**: Apply automated checks — ensure data types, value ranges, and distributions meet expectations.
6.  **Publishing**: Export cleaned data for consumption — save as CSV, Parquet, Feather, or push to a database.



![alt text](http://drive.google.com/uc?export=view&id=1nsgpcRuh9QhKoQ49VVlbXXXviynq9daX)

source: [FavTutor](https://favtutor.com/blogs/data-wrangling)

## Important Python Packages for Data Wrangling

| Package         | Purpose                                                                 | Installation & Docs                                                                 |
|-----------------|-------------------------------------------------------------------------|-------------------------------------------------------------------------------------|
| `pandas`        | Core library for data manipulation — DataFrames, Series, I/O, cleaning   | `pip install pandas` — [pandas.pydata.org](https://pandas.pydata.org)               |
| `numpy`         | Numerical computing — arrays, math ops, random numbers                  | `pip install numpy` — [numpy.org](https://numpy.org)                                |
| `polars`        | Blazing-fast DataFrame library (Rust-based) for big data                | `pip install polars` — [pola.rs](https://www.pola.rs)                               |
| `pyjanitor`     | Clean, method-chaining API for common cleaning tasks                    | `pip install pyjanitor` — [pyjanitor.readthedocs.io](https://pyjanitor.readthedocs.io) |
| `dataprep`      | Automated EDA and cleaning (e.g., `clean_email`, `clean_country`)       | `pip install dataprep` — [dataprep.ai](https://dataprep.ai)                         |
| `missingno`     | Visualize missing data patterns                                         | `pip install missingno` — [GitHub](https://github.com/ResidentMario/missingno)       |
| `scikit-learn`  | Imputation, encoding, scaling, feature engineering                      | `pip install scikit-learn` — [scikit-learn.org](https://scikit-learn.org)           |
| `dask`          | Parallel computing — scales pandas to larger-than-memory datasets       | `pip install dask` — [dask.org](https://dask.org)                                   |
| `fuzzywuzzy`    | Fuzzy string matching for deduplication                                 | `pip install fuzzywuzzy` — [GitHub](https://github.com/seatgeek/fuzzywuzzy)         |
| `unidecode`     | Convert Unicode text to ASCII (e.g., “José” → “Jose”)                   | `pip install unidecode` — [PyPI](https://pypi.org/project/Unidecode/)               |

## Books Focused on Data Wrangling in Python

1.  **Python for Data Analysis** by Wes McKinney (Creator of Pandas)
    - [O’Reilly](https://www.oreilly.com/library/view/python-for-data/9781098104023/)
    - Covers `pandas`, `numpy`, data cleaning, and visualization.
    - The definitive guide — perfect for beginners and pros.

2.  **Effective Data Wrangling with Python** by Tirthajyoti Sarkar
    - Practical, hands-on examples using real-world datasets.
    - Covers `pandas`, `numpy`, `scikit-learn`, and automation.

3.  **Hands-On Data Analysis with Pandas** by Stefanie Molin
    - Step-by-step projects — from importing to advanced wrangling.
    - Includes time series, text data, and performance tips.

## General/Language-Agnostic Books on Data Wrangling & Cleaning

4.  **Data Wrangling with Python** by Jacqueline Kazil & Katharine Jarmul
    - Focuses on practical workflows — scraping, cleaning, storing.
    - Great for building end-to-end pipelines.

5.  **The Data Wrangling Workshop** by Brian Lipp, et al.
    - Project-based learning — fix messy data, automate cleaning.
    - Covers Python tools and best practices.

6.  **Data Science from Scratch** by Joel Grus
    - Builds tools from the ground up — great for understanding fundamentals.
    - Includes chapters on cleaning and preprocessing.

7.  **Feature Engineering and Selection** by Max Kuhn & Kjell Johnson
    - Deep dive into creating and selecting predictive features.
    - Uses R and Python examples — concepts are universal.

## Quick Example: Data Wrangling with Pandas

Let’s walk through a simple example using `pandas`:

In [1]:
import pandas as pd
import numpy as np

# Create sample messy dataset
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'age': [25, np.nan, 35, 45, 29],
    'salary': [50000, 60000, np.nan, 80000, 55000],
    'department': ['Engineering', 'HR', 'Engineering', 'Marketing', 'HR']
}

df = pd.DataFrame(data)
print("Original DataFrame:")
display(df)

Original DataFrame:


Unnamed: 0,name,age,salary,department
0,Alice,25.0,50000.0,Engineering
1,Bob,,60000.0,HR
2,Charlie,35.0,,Engineering
3,David,45.0,80000.0,Marketing
4,Eve,29.0,55000.0,HR


In [2]:
# Step 1: Discover
print("\n--- Step 1: Discover ---")
print(df.info())
print(df.describe())


--- Step 1: Discover ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   name        5 non-null      object 
 1   age         4 non-null      float64
 2   salary      4 non-null      float64
 3   department  5 non-null      object 
dtypes: float64(2), object(2)
memory usage: 288.0+ bytes
None
             age        salary
count   4.000000      4.000000
mean   33.500000  61250.000000
std     8.698659  13149.778198
min    25.000000  50000.000000
25%    28.000000  53750.000000
50%    32.000000  57500.000000
75%    37.500000  65000.000000
max    45.000000  80000.000000


In [None]:
# Step 2 & 3: Clean - Handle missing values
print("\n--- Step 2 & 3: Clean ---")
df_clean = df.copy()
df_clean['age'].fillna(df_clean['age'].mean(), inplace=True)
df_clean['salary'].fillna(df_clean['salary'].median(), inplace=True)
display(df_clean)

In [None]:
# Step 4: Enrich - Feature Engineering
print("\n--- Step 4: Enrich ---")
df_clean['age_group'] = pd.cut(df_clean['age'], bins=[0, 30, 40, 100], labels=['Young', 'Middle', 'Senior'])
display(df_clean)

In [None]:
# Step 5: Validate
print("\n--- Step 5: Validate ---")
print("No missing values:", df_clean.isnull().sum().sum() == 0)
print("Valid age range:", df_clean['age'].between(0, 100).all())

In [None]:
# Step 6: Publish (save to CSV)
print("\n--- Step 6: Publish ---")
df_clean.to_csv('cleaned_data.csv', index=False)
print("Data saved to 'cleaned_data.csv'")

## Summary and Conclusions

Data wrangling is not just a preliminary step — it’s the backbone of robust data science. Python, with its rich ecosystem of libraries like Pandas, NumPy, Polars, and Dataprep, provides powerful, flexible, and scalable tools to tackle even the messiest datasets.

Mastering data wrangling will dramatically improve your efficiency, accuracy, and confidence in any data-driven project.

## Resources

1.  **Python for Data Analysis** by Wes McKinney (Creator of Pandas)

    -   [O’Reilly](https://www.oreilly.com/library/view/python-for-data/9781098104023/)

    -   Covers `pandas`, `numpy`, data cleaning, and visualization.

    -   The definitive guide — perfect for beginners and pros.

2.  **Effective Data Wrangling with Python** by Tirthajyoti Sarkar

    -   Practical, hands-on examples using real-world datasets.

    -   Covers `pandas`, `numpy`, `scikit-learn`, and automation.

3.  **Hands-On Data Analysis with Pandas** by Stefanie Molin

    -   Step-by-step projects — from importing to advanced wrangling.

    -   Includes time series, text data, and performance tips.

4.  **Data Wrangling with Python** by Jacqueline Kazil & Katharine Jarmul

    -   Focuses on practical workflows — scraping, cleaning, storing.

    -   Great for building end-to-end pipelines.

5.  **The Data Wrangling Workshop** by Brian Lipp, et al.

    -   Project-based learning — fix messy data, automate cleaning.

    -   Covers Python tools and best practices.

6.  **Data Science from Scratch** by Joel Grus

    -   Builds tools from the ground up — great for understanding fundamentals.

    -   Includes chapters on cleaning and preprocessing.

7.  **Feature Engineering and Selection** by Max Kuhn & Kjell Johnson

    -   Deep dive into creating and selecting predictive features.

    -   Uses R and Python examples — concepts are universal.