# üóÇÔ∏è Data Wrangling ‚Äì Cleaning & Preparing Data üßπ


In [1]:

# 1Ô∏è‚É£ Definition
definition = """
Data Wrangling (or Data Munging) is the process of transforming and cleaning 
raw data into a structured, usable format for analysis or modeling.
It ensures data is accurate, consistent, and ready for insights.
"""
print(definition)



Data Wrangling (or Data Munging) is the process of transforming and cleaning 
raw data into a structured, usable format for analysis or modeling.
It ensures data is accurate, consistent, and ready for insights.



In [2]:

# 2Ô∏è‚É£ Steps in Data Wrangling üõ†Ô∏è
steps = [
    "Data Cleaning: Handle missing values, duplicates, and errors",
    "Data Transformation: Convert data types, normalize, scale",
    "Data Standardization: Consistent naming, formats, units",
    "Data Integration: Combine multiple sources into one dataset",
    "Feature Engineering: Create new features to enhance analysis"
]

print("Main Steps in Data Wrangling:")
for idx, step in enumerate(steps, 1):
    print(f"{idx}. {step}")


Main Steps in Data Wrangling:
1. Data Cleaning: Handle missing values, duplicates, and errors
2. Data Transformation: Convert data types, normalize, scale
3. Data Standardization: Consistent naming, formats, units
4. Data Integration: Combine multiple sources into one dataset
5. Feature Engineering: Create new features to enhance analysis


In [4]:

# 3Ô∏è‚É£ Python Example ‚Äì Pandas Data Wrangling
import pandas as pd

# Sample raw data
data = {
    "Name": ["Alice", "Bob", None, "Emma", "John"],
    "Age": [25, 30, 22, None, 28],
    "City": ["Paris", "London", "New York", "Paris", None]
}

df = pd.DataFrame(data)
print("\nRaw Data:")
print(df)

# a) Handle missing values
df["Age"].fillna(df["Age"].mean(), inplace=True)     # Fill missing age with mean
df["Name"].fillna("Unknown", inplace=True)           # Fill missing names
df["City"].fillna("Unknown", inplace=True)

# b) Remove duplicates
df.drop_duplicates(inplace=True)

# c) Standardize text
df["City"] = df["City"].str.title()                 # Capitalize city names

print("\nCleaned Data:")
print(df)

# üîπ Quick Tips:
# - Wrangling is crucial before analysis or modeling
# - Always check for missing, inconsistent, or duplicate data
# - Feature engineering improves model performance
# - Use Pandas for efficient data wrangling in Python



Raw Data:
    Name   Age      City
0  Alice  25.0     Paris
1    Bob  30.0    London
2   None  22.0  New York
3   Emma   NaN     Paris
4   John  28.0      None

Cleaned Data:
      Name    Age      City
0    Alice  25.00     Paris
1      Bob  30.00    London
2  Unknown  22.00  New York
3     Emma  26.25     Paris
4     John  28.00   Unknown


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Age"].fillna(df["Age"].mean(), inplace=True)     # Fill missing age with mean
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Name"].fillna("Unknown", inplace=True)           # Fill missing names
The behavior will change in pandas 3.0. This inplace method will never work 