
# **Day 31 — Introduction to Data Wrangling in Python**

### **What is Data Wrangling?**

Data wrangling, also known as data preprocessing, is the process of cleaning, structuring, and enriching raw data to make it ready for analysis. Whether you're working with a small dataset or a massive data pipeline, data wrangling is a crucial step in ensuring your results are accurate, meaningful, and actionable.

The key steps involved in data wrangling include:
1. **Data Collection**: Gathering raw data from various sources.
2. **Data Cleaning**: Handling missing data, fixing errors, and dealing with inconsistencies.
3. **Data Transformation**: Reshaping, filtering, and aggregating data for analysis.
4. **Data Enrichment**: Adding new information or context by merging and joining datasets.
5. **Data Validation**: Ensuring the data is accurate, consistent, and ready for analysis.


### **Step 1: Importing the Necessary Libraries**

In [1]:

import pandas as pd


### **Step 2: Loading a Sample Dataset**

In [2]:

# Loading a sample dataset
data = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [25, None, 35, 45, 55],
    'City': ['New York', 'Los Angeles', None, 'Chicago', 'Houston'],
    'Income': [70000, 80000, None, 50000, 95000]
})

# Displaying the dataset
print(data)


      Name   Age         City   Income
0    Alice  25.0     New York  70000.0
1      Bob   NaN  Los Angeles  80000.0
2  Charlie  35.0         None      NaN
3    David  45.0      Chicago  50000.0
4      Eva  55.0      Houston  95000.0


### **Step 3: Cleaning the Data**

In [3]:

# Handling missing values

# Filling missing Age with the mean
data['Age'].fillna(data['Age'].mean(), inplace=True)

# Dropping rows where City or Income is missing
data.dropna(subset=['City', 'Income'], inplace=True)

# Displaying the cleaned data
print(data)


    Name   Age         City   Income
0  Alice  25.0     New York  70000.0
1    Bob  40.0  Los Angeles  80000.0
3  David  45.0      Chicago  50000.0
4    Eva  55.0      Houston  95000.0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Age'].fillna(data['Age'].mean(), inplace=True)


### **Step 4: Transforming the Data**

In [4]:

# Adding a new column for Age Group
data['Age Group'] = pd.cut(data['Age'], bins=[0, 30, 50, 100], labels=['Young', 'Middle-aged', 'Senior'])

# Filtering the data to include only high-income individuals
high_income_data = data[data['Income'] > 60000]

# Displaying the transformed data
print(high_income_data)


    Name   Age         City   Income    Age Group
0  Alice  25.0     New York  70000.0        Young
1    Bob  40.0  Los Angeles  80000.0  Middle-aged
4    Eva  55.0      Houston  95000.0       Senior


### **Step 5: Enriching the Data**

In [5]:

# Sample dataset for city populations
city_populations = pd.DataFrame({
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston'],
    'Population': [8500000, 4000000, 2700000, 2300000]
})

# Merging datasets to add population data
enriched_data = pd.merge(data, city_populations, on='City', how='left')

# Displaying the enriched data
print(enriched_data)


    Name   Age         City   Income    Age Group  Population
0  Alice  25.0     New York  70000.0        Young     8500000
1    Bob  40.0  Los Angeles  80000.0  Middle-aged     4000000
2  David  45.0      Chicago  50000.0  Middle-aged     2700000
3    Eva  55.0      Houston  95000.0       Senior     2300000


### **Step 6: Validating the Data**

In [6]:

# Checking for any remaining missing values
print(enriched_data.isnull().sum())

# Ensuring no duplicates
enriched_data.drop_duplicates(inplace=True)

# Displaying the validated data
print(enriched_data)


Name          0
Age           0
City          0
Income        0
Age Group     0
Population    0
dtype: int64
    Name   Age         City   Income    Age Group  Population
0  Alice  25.0     New York  70000.0        Young     8500000
1    Bob  40.0  Los Angeles  80000.0  Middle-aged     4000000
2  David  45.0      Chicago  50000.0  Middle-aged     2700000
3    Eva  55.0      Houston  95000.0       Senior     2300000



### **Conclusion**

In today's post, we introduced the concept of data wrangling and provided an overview of its key steps, including data cleaning, transformation, enrichment, and validation.

**Key Takeaways**:
- **Data Wrangling** is essential for preparing raw data for analysis by cleaning, transforming, and enriching datasets.
- **Pandas** offers powerful tools for performing all steps of data wrangling, from handling missing values to merging datasets.
- **Clean Data** leads to more accurate, reliable, and actionable insights.
