<a href="https://github.com/zia207/python-colab/blob/main/NoteBook/Python_for_Beginners/01-03-01-data-wrangling-pandas-python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![alt text](http://drive.google.com/uc?export=view&id=1IFEWet-Aw4DhkkVe1xv_2YYqlvRe9m5_)

# 3.1 Data Wrangling with Pandas

[Pandas](https://pandas.pydata.org/), is a powerful, open-source Python library used for data manipulation and analysis. It began development in 2008 at AQR Capital Management and became open-source in 2009. It has since grown with contributions from a global community and, since 2015, is a NumFOCUS-sponsored project. Key milestones include the 2012 publication of Python for Data Analysis and the first core developer sprint in 2018. Pandas offers a fast, flexible DataFrame for tasks like data alignment, missing data handling, reshaping, merging, and time series analysis, with high-performance code in Cython/C. Used across domains like finance and neuroscience, pandas aims to be the most accessible, powerful, and user-friendly open-source data analysis tool, fostering an inclusive community for all users and contributors.

![alt text](http://drive.google.com/uc?export=view&id=1cVOaeig7WVBVOYpYgwVJoxUUooDsaaOR)

## Overview

In this section, you will delve deeper into data manipulation using. By mastering pandas, you will be able to seamlessly transform, clean, reshape, and analyze data in a streamlined and efficient manner — much like `{dplyr}` and `{tidyr}` in R, but with even more flexibility and integration with the broader Python scientific stack.

>  **Note**: This tutorial assumes you are using **Python 3.8+**  installed.


## Prerequisites

Install the required packages:

In [None]:
import importlib.util
import sys

# List of required packages
packages = ['scipy', 'numpy', 'pandas' ,'matplotlib', 'seaborn' ]

# Check and install missing packages
for package in packages:
    if not importlib.util.find_spec(package):
        try:
            import pip
            pip.main(['install', package])
        except ImportError:
            print(f"Failed to install {package}. Pip is not available.")

# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Verify package availability
for package in packages:
    print(f"{package} installed: {bool(importlib.util.find_spec(package))}")

scipy installed: True
numpy installed: True
pandas installed: True
matplotlib installed: True
seaborn installed: True


In [1]:
# Optional: for better display in Jupyter
from IPython.display import display
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)



### Load Data

In this exercise we will use following CSV files:

1.  `usa_division.csv`: USA division names with IDs

2.  `usa_state.csv`: USA State names with ID and division ID.

3.  `usa_corn_production.csv`: USA grain crop production by state from 2012-2022

4.  `gp_soil_data.csv`: Soil carbon with co-variate from four states in the Greatplain region in the USA

5.  `usa_geochemical_raw.csv`: Soil geochemical data for the USA, but not cleaned

All data set use in this exercise can be downloaded from my [Dropbox](https://www.dropbox.com/scl/fo/fohioij7h503duitpl040/h?rlkey=3voumajiklwhgqw75fe8kby3o&dl=0) or from my [Github](https://github.com/zia207/r-colab/tree/main/Data/R_Beginners) accounts.

In [2]:
# Load datasets
div = pd.read_csv("https://github.com/zia207/r-colab/raw/main/Data/R_Beginners/usa_division.csv")
state = pd.read_csv("https://github.com/zia207/r-colab/raw/main/Data/R_Beginners/usa_state.csv")
corn = pd.read_csv("https://github.com/zia207/r-colab/raw/main/Data/R_Beginners/usa_corn_production.csv")
soil = pd.read_csv("https://github.com/zia207/r-colab/raw/main/Data/R_Beginners/gp_soil_data.csv")
geo_raw = pd.read_csv("https://github.com/zia207/r-colab/raw/main/Data/R_Beginners/usa_geochemical_raw.csv")

# Display first few rows of soil data
display(soil.head())

Unnamed: 0,ID,FIPS,STATE_ID,STATE,COUNTY,Longitude,Latitude,SOC,DEM,Aspect,Slope,TPI,KFactor,MAP,MAT,NDVI,SiltClay,NLCD,FRG
0,1,56041,56,Wyoming,Uinta County,-111.01186,41.0563,15.763,2229.078613,159.187744,5.671615,-0.085724,0.32,468.324493,4.595169,0.413939,64.842697,Shrubland,Fire Regime Group IV
1,2,56023,56,Wyoming,Lincoln County,-110.982973,42.883497,15.883,1889.400146,156.878555,8.913812,4.559132,0.261212,536.352173,3.859924,0.693953,72.004547,Shrubland,Fire Regime Group IV
2,3,56039,56,Wyoming,Teton County,-110.80649,44.53497,18.142,2423.04834,168.61235,4.774805,2.605887,0.2162,859.550903,0.8855,0.546603,57.187,Forest,Fire Regime Group V
3,4,56039,56,Wyoming,Teton County,-110.734417,44.432886,10.745,2484.282715,198.353622,7.121811,5.146931,0.181667,869.472412,0.470781,0.619101,54.991665,Forest,Fire Regime Group V
4,5,56029,56,Wyoming,Park County,-110.73079,44.80635,10.479,2396.19458,201.321487,7.949864,3.755706,0.12551,802.974304,0.758827,0.584472,51.228573,Forest,Fire Regime Group V


In [6]:
# Load datasets
div = pd.read_csv("https://github.com/zia207/python-colab/raw/refs/heads/main/Data/Python_for_Beginners/Data/usa_division.csv")
state = pd.read_csv("https://github.com/zia207/python-colab/raw/refs/heads/main/Data/Python_for_Beginners/Data/usa_state.csv")
corn = pd.read_csv("https://github.com/zia207/python-colab/raw/refs/heads/main/Data/Python_for_Beginners/Data/usa_corn_production.csv")
soil = pd.read_csv("https://github.com/zia207/python-colab/raw/refs/heads/main/Data/Python_for_Beginners/Data/gp_soil_data.csv")
geo_raw = pd.read_csv("https://github.com/zia207/python-colab/raw/refs/heads/main/Data/Python_for_Beginners/Data/usa_geochemical_raw.csv")

# Display first few rows of soil data
display(soil.head())



Unnamed: 0,ID,FIPS,STATE_ID,STATE,COUNTY,Longitude,Latitude,SOC,DEM,Aspect,Slope,TPI,KFactor,MAP,MAT,NDVI,SiltClay,NLCD,FRG
0,1,56041,56,Wyoming,Uinta County,-111.01186,41.0563,15.763,2229.078613,159.187744,5.671615,-0.085724,0.32,468.324493,4.595169,0.413939,64.842697,Shrubland,Fire Regime Group IV
1,2,56023,56,Wyoming,Lincoln County,-110.982973,42.883497,15.883,1889.400146,156.878555,8.913812,4.559132,0.261212,536.352173,3.859924,0.693953,72.004547,Shrubland,Fire Regime Group IV
2,3,56039,56,Wyoming,Teton County,-110.80649,44.53497,18.142,2423.04834,168.61235,4.774805,2.605887,0.2162,859.550903,0.8855,0.546603,57.187,Forest,Fire Regime Group V
3,4,56039,56,Wyoming,Teton County,-110.734417,44.432886,10.745,2484.282715,198.353622,7.121811,5.146931,0.181667,869.472412,0.470781,0.619101,54.991665,Forest,Fire Regime Group V
4,5,56029,56,Wyoming,Park County,-110.73079,44.80635,10.479,2396.19458,201.321487,7.949864,3.755706,0.12551,802.974304,0.758827,0.584472,51.228573,Forest,Fire Regime Group V


### Method Chaining & Pipe Equivalent in Python

Python doesn’t have a native pipe operator like R’s `%>%`, but **pandas supports method chaining**, and you can use `.pipe()` for custom functions or complex pipelines.

In [7]:
# Example of method chaining
result = (
    corn
    .merge(state, on='STATE_ID', how='inner')
    .merge(div, on='DIVISION_ID', how='inner')
    .query('MT > MT.mean()')
    .groupby('DIVISION_NAME')
    .agg(avg_prod=('MT', 'mean'))
    .reset_index()
)

### Joining DataFrames (like `inner_join`)

In [8]:
# Join corn with state, then with division
df_corn = (
    corn
    .merge(state, on='STATE_ID', how='inner')  # equivalent to inner_join
    .merge(div, on='DIVISION_ID', how='inner')
)

# Display structure
df_corn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 465 entries, 0 to 464
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   STATE_ID       465 non-null    int64  
 1   YEAR           465 non-null    int64  
 2   MT             465 non-null    float64
 3   STATE_NAME     465 non-null    object 
 4   DIVISION_ID    465 non-null    int64  
 5   DIVISION_NAME  465 non-null    object 
dtypes: float64(1), int64(3), object(2)
memory usage: 21.9+ KB


### Relocate Columns (like `relocate`)

Pandas doesn’t have `relocate()`, but you can reorder columns easily:

In [9]:
# Define desired column order
cols_order = [
    'DIVISION_ID', 'DIVISION_NAME', 'STATE_ID', 'STATE_NAME',
    'YEAR', 'MT'
] + [col for col in df_corn.columns if col not in [
    'DIVISION_ID', 'DIVISION_NAME', 'STATE_ID', 'STATE_NAME', 'YEAR', 'MT'
]]

df_corn = df_corn[cols_order]

###  Rename Columns (like `rename`)

In [None]:
df_corn = df_corn.rename(columns={'STATE_ID': 'STATE_FIPS'})
print(df_corn.columns.tolist())

### Combine Join + Relocate + Rename in One Chain

In [10]:
df_corn = (
    corn
    .merge(state, on='STATE_ID', how='inner')
    .merge(div, on='DIVISION_ID', how='inner')
    .rename(columns={'STATE_ID': 'STATE_FIPS'})
    # Reorder columns
    .pipe(lambda df: df[[
        'DIVISION_ID', 'DIVISION_NAME', 'STATE_FIPS', 'STATE_NAME',
        'YEAR', 'MT'
    ] + [c for c in df.columns if c not in [
        'DIVISION_ID', 'DIVISION_NAME', 'STATE_FIPS', 'STATE_NAME', 'YEAR', 'MT'
    ]]])
)

# Optional: Use .info() or .head() instead of glimpse
df_corn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 465 entries, 0 to 464
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   DIVISION_ID    465 non-null    int64  
 1   DIVISION_NAME  465 non-null    object 
 2   STATE_FIPS     465 non-null    int64  
 3   STATE_NAME     465 non-null    object 
 4   YEAR           465 non-null    int64  
 5   MT             465 non-null    float64
dtypes: float64(1), int64(3), object(2)
memory usage: 21.9+ KB


### Select Columns (like `select`)

In [11]:
df_state = df_corn[['STATE_NAME', 'YEAR', 'MT']].copy()
df_state.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 465 entries, 0 to 464
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   STATE_NAME  465 non-null    object 
 1   YEAR        465 non-null    int64  
 2   MT          465 non-null    float64
dtypes: float64(1), int64(1), object(1)
memory usage: 11.0+ KB


### Filter Rows (like `filter`)

In [12]:
# Filter by one condition
df_01 = df_corn[df_corn['DIVISION_NAME'] == 'East North Central']

# Filter by multiple conditions with .isin()
df_02 = df_corn[df_corn['DIVISION_NAME'].isin([
    'East North Central', 'East South Central', 'Middle Atlantic'
])]

# Filter with OR (|)
df_03 = df_corn[
    (df_corn['DIVISION_NAME'] == 'East North Central') |
    (df_corn['DIVISION_NAME'] == 'Middle Atlantic')
]

# Filter NY in Middle Atlantic
df_ny = df_corn[
    (df_corn['DIVISION_NAME'] == 'Middle Atlantic') &
    (df_corn['STATE_NAME'] == 'New York')
]

# Filter where MT > global mean
global_mean = df_corn['MT'].mean()
mean_prod = df_corn[df_corn['MT'] > global_mean]

# Filter for 2017 AND above mean
mean_prod_2017 = df_corn[
    (df_corn['MT'] > global_mean) &
    (df_corn['YEAR'] == 2017)
]

# Filter states starting with 'A' (using str.startswith)
state_a = df_corn[df_corn['STATE_NAME'].str.startswith('A')]

### Summarize Data (like `summarise`)

In [13]:
# Single summary
print("Mean:", df_corn['MT'].mean())
print("Median:", df_corn['MT'].median())

# Multiple summaries in one go
summary_stats = df_corn['MT'].agg(['mean', 'median', 'std', 'min', 'max'])
print(summary_stats)

# Grouped summaries
division_summary = (
    df_corn
    .groupby('DIVISION_NAME', as_index=False)
    .agg(Prod_MT=('MT', 'mean'))
)

# For 2020 only
division_2020_summary = (
    df_corn[df_corn['YEAR'] == 2020]
    .groupby('DIVISION_NAME', as_index=False)
    .agg(Prod_MT=('MT', 'mean'))
    .assign(Prod_1000_MT = lambda x: x['Prod_MT'] / 1000)
)

# Summarize multiple numeric columns by group
soil_summary = (
    soil
    .groupby('STATE', as_index=False)
    [['SOC', 'NDVI', 'MAP', 'MAT']]
    .mean()
)

Mean: 8360401.602795699
Median: 2072749.4
mean      8.360402e+06
median    2.072749e+06
std       1.440890e+07
min       2.691000e+02
max       6.961238e+07
Name: MT, dtype: float64


### Create New Columns (like `mutate`)

In [14]:
df_corn = df_corn.assign(MT_1000 = df_corn['MT'] / 10000)
df_corn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 465 entries, 0 to 464
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   DIVISION_ID    465 non-null    int64  
 1   DIVISION_NAME  465 non-null    object 
 2   STATE_FIPS     465 non-null    int64  
 3   STATE_NAME     465 non-null    object 
 4   YEAR           465 non-null    int64  
 5   MT             465 non-null    float64
 6   MT_1000        465 non-null    float64
dtypes: float64(2), int64(3), object(2)
memory usage: 25.6+ KB


### Group By + Summarize + Mutate Chain

In [19]:
result = (
    df_corn
    .query('YEAR == 2020')
    .groupby('DIVISION_NAME', as_index=False)
    .agg(Prod_MT=('MT', 'mean'))
    .assign(Prod_1000_MT = lambda x: x['Prod_MT'] / 1000)
)
print(result)

        DIVISION_NAME       Prod_MT  Prod_1000_MT
0  East North Central  2.274695e+07  22746.951840
1  East South Central  3.350119e+06   3350.119375
2     Middle Atlantic  1.929554e+06   1929.553633
3            Mountain  6.512214e+05    651.221443
4             Pacific  3.917310e+05    391.731033
5      South Atlantic  1.223075e+06   1223.074588
6  West North Central  2.828109e+07  28281.091243
7  West South Central  3.009964e+06   3009.963650


### Pivoting Data (like `pivot_wider`, `pivot_longer`)

Pivoting a DataFrame is a data manipulation technique that involves reorganizing the structure of the data to create a new view. This technique is particularly useful when working with large datasets because it can help to make the data more manageable and easier to analyze. In R, pivoting a DataFrame typically involves using the dplyr or tidyr packages to transform the data from a long format to a wide format or vice versa. This allows for easier analysis of the data and can help to highlight patterns, trends, and relationships that might otherwise be difficult to see. Overall, pivoting a DataFrame is an important tool in the data analyst's toolkit and can be used to gain valuable insights into complex datasets.

In [25]:
# First filter out states with incomplete years
complete_states = (
    df_corn
    .groupby('STATE_NAME')
    .filter(lambda x: len(x) >= 11)  # keep if 11+ years
)

corn_wider = (
    complete_states
    [['STATE_FIPS', 'STATE_NAME', 'YEAR', 'MT']]
    .pivot(index=['STATE_FIPS', 'STATE_NAME'], columns='YEAR', values='MT')
    .reset_index()
)
print(corn_wider.head())

YEAR  STATE_FIPS  STATE_NAME       2012       2013       2014       2015  \
0              1     Alabama   734352.8  1101529.2  1151061.8   914829.3   
1              4     Arizona   158504.4   229297.9   149359.9   192034.1   
2              5    Arkansas  3142399.9  4110445.0  2517526.9  2045951.0   
3              6  California   823003.5   873298.1   398166.0   239280.6   
4              8    Colorado  3412162.2  3261024.2  3745681.8  3426640.9   

YEAR       2016       2017       2018       2019       2020       2021  \
0      960170.7   996875.6   970839.3  1138869.1  1284291.8  1407742.3   
1      273064.4   158504.4   111765.9   217105.3   148801.1    82757.6   
2     3236003.9  2765825.0  2965479.6  3267247.5  2827677.3  3879292.8   
3      469924.8   339361.9   285638.1   256045.5   285003.1   238772.6   
4     4071581.0  4722109.3  3929587.5  4061674.5  3123348.9  3768289.0   

YEAR       2022  
0      869233.9  
1      223531.8  
2     3054130.3  
3       89920.8  
4     30

# First filter out states with incomplete years
complete_states = (
    df_corn
    .groupby('STATE_NAME')
    .filter(lambda x: len(x) >= 11)  # keep if 11+ years
)

corn_wider = (
    complete_states
    [['STATE_FIPS', 'STATE_NAME', 'YEAR', 'MT']]
    .pivot(index=['STATE_FIPS', 'STATE_NAME'], columns='YEAR', values='MT')
    .reset_index()
)

In [26]:
# pivot_longer equivalent
year_cols = [str(year) for year in range(2012, 2023)]

corn_wider.columns = corn_wider.columns.astype(str)

corn_longer = (
    corn_wider
    .melt(
        id_vars=['STATE_FIPS', 'STATE_NAME'],
        value_vars=year_cols,
        var_name='YEAR',
        value_name='MT'
    )
    .astype({'YEAR': int})  # Convert YEAR back to int
)
print(corn_longer)

     STATE_FIPS     STATE_NAME  YEAR          MT
0             1        Alabama  2012    734352.8
1             4        Arizona  2012    158504.4
2             5       Arkansas  2012   3142399.9
3             6     California  2012    823003.5
4             8       Colorado  2012   3412162.2
..          ...            ...   ...         ...
446          51       Virginia  2022   1442288.2
447          53     Washington  2022    419122.1
448          54  West Virginia  2022    149359.9
449          55      Wisconsin  2022  13853891.5
450          56        Wyoming  2022    217638.7

[451 rows x 4 columns]


In [30]:
# Summary Stats + Pivot Longer (Advanced)
summary_stats = (
    soil[['SOC', 'DEM', 'NDVI', 'MAP', 'MAT']]
    .agg(['min', lambda x: x.quantile(0.25), 'median', lambda x: x.quantile(0.75), 'max', 'mean', 'std'])
    .T
    .reset_index()
    .melt(id_vars='index', var_name='stat', value_name='value')
    .assign(variable = lambda x: x['index'])
    [['variable', 'stat', 'value']]
)
print(summary_stats.head())

  variable stat       value
0      SOC  min    0.408000
1      DEM  min  258.648804
2     NDVI  min    0.142433
3      MAP  min  193.913223
4      MAT  min   -0.591064


##  Additional Powerful pandas Functions for Data Wrangling


|  TASK | Pandas Function  |
|----|----|
| Handle missing values | `.fillna()`,`.dropna()`,`.interpolate()`,`.bfill()`,`.ffill()` |
| String manipulation | `.str.replace()`,`.str.contains()`,`.str.extract()`,`.str.split()` |
| Datetime handling | `pd.to_datetime()`,`.dt.year`,`.dt.month`, etc. |
| Binning / Categorizing | `pd.cut()`,`pd.qcut()`,`.astype('category')` |
| Duplicates | `.duplicated()`,`.drop_duplicates()` |
| Value counts | `.value_counts()`,`.nunique()` |
| Cross-tabulation | `pd.crosstab()` |
| Reshaping | `.stack()`,`.unstack()`,`.explode()`(for lists) |
| Conditional assignment | `np.where()`,`pd.Series.mask()`,`pd.Series.where()`,`pd.cut()` |
| Advanced groupby | `.groupby().apply()`,`.transform()`,`.filter()` |
| Memory optimization | `.astype()`,`pd.read_csv(..., dtype=...)`,`pd.Categorical` |
| Sampling | `.sample(n=100)`,`.sample(frac=0.1)` |
| Rolling windows | `.rolling()`,`.expanding()`,`.ewm()` |

### Advanced Missing Data Handling
Beyond `.fillna()` and `.dropna()`, pandas offers nuanced control:

In [34]:
df = pd.DataFrame({
    'col': [1, 2, np.nan, 4, 5, np.nan, 7, 8],
    'category': ['A', 'A', 'A', 'B', 'B', 'B', 'A', 'A']
})
print(df)


# Forward fill (last observation carried forward)
df['col'].ffill()

# Backward fill
df['col'].bfill()

# Interpolate missing values (linear, time, spline, etc.)
df['col'].interpolate(method='linear')

# Fill conditionally
df['col'] = df['col'].mask(df['col'] < 0, np.nan)  # Set negatives to NaN
df['col'] = df['col'].where(df['col'] > 0, 0)      # Set negatives to 0

# Fill with grouped mean (transform keeps original index)
df['col_filled'] = df.groupby('category')['col'].transform(lambda x: x.fillna(x.mean()))

   col category
0  1.0        A
1  2.0        A
2  NaN        A
3  4.0        B
4  5.0        B
5  NaN        B
6  7.0        A
7  8.0        A


### Efficient String Operations with `.str` Accessor

In [37]:
data = {'full_name': ['Alice Smith', 'Bob Johnson', 'Charlie Brown'],
    'email': ['alice@example.com', 'bob.johnson@domain.net', 'charlie'],
    'text': ['  Leading and trailing spaces  ', 'Multiple   internal   spaces', 'No spaces'],
    'price_str': ['$10.50', '20', '30.75 USD']}
df = pd.DataFrame(data)

# Extract numbers from string
df['price_str'].str.extract(r'(\d+\.?\d*)').astype(float)

# Split into multiple columns
df[['first', 'last']] = df['full_name'].str.split(' ', n=1, expand=True)

# Check patterns
df['email'].str.contains('@', na=False)  # Handle NaN safely

# Replace with regex
df['text'].str.replace(r'\s+', ' ', regex=True)  # Normalize whitespace

# Case conversion, stripping, etc.
df['text'].str.lower().str.strip().str.title()

0     Leading And Trailing Spaces
1    Multiple   Internal   Spaces
2                       No Spaces
Name: text, dtype: object

###  Datetime & Time-Series Power Tools

In [46]:
#  CREATE SAMPLE DATAFRAME
np.random.seed(42)  # For reproducible results
date_range = pd.date_range(start='2023-01-01', end='2023-03-31', freq='D')
df = pd.DataFrame({
    'date_str': date_range.strftime('%Y-%m-%d'),  # Simulate string dates
    'sales': np.random.randint(100, 1000, size=len(date_range)),
    'visitors': np.random.randint(50, 500, size=len(date_range))
})

print(" Original DataFrame (first 5 rows):")
print(df.head())
print("\n" + "="*60 + "\n")

#  1. Convert to datetime (auto-infer format)
df['date'] = pd.to_datetime(df['date_str'])
print(" After converting 'date_str' to datetime 'date':")
print(df[['date_str', 'date']].head(3))
print("\n" + "="*60 + "\n")

#  2. Extract components
df['year'] = df['date'].dt.year
df['month_name'] = df['date'].dt.month_name()
df['dayofweek'] = df['date'].dt.dayofweek  # Monday=0, Sunday=6

print(" After extracting datetime components:")
print(df[['date', 'year', 'month_name', 'dayofweek']].head(5))
print("\n" + "="*60 + "\n")

#  3. Resample time series (daily to monthly)
monthly_summary = df.set_index('date').resample('M').sum(numeric_only=True)
print(" Monthly Resample (sum of sales & visitors):")
print(monthly_summary[['sales', 'visitors']])
print("\n" + "="*60 + "\n")

#  4. Rolling windows with time-based offsets (7-day rolling average)
# We'll create a new DataFrame to avoid SettingWithCopyWarning
df_rolling = df.set_index('date').copy()
df_rolling['sales_7d_rolling'] = df_rolling['sales'].rolling('7D').mean()

print(" 7-Day Rolling Average of Sales (last 5 rows):")
print(df_rolling[['sales', 'sales_7d_rolling']].tail())
print("\n" + "="*60 + "\n")

#  5. Timezone handling
# Localize to UTC, then convert to US/Eastern
df['date_utc'] = df['date'].dt.tz_localize('UTC')
df['date_eastern'] = df['date_utc'].dt.tz_convert('US/Eastern')

print(" After timezone conversion (first 3 rows):")
print(df[['date', 'date_utc', 'date_eastern']].head(3))

 Original DataFrame (first 5 rows):
     date_str  sales  visitors
0  2023-01-01    202       240
1  2023-01-02    535       451
2  2023-01-03    960       267
3  2023-01-04    370        93
4  2023-01-05    206       211


 After converting 'date_str' to datetime 'date':
     date_str       date
0  2023-01-01 2023-01-01
1  2023-01-02 2023-01-02
2  2023-01-03 2023-01-03


 After extracting datetime components:
        date  year month_name  dayofweek
0 2023-01-01  2023    January          6
1 2023-01-02  2023    January          0
2 2023-01-03  2023    January          1
3 2023-01-04  2023    January          2
4 2023-01-05  2023    January          3


 Monthly Resample (sum of sales & visitors):
            sales  visitors
date                       
2023-01-31  14891      8990
2023-02-28  17081      7113
2023-03-31  15903      8523


 7-Day Rolling Average of Sales (last 5 rows):
            sales  sales_7d_rolling
date                               
2023-03-27    665        507.714

  monthly_summary = df.set_index('date').resample('M').sum(numeric_only=True)


### Advanced Reshaping & Nesting

`xplode()`— Unnest list-like columns

In [47]:
# If a column contains lists
df = pd.DataFrame({
    'team': ['A', 'B'],
    'members': [['Alice', 'Bob'], ['Charlie', 'Dana']]
})
df.explode('members')  # Each list element becomes a row

Unnamed: 0,team,members
0,A,Alice
0,A,Bob
1,B,Charlie
1,B,Dana


`melt()` + `pivot_table()` — Flexible reshaping

In [49]:
# CREATE SAMPLE DATASET (WIDE FORMAT)
# Each row = one student, each subject column = their score
df = pd.DataFrame({
    'id': ['S001', 'S002', 'S003', 'S004', 'S005'],
    'Math': [88, 92, 75, 85, 90],
    'Science': [90, 85, 80, 95, 88],
    'English': [82, 88, 78, 85, 92],
    'History': [85, 80, 85, 90, 87]
})

# . MELT: Go from Wide to Long
# 'id' stays as identifier, other columns become 'category' and 'value'
df_melted = df.melt(
    id_vars=['id'],           # Keep these columns as-is
    var_name='subject',       # Name for the former column headers
    value_name='score'        # Name for the values
)


#  PIVOT_TABLE: Go back to Wide (with aggregation)
# Simulate duplicate entries by appending a modified copy
df_melted_duplicate = df_melted.copy()
df_melted_duplicate['score'] += 5  # Add 5 points to all scores
df_long_with_duplicates = pd.concat([df_melted, df_melted_duplicate], ignore_index=True)


# Pivot back to wide format, aggregating duplicates with mean
df_pivoted = df_long_with_duplicates.pivot_table(
    index='id',               # Rows
    columns='subject',        # Columns
    values='score',           # Values to aggregate
    aggfunc='mean',           # How to aggregate (handles duplicates!)
    fill_value=0              # Replace NaN with 0 if any
).reset_index()               # Move 'id' from index back to column

print(" Pivoted Back to Wide Format (Mean Scores):")
print(df_pivoted)

 Pivoted Back to Wide Format (Mean Scores):
subject    id  English  History  Math  Science
0        S001     84.5     87.5  90.5     92.5
1        S002     90.5     82.5  94.5     87.5
2        S003     80.5     87.5  77.5     82.5
3        S004     87.5     92.5  87.5     97.5
4        S005     94.5     89.5  92.5     90.5


### Conditional Logic & Binning
`np.select()` — Multi-condition assignment

In [55]:
conditions = [
    df_pivoted['Math'] >= 90,
    df_pivoted['Math'] >= 80,
    df_pivoted['Math'] >= 70
]
choices = ['A', 'B', 'C']
df_pivoted['grade'] = np.select(conditions, choices, default='F')
print(df_pivoted)

subject    id  English  History  Math  Science grade
0        S001     84.5     87.5  90.5     92.5     A
1        S002     90.5     82.5  94.5     87.5     A
2        S003     80.5     87.5  77.5     82.5     C
3        S004     87.5     92.5  87.5     97.5     B
4        S005     94.5     89.5  92.5     90.5     A


`pd.cut()` / `pd.qcut()` — Binning continuous variables

In [None]:
#  CREATE SAMPLE DATAFRAME
np.random.seed(42)  # For reproducible results
df = pd.DataFrame({
    'name': [f'Person_{i}' for i in range(1, 21)],
    'age': np.random.randint(5, 95, size=20),      # Ages between 5 and 94
    'income': np.random.randint(20000, 200000, size=20)  # Income between 20K and 200K
})


# . Custom Bins: Categorize age into groups
df['age_group'] = pd.cut(
    df['age'],
    bins=[0, 18, 35, 60, 100],
    labels=['Child', 'Young', 'Middle', 'Senior'],
    include_lowest=True  # Ensures 0 is included
)

print(" After applying pd.cut() for age groups:")
print(df[['name', 'age', 'age_group']])
print("\n" + "="*60 + "\n")

# . Quantile-Based Bins: Divide income into quartiles
df['income_quantile'] = pd.qcut(
    df['income'],
    q=4,
    labels=['Q1 (Lowest)', 'Q2 (Lower-Mid)', 'Q3 (Upper-Mid)', 'Q4 (Highest)'],
    duplicates='drop'  # Safeguard in case of duplicate quantile edges
)

print("After applying pd.qcut() for income quartiles:")
print(df[['name', 'income', 'income_quantile']])
print("\n" + "="*60 + "\n")

#  Optional: Show distribution
print(" Age Group Distribution:")
print(df['age_group'].value_counts().sort_index())
print("\n Income Quartile Distribution:")
print(df['income_quantile'].value_counts().sort_index())

🧩 After applying pd.cut() for age groups:
         name  age age_group
0    Person_1   56    Middle
1    Person_2   19     Young
2    Person_3   76    Senior
3    Person_4   65    Senior
4    Person_5   25     Young
5    Person_6   87    Senior
6    Person_7   91    Senior
7    Person_8   79    Senior
8    Person_9   79    Senior
9   Person_10   92    Senior
10  Person_11   28     Young
11  Person_12    7     Child
12  Person_13   26     Young
13  Person_14   57    Middle
14  Person_15    6     Child
15  Person_16   92    Senior
16  Person_17   34     Young
17  Person_18   42    Middle
18  Person_19    6     Child
19  Person_20   68    Senior


After applying pd.qcut() for income quartiles:
         name  income income_quantile
0    Person_1  123355  Q2 (Lower-Mid)
1    Person_2  105305  Q2 (Lower-Mid)
2    Person_3  179765    Q4 (Highest)
3    Person_4  150608    Q4 (Highest)
4    Person_5  176730    Q4 (Highest)
5    Person_6  104478  Q2 (Lower-Mid)
6    Person_7  142537  Q3 (Upper-M

### Performance Optimization

When working with large datasets, performance can become a bottleneck. Here are some tips to optimize your pandas code:

In [58]:
# CREATE SAMPLE DATAFRAME — Simulate 100K Sales Records
np.random.seed(42)
n = 100_000

df = pd.DataFrame({
    'transaction_id': range(1, n + 1),
    'product_category': np.random.choice(
        ['Electronics', 'Clothing', 'Home & Kitchen', 'Books', 'Sports'], 
        size=n, 
        p=[0.3, 0.25, 0.2, 0.15, 0.1]  # Uneven probabilities
    ),
    'region': np.random.choice(
        ['North', 'South', 'East', 'West'], 
        size=n
    ),
    'price': np.round(np.random.uniform(10.0, 500.0, size=n), 2),
    'quantity': np.random.randint(1, 6, size=n),
    'discount_pct': np.random.choice([0.0, 0.05, 0.1, 0.15, 0.2], size=n, p=[0.5, 0.2, 0.15, 0.1, 0.05])
})

# Create a flag column
df['is_high_value'] = (df['price'] > 300).astype(int)

print(" Original DataFrame Info:")
print(df.info())
print(f"\nMemory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print("\n" + "="*70 + "\n")

# 1. Use eval() for complex expressions (avoids intermediate copies)
# Calculate total revenue and net revenue in one efficient step
df = df.eval('''
    total_revenue = price * quantity
    discount_amount = total_revenue * discount_pct
    net_revenue = total_revenue - discount_amount
''')

print(" After using .eval() to create 3 new columns:")
print(df[['price', 'quantity', 'discount_pct', 'total_revenue', 'net_revenue']].head(3))
print("\n" + "="*70 + "\n")

# 🔍 2. Use query() for filtering (cleaner syntax + often faster)
# Filter: High-value items in Electronics or Books, with discount, in East or West
df_filtered = df.query(
    "is_high_value == 1 and "
    "product_category in ['Electronics', 'Books'] and "
    "discount_pct > 0 and "
    "region in ['East', 'West']"
)

print(f" After .query(): {len(df_filtered):,} rows match criteria")
print(df_filtered[['product_category', 'region', 'price', 'discount_pct']].head())
print("\n" + "="*70 + "\n")

#  3. Convert to category dtype for low-cardinality string columns
# Check unique values first
print("  Unique values before conversion:")
for col in ['product_category', 'region']:
    print(f"  {col}: {df[col].nunique()} unique values → {list(df[col].unique())}")

# Convert to category
df['product_category'] = df['product_category'].astype('category')
df['region'] = df['region'].astype('category')
df['is_high_value'] = df['is_high_value'].astype('category')  # Also good for binary flags

print("\n After converting to 'category' dtype:")
print(df.dtypes)
print(f"\nNew memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"Memory saved: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB vs original")

# Show memory usage per column
print("\n Memory usage per column (after optimization):")
mem_usage = df.memory_usage(deep=True) / 1024
for col, mem_kb in mem_usage.items():
    print(f"  {col}: {mem_kb:.1f} KB")

print("\n" + "="*70 + "\n")

#  Combine all optimizations in a realistic workflow
print(" REALISTIC WORKFLOW: Filter + Compute + GroupBy")

result = (
    df
    .query("net_revenue > 1000 and product_category != 'Books'")
    .eval('profit_margin = (net_revenue - 50) / net_revenue')  # Assume $50 fixed cost
    .groupby(['product_category', 'region'], observed=True)  # observed=True for categories
    .agg(
        avg_net_revenue=('net_revenue', 'mean'),
        total_transactions=('transaction_id', 'count'),
        avg_profit_margin=('profit_margin', 'mean')
    )
    .round(2)
    .reset_index()
)

print("Final Aggregated Result:")
print(result)

 Original DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 7 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   transaction_id    100000 non-null  int64  
 1   product_category  100000 non-null  object 
 2   region            100000 non-null  object 
 3   price             100000 non-null  float64
 4   quantity          100000 non-null  int64  
 5   discount_pct      100000 non-null  float64
 6   is_high_value     100000 non-null  int64  
dtypes: float64(2), int64(3), object(2)
memory usage: 5.3+ MB
None

Memory usage: 16.02 MB


 After using .eval() to create 3 new columns:
    price  quantity  discount_pct  total_revenue  net_revenue
0   43.91         4          0.15         175.64      149.294
1  296.99         5          0.00        1484.95     1484.950
2  401.45         4          0.10        1605.80     1445.220


 After .query(): 4,566 rows match criteria
 

###  Method Chaining Enhancements with `.pipe()`

In [59]:
#  SAMPLE DATAFRAME
np.random.seed(42)  # For reproducible results
n = 1000

# Generate base data
df = pd.DataFrame({
    'household_id': range(1, n + 1),
    'region': np.random.choice(
        ['North', 'South', 'East', 'West'], 
        size=n,
        p=[0.25, 0.25, 0.25, 0.25]
    ),
    'household_size': np.random.choice(
        [1, 2, 3, 4, 5], 
        size=n,
        p=[0.15, 0.35, 0.30, 0.15, 0.05]
    ),
    'income': np.random.lognormal(mean=10.5, sigma=0.8, size=n).round(2)
})

# Intentionally add some extreme outliers
outlier_indices = np.random.choice(df.index, size=30, replace=False)
df.loc[outlier_indices, 'income'] = df['income'].max() * np.random.uniform(2, 5, size=30)

# Define outlier removal function
def remove_outliers(df, col):
    q1 = df[col].quantile(0.25)
    q3 = df[col].quantile(0.75)
    iqr = q3 - q1
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    return df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]

# Define feature engineering function
def add_features(df):
    return df.assign(
        log_income = np.log1p(df['income']),           # log(1 + income) to handle zeros
        income_per_capita = df['income'] / df['household_size']
    )

# Chain everything together
result = (
    df
    .pipe(remove_outliers, 'income')
    .pipe(add_features)
    .groupby('region')
    .agg({
        'log_income': 'mean', 
        'income_per_capita': 'median'
    })
    .round(2)
    .reset_index()
)

print(" Final Result after Pipeline:")
print(result)
print(f"\nShape after removing outliers: {df.pipe(remove_outliers, 'income').shape}")

# Show outlier removal statistics
original_count = len(df)
cleaned_count = len(df.pipe(remove_outliers, 'income'))
removed_count = original_count - cleaned_count

print(f"\n Outlier Removal Summary:")
print(f"Original records: {original_count:,}")
print(f"Records after removal: {cleaned_count:,}")
print(f"Outliers removed: {removed_count:,} ({removed_count/original_count:.1%})")

#  Show IQR bounds for income
q1 = df['income'].quantile(0.25)
q3 = df['income'].quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

print(f"\n Income IQR Bounds:")
print(f"Q1: ${q1:,.2f}")
print(f"Q3: ${q3:,.2f}")
print(f"IQR: ${iqr:,.2f}")
print(f"Lower bound: ${lower_bound:,.2f}")
print(f"Upper bound: ${upper_bound:,.2f}")
print(f"Outliers are values < ${lower_bound:,.2f} or > ${upper_bound:,.2f}")

 Final Result after Pipeline:
  region  log_income  income_per_capita
0   East       10.41           14296.69
1  North       10.44           14335.13
2  South       10.40           13410.24
3   West       10.42           12879.74

Shape after removing outliers: (920, 4)

 Outlier Removal Summary:
Original records: 1,000
Records after removal: 920
Outliers removed: 80 (8.0%)

 Income IQR Bounds:
Q1: $22,142.19
Q3: $64,728.53
IQR: $42,586.35
Lower bound: $-41,737.33
Upper bound: $128,608.05
Outliers are values < $-41,737.33 or > $128,608.05


### Validation & Quality Checks

In [60]:
# Simulate Product Sales with Duplicates
np.random.seed(42)
n = 1000

# Generate base data
df = pd.DataFrame({
    'id': np.random.randint(1, 800, size=n),  # Only 800 unique IDs for 1000 rows → guarantees duplicates
    'category': np.random.choice(
        ['Electronics', 'Clothing', 'Home & Kitchen', 'Books', 'Sports'], 
        size=n,
        p=[0.3, 0.25, 0.2, 0.15, 0.1]  # Uneven probabilities
    ),
    'price': np.round(np.random.lognormal(mean=3.5, sigma=0.8, size=n), 2),
    'quantity': np.random.randint(1, 6, size=n),
    'region': np.random.choice(['North', 'South', 'East', 'West'], size=n),
    'sale_date': pd.date_range(start='2023-01-01', periods=n, freq='H')
})

# Intentionally create duplicates by appending first 50 rows
df_with_duplicates = pd.concat([df, df.head(50)], ignore_index=True)


# Check for duplicates
print("Check for duplicates:")
duplicate_count = df_with_duplicates.duplicated().sum()
print(f"Total duplicate rows: {duplicate_count}")

# Show what duplicates look like
print("\nSample of duplicate rows (first 5):")
duplicates = df_with_duplicates[df_with_duplicates.duplicated(keep=False)]
print(duplicates.head(10)[['id', 'category', 'price']])  # Show first 10 including originals

print("\n" + "="*70 + "\n")

# Drop duplicates (keep last occurrence)
print("Drop duplicates (keeping last occurrence):")
df_clean = df_with_duplicates.drop_duplicates(subset=['id'], keep='last')
print(f"Shape after dropping duplicates: {df_clean.shape}")
print(f"Rows removed: {len(df_with_duplicates) - len(df_clean)}")

# Show sample of cleaned data
print("\nSample of cleaned data (first 5 rows):")
print(df_clean.head()[['id', 'category', 'price', 'quantity']])

print("\n" + "="*70 + "\n")

# Value counts with percentages
print("Value counts with percentages:")
category_percentages = (
    df_clean['category']
    .value_counts(normalize=True)
    .mul(100)
    .round(1)
    .sort_values(ascending=False)
)

print("Category Distribution (%):")
for category, percentage in category_percentages.items():
    print(f"  {category}: {percentage}%")

print("\n" + "="*70 + "\n")

# Check data types and memory usage
print("Check data types and memory usage:")
df_clean.info(memory_usage='deep')
print(f"\nTotal memory usage: {df_clean.memory_usage(deep=True).sum() / 1024:.1f} KB")

print("\n" + "="*70 + "\n")

# Describe with custom percentiles
print("Describe with percentiles [0.01, 0.25, 0.5, 0.75, 0.99]:")
numeric_cols = df_clean.select_dtypes(include=[np.number]).columns.tolist()
description = df_clean[numeric_cols].describe(percentiles=[0.01, 0.25, 0.5, 0.75, 0.99])

# Round for cleaner display
description = description.round(2)
print(description)

# Optional: Show only price column for focus
print("\nPrice column detailed statistics:")
print(df_clean['price'].describe(percentiles=[0.01, 0.25, 0.5, 0.75, 0.99]).round(2))

Check for duplicates:
Total duplicate rows: 50

Sample of duplicate rows (first 5):
    id        category  price
0  103     Electronics  38.61
1  436           Books  25.86
2  271  Home & Kitchen  36.85
3  107  Home & Kitchen  29.31
4   72           Books  58.35
5  701        Clothing  71.19
6   21     Electronics  17.66
7  615        Clothing  11.42
8  122  Home & Kitchen   7.62
9  467  Home & Kitchen  49.72


Drop duplicates (keeping last occurrence):
Shape after dropping duplicates: (556, 6)
Rows removed: 494

Sample of cleaned data (first 5 rows):
     id        category  price  quantity
56   14        Clothing  37.30         1
58  777        Clothing  20.27         4
65  428  Home & Kitchen  85.84         4
66  509           Books  70.78         2
69  206        Clothing  69.93         3


Value counts with percentages:
Category Distribution (%):
  Electronics: 30.4%
  Clothing: 23.6%
  Home & Kitchen: 23.2%
  Books: 14.4%
  Sports: 8.5%


Check data types and memory usage:
<clas

  'sale_date': pd.date_range(start='2023-01-01', periods=n, freq='H')


##  Cleaning Messy Data (Geochemical Example)

This exercise is designed to provide you with a more in-depth understanding of how to clean and prepare data for analysis. Data cleaning involves a set of processes that help to ensure that your data is accurate, consistent, and complete. In this exercise, you will be introduced to various techniques that can be used to address common issues in data such as missing values, data below detection limits, and spatial characters.

Missing values can occur when data is not collected or recorded for a particular variable. It is important to address missing values, as they can lead to inaccurate results and bias in your analysis. You will learn how to identify missing values and how to handle them using techniques such as imputation and deletion.

Data below detection limits, also known as censored data, are values that are below the limit of detection for a particular measurement. These values can be problematic as they can skew your analysis and lead to inaccurate results. You will learn how to identify censored data and how to handle it using techniques such as substitution and regression.

Spatial characters are characters that are used to represent geographical locations in data, such as postcodes or zip codes. However, these characters can sometimes be recorded incorrectly or inconsistently, which can lead to issues in your analysis. You will learn how to identify and clean spatial characters to ensure that your data is accurate and consistent.

By the end of this exercise, you will have a solid understanding of how to clean and prepare your data for analysis, which will help you to obtain more accurate and reliable results.

We will use National Geochemical Database from United States Geological Survey (USGS) as a part of the USGS Geochemical Landscapes Project (Smith et al., 2011)

In [62]:
# Load and select/rename columns
mf_geo = (
    geo_raw
    [['A_LabID', 'StateID', 'LandCover1', 'A_Depth', 'A_C_Tot', 'A_C_Inorg', 'A_C_Org', 'A_As', 'A_Cd', 'A_Pb', 'A_Cr']]
    .rename(columns={
        'A_LabID': 'LAB_ID',
        'LandCover1': 'LandCover',
        'A_Depth': 'Soil_depth',
        'A_C_Tot': 'Total_Carbon',
        'A_C_Inorg': 'Inorg_Carbon',
        'A_C_Org': 'Organic_Carbon',
        'A_As': 'Arsenic',
        'A_Cd': 'Cadmium',
        'A_Pb': 'Lead',
        'A_Cr': 'Chromium'
    })
)

# Filter out 'N.S.' and 'INS'
mf_geo = mf_geo[
    (mf_geo['Soil_depth'] != 'N.S.') &
    (mf_geo['Total_Carbon'] != 'INS')
]

# Replace 'N.D.' with NaN
mf_geo = mf_geo.replace('N.D.', np.nan)

# Clean and convert metal columns
metal_cols = ['Arsenic', 'Cadmium', 'Lead', 'Chromium']
for col in metal_cols:
    mf_geo[col] = (
        mf_geo[col]
        .astype(str)
        .str.replace('<', '', regex=False)
        .replace('', np.nan)
        .astype(float)
    )

# Replace detection limit values with half
detection_limits = {'Arsenic': 0.6, 'Cadmium': 0.1, 'Lead': 0.5, 'Chromium': 1.0}
half_limits = {k: v/2 for k, v in detection_limits.items()}

for col, limit in detection_limits.items():
    mf_geo.loc[mf_geo[col] == limit, col] = half_limits[col]

# Summarize missingness and stats
summary_list = []
for col in metal_cols:
    summary_list.append({
        'Element': col,
        'N': len(mf_geo[col]),
        'Missing': mf_geo[col].isna().sum(),
        'Min': mf_geo[col].min(),
        'Max': mf_geo[col].max()
    })

geo_sum = pd.DataFrame(summary_list).set_index('Element')
print(geo_sum)

# Option 1: Drop rows with any missing (like na.omit)
newdata = mf_geo.dropna()
print(f"Rows after dropping all missing: {len(newdata)}")

# Option 2: Impute missing with mean
mf_geo_new = mf_geo.copy()
for col in metal_cols:
    mf_geo_new[col] = mf_geo_new[col].fillna(mf_geo_new[col].mean())

# Verify no missing in metals
print("Missing after imputation:")
print(mf_geo_new[metal_cols].isna().sum())

# Optional: Save cleaned data
# mf_geo_new.to_csv("usa_geochemical_clean.csv", index=False)

             N  Missing   Min     Max
Element                              
Arsenic   4809        0  0.30  1110.0
Cadmium   4809        0  0.05    46.6
Lead      4809        0  0.25  2200.0
Chromium  4809        0  0.50  3850.0
Rows after dropping all missing: 1178
Missing after imputation:
Arsenic     0
Cadmium     0
Lead        0
Chromium    0
dtype: int64


## Summary and Conclusion

This Python tutorial mirrors the power of R’s `{dplyr}` and `{tidyr}` using **pandas**, arguably the most essential library for data manipulation in Python. You’ve learned how to:

- Load, join, filter, select, rename, and relocate columns.
- Create new variables, group data, and compute summaries.
- Pivot data between long and wide formats.
- Clean messy real-world datasets (handling missing values, detection limits, string cleaning).
- Use method chaining for readable, efficient workflows.

While Python doesn’t have a native pipe operator, **method chaining** and `.pipe()` provide equally powerful — and often more flexible — alternatives.

As you advance, explore:
- `polars` for faster DataFrame operations.
- `pyjanitor` for even more R-like data cleaning verbs.
- `siuba` for dplyr-style syntax in Python.
- Integration with `scikit-learn` for modeling and `plotly`/`seaborn` for visualization.

**Data wrangling is iterative and creative — practice with diverse datasets, and you’ll become fluent in transforming raw data into actionable insights.**

## Resources

1. [pandas User Guide](https://pandas.pydata.org/docs/user_guide/index.html)
2. [Python for Data Analysis (Book by Wes McKinney)](https://wesmckinney.com/book/)
3. [Real Python — pandas Tutorials](https://realpython.com/learning-paths/pandas-data-science/)
4. [Kaggle pandas Course](https://www.kaggle.com/learn/pandas)
5. [YouTube: Data Analysis with Python — freeCodeCamp](https://www.youtube.com/watch?v=r-uOLxNrNk8)