# Module 2 — Pandas Foundations (Data Wrangling)

This notebook covers Pandas basics for tabular data analysis. Lessons: 2.1–2.5
We'll use the provided dataset `eda_course_dataset_100rows.csv` for examples.

## Lesson 2.1 — Introduction to Pandas

Pandas provides `Series` and `DataFrame` built on NumPy arrays. It's ideal for cleaning, transforming, and analyzing tabular data.

In [1]:
import pandas as pd
pd.Series([10,20,30], index=['a','b','c'])

Unnamed: 0,0
a,10
b,20
c,30


In [2]:
pd.DataFrame({'Name':['Alice','Bob'],'Age':[25,30]})

Unnamed: 0,Name,Age
0,Alice,25
1,Bob,30


## Lesson 2.2 — Importing & Inspecting
Read the CSV and inspect using `.head()`, `.info()`, `.describe()`, `.shape`, `.dtypes`.

In [4]:
from google.colab import drive
drive.mount('/content/drive')


# Load dataset
file_path = '/content/drive/My Drive/Numpy_pandas/eda_course_dataset_100rows.csv'
df = pd.read_csv(file_path, parse_dates=['order_date'])
display(df.head())
print('shape:', df.shape)
display(df.info())
display(df.describe(include='all').T)

Mounted at /content/drive


Unnamed: 0,order_id,customer_id,order_date,gender,age,region,channel,product_category,price,quantity,total_amount,discount_pct,coupon_used,payment_method,customer_tenure_months,is_churn,product_rating,returned
0,ORD10083,1137,2025-08-28,Male,46,West,Store,Clothing,59.58,1,59.58,0,0,Credit Card,30,0,4.8,0
1,ORD10053,1076,2023-09-15,Male,18,East,Online,Home,135.43,1,135.43,5,0,Credit Card,153,0,5.0,0
2,ORD10070,1019,2025-03-20,Male,50,South,Online,Clothing,60.92,3,182.76,0,1,Debit Card,16,0,2.7,0
3,ORD10045,1173,2024-12-29,Other,22,Central,Store,Toys,99.32,1,99.32,5,0,Credit Card,236,1,3.3,0
4,ORD10044,1160,2024-01-10,Female,27,North,Online,Clothing,64.04,1,64.04,5,0,Debit Card,193,0,3.9,0


shape: (100, 18)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 18 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   order_id                100 non-null    object        
 1   customer_id             100 non-null    int64         
 2   order_date              100 non-null    datetime64[ns]
 3   gender                  100 non-null    object        
 4   age                     100 non-null    int64         
 5   region                  97 non-null     object        
 6   channel                 100 non-null    object        
 7   product_category        100 non-null    object        
 8   price                   100 non-null    float64       
 9   quantity                100 non-null    int64         
 10  total_amount            100 non-null    float64       
 11  discount_pct            100 non-null    int64         
 12  coupon_used             100 non-nu

None

Unnamed: 0,count,unique,top,freq,mean,min,25%,50%,75%,max,std
order_id,100.0,100.0,ORD10083,1.0,,,,,,,
customer_id,100.0,,,,1095.56,1003.0,1048.75,1092.5,1149.5,1195.0,57.061958
order_date,100.0,,,,2024-08-16 20:52:48,2023-09-03 00:00:00,2024-02-13 18:00:00,2024-08-11 00:00:00,2025-01-19 00:00:00,2025-08-31 00:00:00,
gender,100.0,3.0,Female,60.0,,,,,,,
age,100.0,,,,43.4,18.0,29.75,44.5,54.0,69.0,15.283185
region,97.0,5.0,South,27.0,,,,,,,
channel,100.0,3.0,Online,62.0,,,,,,,
product_category,100.0,5.0,Clothing,32.0,,,,,,,
price,100.0,,,,85.4234,5.0,57.6525,93.205,108.0675,201.98,44.472109
quantity,100.0,,,,1.73,1.0,1.0,1.0,2.0,5.0,1.126853


## Lesson 2.3 — Indexing & Selection
Use `df['col']`, `df[['c1','c2']]`, `.loc[]` (label) and `.iloc[]` (position). Avoid chained indexing.

In [None]:
display(df['region'].head())
display(df[['customer_id','price']].head())
display(df.loc[0])
display(df.iloc[0])
display(df[df['price'] > 200].head())

## Lesson 2.4 — Cleaning Data
Missing values, duplicates, type conversion, dropping columns.

In [None]:
display(df.isna().sum())
df['product_rating'] = df['product_rating'].fillna(df['product_rating'].median())
df_clean = df.drop_duplicates()
print('duplicates removed, new shape:', df_clean.shape)

## Lesson 2.5 — Transformation
Create new columns, apply functions, string operations, and date extraction.

In [None]:
df_clean['price_per_item'] = df_clean['total_amount'] / df_clean['quantity']
df_clean['age_group'] = df_clean['age'].apply(lambda x: 'Youth' if x < 30 else ('Senior' if x>=65 else 'Adult'))
df_clean['payment_method'] = df_clean['payment_method'].astype(str).str.lower()
df_clean['order_year'] = df_clean['order_date'].dt.year
display(df_clean[['price_per_item','age_group','payment_method','order_year']].head())

**Homework / Practice:**
- Identify 3 columns with missing values and propose imputation strategies.
- Create a `high_value` boolean column where `total_amount` > median(total_amount).