# 02.5 — Pandas: Creating & Transforming Columns

This notebook covers common operations in Pandas for **creating and transforming columns**:

- Adding new columns (derived features)
- Applying functions to rows/columns
- String operations on text columns
- Date/time operations with `.dt`

Dataset used: **Titanic** (loaded from GitHub). This notebook is Google Colab-ready.

---

In [1]:
import pandas as pd

url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
df = pd.read_csv(url)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Adding New Columns

- Use `df['new'] = ...` to create a new derived column.
- Example: Fare per person, family size, etc.


In [2]:
# Add a new column: Family size (SibSp + Parch + 1 for self)
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

# Add Fare per person
df['FarePerPerson'] = df['Fare'] / df['FamilySize']

df[['SibSp','Parch','FamilySize','Fare','FarePerPerson']].head()

Unnamed: 0,SibSp,Parch,FamilySize,Fare,FarePerPerson
0,1,0,2,7.25,3.625
1,1,0,2,71.2833,35.64165
2,0,0,1,7.925,7.925
3,1,0,2,53.1,26.55
4,0,0,1,8.05,8.05


## Apply Functions

- Use `.apply()` to transform rows/columns with a function or lambda.
- Great for custom calculations.


In [4]:
# Example: Apply a lambda to create AgeGroup

def age_group(age):
    if pd.isna(age):
        return 'Unknown'
    elif age < 18:
        return 'Child'
    elif age < 60:
        return 'Adult'
    else:
        return

# Apply function to Age column
df['AgeGroup'] = df['Age'].apply(lambda age:"Unknown" if pd.isna(age) else "Child" if  age < 18 else 'Adult' if age < 60 else 'Senior')
df[['Age','AgeGroup']].head(10)

Unnamed: 0,Age,AgeGroup
0,22.0,Adult
1,38.0,Adult
2,26.0,Adult
3,35.0,Adult
4,35.0,Adult
5,,Unknown
6,54.0,Adult
7,2.0,Child
8,27.0,Adult
9,14.0,Child


## String Operations

- Use `.str` accessor for string transformations.
- Common operations:
  - `.str.lower()` → lowercase
  - `.str.upper()` → uppercase
  - `.str.contains('text')` → search for substring


In [5]:
# Convert names to lowercase
df['Name_lower'] = df['Name'].str.lower()

# Find passengers with 'miss' in their name (case-insensitive)
miss = df[df['Name'].str.contains('miss', case=False, na=False)]

print('Number of passengers with "Miss" in name:', len(miss))
miss[['Name','Sex','Age']].head()

Number of passengers with "Miss" in name: 182


Unnamed: 0,Name,Sex,Age
2,"Heikkinen, Miss. Laina",female,26.0
10,"Sandstrom, Miss. Marguerite Rut",female,4.0
11,"Bonnell, Miss. Elizabeth",female,58.0
14,"Vestrom, Miss. Hulda Amanda Adolfina",female,14.0
22,"McGowan, Miss. Anna ""Annie""",female,15.0


## Date Operations

- Convert date-like strings to `datetime` using `pd.to_datetime()`.
- Use `.dt` accessor to extract parts of a datetime column.
- Examples: `.dt.year`, `.dt.month`, `.dt.day_name()`.


In [6]:
# Example with Titanic Embarked dates (not available)
# We'll create a demo DataFrame with dates

sample_dates = pd.DataFrame({
    'date': pd.date_range('2020-01-01', periods=5, freq='D')
})

# df.['date'].to_datetime()

# Extract year, month, day name
sample_dates['year'] = sample_dates['date'].dt.year
sample_dates['month'] = sample_dates['date'].dt.month
sample_dates['day_name'] = sample_dates['date'].dt.day_name()

sample_dates

Unnamed: 0,date,year,month,day_name
0,2020-01-01,2020,1,Wednesday
1,2020-01-02,2020,1,Thursday
2,2020-01-03,2020,1,Friday
3,2020-01-04,2020,1,Saturday
4,2020-01-05,2020,1,Sunday


## Best Practices

- Keep derived columns meaningful and well-named.
- Use `.apply()` only when vectorized Pandas operations are not possible (vectorized is faster).
- Always convert date strings to `datetime` for reliable analysis.
- For string operations, use `.str` accessor instead of Python loops.


## Exercises

1. Create a new column `IsMinor` that is True if `Age < 18`.
2. Create `Title` by extracting part of the Name (e.g., Mr, Mrs, Miss).
3. Count how many passengers have the substring `'Dr'` in their name.
4. Create a small DataFrame with custom dates and extract `.dt.year`, `.dt.month_name()`, `.dt.weekday`.
