### What is Data Wrangling?
Data wrangling (also called data munging) is the process of cleaning, transforming, and preparing raw data into a format that is easier to analyze.

When you get real-world data, it’s usually:

Messy

Incomplete

Inconsistent

Containing errors or duplicates

So before you can run analysis or build models, you have to wrangle (fix/reshape) the data.

### Steps in Data Wrangling (with examples)

Data Cleaning

Handle missing values → fill them (imputation) or drop them.

Fix inconsistent categories → e.g., "Male", "male", "M" → make all "male".

Remove typos and formatting issues.

Data Transformation

Change column types → e.g., "Age" from string → numeric.

Scale or normalize numbers (if needed).

Standardize formats (dates, currency, etc.).

Data Enrichment / Feature Engineering

Add new useful features → e.g., FamilySize = SibSp + Parch + 1.

Extract information → e.g., title (Mr, Mrs) from Name.

Data Reduction

Remove irrelevant columns.

Remove duplicate rows.

Aggregate or group data (summarize).

Handling Outliers

Detect extreme/unrealistic values (e.g., negative age).

Decide whether to cap, transform, or remove them.

**Example before cleaning data**
Name: "  SMITH, Mr. John  "
Sex: "Male "
Age: NaN
Fare: 0
Embarked: "S "

**After Data Cleaning**
Name: "Smith, Mr. John"
Sex: "male"
Age: 32  (filled with median)
Fare: 14.45 (fixed 0 → median)
Embarked: "S"


### Handling Missing Data

Find out how many missing values are in each column.

Drop rows where Age is missing.

Fill missing Age values with the mean age.

Fill missing Embarked values with the most frequent port.

Replace missing Cabin values with "Unknown".

**Removing Duplicates**

Check if there are any duplicate rows. If yes, remove them.

How many duplicate Name entries exist?

**String Operations**

Extract passenger titles (Mr, Mrs, Miss, etc.) from the Name column.

Standardize Sex column values so they’re exactly "Male" or "Female" (capitalize them).

Create a new column Deck by taking the first letter from Cabin (e.g., C85 → C).

**Feature Engineering**

Create a new column FamilySize = SibSp + Parch + 1.

Create a new column IsAlone = 1 if FamilySize == 1 else 0.

Bin ages into groups (Child <12, Teen 12–18, Adult 19–59, Senior >=60).

Create a column FarePerPerson = Fare / FamilySize.

**Filtering & Conditional Updates**

Replace fares of 0 with the median fare.

Replace age outliers (say <1 or >80) with NaN.

Create a column HighFare = 1 if fare > 100, else 0.

**Merging & Joining**

Suppose you have a small dataset of Titanic passengers with their country of origin. Merge it with the main dataset on PassengerId.

Split Name into FirstName and LastName.

**Final Cleanup**

Reorder columns logically: PassengerId, Name, Sex, Age, Pclass, Embarked, ... etc.

# Find out how many missing values are in each column.

In [64]:
import pandas as pd
import numpy as np

In [2]:
df=pd.read_csv('train.csv')

In [4]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

# Drop rows where Age is missing.

In [5]:
df.dropna(subset=['Age'])

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


# Fill missing Age values with the mean age.

In [6]:
df.Age.fillna(df.Age.mean())

0      22.000000
1      38.000000
2      26.000000
3      35.000000
4      35.000000
         ...    
886    27.000000
887    19.000000
888    29.699118
889    26.000000
890    32.000000
Name: Age, Length: 891, dtype: float64

# Fill missing Embarked values with the most frequent port.

In [11]:
df.Embarked.value_counts().idxmax()

'S'

In [80]:
df['Embarked'].mode()

0    S
Name: Embarked, dtype: object

In [12]:
df.Embarked.fillna(df.Embarked.value_counts().idxmax())
#  df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])


0      S
1      C
2      S
3      S
4      S
      ..
886    S
887    S
888    S
889    C
890    Q
Name: Embarked, Length: 891, dtype: object

# Replace missing Cabin values with "Unknown".

In [13]:
df.Cabin.fillna('Unknown')

0      Unknown
1          C85
2      Unknown
3         C123
4      Unknown
        ...   
886    Unknown
887        B42
888    Unknown
889       C148
890    Unknown
Name: Cabin, Length: 891, dtype: object

# Check if there are any duplicate rows. If yes, remove them.

In [18]:
df.duplicated().sum()

0

In [14]:
df.drop_duplicates()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


# How many duplicate Name entries exist?


In [16]:
df['Name'].duplicated().sum()

0

# Extract passenger titles (Mr, Mrs, Miss, etc.) from the Name column.

In [28]:
#df['title']=df.Name.str.extract(r',\s*([^\.]+)\.')

df['Title'] = df['Name'].str.extract(r',\s*([^\.]+\.)')

In [29]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,title,Title
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,Mr,Mr.
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs,Mrs.
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,Miss,Miss.
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,Mrs,Mrs.
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,Mr,Mr.
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,Rev,Rev.
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,Miss,Miss.
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,Miss,Miss.
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,Mr,Mr.


# Standardize Sex column values so they’re exactly "Male" or "Female" (capitalize them).

In [33]:
df['Sex']= df.Sex.str.strip().str.capitalize()

In [34]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,title,Title
0,1,0,3,"Braund, Mr. Owen Harris",Male,22.0,1,0,A/5 21171,7.2500,,S,Mr,Mr.
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",Female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs,Mrs.
2,3,1,3,"Heikkinen, Miss. Laina",Female,26.0,0,0,STON/O2. 3101282,7.9250,,S,Miss,Miss.
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",Female,35.0,1,0,113803,53.1000,C123,S,Mrs,Mrs.
4,5,0,3,"Allen, Mr. William Henry",Male,35.0,0,0,373450,8.0500,,S,Mr,Mr.
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",Male,27.0,0,0,211536,13.0000,,S,Rev,Rev.
887,888,1,1,"Graham, Miss. Margaret Edith",Female,19.0,0,0,112053,30.0000,B42,S,Miss,Miss.
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",Female,,1,2,W./C. 6607,23.4500,,S,Miss,Miss.
889,890,1,1,"Behr, Mr. Karl Howell",Male,26.0,0,0,111369,30.0000,C148,C,Mr,Mr.


# Create a new column Deck by taking the first letter from Cabin (e.g., C85 → C).

In [40]:
df['Deck']= df.Cabin.str.extract(r"^(.)")
# df['Deck']=df.Cabin.str[0]

In [41]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,title,Title,Deck
0,1,0,3,"Braund, Mr. Owen Harris",Male,22.0,1,0,A/5 21171,7.2500,,S,Mr,Mr.,
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",Female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs,Mrs.,C
2,3,1,3,"Heikkinen, Miss. Laina",Female,26.0,0,0,STON/O2. 3101282,7.9250,,S,Miss,Miss.,
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",Female,35.0,1,0,113803,53.1000,C123,S,Mrs,Mrs.,C
4,5,0,3,"Allen, Mr. William Henry",Male,35.0,0,0,373450,8.0500,,S,Mr,Mr.,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",Male,27.0,0,0,211536,13.0000,,S,Rev,Rev.,
887,888,1,1,"Graham, Miss. Margaret Edith",Female,19.0,0,0,112053,30.0000,B42,S,Miss,Miss.,B
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",Female,,1,2,W./C. 6607,23.4500,,S,Miss,Miss.,
889,890,1,1,"Behr, Mr. Karl Howell",Male,26.0,0,0,111369,30.0000,C148,C,Mr,Mr.,C


# Create a new column FamilySize = SibSp + Parch + 1.

In [42]:
df['FamilySize']=df.SibSp+ df.Parch+1

In [43]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,title,Title,Deck,FamilySize
0,1,0,3,"Braund, Mr. Owen Harris",Male,22.0,1,0,A/5 21171,7.2500,,S,Mr,Mr.,,2
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",Female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs,Mrs.,C,2
2,3,1,3,"Heikkinen, Miss. Laina",Female,26.0,0,0,STON/O2. 3101282,7.9250,,S,Miss,Miss.,,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",Female,35.0,1,0,113803,53.1000,C123,S,Mrs,Mrs.,C,2
4,5,0,3,"Allen, Mr. William Henry",Male,35.0,0,0,373450,8.0500,,S,Mr,Mr.,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",Male,27.0,0,0,211536,13.0000,,S,Rev,Rev.,,1
887,888,1,1,"Graham, Miss. Margaret Edith",Female,19.0,0,0,112053,30.0000,B42,S,Miss,Miss.,B,1
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",Female,,1,2,W./C. 6607,23.4500,,S,Miss,Miss.,,4
889,890,1,1,"Behr, Mr. Karl Howell",Male,26.0,0,0,111369,30.0000,C148,C,Mr,Mr.,C,1


# Create a new column IsAlone = 1 if FamilySize == 1 else 0.

In [45]:
df['IsAlone'] = df['FamilySize'].apply(lambda x: 1 if x == 1 else 0)

In [46]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,title,Title,Deck,FamilySize,IsAlone
0,1,0,3,"Braund, Mr. Owen Harris",Male,22.0,1,0,A/5 21171,7.2500,,S,Mr,Mr.,,2,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",Female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs,Mrs.,C,2,0
2,3,1,3,"Heikkinen, Miss. Laina",Female,26.0,0,0,STON/O2. 3101282,7.9250,,S,Miss,Miss.,,1,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",Female,35.0,1,0,113803,53.1000,C123,S,Mrs,Mrs.,C,2,0
4,5,0,3,"Allen, Mr. William Henry",Male,35.0,0,0,373450,8.0500,,S,Mr,Mr.,,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",Male,27.0,0,0,211536,13.0000,,S,Rev,Rev.,,1,1
887,888,1,1,"Graham, Miss. Margaret Edith",Female,19.0,0,0,112053,30.0000,B42,S,Miss,Miss.,B,1,1
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",Female,,1,2,W./C. 6607,23.4500,,S,Miss,Miss.,,4,0
889,890,1,1,"Behr, Mr. Karl Howell",Male,26.0,0,0,111369,30.0000,C148,C,Mr,Mr.,C,1,1


# Bin ages into groups (Child <12, Teen 12–18, Adult 19–59, Senior >=60).

In [52]:
bins=[0,12,19,60,100]
labels=['Child','Teen','Adult','Senior']
df['AgeGroup']= pd.cut(df.Age, bins=bins, labels=labels, right=False)

In [53]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,title,Title,Deck,FamilySize,IsAlone,AgeGroup
0,1,0,3,"Braund, Mr. Owen Harris",Male,22.0,1,0,A/5 21171,7.2500,,S,Mr,Mr.,,2,0,Adult
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",Female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs,Mrs.,C,2,0,Adult
2,3,1,3,"Heikkinen, Miss. Laina",Female,26.0,0,0,STON/O2. 3101282,7.9250,,S,Miss,Miss.,,1,1,Adult
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",Female,35.0,1,0,113803,53.1000,C123,S,Mrs,Mrs.,C,2,0,Adult
4,5,0,3,"Allen, Mr. William Henry",Male,35.0,0,0,373450,8.0500,,S,Mr,Mr.,,1,1,Adult
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",Male,27.0,0,0,211536,13.0000,,S,Rev,Rev.,,1,1,Adult
887,888,1,1,"Graham, Miss. Margaret Edith",Female,19.0,0,0,112053,30.0000,B42,S,Miss,Miss.,B,1,1,Adult
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",Female,,1,2,W./C. 6607,23.4500,,S,Miss,Miss.,,4,0,
889,890,1,1,"Behr, Mr. Karl Howell",Male,26.0,0,0,111369,30.0000,C148,C,Mr,Mr.,C,1,1,Adult


# Create a column FarePerPerson = Fare / FamilySize.

In [54]:
df['FarePerPerson']=df['Fare']/df['FamilySize']

In [55]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,title,Title,Deck,FamilySize,IsAlone,AgeGroup,FarePerPerson
0,1,0,3,"Braund, Mr. Owen Harris",Male,22.0,1,0,A/5 21171,7.2500,,S,Mr,Mr.,,2,0,Adult,3.62500
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",Female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs,Mrs.,C,2,0,Adult,35.64165
2,3,1,3,"Heikkinen, Miss. Laina",Female,26.0,0,0,STON/O2. 3101282,7.9250,,S,Miss,Miss.,,1,1,Adult,7.92500
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",Female,35.0,1,0,113803,53.1000,C123,S,Mrs,Mrs.,C,2,0,Adult,26.55000
4,5,0,3,"Allen, Mr. William Henry",Male,35.0,0,0,373450,8.0500,,S,Mr,Mr.,,1,1,Adult,8.05000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",Male,27.0,0,0,211536,13.0000,,S,Rev,Rev.,,1,1,Adult,13.00000
887,888,1,1,"Graham, Miss. Margaret Edith",Female,19.0,0,0,112053,30.0000,B42,S,Miss,Miss.,B,1,1,Adult,30.00000
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",Female,,1,2,W./C. 6607,23.4500,,S,Miss,Miss.,,4,0,,5.86250
889,890,1,1,"Behr, Mr. Karl Howell",Male,26.0,0,0,111369,30.0000,C148,C,Mr,Mr.,C,1,1,Adult,30.00000


# Replace fares of 0 with the median fare.

In [56]:
median= df.Fare.median()

In [57]:
df.loc[df.Fare==0, "Fare"]= median

In [58]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,title,Title,Deck,FamilySize,IsAlone,AgeGroup,FarePerPerson
0,1,0,3,"Braund, Mr. Owen Harris",Male,22.0,1,0,A/5 21171,7.2500,,S,Mr,Mr.,,2,0,Adult,3.62500
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",Female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs,Mrs.,C,2,0,Adult,35.64165
2,3,1,3,"Heikkinen, Miss. Laina",Female,26.0,0,0,STON/O2. 3101282,7.9250,,S,Miss,Miss.,,1,1,Adult,7.92500
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",Female,35.0,1,0,113803,53.1000,C123,S,Mrs,Mrs.,C,2,0,Adult,26.55000
4,5,0,3,"Allen, Mr. William Henry",Male,35.0,0,0,373450,8.0500,,S,Mr,Mr.,,1,1,Adult,8.05000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",Male,27.0,0,0,211536,13.0000,,S,Rev,Rev.,,1,1,Adult,13.00000
887,888,1,1,"Graham, Miss. Margaret Edith",Female,19.0,0,0,112053,30.0000,B42,S,Miss,Miss.,B,1,1,Adult,30.00000
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",Female,,1,2,W./C. 6607,23.4500,,S,Miss,Miss.,,4,0,,5.86250
889,890,1,1,"Behr, Mr. Karl Howell",Male,26.0,0,0,111369,30.0000,C148,C,Mr,Mr.,C,1,1,Adult,30.00000


# Replace age outliers (say <1 or >80) with NaN.

In [65]:
df.loc[(df.Age<1) | (df.Age>80), "Age"]= np.nan

In [67]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,title,Title,Deck,FamilySize,IsAlone,AgeGroup,FarePerPerson
0,1,0,3,"Braund, Mr. Owen Harris",Male,22.0,1,0,A/5 21171,7.2500,,S,Mr,Mr.,,2,0,Adult,3.62500
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",Female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs,Mrs.,C,2,0,Adult,35.64165
2,3,1,3,"Heikkinen, Miss. Laina",Female,26.0,0,0,STON/O2. 3101282,7.9250,,S,Miss,Miss.,,1,1,Adult,7.92500
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",Female,35.0,1,0,113803,53.1000,C123,S,Mrs,Mrs.,C,2,0,Adult,26.55000
4,5,0,3,"Allen, Mr. William Henry",Male,35.0,0,0,373450,8.0500,,S,Mr,Mr.,,1,1,Adult,8.05000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",Male,27.0,0,0,211536,13.0000,,S,Rev,Rev.,,1,1,Adult,13.00000
887,888,1,1,"Graham, Miss. Margaret Edith",Female,19.0,0,0,112053,30.0000,B42,S,Miss,Miss.,B,1,1,Adult,30.00000
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",Female,,1,2,W./C. 6607,23.4500,,S,Miss,Miss.,,4,0,,5.86250
889,890,1,1,"Behr, Mr. Karl Howell",Male,26.0,0,0,111369,30.0000,C148,C,Mr,Mr.,C,1,1,Adult,30.00000


# Create a column HighFare = 1 if fare > 100, else 0.

In [68]:
df['HighFare']=df.Fare.apply(lambda x: 1 if x>100 else 0)

In [70]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,title,Title,Deck,FamilySize,IsAlone,AgeGroup,FarePerPerson,HighFare
0,1,0,3,"Braund, Mr. Owen Harris",Male,22.0,1,0,A/5 21171,7.2500,,S,Mr,Mr.,,2,0,Adult,3.62500,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",Female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs,Mrs.,C,2,0,Adult,35.64165,0
2,3,1,3,"Heikkinen, Miss. Laina",Female,26.0,0,0,STON/O2. 3101282,7.9250,,S,Miss,Miss.,,1,1,Adult,7.92500,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",Female,35.0,1,0,113803,53.1000,C123,S,Mrs,Mrs.,C,2,0,Adult,26.55000,0
4,5,0,3,"Allen, Mr. William Henry",Male,35.0,0,0,373450,8.0500,,S,Mr,Mr.,,1,1,Adult,8.05000,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",Male,27.0,0,0,211536,13.0000,,S,Rev,Rev.,,1,1,Adult,13.00000,0
887,888,1,1,"Graham, Miss. Margaret Edith",Female,19.0,0,0,112053,30.0000,B42,S,Miss,Miss.,B,1,1,Adult,30.00000,0
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",Female,,1,2,W./C. 6607,23.4500,,S,Miss,Miss.,,4,0,,5.86250,0
889,890,1,1,"Behr, Mr. Karl Howell",Male,26.0,0,0,111369,30.0000,C148,C,Mr,Mr.,C,1,1,Adult,30.00000,0


# Split Name into FirstName and LastName.

In [72]:
df[['LastName', 'FirstName']] = df['Name'].str.split(',', n=1, expand=True)
df['FirstName'] = df['FirstName'].str.strip()


In [73]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,...,title,Title,Deck,FamilySize,IsAlone,AgeGroup,FarePerPerson,HighFare,LastName,FirstName
0,1,0,3,"Braund, Mr. Owen Harris",Male,22.0,1,0,A/5 21171,7.2500,...,Mr,Mr.,,2,0,Adult,3.62500,0,Braund,Mr. Owen Harris
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",Female,38.0,1,0,PC 17599,71.2833,...,Mrs,Mrs.,C,2,0,Adult,35.64165,0,Cumings,Mrs. John Bradley (Florence Briggs Thayer)
2,3,1,3,"Heikkinen, Miss. Laina",Female,26.0,0,0,STON/O2. 3101282,7.9250,...,Miss,Miss.,,1,1,Adult,7.92500,0,Heikkinen,Miss. Laina
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",Female,35.0,1,0,113803,53.1000,...,Mrs,Mrs.,C,2,0,Adult,26.55000,0,Futrelle,Mrs. Jacques Heath (Lily May Peel)
4,5,0,3,"Allen, Mr. William Henry",Male,35.0,0,0,373450,8.0500,...,Mr,Mr.,,1,1,Adult,8.05000,0,Allen,Mr. William Henry
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",Male,27.0,0,0,211536,13.0000,...,Rev,Rev.,,1,1,Adult,13.00000,0,Montvila,Rev. Juozas
887,888,1,1,"Graham, Miss. Margaret Edith",Female,19.0,0,0,112053,30.0000,...,Miss,Miss.,B,1,1,Adult,30.00000,0,Graham,Miss. Margaret Edith
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",Female,,1,2,W./C. 6607,23.4500,...,Miss,Miss.,,4,0,,5.86250,0,Johnston,"Miss. Catherine Helen ""Carrie"""
889,890,1,1,"Behr, Mr. Karl Howell",Male,26.0,0,0,111369,30.0000,...,Mr,Mr.,C,1,1,Adult,30.00000,0,Behr,Mr. Karl Howell


# Suppose you have a small dataset of Titanic passengers with their country of origin. Merge it with the main dataset on PassengerId.

In [74]:
country_data = pd.DataFrame({
    "PassengerId": [1, 2, 3, 4],
    "Country": ["UK", "USA", "France", "Germany"]
})

In [75]:
merged_df = df.merge(country_data, on="PassengerId", how="left")

In [76]:
merged_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,...,Title,Deck,FamilySize,IsAlone,AgeGroup,FarePerPerson,HighFare,LastName,FirstName,Country
0,1,0,3,"Braund, Mr. Owen Harris",Male,22.0,1,0,A/5 21171,7.2500,...,Mr.,,2,0,Adult,3.62500,0,Braund,Mr. Owen Harris,UK
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",Female,38.0,1,0,PC 17599,71.2833,...,Mrs.,C,2,0,Adult,35.64165,0,Cumings,Mrs. John Bradley (Florence Briggs Thayer),USA
2,3,1,3,"Heikkinen, Miss. Laina",Female,26.0,0,0,STON/O2. 3101282,7.9250,...,Miss.,,1,1,Adult,7.92500,0,Heikkinen,Miss. Laina,France
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",Female,35.0,1,0,113803,53.1000,...,Mrs.,C,2,0,Adult,26.55000,0,Futrelle,Mrs. Jacques Heath (Lily May Peel),Germany
4,5,0,3,"Allen, Mr. William Henry",Male,35.0,0,0,373450,8.0500,...,Mr.,,1,1,Adult,8.05000,0,Allen,Mr. William Henry,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",Male,27.0,0,0,211536,13.0000,...,Rev.,,1,1,Adult,13.00000,0,Montvila,Rev. Juozas,
887,888,1,1,"Graham, Miss. Margaret Edith",Female,19.0,0,0,112053,30.0000,...,Miss.,B,1,1,Adult,30.00000,0,Graham,Miss. Margaret Edith,
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",Female,,1,2,W./C. 6607,23.4500,...,Miss.,,4,0,,5.86250,0,Johnston,"Miss. Catherine Helen ""Carrie""",
889,890,1,1,"Behr, Mr. Karl Howell",Male,26.0,0,0,111369,30.0000,...,Mr.,C,1,1,Adult,30.00000,0,Behr,Mr. Karl Howell,


In [77]:
merged_df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,...,Title,Deck,FamilySize,IsAlone,AgeGroup,FarePerPerson,HighFare,LastName,FirstName,Country
0,1,0,3,"Braund, Mr. Owen Harris",Male,22.0,1,0,A/5 21171,7.25,...,Mr.,,2,0,Adult,3.625,0,Braund,Mr. Owen Harris,UK
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",Female,38.0,1,0,PC 17599,71.2833,...,Mrs.,C,2,0,Adult,35.64165,0,Cumings,Mrs. John Bradley (Florence Briggs Thayer),USA
2,3,1,3,"Heikkinen, Miss. Laina",Female,26.0,0,0,STON/O2. 3101282,7.925,...,Miss.,,1,1,Adult,7.925,0,Heikkinen,Miss. Laina,France
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",Female,35.0,1,0,113803,53.1,...,Mrs.,C,2,0,Adult,26.55,0,Futrelle,Mrs. Jacques Heath (Lily May Peel),Germany
4,5,0,3,"Allen, Mr. William Henry",Male,35.0,0,0,373450,8.05,...,Mr.,,1,1,Adult,8.05,0,Allen,Mr. William Henry,
5,6,0,3,"Moran, Mr. James",Male,,0,0,330877,8.4583,...,Mr.,,1,1,,8.4583,0,Moran,Mr. James,
6,7,0,1,"McCarthy, Mr. Timothy J",Male,54.0,0,0,17463,51.8625,...,Mr.,E,1,1,Adult,51.8625,0,McCarthy,Mr. Timothy J,
7,8,0,3,"Palsson, Master. Gosta Leonard",Male,2.0,3,1,349909,21.075,...,Master.,,5,0,Child,4.215,0,Palsson,Master. Gosta Leonard,
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",Female,27.0,0,2,347742,11.1333,...,Mrs.,,3,0,Adult,3.7111,0,Johnson,Mrs. Oscar W (Elisabeth Vilhelmina Berg),
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",Female,14.0,1,0,237736,30.0708,...,Mrs.,,2,0,Teen,15.0354,0,Nasser,Mrs. Nicholas (Adele Achem),


# Reorder columns logically: PassengerId, Name, Sex, Age, Pclass, Embarked, ... etc.

In [79]:
print(df.columns.tolist())
new_order = ['PassengerId', 'Name', 'Sex', 'Age', 'Pclass', 'Embarked', 'Fare', 'Survived']
df = df[new_order]
print(df.head())


['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'title', 'Title', 'Deck', 'FamilySize', 'IsAlone', 'AgeGroup', 'FarePerPerson', 'HighFare', 'LastName', 'FirstName']
   PassengerId                                               Name     Sex  \
0            1                            Braund, Mr. Owen Harris    Male   
1            2  Cumings, Mrs. John Bradley (Florence Briggs Th...  Female   
2            3                             Heikkinen, Miss. Laina  Female   
3            4       Futrelle, Mrs. Jacques Heath (Lily May Peel)  Female   
4            5                           Allen, Mr. William Henry    Male   

    Age  Pclass Embarked     Fare  Survived  
0  22.0       3        S   7.2500         0  
1  38.0       1        C  71.2833         1  
2  26.0       3        S   7.9250         1  
3  35.0       1        S  53.1000         1  
4  35.0       3        S   8.0500         0  
