# Titanic Dataset Analysis with Pandas

This notebook demonstrates essential pandas methods using the famous Titanic dataset. We'll explore various data manipulation and analysis techniques including:

- **groupby**: Grouping data for aggregation
- **query**: Filtering data with expressions
- **agg**: Applying multiple aggregation functions
- **sort_values**: Sorting data
- **iloc**: Integer-based indexing
- **loc**: Label-based indexing
- And many more pandas methods!

Let's start by loading the necessary libraries and the dataset.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Display options for better output
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
print("Libraries imported successfully!")

Libraries imported successfully!


## 1. Loading and Initial Exploration of the Dataset

First, let's load the Titanic dataset and get familiar with its structure.

In [2]:
# Load the Titanic dataset
df = pd.read_csv('datasets/Titanic-Dataset.csv')

print("Dataset loaded successfully!")
print(f"Shape of the dataset: {df.shape}")

Dataset loaded successfully!
Shape of the dataset: (891, 12)


In [None]:
# Display first few rows
print("First 5 rows of the dataset:")
df.head() 

First 5 rows of the dataset:


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [5]:
# Get basic information about the dataset
print("Dataset Info:")
df.info()

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [None]:
# Statistical summary
print("Statistical Summary:")
df.describe()

Statistical Summary:


Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [7]:
# Check for missing values
print("Missing values in each column:")
df.isnull().sum()

Missing values in each column:


PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

## 2. GroupBy Method - Grouping Data for Analysis

The `groupby()` method is one of the most powerful features in pandas. It allows us to split data into groups based on some criteria, apply a function to each group, and combine the results.

### Example 1: Basic GroupBy - Survival Rate by Class

In [8]:
# Group by passenger class and calculate survival rate
survival_by_class = df.groupby('Pclass')['Survived'].mean()

print("Survival rate by passenger class:")
print(survival_by_class)
print("\nInterpretation: Higher class passengers had better survival rates")

Survival rate by passenger class:
Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64

Interpretation: Higher class passengers had better survival rates


### Example 2: Multiple Grouping Variables - Survival by Class and Gender

In [9]:
# Group by multiple columns
survival_by_class_gender = df.groupby(['Pclass', 'Sex'])['Survived'].mean()

print("Survival rate by class and gender:")
print(survival_by_class_gender)
print("\nInterpretation: Females had higher survival rates across all classes")

Survival rate by class and gender:
Pclass  Sex   
1       female    0.968085
        male      0.368852
2       female    0.921053
        male      0.157407
3       female    0.500000
        male      0.135447
Name: Survived, dtype: float64

Interpretation: Females had higher survival rates across all classes


### Example 3: Multiple Statistics with GroupBy

In [10]:
# Get multiple statistics for grouped data
age_stats_by_class = df.groupby('Pclass')['Age'].agg(['count', 'mean', 'median', 'std'])

print("Age statistics by passenger class:")
print(age_stats_by_class.round(2))
print("\nInterpretation: First class passengers were generally older")

Age statistics by passenger class:
        count   mean  median   std
Pclass                            
1         186  38.23    37.0  14.8
2         173  29.88    29.0  14.0
3         355  25.14    24.0  12.5

Interpretation: First class passengers were generally older


### Example 4: GroupBy with Transform

In [11]:
# Add a column with mean age by class
df['Mean_Age_By_Class'] = df.groupby('Pclass')['Age'].transform('mean')

print("First 10 rows showing original age and mean age by class:")
print(df[['Pclass', 'Age', 'Mean_Age_By_Class']].head(10))
print("\nInterpretation: Transform keeps the original dataframe structure while adding group statistics")

First 10 rows showing original age and mean age by class:
   Pclass   Age  Mean_Age_By_Class
0       3  22.0          25.140620
1       1  38.0          38.233441
2       3  26.0          25.140620
3       1  35.0          38.233441
4       3  35.0          25.140620
5       3   NaN          25.140620
6       1  54.0          38.233441
7       3   2.0          25.140620
8       3  27.0          25.140620
9       2  14.0          29.877630

Interpretation: Transform keeps the original dataframe structure while adding group statistics


## 3. Query Method - Powerful Data Filtering

The `query()` method provides a concise way to filter data using string expressions. It's often more readable than boolean indexing.

### Example 1: Simple Query - First Class Passengers

In [12]:
# Filter first class passengers
first_class = df.query('Pclass == 1')

print(f"Number of first class passengers: {len(first_class)}")
print(f"Survival rate among first class: {first_class['Survived'].mean():.3f}")
print("\nFirst few first class passengers:")
print(first_class[['Name', 'Sex', 'Age', 'Survived']].head())

Number of first class passengers: 216
Survival rate among first class: 0.630

First few first class passengers:
                                                 Name     Sex   Age  Survived
1   Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0         1
3        Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0         1
6                             McCarthy, Mr. Timothy J    male  54.0         0
11                           Bonnell, Miss. Elizabeth  female  58.0         1
23                       Sloper, Mr. William Thompson    male  28.0         1


### Example 2: Complex Query with Multiple Conditions

In [13]:
# Filter young female survivors
young_female_survivors = df.query('Age < 30 and Sex == "female" and Survived == 1')

print(f"Number of young female survivors: {len(young_female_survivors)}")
print("\nSample of young female survivors:")
print(young_female_survivors[['Name', 'Age', 'Pclass']].head())

Number of young female survivors: 105

Sample of young female survivors:
                                                 Name   Age  Pclass
2                              Heikkinen, Miss. Laina  26.0       3
8   Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)  27.0       3
9                 Nasser, Mrs. Nicholas (Adele Achem)  14.0       2
10                    Sandstrom, Miss. Marguerite Rut   4.0       3
22                        McGowan, Miss. Anna "Annie"  15.0       3


### Example 3: Query with Variables

In [14]:
# Using variables in queries
min_age = 30
max_age = 50
target_class = 2

filtered_passengers = df.query('Age >= @min_age and Age <= @max_age and Pclass == @target_class')

print(f"Passengers aged {min_age}-{max_age} in class {target_class}: {len(filtered_passengers)}")
print(f"Their survival rate: {filtered_passengers['Survived'].mean():.3f}")

Passengers aged 30-50 in class 2: 70
Their survival rate: 0.457


### Example 4: Query with String Methods

In [15]:
# Find passengers with titles in their names
passengers_with_mr = df.query('Name.str.contains("Mr\.", na=False)')
passengers_with_mrs = df.query('Name.str.contains("Mrs\.", na=False)')

print(f"Passengers with 'Mr.': {len(passengers_with_mr)}")
print(f"Passengers with 'Mrs.': {len(passengers_with_mrs)}")
print(f"Survival rate for Mr.: {passengers_with_mr['Survived'].mean():.3f}")
print(f"Survival rate for Mrs.: {passengers_with_mrs['Survived'].mean():.3f}")

Passengers with 'Mr.': 517
Passengers with 'Mrs.': 125
Survival rate for Mr.: 0.157
Survival rate for Mrs.: 0.792


## 4. Agg Method - Multiple Aggregations

The `agg()` method allows you to apply multiple aggregation functions to your data in a single operation.

### Example 1: Basic Aggregation on Multiple Columns

In [16]:
# Apply different aggregations to different columns
basic_agg = df.agg({
    'Age': ['mean', 'median', 'std'],
    'Fare': ['mean', 'max', 'min'],
    'Survived': ['sum', 'mean']
})

print("Basic aggregations:")
print(basic_agg.round(2))

Basic aggregations:
          Age    Fare  Survived
mean    29.70   32.20      0.38
median  28.00     NaN       NaN
std     14.53     NaN       NaN
max       NaN  512.33       NaN
min       NaN    0.00       NaN
sum       NaN     NaN    342.00


### Example 2: Custom Aggregation Functions

In [18]:
# Define custom aggregation functions
def age_range(series):
    return series.max() - series.min()

def survival_rate(series):
    return f"{series.mean():.1%}"

custom_agg = df.agg({
    'Age': [age_range, 'mean'],
    'Survived': [survival_rate, 'sum']
})

print("Custom aggregations:")
print(custom_agg)

Custom aggregations:
                     Age Survived
age_range      79.580000      NaN
mean           29.699118      NaN
survival_rate        NaN    38.4%
sum                  NaN      342


### Example 3: Aggregation with GroupBy

In [19]:
# Combine groupby with agg for detailed analysis
detailed_stats = df.groupby('Sex').agg({
    'Age': ['count', 'mean', 'std'],
    'Fare': ['mean', 'median'],
    'Survived': ['sum', 'mean'],
    'SibSp': 'mean',
    'Parch': 'mean'
})

print("Detailed statistics by gender:")
print(detailed_stats.round(2))

Detailed statistics by gender:
         Age                 Fare        Survived       SibSp Parch
       count   mean    std   mean median      sum  mean  mean  mean
Sex                                                                
female   261  27.92  14.11  44.48   23.0      233  0.74  0.69  0.65
male     453  30.73  14.68  25.52   10.5      109  0.19  0.43  0.24


### Example 4: Named Aggregations (pandas >= 0.25)

In [20]:
# Use named aggregations for cleaner column names
named_agg = df.groupby('Pclass').agg(
    avg_age=('Age', 'mean'),
    median_fare=('Fare', 'median'),
    total_passengers=('PassengerId', 'count'),
    survivors=('Survived', 'sum'),
    survival_rate=('Survived', 'mean')
)

print("Named aggregations by class:")
print(named_agg.round(2))

Named aggregations by class:
        avg_age  median_fare  total_passengers  survivors  survival_rate
Pclass                                                                  
1         38.23        60.29               216        136           0.63
2         29.88        14.25               184         87           0.47
3         25.14         8.05               491        119           0.24


## 5. Sort_values Method - Sorting Data

The `sort_values()` method is used to sort dataframes by one or more columns.

### Example 1: Simple Sorting by Single Column

In [None]:
# Sort by age (ascending)
sorted_by_age = df.sort_values('Age').dropna(subset=['Age'])

print("Youngest passengers:")
print(sorted_by_age[['Name', 'Age', 'Sex', 'Pclass']].head())

print("\nOldest passengers:")
print(sorted_by_age[['Name', 'Age', 'Sex', 'Pclass']].tail())

### Example 2: Sorting in Descending Order

In [21]:
# Sort by fare (descending)
sorted_by_fare = df.sort_values('Fare', ascending=False)

print("Passengers who paid the highest fares:")
print(sorted_by_fare[['Name', 'Fare', 'Pclass', 'Survived']].head())

print("\nInterpretation: Higher fares were generally associated with first class")

Passengers who paid the highest fares:
                                   Name      Fare  Pclass  Survived
258                    Ward, Miss. Anna  512.3292       1         1
737              Lesurer, Mr. Gustave J  512.3292       1         1
679  Cardeza, Mr. Thomas Drake Martinez  512.3292       1         1
88           Fortune, Miss. Mabel Helen  263.0000       1         1
27       Fortune, Mr. Charles Alexander  263.0000       1         0

Interpretation: Higher fares were generally associated with first class


### Example 3: Sorting by Multiple Columns

In [22]:
# Sort by class (ascending) then by fare (descending)
multi_sort = df.sort_values(['Pclass', 'Fare'], ascending=[True, False])

print("Passengers sorted by class, then by fare within each class:")
print(multi_sort[['Name', 'Pclass', 'Fare', 'Survived']].head(10))

print("\nInterpretation: Within each class, passengers are ordered by highest fare first")

Passengers sorted by class, then by fare within each class:
                                      Name  Pclass      Fare  Survived
258                       Ward, Miss. Anna       1  512.3292         1
679     Cardeza, Mr. Thomas Drake Martinez       1  512.3292         1
737                 Lesurer, Mr. Gustave J       1  512.3292         1
27          Fortune, Mr. Charles Alexander       1  263.0000         0
88              Fortune, Miss. Mabel Helen       1  263.0000         1
341         Fortune, Miss. Alice Elizabeth       1  263.0000         1
438                      Fortune, Mr. Mark       1  263.0000         0
311             Ryerson, Miss. Emily Borie       1  262.3750         1
742  Ryerson, Miss. Susan Parker "Suzette"       1  262.3750         1
118               Baxter, Mr. Quigg Edmond       1  247.5208         0

Interpretation: Within each class, passengers are ordered by highest fare first


### Example 4: Sorting with Missing Values

In [23]:
# Sort by age, handling missing values
sort_with_na_first = df.sort_values('Age', na_position='first')
sort_with_na_last = df.sort_values('Age', na_position='last')

print("With NaN values first:")
print(sort_with_na_first[['Name', 'Age']].head())

print("\nWith NaN values last:")
print(sort_with_na_last[['Name', 'Age']].tail())

With NaN values first:
                             Name  Age
5                Moran, Mr. James  NaN
17   Williams, Mr. Charles Eugene  NaN
19        Masselmani, Mrs. Fatima  NaN
26        Emir, Mr. Farred Chehab  NaN
28  O'Dwyer, Miss. Ellen "Nellie"  NaN

With NaN values last:
                                         Name  Age
859                          Razi, Mr. Raihed  NaN
863         Sage, Miss. Dorothy Edith "Dolly"  NaN
868               van Melkebeke, Mr. Philemon  NaN
878                        Laleff, Mr. Kristo  NaN
888  Johnston, Miss. Catherine Helen "Carrie"  NaN


## 6. iloc Method - Integer-based Indexing

The `iloc` accessor provides integer-based indexing for selection by position.

### Example 1: Selecting Rows by Position

In [24]:
# Select first 5 rows
first_five = df.iloc[:5]
print("First 5 passengers:")
print(first_five[['Name', 'Age', 'Sex', 'Survived']])

# Select last 3 rows
last_three = df.iloc[-3:]
print("\nLast 3 passengers:")
print(last_three[['Name', 'Age', 'Sex', 'Survived']])

First 5 passengers:
                                                Name   Age     Sex  Survived
0                            Braund, Mr. Owen Harris  22.0    male         0
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  38.0  female         1
2                             Heikkinen, Miss. Laina  26.0  female         1
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  35.0  female         1
4                           Allen, Mr. William Henry  35.0    male         0

Last 3 passengers:
                                         Name   Age     Sex  Survived
888  Johnston, Miss. Catherine Helen "Carrie"   NaN  female         0
889                     Behr, Mr. Karl Howell  26.0    male         1
890                       Dooley, Mr. Patrick  32.0    male         0


### Example 2: Selecting Specific Rows and Columns

In [25]:
# Select specific rows and columns by position
# Rows 10-15, columns 1-4
subset = df.iloc[10:16, 1:5]
print("Rows 10-15, columns 1-4:")
print(subset)

# Select every 100th row
every_100th = df.iloc[::100]
print("\nEvery 100th passenger:")
print(every_100th[['Name', 'Age', 'Sex']])

Rows 10-15, columns 1-4:
    Survived  Pclass                                  Name     Sex
10         1       3       Sandstrom, Miss. Marguerite Rut  female
11         1       1              Bonnell, Miss. Elizabeth  female
12         0       3        Saundercock, Mr. William Henry    male
13         0       3           Andersson, Mr. Anders Johan    male
14         0       3  Vestrom, Miss. Hulda Amanda Adolfina  female
15         1       2      Hewlett, Mrs. (Mary D Kingcome)   female

Every 100th passenger:
                                                  Name   Age     Sex
0                              Braund, Mr. Owen Harris  22.0    male
100                            Petranec, Miss. Matilda  28.0  female
200                     Vande Walle, Mr. Nestor Cyriel  28.0    male
300           Kelly, Miss. Anna Katherine "Annie Kate"   NaN  female
400                                 Niskanen, Mr. Juha  39.0    male
500                                   Calic, Mr. Petar  17.0    male

### Example 3: Selecting Random Samples

In [26]:
# Select random rows using iloc
import random
random.seed(42)

random_indices = random.sample(range(len(df)), 5)
random_passengers = df.iloc[random_indices]

print("5 random passengers:")
print(random_passengers[['Name', 'Age', 'Sex', 'Pclass', 'Survived']])

5 random passengers:
                                                  Name   Age     Sex  Pclass  \
654                       Hegarty, Miss. Hanora "Nora"  18.0  female       3   
114                              Attalah, Miss. Malake  17.0  female       3   
25   Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...  38.0  female       3   
759  Rothes, the Countess. of (Lucy Noel Martha Dye...  33.0  female       1   
281                   Olsson, Mr. Nils Johan Goransson  28.0    male       3   

     Survived  
654         0  
114         0  
25          1  
759         1  
281         0  


### Example 4: Boolean Indexing with iloc

In [27]:
# Combine boolean indexing with iloc
survivors = df[df['Survived'] == 1]
first_10_survivors = survivors.iloc[:10]

print("First 10 survivors in the dataset:")
print(first_10_survivors[['Name', 'Age', 'Sex', 'Pclass']])

First 10 survivors in the dataset:
                                                 Name   Age     Sex  Pclass
1   Cumings, Mrs. John Bradley (Florence Briggs Th...  38.0  female       1
2                              Heikkinen, Miss. Laina  26.0  female       3
3        Futrelle, Mrs. Jacques Heath (Lily May Peel)  35.0  female       1
8   Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)  27.0  female       3
9                 Nasser, Mrs. Nicholas (Adele Achem)  14.0  female       2
10                    Sandstrom, Miss. Marguerite Rut   4.0  female       3
11                           Bonnell, Miss. Elizabeth  58.0  female       1
15                   Hewlett, Mrs. (Mary D Kingcome)   55.0  female       2
17                       Williams, Mr. Charles Eugene   NaN    male       2
19                            Masselmani, Mrs. Fatima   NaN  female       3


## 7. loc Method - Label-based Indexing

The `loc` accessor provides label-based indexing for selection by labels/conditions.

### Example 1: Selecting by Index Labels

In [28]:
# Select specific rows by index
specific_passengers = df.loc[0:4, ['Name', 'Age', 'Sex', 'Survived']]
print("Passengers with index 0-4:")
print(specific_passengers)

Passengers with index 0-4:
                                                Name   Age     Sex  Survived
0                            Braund, Mr. Owen Harris  22.0    male         0
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  38.0  female         1
2                             Heikkinen, Miss. Laina  26.0  female         1
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  35.0  female         1
4                           Allen, Mr. William Henry  35.0    male         0


### Example 2: Boolean Indexing with loc

In [29]:
# Select all female passengers
female_passengers = df.loc[df['Sex'] == 'female', ['Name', 'Age', 'Pclass', 'Survived']]
print(f"Total female passengers: {len(female_passengers)}")
print("\nFirst 10 female passengers:")
print(female_passengers.head(10))

Total female passengers: 314

First 10 female passengers:
                                                 Name   Age  Pclass  Survived
1   Cumings, Mrs. John Bradley (Florence Briggs Th...  38.0       1         1
2                              Heikkinen, Miss. Laina  26.0       3         1
3        Futrelle, Mrs. Jacques Heath (Lily May Peel)  35.0       1         1
8   Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)  27.0       3         1
9                 Nasser, Mrs. Nicholas (Adele Achem)  14.0       2         1
10                    Sandstrom, Miss. Marguerite Rut   4.0       3         1
11                           Bonnell, Miss. Elizabeth  58.0       1         1
14               Vestrom, Miss. Hulda Amanda Adolfina  14.0       3         0
15                   Hewlett, Mrs. (Mary D Kingcome)   55.0       2         1
18  Vander Planke, Mrs. Julius (Emelia Maria Vande...  31.0       3         0


### Example 3: Complex Boolean Conditions

In [30]:
# Select first class passengers who survived
first_class_survivors = df.loc[
    (df['Pclass'] == 1) & (df['Survived'] == 1), 
    ['Name', 'Age', 'Sex', 'Fare']
]
print(f"First class survivors: {len(first_class_survivors)}")
print("\nSample of first class survivors:")
print(first_class_survivors.head())

First class survivors: 136

Sample of first class survivors:
                                                 Name   Age     Sex      Fare
1   Cumings, Mrs. John Bradley (Florence Briggs Th...  38.0  female   71.2833
3        Futrelle, Mrs. Jacques Heath (Lily May Peel)  35.0  female   53.1000
11                           Bonnell, Miss. Elizabeth  58.0  female   26.5500
23                       Sloper, Mr. William Thompson  28.0    male   35.5000
31     Spencer, Mrs. William Augustus (Marie Eugenie)   NaN  female  146.5208


### Example 4: Modifying Data with loc

In [31]:
# Create a copy to avoid modifying original data
df_copy = df.copy()

# Create age groups using loc
df_copy.loc[df_copy['Age'] < 18, 'Age_Group'] = 'Child'
df_copy.loc[(df_copy['Age'] >= 18) & (df_copy['Age'] < 65), 'Age_Group'] = 'Adult'
df_copy.loc[df_copy['Age'] >= 65, 'Age_Group'] = 'Senior'

print("Age group distribution:")
print(df_copy['Age_Group'].value_counts())

print("\nSurvival rate by age group:")
print(df_copy.groupby('Age_Group')['Survived'].mean())

Age group distribution:
Age_Group
Adult     590
Child     113
Senior     11
Name: count, dtype: int64

Survival rate by age group:
Age_Group
Adult     0.386441
Child     0.539823
Senior    0.090909
Name: Survived, dtype: float64


## 8. Additional Pandas Methods

Let's explore more essential pandas methods for data analysis.

### Value_counts - Frequency Analysis

In [32]:
# Analyze categorical variables
print("Passenger class distribution:")
print(df['Pclass'].value_counts())

print("\nGender distribution:")
print(df['Sex'].value_counts())

print("\nEmbarkation port distribution:")
print(df['Embarked'].value_counts(dropna=False))

# Normalized counts (percentages)
print("\nSurvival rate (as percentages):")
print(df['Survived'].value_counts(normalize=True) * 100)

Passenger class distribution:
Pclass
3    491
1    216
2    184
Name: count, dtype: int64

Gender distribution:
Sex
male      577
female    314
Name: count, dtype: int64

Embarkation port distribution:
Embarked
S      644
C      168
Q       77
NaN      2
Name: count, dtype: int64

Survival rate (as percentages):
Survived
0    61.616162
1    38.383838
Name: proportion, dtype: float64


### Crosstab - Cross Tabulation

In [33]:
# Create cross-tabulations
survival_by_class_gender = pd.crosstab(df['Pclass'], [df['Sex'], df['Survived']], margins=True)
print("Cross-tabulation: Class vs Gender and Survival:")
print(survival_by_class_gender)

# Normalized crosstab (percentages)
print("\nNormalized cross-tabulation (percentages):")
normalized_crosstab = pd.crosstab(df['Sex'], df['Survived'], normalize='index') * 100
print(normalized_crosstab.round(1))

Cross-tabulation: Class vs Gender and Survival:
Sex      female      male       All
Survived      0    1    0    1     
Pclass                             
1             3   91   77   45  216
2             6   70   91   17  184
3            72   72  300   47  491
All          81  233  468  109  891

Normalized cross-tabulation (percentages):
Survived     0     1
Sex                 
female    25.8  74.2
male      81.1  18.9


### Pivot_table - Summarizing Data

In [34]:
# Create pivot tables for analysis
pivot_survival = pd.pivot_table(df, 
                               values='Survived', 
                               index='Pclass', 
                               columns='Sex', 
                               aggfunc='mean')
print("Pivot table: Survival rate by class and gender:")
print(pivot_survival.round(3))

# More complex pivot table
pivot_complex = pd.pivot_table(df, 
                              values=['Survived', 'Age', 'Fare'], 
                              index='Pclass', 
                              columns='Sex', 
                              aggfunc={'Survived': 'mean', 'Age': 'mean', 'Fare': 'median'})
print("\nComplex pivot table:")
print(pivot_complex.round(2))

Pivot table: Survival rate by class and gender:
Sex     female   male
Pclass               
1        0.968  0.369
2        0.921  0.157
3        0.500  0.135

Complex pivot table:
          Age          Fare        Survived      
Sex    female   male female   male   female  male
Pclass                                           
1       34.61  41.28  82.66  41.26     0.97  0.37
2       28.72  30.74  22.00  13.00     0.92  0.16
3       21.75  26.51  12.48   7.92     0.50  0.14


### Cut and qcut - Binning Data

In [None]:
# Create age bins using cut (equal-width bins)
df_copy['Age_Bins'] = pd.cut(df_copy['Age'], bins=5, labels=['Very Young', 'Young', 'Middle', 'Older', 'Senior'])
print("Age bins distribution (equal-width):")
print(df_copy['Age_Bins'].value_counts())

# Create fare quantiles using qcut (equal-frequency bins)
df_copy['Fare_Quantiles'] = pd.qcut(df_copy['Fare'], q=4, labels=['Low', 'Medium-Low', 'Medium-High', 'High'])
print("\nFare quantiles distribution (equal-frequency):")
print(df_copy['Fare_Quantiles'].value_counts())

# Analyze survival by fare quantiles
print("\nSurvival rate by fare quantiles:")
print(df_copy.groupby('Fare_Quantiles')['Survived'].mean().round(3))

### Apply and Map - Custom Functions

In [None]:
# Extract titles from names using apply
def extract_title(name):
    return name.split(',')[1].split('.')[0].strip()

df_copy['Title'] = df_copy['Name'].apply(extract_title)
print("Titles extracted from names:")
print(df_copy['Title'].value_counts())

# Map titles to broader categories
title_mapping = {
    'Mr': 'Mr',
    'Mrs': 'Mrs',
    'Miss': 'Miss',
    'Master': 'Master',
    'Dr': 'Officer',
    'Rev': 'Officer',
    'Col': 'Officer',
    'Major': 'Officer',
    'Mlle': 'Miss',
    'Countess': 'Mrs',
    'Ms': 'Miss',
    'Lady': 'Mrs',
    'Jonkheer': 'Officer',
    'Don': 'Mr',
    'Dona': 'Mrs',
    'Mme': 'Mrs',
    'Capt': 'Officer',
    'Sir': 'Mr'
}

df_copy['Title_Group'] = df_copy['Title'].map(title_mapping)
print("\nGrouped titles:")
print(df_copy['Title_Group'].value_counts())

print("\nSurvival rate by title group:")
print(df_copy.groupby('Title_Group')['Survived'].mean().round(3))

### String Methods - Text Processing

In [None]:
# Working with string data
print("Names containing 'William':")
williams = df[df['Name'].str.contains('William', na=False)]
print(williams[['Name', 'Age', 'Survived']])

print(f"\nTotal passengers named William: {len(williams)}")
print(f"Survival rate for Williams: {williams['Survived'].mean():.3f}")

# Extract cabin deck (first letter of cabin)
df_copy['Cabin_Deck'] = df_copy['Cabin'].str[0]
print("\nCabin deck distribution:")
print(df_copy['Cabin_Deck'].value_counts(dropna=False))

### Merge and Concat - Combining DataFrames

In [None]:
# Create separate dataframes for demonstration
passengers_basic = df[['PassengerId', 'Name', 'Sex', 'Age']].head(5)
passengers_travel = df[['PassengerId', 'Pclass', 'Fare', 'Embarked']].head(5)

# Merge dataframes
merged_data = pd.merge(passengers_basic, passengers_travel, on='PassengerId')
print("Merged passenger data:")
print(merged_data)

# Concatenate dataframes
first_half = df.head(5)
second_half = df.tail(5)
concatenated = pd.concat([first_half, second_half], ignore_index=True)
print(f"\nConcatenated dataframe shape: {concatenated.shape}")
print("First few rows of concatenated data:")
print(concatenated[['Name', 'Age', 'Survived']].head())

## 9. Data Visualization with Pandas

Pandas integrates well with matplotlib for quick visualizations.

In [None]:
# Set up matplotlib
plt.style.use('default')
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Plot 1: Survival by class
df.groupby('Pclass')['Survived'].mean().plot(kind='bar', ax=axes[0,0], title='Survival Rate by Class')
axes[0,0].set_ylabel('Survival Rate')
axes[0,0].set_xlabel('Passenger Class')

# Plot 2: Age distribution
df['Age'].hist(bins=30, ax=axes[0,1], title='Age Distribution')
axes[0,1].set_xlabel('Age')
axes[0,1].set_ylabel('Frequency')

# Plot 3: Fare distribution by class
df.boxplot(column='Fare', by='Pclass', ax=axes[1,0])
axes[1,0].set_title('Fare Distribution by Class')
axes[1,0].set_xlabel('Passenger Class')

# Plot 4: Survival by gender
survival_by_gender = df.groupby('Sex')['Survived'].mean()
survival_by_gender.plot(kind='pie', ax=axes[1,1], title='Survival Rate by Gender', autopct='%1.1f%%')

plt.tight_layout()
plt.show()

print("Visualizations created successfully!")

## 10. Summary and Key Insights

Let's summarize our analysis with key findings about the Titanic dataset.

In [None]:
# Final summary statistics
print("=== TITANIC DATASET ANALYSIS SUMMARY ===")
print(f"\nTotal passengers: {len(df)}")
print(f"Overall survival rate: {df['Survived'].mean():.1%}")

print("\n1. Survival by Gender:")
gender_survival = df.groupby('Sex')['Survived'].agg(['count', 'sum', 'mean'])
gender_survival.columns = ['Total', 'Survivors', 'Survival_Rate']
print(gender_survival)

print("\n2. Survival by Class:")
class_survival = df.groupby('Pclass')['Survived'].agg(['count', 'sum', 'mean'])
class_survival.columns = ['Total', 'Survivors', 'Survival_Rate']
print(class_survival)

print("\n3. Age Statistics by Survival:")
age_by_survival = df.groupby('Survived')['Age'].agg(['count', 'mean', 'median', 'std'])
age_by_survival.index = ['Died', 'Survived']
print(age_by_survival.round(2))

print("\n4. Key Insights:")
print("   • Women had a much higher survival rate than men")
print("   • First-class passengers had the highest survival rate")
print("   • Children (based on titles like 'Master') had good survival chances")
print("   • Higher fare generally correlated with better survival chances")
print("\n=== END OF ANALYSIS ===")

## Conclusion

This notebook demonstrated essential pandas methods using the Titanic dataset:

- **`groupby()`**: Powerful for aggregating data by categories
- **`query()`**: Intuitive filtering with string expressions
- **`agg()`**: Applying multiple aggregation functions
- **`sort_values()`**: Sorting data by one or multiple columns
- **`iloc`**: Integer-based indexing for position-based selection
- **`loc`**: Label-based indexing for condition-based selection
- **Additional methods**: `value_counts()`, `crosstab()`, `pivot_table()`, `cut()`, `qcut()`, `apply()`, `map()`, string methods, and more

These methods form the foundation of data analysis with pandas and can be combined in countless ways to extract insights from your data. The Titanic dataset provided an excellent real-world example to practice these techniques.

### Next Steps:
1. Try applying these methods to your own datasets
2. Experiment with combining different methods
3. Explore more advanced pandas functionality like `resample()`, `rolling()`, and `groupby().transform()`
4. Practice with time series data and multi-index dataframes