In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn

In [None]:
ar = np.array([2, 4, 5, 6, 2, 4, 5, 6, 3, 5, 6])

In [None]:
np.sum(ar)

In [None]:
len(ar)

**Measure of Center Tendency**

A measure of central tendency (also referred to as measures of centre or central location) is a summary measure that attempts to describe a whole set of data with a single value that represents the middle or centre of its distribution. There are three main measures of central tendency:mean, mode, median.

Mean = sum of all data / total number of all data

In [None]:
mean=np.sum(ar)/len(ar)
mean

In [None]:
np.sort(ar)

In [None]:
np.mean(ar)

***1. What is the typical age of a passenger?***

In [None]:
df=pd.read_csv("Titanic-Dataset.csv")
df.head(3)

In [None]:
df["Age"].mean()

In [None]:
mn=np.mean(df["Age"])
mn

In [None]:
sn.histplot(x=df["Age"],data=df,bins=[i for i in range (0,81,10)])
plt.plot([mn for i in range(0, 300)],[i for i in range(0,300)] , c= "orange")
plt.show()

Between 20-30, mean lies here since  average 29.699.

**Median**

When our data is affected by extreme values or outliers, the median is often used instead of the mean.
The median is the middle value of a dataset when the values are arranged in order.
It is less influenced by outliers, so when outliers are present in the data, it is better to use the median than the mean to represent the central value.
In the case of an even number of values, the median is the average of the two middle values.



In [None]:
df["Age"].fillna(df["Age"].mean(),inplace=True)
df["Age"].isnull().sum()

In [None]:
np.median(df["Age"])

In [None]:
md= df["Age"].median()
md

In [None]:
sn.histplot(x=df["Age"],data=df,bins=[i for i in range (0,81,10)])
plt.plot([md for i in range(0, 500)],[i for i in range(0,500)] , c= "orange")
plt.show()

*Therefore, it can say that typical age of the passengers on the Titanic was around 29 to 30 years old.*

**Mode**

If any value in a dataset is repeated again and again, that value is called the mode (the most frequent value).
The mode is particularly useful for categorical data, where we want to know which category occurs most often.



In [None]:
mode= df['Embarked'].mode()
mode

***2. Which port did most passengers on the Titanic embark from?***

In [None]:
# Count the frequency of each embarkation port
embarked_counts = df['Embarked'].value_counts()
embarked_counts

In [None]:
sn.set_style("whitegrid")
colors = sn.color_palette("pastel")

plt.figure(figsize=(6, 4))
sn.barplot(x=embarked_counts.index, y=embarked_counts.values, palette=colors)

plt.title("Passenger Count by Embarkation Port", fontsize=14)
plt.xlabel("Embarkation Port")
plt.ylabel("Number of Passengers")

# Show exact numbers on top of bars
for i, value in enumerate(embarked_counts.values):
    plt.text(i, value + 5, str(value), ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.show()

*The most frequent embarkation port was 'S', which stands for Southampton.*



**Summary:**

The purpose of using mean, median, and mode is similar — they all measure the central tendency of a dataset, meaning they try to describe what a “typical” or “central” value looks like.
However, they are used in different situations because each one responds differently to the distribution of data:


*  Mean: Best for numerical data without outliers. It considers all values but is sensitive to extreme values.

*   Median: Best for numerical data with outliers or skewed distributions. It shows the middle value and is not affected by outliers.

* Mode: Best for categorical data or when you want to find the most frequent value in any type of data.








   So, while their purpose is the same (describing central tendency), the method you choose depends on the nature and distribution of the data.











**Measure of Variabiliy(dispersion,spread)**

Measures of dispersion tell us how spread out or scattered the data is.

Even if two datasets have the same average (mean), they can be very different in how their values are spread. Measures of dispersion help us understand that difference.

**Range**

Range is the difference between the maximum and minimum values in the dataset  It provides a simple measure of the spread of the data, but it can be sensative to outlirers [Range=(max-min)].



*   Sensitive to outliers: If there's one very high or low value, the range can become misleading.

*  
Doesn’t show distribution: It only uses the minimum and maximum, ignoring everything in between.
*  Better for small datasets: In small datasets, it gives a quick idea of spread. In large datasets, it can be misleading.










***3. In the Titanic dataset, what age range of people were traveling?***

In [None]:
min_r = df['Age'].min()
min_r

In [None]:
max_r = df['Age'].max()
max_r

In [None]:
age_range_value = max_r - min_r
age_range_value

In [None]:
sn.set_style("whitegrid")
plt.figure(figsize=(6, 4))

plt.bar(['Min Age', 'Max Age', 'Range'], [min_r, max_r, age_range_value], color=sn.color_palette("pastel"))

for i, val in enumerate([min_r, max_r, age_range_value]):
    plt.text(i, val + 2, f'{val:.1f}', ha='center', va='bottom')

plt.title('Titanic Passenger Age Range')
plt.ylabel('Age')
plt.tight_layout()
plt.show()

This shows that people of all age groups from infants to elderly were on board the Titanic.


*  The youngest passenger was approximately 0.42 years old.

*   The oldest passenger was 80 years old.
*   
This gives us an age range of about 79.6 years.











**Quartiles**

Quartiles divide a sorted dataset into 4 equal parts:

*   
Q1 (First Quartile) → 25% of the data falls below this value

*   Q2 (Second Quartile / Median) → 50% of the data falls below this value
*  
Q3 (Third Quartile) → 75% of the data falls below this value










For a sorted dataset of n values:



*  Q1 position = (n + 1) × 1/4



*   Q2 (Median) = (n + 1) × 2/4

*   Q3 position = (n + 1) × 3/4






4. ***What are the age quartiles of Titanic passengers, and what do they tell us about age distribution?***

In [None]:
ages = df['age']

# Calculate quartiles
Q1 = ages.quantile(0.25)
Q2 = ages.quantile(0.50)
Q3 = ages.quantile(0.75)


In [None]:
print(f"Q1 (25th percentile): {Q1:.1f}")
print(f"Q2 (Median): {Q2:.1f}")
print(f"Q3 (75th percentile): {Q3:.1f}")

In [None]:
sn.set_style("whitegrid")
plt.figure(figsize=(6, 3))
sn.boxplot(x=ages, color='lightblue')

plt.axvline(Q1, color='green', linestyle='--', label='Q1 (25%)')
plt.axvline(Q2, color='orange', linestyle='--', label='Q2 (Median)')
plt.axvline(Q3, color='red', linestyle='--', label='Q3 (75%)')

# Labels and legend
plt.title('Age Quartiles of Titanic Passengers')
plt.xlabel('Age')
plt.legend()
plt.tight_layout()
plt.show()

I analyzed the age column in the Titanic dataset to find its quartiles, which help me to  understand how the ages are spread across the passengers:



*   Q1 (25%): Around 20 years





*   Q2 (Median): Around 28 years
* Q3 (75%): Around 38 years




This means:



*   25% of passengers were younger than 20



*   50% were younger than 28
*   
75% were younger than 38






The boxplot shows that most passengers were young to middle-aged, and it also helps identify any outliers, like very old or very young passengers.



**Percentile**

A percentile is a number that tells you what percentage of the data falls below a certain value. Percentiles divide a dataset into 100 equal parts.The nth percentile is the value below which n% of the data falls.[ p/100 *(n+1) ]

Percentiles are useful for understanding where a value stands relative to the whole dataset.


For example:


*   25th percentile → 25% of values are below it (same as Q1)

*   
50th percentile → 50% of values are below it (median)
*   
90th percentile → 90% of values are below it

Percentiles are used when we want to:


*   Understand position: Where a particular value lies compared to the rest of the data.

*   
Analyze distributions: Especially useful for skewed data or when data is not evenly spread.

*   
Detect outliers: By comparing values to the 25th (Q1) and 75th (Q3) percentiles.
*   Summarize large data: Without looking at every individual value




















5. ***What are the 25th, 50th, and 90th percentiles of passenger age in the Titanic dataset?***

In [None]:
ages = df['age']

# Calculate percentiles
p25 = ages.quantile(0.25)
p50 = ages.quantile(0.50)
p90 = ages.quantile(0.90)

print(f"25th Percentile: {p25:.1f}")
print(f"50th Percentile (Median): {p50:.1f}")
print(f"90th Percentile: {p90:.1f}")

In [None]:
sn.set_style("whitegrid")
plt.figure(figsize=(7, 2.5))
sn.boxplot(x=ages, color='lightgreen')

# Add percentile lines
plt.axvline(p25, color='green', linestyle='--', label='25th Percentile (Q1)')
plt.axvline(p50, color='orange', linestyle='--', label='50th Percentile (Median)')
plt.axvline(p90, color='red', linestyle='--', label='90th Percentile')

plt.title('Age Percentiles (Boxplot) of Titanic Passengers')
plt.xlabel('Age')
plt.legend(loc='upper right')
plt.tight_layout()
plt.show()

I analyzed the age of Titanic passengers to understand how age values are spread using percentiles.



*   25th percentile: About 20 years → 25% of passengers were younger than 20

*   50th percentile (Median): About 28 years → 50% of passengers were younger than 28
*  90th percentile: About 51 years → 90% of passengers were younger than 50










The boxplot helps me to  see how most passengers were young adults, with fewer very old passengers.


**IQR**