
# Exploratory Data Analysis (EDA) on Titanic Dataset

This notebook performs **Task 2: EDA** on the Titanic dataset using Python libraries.  
We'll generate summary statistics, visualizations, and identify patterns in the data.


In [None]:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
file_path = "Titanic-Dataset.csv"
titanic_df = pd.read_csv(file_path)

titanic_df.head()


In [None]:

# Dataset info
titanic_df.info()


In [None]:

# Summary statistics
titanic_df.describe(include="all")


In [None]:

# Histograms for numeric features
titanic_df.hist(bins=20, figsize=(14, 10))
plt.suptitle("Histograms of Numeric Features")
plt.show()


In [None]:

# Boxplots for numeric features
plt.figure(figsize=(14, 8))
titanic_df[["Age", "Fare", "SibSp", "Parch"]].boxplot()
plt.title("Boxplots of Numeric Features")
plt.show()


In [None]:

# Correlation heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(titanic_df.corr(numeric_only=True), annot=True, cmap="coolwarm")
plt.title("Correlation Heatmap of Numeric Features")
plt.show()



## 📈 Insights

- **Class & Survival:** Higher-class passengers (Pclass 1) had higher survival chances.  
- **Gender:** Females survived at a much higher rate than males.  
- **Fare:** Passengers paying higher fares were more likely to survive.  
- **Age:** Younger children had relatively better survival chances.  
- **Missing Data:** Age, Cabin, and Embarked columns have missing values.

---



## 🎤 Interview Questions & Answers

**1. What is the purpose of EDA?**  
EDA helps us understand the dataset by summarizing its key characteristics, spotting missing values, detecting anomalies, and identifying patterns that can guide feature engineering and model building.

**2. How do boxplots help in understanding a dataset?**  
Boxplots show the data distribution, median, quartiles, and outliers. They are useful for detecting skewness and extreme values.

**3. What is correlation and why is it useful?**  
Correlation measures the relationship between two variables. It is useful in identifying redundant features (high correlation) or strong predictors of the target.

**4. How do you detect skewness in data?**  
By checking histograms, calculating skewness values, or observing asymmetry in boxplots.

**5. What is multicollinearity?**  
It occurs when independent variables are highly correlated with each other, which can distort model coefficients and reduce interpretability.

**6. What tools do you use for EDA?**  
Commonly: Pandas, NumPy, Matplotlib, Seaborn, Plotly. For large datasets: Spark or Dask.

**7. Can you explain a time when EDA helped you find a problem?**  
(Example) In Titanic data, EDA revealed many missing values in `Cabin`. This insight helps decide whether to drop the column or engineer new features from it.

**8. What is the role of visualization in ML?**  
Visualization makes patterns clear, highlights anomalies, and helps in communicating insights effectively to both technical and non-technical audiences.
