# Titanic Dataset: Exploratory Data Analysis (EDA)

Welcome to the Titanic dataset EDA notebook! In this notebook, you will explore the Titanic dataset, clean the data, and perform some basic visualizations. Follow the instructions in each cell and complete the tasks where needed.

Let's get started!

In [11]:
# Task: Import the necessary libraries for data analysis and visualization
# Libraries to import: pandas, numpy, matplotlib, seaborn

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 
# Use %matplotlib inline if you are using Jupyter Notebook for inline plotting

In [12]:
# Task: Load the Titanic dataset into a pandas DataFrame
# You can use 'titanic.csv' if the dataset is available locally or provide a URL to load it.

# Example: df = pd.read_csv('titanic.csv')

# Load your dataset here
# df = pd.read_csv('titanic3.csv')
# or 
df = sns.load_dataset('titanic')
# Preview the first few rows of the dataset using df.head()

In [13]:
# Task: Inspect the dataset to understand its structure
# Instructions: Use df.info() to check the data types and non-null values for each column.
df.info()
# Add df.describe() to get a summary of the numerical features.
df.describe().round(3)
# Inspect data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.384,2.309,29.699,0.523,0.382,32.204
std,0.487,0.836,14.526,1.103,0.806,49.693
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.91
50%,0.0,3.0,28.0,0.0,0.0,14.454
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.329


In [14]:
# Task: Check for missing data in the dataset
# Instructions: Use isnull().sum() to count missing values per column.

# Explore how to handle missing values (fillna or dropna).

# Example:
df.isnull().sum()

# Now handle missing values by either dropping or filling them

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [15]:
# Task: Perform additional data cleaning steps if needed
# Instructions: Look for duplicates, adjust data types, and rename columns where necessary.

# Remove duplicate rows, if any
df.drop_duplicates(inplace=True)

# Check if any columns need renaming or converting to appropriate data types
# Example: df['column_name'] = df['column_name'].astype('new_dtype') if you need to change the data type of a column
# Example: df.rename(columns={'old_name': 'new_name'}, inplace=True) if you need to rename a column

In [17]:
# Task: Generate descriptive statistics for the dataset
# Instructions: Use df.describe() to get an overview of the numerical columns.
# Use value_counts() for categorical columns to see the distribution of values.

# Example:
df.describe()

# Now calculate statistics for specific columns (e.g., mean, median)
df['age'].mean(), df['survived'].value_counts()

(29.869351032448375,
 survived
 0    461
 1    323
 Name: count, dtype: int64)

In [None]:
# Task: Create visualizations to explore the data
# Instructions: Create a bar chart or a histogram to visualize the distribution of certain features.

# Example: Use matplotlib or seaborn to plot the distribution of 'Age' or 'Fare'
sns.histplot(df['Age'].dropna(), bins=30)


In [None]:

# Now create a bar chart for categorical data such as 'Pclass' or 'Sex'
#sns.countplot(x='Pclass', data=df)
#sns.barplot(x='Pclass', y='Survived', data=df)

In [None]:
# Task: Explore relationships between variables
# Instructions: Create a correlation matrix to see how features are related to each other.
# Use a heatmap from seaborn to visualize the correlations.

# Example: sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')

# This wont work because we have non-numeric columns in the dataset
# We need to convert them to numeric or drop them before creating the correlation


In [None]:
# Find the non-numeric columns
df.select_dtypes(include=['object'])

In [None]:
# Find the numeric columns
df.select_dtypes(include=['int64', 'float64'])

In [None]:
# First we can look at only the numeric columns to see the correlation
sns.heatmap(df.select_dtypes(include=['int64', 'float64']).corr(), annot=True, cmap='coolwarm')

In [None]:
# Now we can convert the non-numeric columns to numeric
# using one-hot encoding
df_dummies = pd.get_dummies(df , columns=['Sex', 'Embarked'], drop_first=True)

In [None]:
df_dummies.head()

In [None]:
# Now we can look at the correlation matrix
# But first lets drop the name,cabin  and the ticket column
# because they are not useful for the correlation matrix    
df_dummies.drop(['Name', 'Ticket','Cabin'], axis=1, inplace=True)

sns.heatmap(df_dummies.corr(), annot=True, cmap='coolwarm')

In [None]:
# Task: Explore further! 
# Instructions: Feel free to explore other relationships in the data using scatter plots or other visualizations.

# For example, plot 'Fare' vs 'Age' to see if there's any trend:
plt.scatter(df['Fare'], df['Age'])

# Try exploring any other interesting relationships in the dataset.

In [None]:
# and again using seaborn
sns.scatterplot(x='Fare', y='Age', data=df)

# Task: Summarize your findings
Now that you’ve completed the EDA, take a moment to reflect on the key insights you gathered from the Titanic dataset. What interesting patterns did you observe? Are there any obvious outliers or trends?

Feel free to add your conclusions here.

1. Seaborn: Includes datasets like tips, iris, titanic, and diamonds.
Example: sns.load_dataset('tips')
2. Scikit-learn: Offers various datasets such as iris, digits, wine, and breast_cancer.
Example: from sklearn.datasets import load_iris