# Titanic Survival Predictions
This project explores the Titanic dataset using Matplotlib and Seaborn to identify patterns in passenger survival. The analysis includes bar plots, histograms, box plots, scatter plots, and a correlation heatmap. Visualizations are customized with titles, labels, and colors, and provide insights into survival differences by gender, passenger class, age, and other factors.

### Contents:
1. Import Necessary Libraries
2. Read In and Explore the Data
3. Data Analysis
4. Cleaning Data
5. Data Visualization

Any and all feedback is welcome! 

## 1. Import Necessary Libraries
We need to import several Python libraries such as numpy, pandas, matplotlib and seaborn.

In [None]:
#data analysis libraries 
import numpy as np
import pandas as pd

#visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## 2. Read in and Explore the Data 
Read in our data-frame using `pd.read_csv`, and take a first look at the data using the `describe()` function.

In [35]:
#import titanic CSV files
df = pd.read_csv("Titanic-Dataset.csv")

#take a look at the training data
df.describe(include="all")

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,,891,2,,,,681.0,,147,3
top,,,,"Dooley, Mr. Patrick",male,,,,347082.0,,G6,S
freq,,,,1,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


## 3. Data Analysis
Take a quik look on the data, get the featurees of the dataset, and how compleate they are

In [None]:
#get a list of the features within the dataset
print(df.columns, df.shape)
#see a sample of the dataset to get an idea of the variables
df.sample(5)

* **Numerical Features:** Age (Continuous), Fare (Continuous), SibSp (Discrete), Parch (Discrete)
* **Categorical Features:** Survived, Sex, Embarked, Pclass
* **Alphanumeric Features:** Ticket, Cabin
* **Alphabitical Features:** Name

#### What are the data types for each feature?
* PassengerId : int (uniqe)
* Survived: int
* Pclass: int
* Name: string
* Sex: string
* Age: float
* SibSp: int
* Parch: int
* Ticket: string
* Fare: float
* Cabin: string
* Embarked: string


In [None]:
#see a summary of the dataset
print(df.describe(include = ['O']), df.describe())

In [None]:
#check for any other unusable values
pd.isnull(df).sum()

#### Some Observations:
* There are a total of 891 passengers in our dataset.
* The Age feature is missing approximately 19.8% of its values. Age feature is important to survival, so we will attempt to fill the gaps. 
* The Cabin feature is missing approximately 77.1% of its values.
* The Embarked feature is missing 0.22% of its values. Can be filled easily with the most common value.

### Some Predictions:
* Sex: Females are more likely to survive.
* SibSp/Parch: People traveling alone are more likely to survive.
* Age: Young children are more likely to survive.
* Pclass: People of higher socioeconomic class are more likely to survive.

## 4. Cleaning Data
Clean the data to account for missing values and unnecessary information!

* We have a total of 891 passengers.
* 2 value from the Embarked feature is missing. Easy handling.
* Around 19.9% of the Age feature is missing. we will need to fill that in.
* Around 77.1% of the Cabin feature is missing. It has no value after creaating the Hascabin.

### Cabin and Ticket Feature

In [None]:
# Extract HasCapin because that's what we need
df['HasCabin'] = df['Cabin'].isnull().astype(int)

# Drop the Cabin feature since it has no value after creaating the Hascabin.
# drop the Ticket feature since it's unlikely to yield any useful information.
df = df.drop(['Cabin'], axis = 1)
df = df.drop(['Ticket'], axis = 1)

### Embarked Feature

In [None]:
# Fill in the missing values in the Embarked feature with the most common Embarked
embarked_count = df.groupby('Embarked')['Survived'].count().sort_values(ascending=False)
df['Embarked'] = df['Embarked'].fillna(embarked_count.reset_index().iloc[0, 0])

print(embarked_count)


It's clear that the majority of people embarked in Southampton (S). Let's go ahead and fill in the missing values with S.

### Name Feature

Extract the tiltel from the Name to use in future cleaning

In [None]:
#extract a title for each Name in the dataset
df['Title'] = df.Name.str.extract(r' ([A-Za-z]+)\.', expand=False)

#replace some titles 
df['Title'] = df['Title'].replace({'Mlle': 'Miss', 'Ms': 'Miss', 'Mme': 'Mrs'})

### Age Feature

Next we'll fill in the missing values in the Age feature. Since a higher percentage of values are missing, it would be illogical to fill all of them with the same value (as we did with Embarked). Instead, let's try to find a way to predict the missing ages. 

In [None]:
# Return the NAN values
df['Age'] = df['Age'].replace(-0.5, np.nan)

# Fill the Age with the meadian value by Sex, Pclass, Title
df['Age'] = df['Age'].fillna(df.groupby(['Pclass', 'Sex', 'Title'])['Age'].transform('median'))

# 

## 5. Data Visualization
Visualize our data so we can see whether our predictions were accurate! 

### Sex Feature

In [None]:
#bar plot of survival by sex
sns.barplot(x="Sex", y="Survived", data=df)

# percentages of females vs. males survived
survive_by_sex = df.groupby('Sex')['Survived'].mean()

print(f"Percentage of Females who survived: {survive_by_sex.loc['female']*100} %")
print(f"Percentage of Males who survived: {survive_by_sex.loc['male']*100} %")

### Pclass Feature

In [None]:
#bar plot of survival by Pclass
sns.barplot(x="Pclass", y="Survived", data=df)

#percentage of people by Pclass that survived
survive_by_pclass = df.groupby('Pclass')['Survived'].mean()

print(f"Percentage of Pclass = 1 who survived: {survive_by_pclass.loc[1]*100} %")
print(f"Percentage of Pclass = 2 who survived: {survive_by_pclass.loc[2]*100} %")
print(f"Percentage of Pclass = 3 who survived: {survive_by_pclass.loc[3]*100} %")

As predicted, people with higher socioeconomic class had a higher rate of survival. (62.9% vs. 47.3% vs. 24.2%)

### SibSp Feature

In [None]:
#bar plot for SibSp vs. survival
sns.barplot(x="SibSp", y="Survived", data=df)

# I won't print all of that for sure
print(df.groupby('SibSp')['Survived'].mean().sort_values(ascending=False)*100)

In general, it's clear that people with more siblings or spouses aboard were less likely to survive. However, contrary to expectations, people with no siblings or spouses were less to likely to survive than those with one or two. (34.5% vs 53.4% vs. 46.4%)

### Parch Feature

In [None]:
#bar plot for Parch vs. survival
sns.barplot(x="Parch", y="Survived", data=df)
print(df.groupby('Parch')['Survived'].mean().sort_values(ascending=False)*100)
plt.show()

People with less than four parents or children aboard are more likely to survive than those with four or more. Again, people traveling alone are less likely to survive than those with 1-3 parents or children.

### Age Feature

In [None]:
bins = [-1, 0, 5, 12, 18, 24, 35, 60, 80]
labels = ['Unknown', 'Baby', 'Child', 'Teenager', 'Student', 'Young Adult', 'Adult', 'Senior']
df['AgeGroup'] = pd.cut(df["Age"], bins, labels = labels)

print(df.groupby('AgeGroup')['Survived'].mean().sort_values(ascending=False)*100)

#draw a bar plot of Age vs. survival
sns.barplot(x="AgeGroup", y="Survived", data=df)
plt.show()

Babies are more likely to survive than any other age group. 

### Cabin Feature

In [None]:
#calculate percentages of HasCabin vs. survived
print(df.groupby('HasCabin')['Survived'].mean().sort_values(ascending=False)*100)

#draw a bar plot of HasCabinl vs. survival
sns.barplot(x="HasCabin", y="Survived", data=df).set_xticks([0, 1], ['No', 'Yes'])
plt.show()

People with a recorded Cabin number are, in fact, more likely to survive. (66.6% vs 29.9%)

I think the idea here is that people with recorded cabin numbers are of higher socioeconomic class, and thus more likely to survive.