**TITANIC SURVIVAL PREDICTION**

**TASK**

-Use the Titanic dataset to build a model that predicts whether a
passenger on the Titanic survived or not. This is a classic beginner
project with readily available data.

-The dataset typically used for this project contains information
about individual passengers, such as their age, gender, ticket
class, fare, cabin, and whether or not they survived.

**INTRODUCTION**

The sinking of the Titanic is one of the most infamous maritime disasters in history. The Titanic dataset is widely used in the data science community for teaching and learning purposes. It provides information about individual passengers on board the Titanic, including details such as age, gender, ticket class, fare, cabin, and survival status. The goal of this project is to build a predictive model that can determine whether a passenger survived or not based on these features.

**BUSINESS UNDERSTANDING**

Understanding the factors that influenced the survival of passengers on the Titanic is not only historically intriguing but also has practical implications for predictive modeling and data analysis. By developing a model that can predict survival, we gain insights into the characteristics that made individuals more likely to survive the disaster. This knowledge could be valuable for designing future safety measures or understanding human behavior in emergency situations.

**PROBLEM STATEMENT**

The problem at hand is a binary classification task: given a set of features for each passenger, we want to predict whether that passenger survived or not. The dataset contains a labeled target variable, 'Survived' (1 if the passenger survived, 0 if not), and various features that can be used for prediction.

**DATA UNDERSTANDING**

The dataset for this project has been obtained from [Kaggle Titanic Dataset](https://www.kaggle.com/datasets/yasserh/titanic-dataset?resource=download)
 and it has 891 rows and 12 columns.The details for each column has been described below in detail;

1. **PassengerId** - this column denotes the passenger identification number


2. **Survived** - this column shows whether the passenger survived or not: 0 = No (to mean the passenger did not survive), 1 = Yes(means the passenger survived).This is our target variable for clasification.


3. **Pclass** - Ticket class: 1 = 1st, 2 = 2nd, 3 = 3rd


4. **Name** - this column indicates the name of the passenger


5. **Sex** - whether male or female


6. **Age** - this column shows the age in years of the passenger


7. **SibSp** - this coumn shows the  number of siblings or spouses aboard the Titanic


8. **Parch** - this column shows the number of parents or children aboard the Titanic


9. **Ticket** - this column shows the ticket number


10. **Fare** - this column shows the passenger fare


11. **Cabin** - this column provides information about the passenger's assigned cabin


12. **Embarked** - column represents the port of embarkation, i.e., the port where the passenger boarded the Titanic.There are three possible values for "Embarked":
C: Cherbourg
Q: Queenstown
S: Southampton

We will import necessary python packages that will be utilize in our project in the code cell  below


In [1]:
#Basic data manipulation and analysis
import pandas as pd
import numpy as np

#Data visualisation libraries
import matplotlib.pyplot as plt
%matplotlib inline

# To ensure a more organized and tidy output, we suppress potential warnings that may arise during the execution of the code.
import warnings
warnings.filterwarnings('ignore')

We will first convert the Titanic_Dataset.csv into a pandas dataframe

In [2]:
# The Titanic dataset is in the same directory as Jupyter notebook
file_path = 'Titanic-Dataset.csv'

# Read the CSV file into a pandas DataFrame read the first five rows of our dataframe
titanic_df = pd.read_csv(file_path)
titanic_df.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
#we will read the last five rows of our dataframe
titanic_df.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


**DATA CLEANING**

To facilitate a comprehensive exploration of our dataset, we will custom functions:

**data_shape(titanic_df)**: This function reveals the shape of the DataFrame, providing the number of rows and columns in the dataset.

**data_info(titanic_df)**: Offering valuable insights, this function presents information about the data, such as column names, data types, and the count of non-null values in each column.

**data_missing(titanic_df)**: With a focus on data completeness, this function detects the presence of missing values by examining each column for null entries. If no missing values are found, a message indicating their absence is displayed.

**identify_duplicates(titanic_df)**: This function pinpoints and provides details about duplicate rows in the dataset. It computes the count and percentage of duplicated rows present. In case no duplicates exist, it conveys a message confirming their nonexistence.

**unique_column_duplicates(titanic_df, column)**: Specifically designed to handle duplicates within a designated column, this function calculates the number and percentage of duplicated rows for that specific column. If no duplicates are detected in the column, a message stating their absence is exhibited.

**data_describe(titanic_df)**: To gain a deeper understanding of numerical columns, this function showcases descriptive statistics such as count, mean, standard deviation, minimum, quartiles, and maximum.

In order to conduct a thorough exploration of our own dataset, we utilize these custom functions within our analysis. This ensures that we can seamlessly inspect the structure of our data, identify missing values, duplicates, and obtain essential summary statistics. Utilizing these functions will aid in our data analysis and decision-making process as we prepare for further modeling and analysis.

In [4]:
# Function to print the shape of the DataFrame
def data_shape(titanic_df):
    print("Data Shape:")
    print(f"Number of Rows: {titanic_df.shape[0]}")
    print(f"Number of Columns: {titanic_df.shape[1]}\n")

# Function to display information about the data
def data_info(titanic_df):
    print("Data Information:")
    print(titanic_df.info())
    
# Function to check for missing values and display missing value percentage
def data_missing(titanic_df):
    print("\nMissing Values:")
    missing_values = titanic_df.isnull().sum()
    total_cells = titanic_df.size
    total_missing = missing_values.sum()

    # Calculate the percentage of missing values for each column
    missing_percentage = (missing_values / titanic_df.shape[0]) * 100

    # Display the missing values and percentage for each column
    missing_info = pd.DataFrame({
        'Total Missing Values': missing_values,
        'Missing Value Percentage': missing_percentage
    })
    print(missing_info)
    print(f"\nTotal Missing Values: {total_missing}")
    print(f"Total Missing Value Percentage: {total_missing / total_cells * 100:.2f}%")
        
# Function to identify and display duplicate rows
def identify_duplicates(titanic_df):
    duplicates = titanic_df[titanic_df.duplicated()]
    print("\nDuplicate Rows:")
    print(duplicates)
    
# Function to display descriptive statistics of numerical columns
def data_describe(titanic_df):
    print("\nDescriptive Statistics:")
    print(titanic_df.describe())
    
# Function to explore the dataset
def explore_dataset(titanic_df):
    data_shape(titanic_df)
    data_info(titanic_df)
    data_missing(titanic_df)
    identify_duplicates(titanic_df)
    data_describe(titanic_df)
# Call the fuction to explore the dataset
explore_dataset(titanic_df)

Data Shape:
Number of Rows: 891
Number of Columns: 12

Data Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None

Missing Values:
             Total Missing Values  Missing Value Percentage
PassengerId                     0                  0.000000
Survived                     