Data Manipulation, analysis, and modeling tasks in Python

In [None]:
Data Loading # test comment

In [1]:
## Import required libraries

import pandas as pd

In [2]:
## Read the dataset using 'pd.read.csv()' function. Read CSV file into a pandas DataFrame

titanic_df = pd.read_csv(r"C:\Users\ext.carmen.salazar\OneDrive - DSV\Desktop\titanic_dataset.csv")

In [3]:
## Exploring the dataset
## Use 'titanic_df.head() to view the first few rows of the dataframe and get an overview od the data'

titanic_df.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [9]:
# Total number of rows in dataframe (total number of observations in dataset)
# total number of observations help make sense of the dataset size and sample size

total_observations = titanic_df.shape[0]
print("Total number of observations:", total_observations)

Total number of observations: 891


In [11]:
# number of non-missing values in each column of the dataframe

column_counts = titanic_df.count()
print("Column counts:")
print(column_counts)


Column counts:
PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64


## Data Cleaning

In [12]:
# total number of missing values in each column

missing_values = titanic_df.isnull().sum()
print("Missing values per column:")
print(missing_values)

Missing values per column:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


In [26]:
# check for duplicates in dataframe
duplicates = titanic_df.duplicated()

# coun the number of duplicates
number_duplicates = duplicates.sum()

print("Number of duplicate rows:", number_duplicates)


Number of duplicate rows: 0


In [27]:
# determining if i need to convert data types for proper analysis
# retrieve the data types of each column

data_types = titanic_df.dtypes
print("Data types of each column:")
print(data_types)

Data types of each column:
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object


expected types for each column in the Titanic dataset:

PassengerId: The assigned type int64 aligns with the expected type for an identifier column.

Survived: The assigned type int64 aligns with the expected type for a binary indicator column (0 or 1 for not survived or survived, respectively).

Pclass: The assigned type int64 aligns with the expected type for a categorical variable representing passenger class.

Name: The assigned type object is appropriate for a column containing string values.

Sex: The assigned type object is appropriate for a column containing categorical values (male or female).

Age: The assigned type float64 is suitable for a column representing numerical values for age.

SibSp: The assigned type int64 aligns with the expected type for a numerical column representing the number of siblings/spouses aboard.

Parch: The assigned type int64 aligns with the expected type for a numerical column representing the number of parents/children aboard.

Ticket: The assigned type object is appropriate for a column containing alphanumeric values.

Fare: The assigned type float64 is appropriate for a numerical column representing the ticket fare.

Cabin: The assigned type object is suitable for a column containing string values representing cabin information.

Embarked: The assigned type object is appropriate for a column containing categorical values representing the port of embarkation.

Overall, the assigned data types seem to align well with the expected types for each column in the Titanic dataset.

Handle Missing Values: Missing values might affect the ability to infer the correct data type automatically. Prior to assessing data types, handle missing values appropriately. Missing values can impact the inferred data types and may require imputation or removal before proceeding with the data type conversion.

In [13]:
# insights into the Titanic dataset

In [14]:
# average age of passengers on the Titanic (29.7) provides an understanding of the age distribution

mean_age = titanic_df['Age'].mean()
print("Average age of passengers:", mean_age)

Average age of passengers: 29.69911764705882


In [24]:
# minimum & maximum age of passengers on the Titanic
# this provides age range of passengers. The minimum age is 0.42 years (about 5 months) and the maximum age is 80 years. 

min_age = titanic_df['Age'].min()
max_age = titanic_df['Age'].max()

print("Minimum age of passengers:", min_age)
print("Maximum age of passengers:", max_age)

Minimum age of passengers: 0.42
Maximum age of passengers: 80.0


In [15]:
# median fare paid by passengers on the Titanic ($14.45). This demonstrates that half of the passengers paid less than $14.45
# while the other half paid more than $14.45. This provides a central tendency that is less affected by extreme fare values.

median_fare = titanic_df['Fare'].median()
print("Median fare paid by passengers:", median_fare)

Median fare paid by passengers: 14.4542


In [25]:
# calculating the quartiles (25th, 50th, and 75th percentile) of "Fare" column
# we can see that 25th percentile of the fare variable is $7.91, the 7th percentile of the fare variables is $31
# this means that 25% of passengers paid less than $7.91, while 25% paid more than $31. 
# This provides insights into fare distribution and spread of prices paid
quartiles = titanic_df['Fare'].quantile([0.25, 0.5, 0.75])
print("Quartiles of fare:")
print(quartiles)

Quartiles of fare:
0.25     7.9104
0.50    14.4542
0.75    31.0000
Name: Fare, dtype: float64


In [16]:
# mode: most frequent value(s) in passenger class variable to determine most common class among the passengers

mode_passenger_class = titanic_df['Pclass'].mode()
print("Mode of passenger class:")
print(mode_passenger_class)

Mode of passenger class:
0    3
Name: Pclass, dtype: int64


In [21]:
# we can that the column has multuple modes (0 & 3), to access most common passenger class, we use indexing

common_passenger_class = mode_passenger_class[0]
print("most common passenger class:", common_passenger_class)

most common passenger class: 3


In [22]:
# double checking to see if there are multiple values occurring with the same highest frequency.
# to retrieve all modes, i can instead convert the series object to a list using the below
# After running the below code, we can see that the mode of the passenger class variable is "Third Class", meaning that the
# most common class among the passengers was third class.

mode_passenger_class.tolist()

[3]