In [12]:
# Import libraries
import arff
import pandas as pd
import numpy as np

In [13]:
# Load the arff file
with open('titanic_data.arff', 'r') as f:
    dataset = arff.load(f)

In [14]:
# Load the data and column names
data = dataset["data"]
column_names = [attr[0] for attr in dataset["attributes"]]

In [15]:
# Create the DataFrame
df = pd.DataFrame(data, columns=column_names)

# Print first 20 values
df.head(20)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,1,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1.0,1,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.55,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
2,1.0,0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1.0,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1.0,2.0,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1.0,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
5,1.0,1,"Anderson, Mr. Harry",male,48.0,0.0,0.0,19952,26.55,E12,S,3,,"New York, NY"
6,1.0,1,"Andrews, Miss. Kornelia Theodosia",female,63.0,1.0,0.0,13502,77.9583,D7,S,10,,"Hudson, NY"
7,1.0,0,"Andrews, Mr. Thomas Jr",male,39.0,0.0,0.0,112050,0.0,A36,S,,,"Belfast, NI"
8,1.0,1,"Appleton, Mrs. Edward Dale (Charlotte Lamson)",female,53.0,2.0,0.0,11769,51.4792,C101,S,D,,"Bayside, Queens, NY"
9,1.0,0,"Artagaveytia, Mr. Ramon",male,71.0,0.0,0.0,PC 17609,49.5042,,C,,22.0,"Montevideo, Uruguay"


This dataset provides detailed information about passengers on the Titanic, including personal data (such as name, age, and sex), ticket details, and survival status. The dataset includes both numerical and categorical variables and contains missing values in several fields.

There are 14 features:
1. pclass – passenger ticket class (1/2/3)
2. survived – whether the passenger survived (0/1)
3. name – passenger’s full name
4. sex – passenger’s gender (female/male)
5. age – passenger’s age
6. sibsp – number of siblings and spouses aboard
7. parch – number of parents and children aboard
8. ticket – passenger’s ticket number
9. fare – fare paid for the ticket
10. cabin – cabin number
11. embarked – port where the passenger boarded the ship (C/Q/S)
12. boat – lifeboat number the passenger boarded
13. body – body identification number
14. home.dest – passenger’s intended destination

In [16]:
missing_values = df.isnull().sum()
print(missing_values)

pclass          0
survived        0
name            0
sex             0
age           263
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        2
boat          823
body         1188
home.dest     564
dtype: int64


Using the pd.isnull().sum() function, I identified that several columns contain missing data. Specifically, there are 263 missing values in the “age” column, 1014 in “cabin”, 2 in “embarked”, 823 in “boat”, 1188 in “body”, and 564 in “home.dest”.

In [17]:
missin_values_percentage = df.isnull().mean()
print(missin_values_percentage)

pclass       0.000000
survived     0.000000
name         0.000000
sex          0.000000
age          0.200917
sibsp        0.000000
parch        0.000000
ticket       0.000000
fare         0.000764
cabin        0.774637
embarked     0.001528
boat         0.628724
body         0.907563
home.dest    0.430863
dtype: float64


Using the pd.isnull().mean() function, we observe that approximately 20% of the values in the “age” column and 43% in the “home.dest” column are missing. This suggests that not all passengers provided complete personal information, possibly due to privacy concerns, lack of documentation, or inconsistent data collection procedures.

The reason why 90% of the data in the “body” column is missing is likely because many of the passengers who perished were never identified. A similar situation applies to the “boat” column, where 63% of the data is missing — unfortunately, not every passenger was fortunate enough to board a lifeboat.

What stands out the most is that 77% of the “cabin” values are missing, and this is more difficult to explain. It could be due to lower-class passengers not being assigned specific cabins, or inconsistent record-keeping during boarding.

Before using functions like isnull().sum() or performing any transformations, it’s important to ensure that the data types are appropriate. If necessary, we can convert non-numeric columns to numeric types using pd.to_numeric() or other preprocessing techniques.

In [18]:
# df.shape

In [19]:
df['boat_null'] = np.where(df['boat'].isnull(), 1, 0)
result_boat = df.groupby('survived')['boat_null'].mean()
print(result_boat)

survived
0    0.988875
1    0.046000
Name: boat_null, dtype: float64


The result above shows that 95% of the passengers who did not survive have missing values in the “boat” column. This suggests a strong relationship between the missingness in the “boat” field and the “survived” status. Therefore, we can reasonably assume that the missing values in the “boat” column are Missing Not at Random (MNAR).

In [20]:
df['body_null'] = np.where(df['body'].isnull(), 1, 0)
result_body = df.groupby('survived')['body_null'].mean()
print(result_body)

survived
0    0.850433
1    1.000000
Name: body_null, dtype: float64


The result above indicates that 85% of the passengers who did not survive have missing values in the “boat” column. Moreover, 100% of the passengers who had a boat number assigned managed to survive the accident. This strongly supports the assumption that the missing values in the “boat” column are Missing Not at Random (MNAR), as their presence appears to be directly linked to survival.

In [21]:
# Age
df['age_null'] = np.where(df['age'].isnull(), 1, 0)
result_age = df.groupby('survived')['age_null'].mean()
print(result_age)

survived
0    0.234858
1    0.146000
Name: age_null, dtype: float64


The result above shows that 23% of passengers who did not survive and 15% of those who survived have missing values in the “age” column. I can assume that these are Missing Completely at Random (MCAR). This means the missingness in the “age” column is likely unrelated to any other variables in the dataset, including the outcome of survival.

In [22]:
# Cabin
df['cabin_null'] = np.where(df['cabin'].isnull(), 1, 0)
result_cabin = df.groupby('survived')['cabin_null'].mean()
print(result_cabin)

survived
0    0.873918
1    0.614000
Name: cabin_null, dtype: float64


The result above shows that 87% of passengers who did not survive and 61% of those who survived have missing values in the “cabin” column. Since the probability of missingness in this column appears to depend on the outcome variable “survived”, we can classify these missing values as Missing at Random (MAR). This suggests that the missingness in “cabin” is related to other observed variables (like survival status), rather than being completely random or directly tied to unobserved data.

Handling Missing Values

As our features are divided into two categories - numerical and categorical - we need to handle missing values separately. We can replace the missing values in categorical features columns with the word "Unknown". When it comes to numerical columns, if the missing values are MCAR or weak MAR, we can replace them with median or mean or create a model to predict the values. If the missing values are MNAR missingness is meaningful, so it is better to create a new binary column to capture this information. 