# Titanic Datasets Description
- **PassengerID** - The ID of each passenger
- **Survived** - Survival (0 = No; 1 = Yes)
- **Pclass** - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
- **Name** - Name
- **Sex** - Sex
- **Age** - Age
- **Sibsp** - Number of Siblings/Spouses Aboard
- **Parch** - Number of Parents/Children Aboard
- **Ticket** - Ticket Number
- **Fare** - Passenger Fare
- **Embarked** - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

In [1]:
import pandas as pd

titanic_data = pd.read_csv("Titanic.csv")
titanic_data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,S
...,...,...,...,...,...,...,...,...,...,...,...
413,1305,0,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,S
414,1306,1,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C
415,1307,0,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,S
416,1308,0,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,S


### 2.1 Please convert the categorical variables into R factors or Python Pandas Categoricals

In [2]:
titanic_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Survived     418 non-null    int64  
 2   Pclass       418 non-null    int64  
 3   Name         418 non-null    object 
 4   Sex          418 non-null    object 
 5   Age          332 non-null    float64
 6   SibSp        418 non-null    int64  
 7   Parch        418 non-null    int64  
 8   Ticket       418 non-null    object 
 9   Fare         417 non-null    float64
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(5), object(4)
memory usage: 36.1+ KB


In [3]:
# Convert to category types
titanic_data['Survived'] = titanic_data['Survived'].astype('category')
titanic_data['Pclass'] = titanic_data['Pclass'].astype('category')
titanic_data['SibSp'] = titanic_data['SibSp'].astype('category')
titanic_data['Parch'] = titanic_data['Parch'].astype('category')
titanic_data['Sex'] = titanic_data['Sex'].astype('category')
titanic_data['Ticket'] = titanic_data['Sex'].astype('category')
titanic_data['Embarked'] = titanic_data['Embarked'].astype('category')

In [4]:
titanic_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   PassengerId  418 non-null    int64   
 1   Survived     418 non-null    category
 2   Pclass       418 non-null    category
 3   Name         418 non-null    object  
 4   Sex          418 non-null    category
 5   Age          332 non-null    float64 
 6   SibSp        418 non-null    category
 7   Parch        418 non-null    category
 8   Ticket       418 non-null    category
 9   Fare         417 non-null    float64 
 10  Embarked     418 non-null    category
dtypes: category(7), float64(2), int64(1), object(1)
memory usage: 17.4+ KB


In [5]:
# Set the "PassengerId" column as the index of the Titanic dataset
titanic_data.set_index("PassengerId", inplace=True)

# Display the modified DataFrame with the new index
titanic_data

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
892,0,3,"Kelly, Mr. James",male,34.5,0,0,male,7.8292,Q
893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,female,7.0000,S
894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,male,9.6875,Q
895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,male,8.6625,S
896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,female,12.2875,S
...,...,...,...,...,...,...,...,...,...,...
1305,0,3,"Spector, Mr. Woolf",male,,0,0,male,8.0500,S
1306,1,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,female,108.9000,C
1307,0,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,male,7.2500,S
1308,0,3,"Ware, Mr. Frederick",male,,0,0,male,8.0500,S


### 2.2 Calculate the number of NA values in each column with any
- “apply” functions in R or Python, and remove those records (rows) with NA values.


In [6]:
# Calculate the number of NA values in each column using apply function
na_count = titanic_data.apply(lambda x: x.isna().sum())

# Print NA counts
na_count

Survived     0
Pclass       0
Name         0
Sex          0
Age         86
SibSp        0
Parch        0
Ticket       0
Fare         1
Embarked     0
dtype: int64

In [7]:
# Remove rows with NA values
df_no_na = titanic_data.dropna()

# Calculate the number of NA values in each column using apply function
df_no_na.apply(lambda x: x.isna().sum())

Survived    0
Pclass      0
Name        0
Sex         0
Age         0
SibSp       0
Parch       0
Ticket      0
Fare        0
Embarked    0
dtype: int64

### 2.3 Calculate the average fare (‘Fare’) among the different classes (‘Pclass’).
- Please sort the average fare in ascending order and show the result.

In [8]:
# Calculate average fare by Pclass and sort in ascending order
avg_fare = titanic_data.groupby("Pclass")["Fare"].mean().sort_values(ascending=True)
# Print the result
avg_fare

Pclass
3    12.459678
2    22.202104
1    94.280297
Name: Fare, dtype: float64

### 2.4 Calculate the correlation matrix between numeric variables.
- Which two features have the strongest positive correlation? Which two have the strongest negative correlation? Please explain your answer

In [9]:
# Drop unnecessary columns ("Name" and "Ticket") from the original DataFrame
df = df_no_na.drop(["Name", "Ticket"], axis=1)

# Perform one-hot encoding for categorical variables
df = pd.get_dummies(df, columns=['Sex', 'Embarked'])

# Print the resulting DataFrame
df

Unnamed: 0_level_0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
892,0,3,34.5,0,0,7.8292,False,True,False,True,False
893,1,3,47.0,1,0,7.0000,True,False,False,False,True
894,0,2,62.0,0,0,9.6875,False,True,False,True,False
895,0,3,27.0,0,0,8.6625,False,True,False,False,True
896,1,3,22.0,1,1,12.2875,True,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...
1301,1,3,3.0,1,1,13.7750,True,False,False,False,True
1303,1,1,37.0,1,0,90.0000,True,False,False,True,False
1304,1,3,28.0,0,0,7.7750,True,False,False,False,True
1306,1,1,39.0,0,0,108.9000,True,False,True,False,False


In [10]:
# Calculate the correlation matrix of numerical variables
correlation_matrix = df.corr()

# Print correlation matrix
print("\ncorrelation matrix:")
correlation_matrix


correlation matrix:


Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
Survived,1.0,-0.117886,0.005104,0.07545,0.16371,0.192672,1.0,-1.0,0.079696,0.088764,-0.121749
Pclass,-0.117886,1.0,-0.502919,0.00115,-0.003279,-0.585726,-0.117886,0.117886,-0.386316,0.199175,0.252389
Age,0.005104,-0.502919,1.0,-0.088679,-0.058506,0.337932,0.005104,-0.005104,0.185669,-0.016352,-0.163895
SibSp,0.07545,0.00115,-0.088679,1.0,0.350736,0.151836,0.07545,-0.07545,-0.045157,-0.078218,0.083968
Parch,0.16371,-0.003279,-0.058506,0.350736,1.0,0.246088,0.16371,-0.16371,0.028491,-0.116345,0.035935
Fare,0.192672,-0.585726,0.337932,0.151836,0.246088,1.0,0.192672,-0.192672,0.35777,-0.122554,-0.266957
Sex_female,1.0,-0.117886,0.005104,0.07545,0.16371,0.192672,1.0,-1.0,0.079696,0.088764,-0.121749
Sex_male,-1.0,0.117886,-0.005104,-0.07545,-0.16371,-0.192672,-1.0,1.0,-0.079696,-0.088764,0.121749
Embarked_C,0.079696,-0.386316,0.185669,-0.045157,0.028491,0.35777,0.079696,-0.079696,1.0,-0.153123,-0.84782
Embarked_Q,0.088764,0.199175,-0.016352,-0.078218,-0.116345,-0.122554,0.088764,-0.088764,-0.153123,1.0,-0.394211


In [11]:
# Calculate the correlation between "Survived" and other numerical variables
survival_correlation = correlation_matrix['Survived']

# Print the correlation between "Survived" and other numerical variables
print("\nRelevance to 'survival':")
survival_correlation.sort_values()


Relevance to 'survival':


Sex_male     -1.000000
Embarked_S   -0.121749
Pclass       -0.117886
Age           0.005104
SibSp         0.075450
Embarked_C    0.079696
Embarked_Q    0.088764
Parch         0.163710
Fare          0.192672
Survived      1.000000
Sex_female    1.000000
Name: Survived, dtype: float64

### Conclusion

#### The two most negatively correlated coefficients are:
1. `Sex_male` with `Survived` has a correlation coefficient of -1.0, indicating that male passengers had a lower survival rate.
2. `Embarked_S` with `Survived` has a correlation coefficient of -0.121749, suggesting that passengers who boarded in Southampton had a lower survival rate.

#### The two most positively correlated coefficients are:
1. `Sex_female` with `Survived` has a correlation coefficient of 1.0, indicating that female passengers had a higher survival rate.
2. `Fare` with `Survived` has a correlation coefficient of 0.192672, suggesting a weak positive correlation between higher fares and higher survival rates.
