## Modelling

The aim of this project is to classify whether passengers on the Titanic survived or died.
In the last part of the project we identified three _features_ we want to add to our model.
- Age
- Gender
- Class

I've built, evaluated and drawn conclusions using two models that you could use. 

## Random Forest Classifier 

Why you might choose to use this model from this problem. 

1. **Handling Non-Linear Relationships**: The Titanic dataset likely contains non-linear relationships between features (e.g., fare, sex, and class) and the target variable (survival). Random Forest can capture these non-linear relationships effectively by combining multiple decision trees.

2. **Ensemble Learning**: Random Forest is an ensemble learning method that aggregates predictions from multiple decision trees. Each tree in the forest is trained on a random subset of the data and features, and then their predictions are combined to produce a more robust and accurate final prediction. This helps reduce overfitting and improves generalisation.

3. **Dealing with Missing Values**: Random Forest can handle missing values without the need for imputation. Missing values can be left as-is during the training process, and the algorithm will still be able to make predictions.

4. **Feature Importance**: Random Forest provides a feature importance measure, which can help identify which features (gender, fare, class) are most influential in predicting survival. This information can be valuable for understanding the underlying factors affecting the outcome.

5. **Robustness to Outliers**: Random Forest is less sensitive to outliers in the data compared to some other machine learning algorithms, making it more reliable when dealing with noisy datasets.

Overall, a Random Forest classifier is a versatile and powerful algorithm that is well-suited for both binary classification tasks like predicting survival and dealing with a mix of numeric and categorical features like those present in the Titanic dataset. It strikes a good balance between performance, interpretability, and ease of use, making it a popular choice for many classification problems.

In [33]:
# If problems pip installing use sci-kitlearn

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [45]:
# Load the Titanic dataset (same as before)
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)

#### Data Preprocessing

In [35]:
# Select relevant features and the target variable (same as before)
features = ['Sex', 'Fare', 'Pclass']
target = 'Survived'

# Preprocess data: Convert 'Sex' to numeric (0 for male, 1 for female)
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})

# Drop rows with missing values for simplicity (this is not the best practice in a real-world scenario)
df = df.dropna(subset=features + [target])

**Note on splitting data** 

The data is split into training and testing sets to evaluate the performance of the machine learning model effectively. The purpose of this splitting process is to train the model on one portion of the data (training set) and then assess its performance on another, previously unseen portion (testing set). This ensures that the model's ability to generalize to new, unseen data is evaluated, which is crucial to avoid overfitting.

Here's an explanation of the parameters used in the train_test_split function:

df[features]: This selects the feature columns (X) from the DataFrame 'df'. These are the input variables used to predict the target variable.

df[target]: This selects the target column (y) from the DataFrame 'df'. This is the variable we want to predict.

test_size=0.2: This parameter specifies the proportion of the data that should be used for testing. In this case, 20% of the data will be reserved for testing, while the remaining 80% will be used for training.

random_state=42: This parameter is used to set a seed for the random number generator, ensuring reproducibility. It means that every time you run the code with the same random_state value, you will get the same random split. This is useful for debugging and comparing results.

In [36]:
# Split the data into training and testing sets (same as before)
X_train, X_test, y_train, y_test = train_test_split(df[features], df[target], test_size=0.2, random_state=19)

#### Model Build

In [37]:
# Create and train the Random Forest classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=19)
rf_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = rf_classifier.predict(X_test)

#### Model Evaluation

In [38]:
# Evaluate the Random Forest classifier
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Random Forest Classifier Metrics:")
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

Random Forest Classifier Metrics:
Accuracy: 0.8603351955307262
Precision: 0.8181818181818182
Recall: 0.8059701492537313
F1 Score: 0.8120300751879699


**In depth look at the key metrics**

1. **Accuracy**: The accuracy is a measure of the overall correctness of the model's predictions. It is calculated as the ratio of correct predictions to the total number of predictions. In this case, the accuracy is approximately 0.816, which means the model correctly predicted the survival status of about 81.6% of the passengers in the test set.

2. **Precision**: Precision is a measure of the model's ability to correctly identify positive instances (survived passengers) among all the instances it predicted as positive. It is calculated as the ratio of true positive predictions to the total number of positive predictions (both true positives and false positives). In this case, the precision is around 0.797, indicating that when the model predicts that a passenger survived, it is correct about 79.7% of the time.

3. **Recall**: Recall, also known as sensitivity or true positive rate, is a measure of the model's ability to correctly identify positive instances among all the actual positive instances. It is calculated as the ratio of true positive predictions to the total number of actual positive instances (both true positives and false negatives). In this case, the recall is approximately 0.743, which means the model correctly identified about 74.3% of the actual survivors.

4. **F1 Score**: The F1 score is the harmonic mean of precision and recall. It provides a balance between precision and recall and is a single metric to evaluate the model's performance. The F1 score is calculated as 2 * (precision * recall) / (precision + recall). In this case, the F1 score is around 0.769, indicating a reasonable balance between precision and recall.

Overall, the Random Forest classifier seems to be performing fairly well, with an accuracy of 81.6% and reasonably balanced precision and recall scores. However, it's always a good practice to compare these metrics with the performance of other models or baselines to ensure that the model is indeed providing useful predictions for the specific problem at hand. Additionally, further tuning and optimization can be done to improve the model's performance if needed.

## Logistic Regression

Using Logistic Regression for the Titanic survival prediction problem has its own set of advantages:

1. **Interpretability**: Logistic Regression provides interpretable results, making it easier to understand the impact of each feature on the prediction. The coefficients of the logistic regression model represent the direction and magnitude of the influence of each feature.

2. **Efficiency**: Logistic Regression is computationally efficient and can handle large datasets with a relatively low computational cost. It is particularly useful when dealing with datasets with limited computational resources.

3. **Probabilistic Predictions**: Logistic Regression outputs probabilities of class membership, which allows for setting different classification thresholds, depending on the specific needs of the problem. For instance, we can adjust the threshold to prioritize precision or recall, depending on the application.

4. **Works Well for Linearly Separable Data**: Logistic Regression performs well when the classes are relatively linearly separable. Although the Titanic dataset may not be perfectly linearly separable, it is worth trying logistic regression as an initial approach.

5. **Feature Scaling**: Logistic Regression is not sensitive to feature scaling. This means we don't need to normalize or standardize the features before training the model, making the preprocessing steps simpler.

In [39]:
from sklearn.linear_model import LogisticRegression

In [40]:
# Load the Titanic dataset
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)

#### Data Preprocessing

In [41]:
# Select relevant features and the target variable
features = ['Sex', 'Fare', 'Pclass']  # 'Pclass' is used instead of 'Gender' as a numerical representation
target = 'Survived'

# Preprocess data: Convert 'Sex' to numeric (0 for male, 1 for female)
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})

# Drop rows with missing values for simplicity (this is not the best practice in a real-world scenario)
df = df.dropna(subset=features + [target])

In [42]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df[features], df[target], test_size=0.2, random_state=42)

#### Model Build

In [43]:
# Build the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

#### Evaluation Metrics 

In [44]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

Accuracy: 0.7821229050279329
Precision: 0.7536231884057971
Recall: 0.7027027027027027
F1 Score: 0.7272727272727273


## Conclusion

To compare the results of the two models (Random Forest and Logistic Regression) and determine which model should be used, let's analyse their performance metrics:

**Random Forest Classifier:**
- Accuracy: 0.816
- Precision: 0.797
- Recall: 0.743
- F1 Score: 0.769

**Logistic Regression:**
- Accuracy: 0.782
- Precision: 0.754
- Recall: 0.703
- F1 Score: 0.727

**Overall Comparison:**
Both models show reasonably good performance, with the Random Forest model generally having slightly better metrics for precision, recall, and F1 score. The Random Forest model also has a marginally higher accuracy.

**Comparison and Justification:**

1. **Accuracy**: The Random Forest model has a slightly higher accuracy (0.816) compared to the Logistic Regression model (0.782). However, the difference is not substantial.

2. **Precision**: The Random Forest model has a higher precision (0.797) than the Logistic Regression model (0.754). This means that the Random Forest model is more accurate in identifying true positive instances (survived passengers) among all the positive predictions it makes.

3. **Recall**: The Random Forest model has a higher recall (0.743) compared to the Logistic Regression model (0.703). This means that the Random Forest model is better at identifying actual positive instances (survived passengers) among all the actual survivors.

4. **F1 Score**: The F1 scores of both models are quite close, with the Random Forest at 0.769 and Logistic Regression at 0.727. The F1 score is a balanced metric considering both precision and recall.

**Justification and Decision:**
Considering the overall performance, both models are quite close in terms of accuracy and F1 score. However, the Random Forest model exhibits slightly better precision and recall, which suggests that it provides a more balanced performance between identifying true positive instances and minimizing false positives and false negatives.

Therefore, based on the metrics and considering the slightly better performance of the Random Forest model, it seems more suitable for this specific Titanic survival prediction problem. However, it's essential to take into account other factors such as model complexity, interpretability, and computational resources when making the final decision. If model interpretability is a priority and the dataset is relatively small, the Logistic Regression model might still be a valid choice due to its simplicity and ease of interpretation. On the other hand, if predictive performance is the primary concern, the Random Forest model is the preferred option. Note neither model had its parameter optimised and this may change the results.