# Module 3: Training and Prediction

## Objectives
1. 
2. **Simple Model Training**:
   - Construct a basic predictive model using logistic regression to determine how well the 'survived' outcome can be predicted from other features.
   - Evaluate the performance of the model to understand the predictability of survival based on the dataset.
3. **First Submission**:
   - Use the trained model to make predictions on new, unseen test data.
   - Prepare and submit the prediction results, ensuring proper formatting and adherence to the submission guidelines.

## Dataset Description
The Titanic dataset contains data for a number of passengers aboard the famous ship, Titanic. It provides detailed demographic information, ticket details, and survival outcomes which make it an excellent resource for binary classification tasks, such as predicting survival. Here is a breakdown of the features included in the dataset:

- **survival**: Survival status (0 = No, 1 = Yes)
- **class**: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
- **sex**: Sex of the passenger (0 = female, 1 = male)
- **age**: Age of the passenger
- **sibsp**: Number of Siblings/Spouses Aboard
- **parch**: Number of Parents/Children Aboard
- **fare**: Passenger Fare
- **cabin**: Have Cabin (0 = No, 1 = Yes)
- **embarked**: Port of Embarkation (encoded as integers)

## Expected Outcomes
By the end of this module, we expect to:
- Develop a preliminary predictive model to estimate survival chances.
- Successfully generate and submit a set of predictions, thereby applying the insights and model to practical outcomes.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

### Data Collection

In [2]:
import requests

# URLs of the files
train_data_url = 'https://www.raphaelcousin.com/modules/module3/course/module3_course_train.csv'
test_data_url = 'https://www.raphaelcousin.com/modules/module3/course/module3_course_test.csv'

# Function to download a file
def download_file(url, file_name):
    response = requests.get(url)
    response.raise_for_status()  # Ensure we notice bad responses
    with open(file_name, 'wb') as file:
        file.write(response.content)
    print(f'Downloaded {file_name} from {url}')

# Downloading the files
download_file(train_data_url, 'module3_course_train.csv')
download_file(test_data_url, 'module3_course_test.csv')

Downloaded module3_course_train.csv from https://www.raphaelcousin.com/modules/module3/course/module3_course_train.csv
Downloaded module3_course_test.csv from https://www.raphaelcousin.com/modules/module3/course/module3_course_test.csv


In [3]:
df =  pd.read_csv("module3_course_train.csv", sep=",", index_col='id')

## Model Building

In [4]:
# Import necessary libraries for model building
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report



In [5]:
# Preparing the data
y = df['survived']
X = df[['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'cabin', 'embarked']]


In [6]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [7]:
from sklearn.ensemble import RandomForestClassifier

In [8]:
# Initialize and train the Logistic Regression model
model = RandomForestClassifier()
model.fit(X_train, y_train)

In [9]:
# Predict on the test data
y_pred = model.predict(X_test)

In [10]:
# Predict on the test data
y_pred = model.predict(X_train)

In [11]:
# Evaluate the model
accuracy = accuracy_score(y_train, y_train)
conf_matrix = confusion_matrix(y_train, y_train)
class_report = classification_report(y_train, y_train)
print("Accuracy of the model:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", class_report)

Accuracy of the model: 1.0
Confusion Matrix:
 [[524   0]
 [  0 313]]
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       524
           1       1.00      1.00      1.00       313

    accuracy                           1.00       837
   macro avg       1.00      1.00      1.00       837
weighted avg       1.00      1.00      1.00       837



In [12]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print("Accuracy of the model:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", class_report)

ValueError: Found input variables with inconsistent numbers of samples: [210, 837]

In [None]:
# Initialize and train the Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

In [None]:
# Predict on the test data
y_pred = model.predict(X_test)

In [None]:
# Predict on the test data
y_pred = model.predict(X_train)

In [None]:
# Evaluate the model
accuracy = accuracy_score(y_train, y_train)
conf_matrix = confusion_matrix(y_train, y_train)
class_report = classification_report(y_train, y_train)
print("Accuracy of the model:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", class_report)

In [None]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print("Accuracy of the model:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", class_report)

In [None]:
conf_matrix = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.xticks(np.arange(len(np.unique(y_test))) + 0.5, ['Not Survived', 'Survived'])
plt.yticks(np.arange(len(np.unique(y_test))) + 0.5, ['Not Survived', 'Survived'], rotation=0)
plt.show()

## Predict test.csv

### Data Collection

In [None]:
df_train =  pd.read_csv("module3_course_train.csv", sep=",", index_col='id')
X_test =  pd.read_csv("module3_course_test.csv", sep=",", index_col='id')

In [None]:
### Data Analysis
Before making predictions, it's crucial to ensure that the test data is consistent with the training data. You can use similar EDA techniques as applied to the training set to validate and explore the test dataset.

In [None]:
# Example EDA on test data
print(X_test.describe())
print(X_test.info())
print("\nMissing values in test data:")
print(X_test.isnull().sum())

### Model Training and Prediction

In [None]:
# Preparing the data
X_train = df_train[['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'cabin', 'embarked']]
y_train = df['survived']

# Train on all information you have
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)



### Generating Submission File

In [None]:
submission = pd.DataFrame({
    'id': X_test.index,
    'Survived': y_pred 
})

submission.to_csv('submission.csv', index=False, sep=',')
submission.head()