<a href="https://colab.research.google.com/github/sanalpillai/Titanic-Machine-Learning-from-Disaster/blob/main/Titanic_Machine_Learning_from_Disaster_Submission.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Importing Libraries and Loading Data**

In [5]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score

In [6]:
# Load the data
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

In [7]:
# Basic overview
print(train.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


First, we begin by importing essential Python libraries for data manipulation and machine learning. Then, we load the Titanic dataset (both the training and test sets) which we'll use throughout our analysis and model building. The datasets are assumed to be stored locally and named train.csv for the training data and test.csv for the test set. This step is crucial for getting our environment set up and data ready for preprocessing and analysis.

**Feature Engineering**

In [8]:
def feature_engineering(df):
    # Extract titles from names
    df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)

    # Simplify titles
    df['Title'] = df['Title'].replace(['Lady', 'Countess', 'Capt', 'Col', 'Don', 'Dr',
                                       'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

    df['Title'] = df['Title'].replace('Mlle', 'Miss')
    df['Title'] = df['Title'].replace('Ms', 'Miss')
    df['Title'] = df['Title'].replace('Mme', 'Mrs')

    # Creating FamilySize
    df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

    # IsAlone feature
    df['IsAlone'] = 0
    df.loc[df['FamilySize'] == 1, 'IsAlone'] = 1

    return df

# Apply feature engineering to both train and test sets
train = feature_engineering(train)
test = feature_engineering(test)

In this step, we focus on creating new features that could help improve the model's predictive performance. Feature engineering is a critical process where domain knowledge is applied to create features that make machine learning algorithms work. We extract titles from passenger names, which can provide insights into social status, gender, and marital status. Additionally, we create a FamilySize feature by adding SibSp and Parch, plus one for the passenger themselves. This helps us capture the size of a passenger's family aboard. These engineered features can significantly influence the model's predictions.

**Data Preprocessing and Model Pipeline**

In [None]:
# Selecting features for the model
features = ["Pclass", "Sex", "SibSp", "Parch", "Fare", "Embarked", "Title", "FamilySize", "IsAlone"]
X = train[features]
y = train["Survived"]
X_test = test[features]

# Preprocessing for numerical and categorical data
numerical_cols = ["SibSp", "Parch", "FamilySize"]
categorical_cols = ["Pclass", "Sex", "Embarked", "Title", "IsAlone"]

# Preprocessing pipeline
numerical_transformer = SimpleImputer(strategy="constant")

# One-hot encode categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(transformers=[
    ('num', numerical_transformer, numerical_cols),
    ('cat', categorical_transformer, categorical_cols)
])

# Model pipeline
model = Pipeline(steps=[('preprocessor', preprocessor),
                        ('model', RandomForestClassifier(n_estimators=100, random_state=0))])

Now, we'll prepare our data for modeling. This involves handling missing values and encoding categorical variables. We use SimpleImputer to fill in missing values and OneHotEncoder to convert categorical variables into a format that can be provided to machine learning algorithms. We then construct a preprocessing pipeline that applies these transformations to the appropriate columns. This pipeline ensures that our data preprocessing steps are organized and can be easily applied to both training and testing data.