**In this project, we will tackle the Titanic dataset,The goal is to train a classifier that can predict the Survived column based on the other columns.**

First we will download the dataset from a github repository

In [1]:
from pathlib import Path
import pandas as pd
import tarfile
import urllib.request

In [2]:
def load_titanic_data():
    tarball_path = Path("datasets/titanic.tgz")
    if not tarball_path.is_file():
        Path("datasets").mkdir(parents=True, exist_ok=True)
        url = "https://github.com/ageron/data/raw/main/titanic.tgz"
        urllib.request.urlretrieve(url, tarball_path)
        with tarfile.open(tarball_path) as titanic_tarball:
            titanic_tarball.extractall(path="datasets")
    return [pd.read_csv(Path("datasets/titanic") / filename)
            for filename in ("train.csv", "test.csv")]

In [3]:
train_data, test_data = load_titanic_data()

In [4]:
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


the goal is to predict whether or not a passenger survived or not

Let's explicitly set the PassengerId column as the index column:

In [5]:
train_data = train_data.set_index("PassengerId")
test_data = test_data.set_index("PassengerId")

In [6]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


In [9]:
train_data[train_data["Sex"]=="female"]["Age"].median()

27.0

In [10]:
train_data[train_data["Sex"]=="male"]["Age"].median()

29.0

cabin, Age and embarked have some null values. Age attribute have 17% null values, so we will decide what to do with them, we will replace them with the median age

In [11]:
train_data.describe()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699113,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526507,1.102743,0.806057,49.693429
min,0.0,1.0,0.4167,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292




*   only 38% have **survided**, that's horrible!

*   The mean **Fare** was £32.20


*   The mean **Age** was less than 30 years old.









In [12]:
train_data["Survived"].value_counts()

Unnamed: 0_level_0,count
Survived,Unnamed: 1_level_1
0,549
1,342


In [13]:
train_data["Pclass"].value_counts()

Unnamed: 0_level_0,count
Pclass,Unnamed: 1_level_1
3,491
1,216
2,184


In [14]:
train_data["Sex"].value_counts()

Unnamed: 0_level_0,count
Sex,Unnamed: 1_level_1
male,577
female,314


In [15]:
train_data["Embarked"].value_counts()

Unnamed: 0_level_0,count
Embarked,Unnamed: 1_level_1
S,644
C,168
Q,77


The Embarked attribute tells us where the passenger embarked: C=Cherbourg, Q=Queenstown, S=Southampton.

Now let's build our pipeline

In [23]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

In [24]:
numerical_pipeline = Pipeline(
    [
        ("imputer",SimpleImputer(strategy="median")),
        ("scaler",StandardScaler())
    ]
)

now lets build the pipeline for the categorical attributes

In [25]:
from  sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

In [27]:
categorical_pipeline = Pipeline(
    [
        ("ordinal_encoder",OrdinalEncoder()),
        ("imputer",SimpleImputer(strategy="most_frequent")),
        ("cat_encoder", OneHotEncoder(sparse_output=False)),
    ]
)

Let's join the two pipelines

In [28]:
from sklearn.compose import ColumnTransformer

numerical_attribs = ["Age", "SibSp", "Parch", "Fare"]
categorical_attribs = ["Pclass", "Sex", "Embarked"]

preprocess_pipeline = ColumnTransformer([
        ("num", numerical_pipeline, numerical_attribs),
        ("cat", categorical_pipeline, categorical_attribs),
    ])

In [29]:
X_train = preprocess_pipeline.fit_transform(train_data)
X_train

array([[-0.56573582,  0.43279337, -0.47367361, ...,  0.        ,
         0.        ,  1.        ],
       [ 0.6638609 ,  0.43279337, -0.47367361, ...,  1.        ,
         0.        ,  0.        ],
       [-0.25833664, -0.4745452 , -0.47367361, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [-0.10463705,  0.43279337,  2.00893337, ...,  0.        ,
         0.        ,  1.        ],
       [-0.25833664, -0.4745452 , -0.47367361, ...,  1.        ,
         0.        ,  0.        ],
       [ 0.20276213, -0.4745452 , -0.47367361, ...,  0.        ,
         1.        ,  0.        ]])

In [30]:
y_train = train_data["Survived"]

We are now ready to train our classifer, lets choose the **RandomForestClassifier**

In [32]:
from sklearn.ensemble import RandomForestClassifier

In [33]:
rfc_clf = RandomForestClassifier(n_estimators=100,random_state =42)
rfc_clf.fit(X_train,y_train)

Our model is trained, let's use it to make predictions on test set

In [34]:
X_test = preprocess_pipeline.fit_transform(test_data)
y_pred = rfc_clf.predict(X_test)

now we need to see how good our model is,
for that we will use crossvalidation

In [37]:
from sklearn.model_selection import cross_val_score

In [39]:
rfc_scores = cross_val_score(rfc_clf,X_train,y_train,cv=10)
rfc_scores.mean()

0.8137578027465668

not too bad, lets try an SVC

In [40]:
from sklearn.svm import SVC

svm_clf = SVC(gamma="auto")
svm_scores = cross_val_score(svm_clf, X_train, y_train, cv=10)
svm_scores.mean()

0.8249313358302123

this model is better!

Let's Compare many more models and tune hyperparameters using cross validation and grid search,

GridSearch for **RandomForestClassifier**

In [44]:
from sklearn.ensemble import  GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

# Define the models and hyperparameter grids
model_params = {
    "RandomForest": {
        "model": RandomForestClassifier(random_state=42),
        "params": {
            "n_estimators": [50, 100, 200],
            "max_depth": [None, 10, 20, 30],
            "min_samples_split": [2, 5, 10],
        },
    },
    "SVC": {
        "model": SVC(random_state=42),
        "params": {
            "C": [0.1, 1, 10],
            "kernel": ["linear", "rbf"],
            "gamma": ["scale", "auto"],
        },
    },
    "GradientBoosting": {
        "model": GradientBoostingClassifier(random_state=42),
        "params": {
            "n_estimators": [50, 100, 200],
            "learning_rate": [0.01, 0.1, 0.2],
            "max_depth": [3, 5, 7],
        },
    },
    "LogisticRegression": {
        "model": LogisticRegression(random_state=42, max_iter=500),
        "params": {
            "C": [0.1, 1, 10],
            "penalty": ["l1", "l2"],
            "solver": ["liblinear", "saga"],
        },
    },
    "KNN": {
        "model": KNeighborsClassifier(),
        "params": {
            "n_neighbors": [3, 5, 7],
            "weights": ["uniform", "distance"],
            "metric": ["euclidean", "manhattan"],
        },
    },
}

# Use GridSearchCV for each model
best_models = {}
for model_name, model_info in model_params.items():
    grid_search = GridSearchCV(
        model_info["model"],
        model_info["params"],
        cv=5,
        scoring="accuracy",
        return_train_score=False,
        n_jobs=-1,
    )
    grid_search.fit(X_train, y_train)
    best_models[model_name] = {
        "best_model": grid_search.best_estimator_,
        "best_params": grid_search.best_params_,
        "best_score": grid_search.best_score_,
    }

# Print the best model for each algorithm
for model_name, model_info in best_models.items():
    print(f"{model_name}:")
    print(f"  Best Params: {model_info['best_params']}")
    print(f"  Best CV Score: {model_info['best_score']:.4f}")


RandomForest:
  Best Params: {'max_depth': 10, 'min_samples_split': 5, 'n_estimators': 100}
  Best CV Score: 0.8350
SVC:
  Best Params: {'C': 1, 'gamma': 'auto', 'kernel': 'rbf'}
  Best CV Score: 0.8283
GradientBoosting:
  Best Params: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 100}
  Best CV Score: 0.8339
LogisticRegression:
  Best Params: {'C': 0.1, 'penalty': 'l2', 'solver': 'liblinear'}
  Best CV Score: 0.7969
KNN:
  Best Params: {'metric': 'euclidean', 'n_neighbors': 5, 'weights': 'uniform'}
  Best CV Score: 0.8070


Step 3: Evaluate the Best Model on Test Data

In [45]:
best_model_name = max(best_models, key=lambda x: best_models[x]["best_score"])
best_model = best_models[best_model_name]["best_model"]

# Predict on the test set
y_pred = best_model.predict(X_test)




In [49]:
best_model

So the best model is randomForestClassifer for our testing