To accomplish the task described, we will follow these steps:

1. Data Preprocessing:

Load the dataset.

Handle missing values.

Encode categorical variables.

2. Exploratory Data Analysis (EDA):

Explore the data to understand its structure and key features.

3. Model Training:

Select a machine learning model suitable for the problem (e.g., Logistic Regression for classification).

Split the data into training and testing sets.

Train the model on the training set.

4. Model Evaluation:

Evaluate the model using relevant metrics (e.g., accuracy for classification).

Present the evaluation results.

5. Model Optimization:

Perform basic hyperparameter tuning on the model.

Document the tuning process and present the results of the optimized model compared to the initial model performance.

6.Model Deployment:

Export the optimized model in a deployable format (e.g., using joblib for scikit-learn models).

Write a simple script that demonstrates loading the model and making predictions on new data.

Let's start by loading and preprocessing the dataset.

# Data Preprocessing

Load the dataset.

In [1]:
import numpy as np 
import pandas as pd 

In [2]:
df = pd.read_csv("C:\\Users\\GOURAV NEGI\\Downloads\\archive (1)\\Titanic Dataset.csv")
df

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.00,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.00,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.00,1,2,113781,151.5500,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.00,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,3,0,"Zabour, Miss. Hileni",female,14.50,1,0,2665,14.4542,,C,,328.0,
1305,3,0,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,,C,,,
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.50,0,0,2656,7.2250,,C,,304.0,
1307,3,0,"Zakarian, Mr. Ortin",male,27.00,0,0,2670,7.2250,,C,,,


In [3]:
df.drop(['boat','body','home.dest'],axis = 1,inplace = True)

In [4]:
df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S


# Handling Missing Values

In [5]:
df.isnull().sum()

pclass         0
survived       0
name           0
sex            0
age          263
sibsp          0
parch          0
ticket         0
fare           1
cabin       1014
embarked       2
dtype: int64

In [6]:
df['age'].fillna(df['age'].median(),inplace =True)
df['embarked'].fillna(df['embarked'].mode()[0],inplace = True)
df.drop(columns=['cabin'],inplace=True)
df.dropna(subset=['fare'],inplace=True)
df.isnull().sum()

pclass      0
survived    0
name        0
sex         0
age         0
sibsp       0
parch       0
ticket      0
fare        0
embarked    0
dtype: int64

In [7]:
df.dropna(subset=['fare'],inplace=True)

In [9]:
df.isnull().sum()

pclass        0
survived      0
name          0
age           0
sibsp         0
parch         0
ticket        0
fare          0
sex_male      0
embarked_Q    0
embarked_S    0
dtype: int64

# Encoding Categorical Variables
We'll encode the categorical variables using one-hot encoding.

In [8]:
df = pd.get_dummies(df,columns=['sex','embarked'],drop_first=True)
df.head()

Unnamed: 0,pclass,survived,name,age,sibsp,parch,ticket,fare,sex_male,embarked_Q,embarked_S
0,1,1,"Allen, Miss. Elisabeth Walton",29.0,0,0,24160,211.3375,0,0,1
1,1,1,"Allison, Master. Hudson Trevor",0.92,1,2,113781,151.55,1,0,1
2,1,0,"Allison, Miss. Helen Loraine",2.0,1,2,113781,151.55,0,0,1
3,1,0,"Allison, Mr. Hudson Joshua Creighton",30.0,1,2,113781,151.55,1,0,1
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",25.0,1,2,113781,151.55,0,0,1


# Splitting the data

In [10]:
df.columns

Index(['pclass', 'survived', 'name', 'age', 'sibsp', 'parch', 'ticket', 'fare',
       'sex_male', 'embarked_Q', 'embarked_S'],
      dtype='object')

In [11]:
from sklearn.model_selection import train_test_split

X = df.drop(columns=['survived','name','ticket',])
y = df['survived']

X_train,X_test, y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

# Model Training
Logistic Regression AS out model

In [12]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000,random_state=42)

model.fit(X_train,y_train)

LogisticRegression(max_iter=1000, random_state=42)

# Model Evaluation
evaluate the model using accuracy as the metric

In [13]:
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test,y_pred)
print('Accuracy: {accuracy:.2f}')

print(classification_report(y_test,y_pred))

print(confusion_matrix(y_test,y_pred))

Accuracy: {accuracy:.2f}
              precision    recall  f1-score   support

           0       0.77      0.85      0.81       156
           1       0.74      0.63      0.68       106

    accuracy                           0.76       262
   macro avg       0.76      0.74      0.75       262
weighted avg       0.76      0.76      0.76       262

[[133  23]
 [ 39  67]]


# model Optimization
perform basic hyperparameter tuning using GridSearchCV

In [14]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [0.1,1,10,100],
    'solver': ['liblinear','saga']
}

grid_search = GridSearchCV(LogisticRegression(max_iter=1000,random_state=42),param_grid, cv=5,scoring='accuracy')

grid_search.fit(X_train,y_train)

print('Best Parameters: {grid_search.best_params}')
print('Best Score: {grid_search.best_score:.2}')



Best Parameters: {grid_search.best_params}
Best Score: {grid_search.best_score:.2}




Re-evaluate the model with the best parameters

Train the optimized model

Make predictions on the test set

Calculate accuracy

Display the classification report

Display the confusion matrix

In [16]:
optimized_model = grid_search.best_estimator_

y_pred_optimized = optimized_model.predict(X_test)

optimized_accuracy = accuracy_score(y_test,y_pred_optimized)
print('Optimized Accuracy: {optimized_accuracy:.2}')

print(classification_report(y_test,y_pred_optimized))

print(confusion_matrix(y_test,y_pred_optimized))

Optimized Accuracy: {optimized_accuracy:.2}
              precision    recall  f1-score   support

           0       0.78      0.85      0.81       156
           1       0.75      0.64      0.69       106

    accuracy                           0.77       262
   macro avg       0.76      0.75      0.75       262
weighted avg       0.77      0.77      0.76       262

[[133  23]
 [ 38  68]]


# Model Deployment

In [18]:
import joblib

joblib.dump(optimized_model,'titanic_model.joblib')

['titanic_model.joblib']

the script to load the model and make predictions on new data.

In [19]:
import joblib
import pandas as pd

model = joblib.load('titanic_model.joblib')

new_data = pd.DataFrame({
    'Pclass': [3,1],
    'Age': [22, 38],
    'SibSp': [1,1],
    'Parch': [0,0],
    'Fare': [7.25, 71.2833],
    'Sex_male' :[1,0],
    'Embarked_Q':[0,0],
    'Embarked_S':[1,0]
})
predictions = model.predict(new_data)
print(predictions)

[0 1]
