# Classification with titanic csv

***

Notes on pytorch dataloader 

Data Handling: TensorDataset provides a convenient way to combine input features and labels into a single dataset object. It allows you to store and manipulate your data in a unified format, making it easier to handle and pass to the model.

Batch Processing: DataLoader enables batch processing of your data during training and evaluation. It automatically creates mini-batches of data, shuffles the data (if specified), and provides an iterable object that you can loop over in your training loop. This is especially useful when you have large datasets that cannot fit entirely into memory, as it allows you to load and process data in smaller batches.

Data Parallelism: DataLoader supports parallel data loading and processing. You can set the num_workers parameter to load and preprocess data in parallel using multiple worker processes. This can significantly speed up data loading and improve the overall training time, especially when working with large datasets.

Randomization: DataLoader provides the option to shuffle the data by setting the shuffle parameter to True. This helps in introducing randomness during training, which can be beneficial for better model generalization and avoiding overfitting.

Efficient Memory Usage: By using DataLoader with TensorDataset, you can efficiently manage memory usage. Instead of loading the entire dataset into memory at once, DataLoader loads and processes data in small batches, reducing the memory footprint and allowing you to work with larger datasets.

***

In [171]:
import pandas as pd 
import numpy as np
from tqdm import tqdm
import json

from sklearn.model_selection import train_test_split
from sklearn import preprocessing
labler = preprocessing.LabelEncoder()
oh = preprocessing.OneHotEncoder(sparse=False)

# Pytorch 
import torch
import torch.nn as nn 
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

Problem Statement: Evaluating whether a person survived 

In [232]:
df = pd.read_csv('./data/titanic.csv').dropna().reset_index(drop=True)
df.head()
features = df[['Age','Sex']].copy()
labels = df[['Survived']].copy()

features.drop('Sex',axis=1,inplace=True)
sex = oh.fit_transform(df[['Sex']])
features_encoded_df = pd.DataFrame(sex, columns=oh.get_feature_names(['Sex']))
features = pd.concat([features, features_encoded_df], axis=1)

# train test split with shuffle 
X_train, X_test, y_train, y_test = train_test_split(features, labels, shuffle=True, test_size=0.3)
data_list = [X_train, y_train, X_test, y_test]

# Turning data into tensor 
for i in tqdm(range(0,4)):
    data_list[i] = torch.tensor(data_list[i].values,dtype=torch.float32)
    
X_train, y_train, X_test, y_test  = data_list

# Storing and using tensor_dataset : 
train_dataset  = TensorDataset(X_train, y_train)
test_dataset = TensorDataset(X_test, y_test)

# hyperparameters 
batch_size = 16

# __init__ parameters:
class Params(object):
    def __init__(self,input_dim):
        self.input_dim = input_dim
        
# Initializations
args = Params(3)        
        
# Create dataloader 
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size)

100%|██████████| 4/4 [00:00<?, ?it/s]


# Pytorch Model 

In [281]:
class LogisticRegression(nn.Module):
    def __init__(self,input_dim):
        super(LogisticRegression,self).__init__()
        # framework 
        self.linear = nn.Linear(input_dim,1)
    
    def forward(self,x):
        out = torch.sigmoid(self.linear(x))
        return out 

model = LogisticRegression(args.input_dim)

# define loss function: 
criterion = nn.BCELoss()

# Optimizer for back-propagation 
optimizer = optim.SGD(model.parameters(),lr=0.01)

# Epochs 
e = 25

# early stopp criteria
best_loss = float('inf')

# Training for loop
for epoch in tqdm(range(e)):
    model.train()
    train_loss = 0
    for batch_input, batch_labels in train_loader:
        # forward
        outputs = model(batch_input)
        loss = criterion(outputs,batch_labels)
        # back-propagation and reset optimizer
        optimizer.zero_grad()
        loss.backward
        optimizer.step()
        train_loss += loss.item()
        
    # Validation Phase 
    model.eval()
    val_loss = 0 
    total_correct = 0
    total_examples = 0
    
    with torch.no_grad():
        for batch_input, batch_labels in test_loader:
            outputs = model(batch_input)
            loss = criterion(outputs,batch_labels)
            val_loss += loss.item()
            
            prediction = torch.round(outputs)
            correct = (prediction==batch_labels).sum().item()
            total_correct += correct
            total_examples += batch_input.size(0)
            
    accuracy = (total_correct / total_examples) * 100
    print(f"Accuracy: {accuracy:.2f}%")
            
    # Calculate average lostt
    train_loss /= len(train_loader)
    val_loss /= len(test_loader)
    
    if (epoch + 1) % 10 == 0:
        print(f"Epoch [{epoch+1}/{e}], Loss: {loss.item():.4f}")
    
    # Check if validation loss improves
    if val_loss < best_loss:
        best_loss = val_loss
        epochs_without_improvement = 0 
        # saving the model 
        torch.save(model.state_dict(),'best_model.pth')
    else:
        epochs_without_improvement  += 1
    
    # Check early stopping 
    if epochs_without_improvement >= 10:
        print(f'Early Stopping triggered. Trainning stopped at {epochs_without_improvement}')
        break

 40%|████      | 10/25 [00:00<00:00, 269.83it/s]

Accuracy: 63.64%
Accuracy: 63.64%
Accuracy: 63.64%
Accuracy: 63.64%
Accuracy: 63.64%
Accuracy: 63.64%
Accuracy: 63.64%
Accuracy: 63.64%
Accuracy: 63.64%
Accuracy: 63.64%
Epoch [10/25], Loss: 4.3991
Accuracy: 63.64%
Early Stopping triggered. Trainning stopped at 10





***

Prediction

In [282]:
df.iloc[0:10,:]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
1,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
2,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
3,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S
4,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S
5,22,1,2,"Beesley, Mr. Lawrence",male,34.0,0,0,248698,13.0,D56,S
6,24,1,1,"Sloper, Mr. William Thompson",male,28.0,0,0,113788,35.5,A6,S
7,28,0,1,"Fortune, Mr. Charles Alexander",male,19.0,3,2,19950,263.0,C23 C25 C27,S
8,53,1,1,"Harper, Mrs. Henry Sleeper (Myna Haxtun)",female,49.0,1,0,PC 17572,76.7292,D33,C
9,55,0,1,"Ostby, Mr. Engelhart Cornelius",male,65.0,0,1,113509,61.9792,B30,C


In [283]:
[data['Age']]

[4]

In [289]:
model = LogisticRegression(3)
model.load_state_dict(torch.load('./best_mode.pth'))
model.eval()
json_data = """{
    "Age":4,
    "Sex_female": 0,
    "Sex_male": 1
}
"""
data = json.loads(json_data)
input_tensor = torch.tensor([[data['Age'],data['Sex_female'],data['Sex_male']]],dtype=torch.float32)
with torch.no_grad():
    output = model(input_tensor)
    predicted_class = torch.argmax(output, dim=1)

class_labels = ['Not Survived', 'Survived']
class_labels[predicted_class.item()]

'Not Survived'