# Definitions

Batch Size: In machine learning, the training process involves updating the model's parameters based on the gradients computed from a batch of training samples. The batch size determines the number of samples that are processed together before the model's parameters are updated.

For example, with a dataset of 1000 samples and a batch size of 100, there would be a total of 10 iterations or updates of the model's parameters.

Reason behind the need of batch size: 
* Mmeory Efficient: batch size smaller than the entire datast
* Computational Efficiency: Batch processing can take advantage of parallelism in modern hardware, such as GPUs. By processing multiple samples simultaneously, the computations can be distributed across multiple cores or devices, leading to faster training times.
* Parameters to be updated more frequently, letting the model to converge to the local minimum

* A larger batch size = more stable gradient estimates but computationally more expensive
* Smaller batch size = more frequent update of model parameters and potentiall converge faster but introduce more noise in the estimates

***

In [5]:
import torch
import torch.nn as nn

In [6]:
# Model Framework
class LogisticRegression(nn.Module):
    def __init__(self,input_size):
        super(LogisticRegression, self).__init__()
        self.linear = nn.Linear(input_size,1)
        self.sigmoid = nn.Sigmoid()
        
    def forward(self,x):
        out = self.linear(x)
        out = self.sigmoid(out) # probabilities
        return out 

In [7]:
# Sample dataset
X = torch.tensor([[1.0, 2.0], [2.0, 3.0], [3.0, 4.0], [4.0, 5.0]])
Y = torch.tensor([[0.0], [0.0], [1.0], [1.0]])

# Model hyperparameters
class Params(object):
    def __init__(self,input_size,learning_rate,epochs, threshold):
        self.input_size = input_size
        self.learning_rate = learning_rate
        self.epochs = epochs
        self.threshold = threshold
        
args = Params(2,0.01,100,0.7)

# Initialize the model
model = LogisticRegression(args.input_size)


# Define Loss Function and Optimizer 
criterion = nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(),lr=args.learning_rate)

# Training_loop
for e in range(args.epochs):
    # forward pass
    outputs = model(X)
    loss = criterion(outputs, Y)
    
    # Backword and optimization 
    
    """ 
     In PyTorch, when performing backpropagation to compute the gradients of the model's parameters, 
     it is necessary to zero out the gradients from the previous iteration. 
     This is because PyTorch accumulates gradients by default, so if we don't reset the gradients, 
     they would accumulate and interfere with subsequent parameter updates.
    """
    optimizer.zero_grad()
    
    
    loss.backward()
    """
    Computes the gradients of the loss function with respect to all the tensors that require gradients in the computational graph. 
    It essentially performs automatic differentiation and accumulates the gradients in the respective parameters of the model.
    """
    
    optimizer.step()
    
    """
    Applies the computed gradients to the model's parameters using the specified optimization algorithm (e.g., SGD, Adam). 
    It adjusts the parameters in the direction that reduces the loss, allowing the model to learn from the training data.
    """
    
     # Print the progress
    if (e + 1) % 10 == 0:
        print(f"Epoch [{e+1}/{args.epochs}], Loss: {loss.item():.4f}")

Epoch [10/100], Loss: 0.6180
Epoch [20/100], Loss: 0.5844
Epoch [30/100], Loss: 0.5715
Epoch [40/100], Loss: 0.5654
Epoch [50/100], Loss: 0.5616
Epoch [60/100], Loss: 0.5588
Epoch [70/100], Loss: 0.5562
Epoch [80/100], Loss: 0.5538
Epoch [90/100], Loss: 0.5515
Epoch [100/100], Loss: 0.5492


In [49]:
# Test the model
test_input = torch.tensor([[5.0, 6.0]])
predicted = model(test_input)
print(f"Predicted probability: {predicted.item():.4f}")

Predicted probability: 0.8169


By default, the threshold is commonly set to 0.5, but you can adjust it according to your specific needs and the trade-off between precision and recall

#### Changing the threshold

In [5]:
# Assuming `outputs` contains the predicted probabilities
# args.threshold = 0.7  # Set a new threshold value
(predicted >= args.threshold).float()

tensor([[1.]])

A higher threshold tends to increase precision (reducing false positives), but it may lead to lower recall (missing some true positives). Conversely, a lower threshold increases recall (capturing more true positives) but may reduce precision (increasing false positives).

In [35]:
# Predicted Probabilities
model(X).detach().numpy()

array([[0.4770312 ],
       [0.57564396],
       [0.6685808 ],
       [0.7500033 ]], dtype=float32)

#### Predicted Proabilities with Blackbox

In [51]:
#community
from interpret.ext.blackbox import KernelExplainer
from interpret.ext.blackbox import TabularExplainer

X_np = X.numpy()

# Use the blackbox explainer to explain the model's predictions

explainer = KernelExplainer(
        model,
        initialization_examples =  X_np,
        classes = [0,1],
        # features = X_train.columns.tolist(),
        model_task='classification')  

global_explanation = explainer.explain_global(X_np)
local_explanation = explainer.explain_local(X_np)

# Print the feature importance values
print(global_explanation.get_feature_importance_dict())
sorted_local_importance_names = local_explanation.get_ranked_local_names()
sorted_local_importance_values = local_explanation.get_ranked_local_values()
sorted_local_importance_names

  0%|          | 0/4 [00:00<?, ?it/s]

  0%|          | 0/4 [00:00<?, ?it/s]

{0: 0.09291041176766157, 1: 0.001433192752301693}


[[[0, 1], [0, 1], [1, 0], [1, 0]], [[1, 0], [1, 0], [0, 1], [0, 1]]]

In [52]:
X_np

array([[1., 2.],
       [2., 3.],
       [3., 4.],
       [4., 5.]], dtype=float32)

Reference: https://learn.microsoft.com/en-us/azure/machine-learning/v1/how-to-machine-learning-interpretability-aml?view=azureml-api-1

***

Local Importance: Local importance refers to the interpretability of a model's predictions for a specific instance or observation in the dataset. It aims to explain why a particular prediction was made by highlighting the features or factors that influenced the model's decision for that specific instance.

Local interpretability methods provide insights into the importance or contribution of individual features for a particular prediction. These methods often involve examining feature attributions, such as the magnitude and direction of feature contributions or feature importance scores for a specific instance.

Examples of local interpretability methods include LIME (Local Interpretable Model-Agnostic Explanations), SHAP (SHapley Additive exPlanations), or feature importance techniques like permutation importance or partial dependence plots.

Local importance helps provide a detailed understanding of the model's behavior for specific instances and can be valuable for debugging, error analysis, and gaining insights into individual predictions.


*** 
Global Importance: Global importance, on the other hand, focuses on understanding the overall behavior of the model and the relative importance of features across the entire dataset. It aims to provide a broader perspective on the model's performance and feature relevance by considering the aggregate impact of features on the model's predictions.

Global interpretability methods analyze the model as a whole and examine the general patterns and trends in feature importance across the dataset. They aim to identify the most influential features or factors for the model's predictions on average.

These methods typically provide feature importance rankings or metrics that indicate the relative contribution of each feature to the model's predictions. They can help identify the most influential features, detect biases, or reveal the overall patterns captured by the model.

Global importance techniques include global feature importance measures like permutation feature importance, mean decrease impurity, or gain in decision trees, as well as model-agnostic methods like feature importance derived from Shapley values or coefficients in linear models.