#### Forward Propagation:

1. **Input Layer:**
   - Input features: $ x = [x_1, x_2, ..., x_n] $
   
2. **Hidden Layer:**
   - Linear transformation: $ z = W_1 \cdot x + b_1 $
   - Activation function (e.g., ReLU): $ h = \text{ReLU}(z) $

3. **Output Layer:**
   - Linear transformation: $ y = W_2 \cdot h + b_2 $
   - Activation function (e.g., Sigmoid for binary classification): $ \hat{y} = \text{Sigmoid}(y) $
   
   
#### Backward Propagation (Gradient Descent):

1. **Compute Loss:**
   - Compute the loss between predicted ($ \hat{y} $) and actual ($ y_{\text{true}} $) outputs using a loss function $ L(\hat{y}, y_{\text{true}}) $.
   
2. **Compute Gradients:**
   - Compute the gradient of the loss with respect to the output layer parameters:
     $ \frac{\partial L}{\partial W_2} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial y} \cdot \frac{\partial y}{\partial W_2} $
     $ \frac{\partial L}{\partial b_2} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial y} \cdot \frac{\partial y}{\partial b_2} $
   - Compute the gradient of the loss with respect to the hidden layer parameters:
     $ \frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial y} \cdot \frac{\partial y}{\partial h} \cdot \frac{\partial h}{\partial z} \cdot \frac{\partial z}{\partial W_1} $
     $ \frac{\partial L}{\partial b_1} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial y} \cdot \frac{\partial y}{\partial h} \cdot \frac{\partial h}{\partial z} \cdot \frac{\partial z}{\partial b_1} $

3. **Update Parameters:**
   - Update the parameters using gradient descent:
     $ W_1 := W_1 - \alpha \frac{\partial L}{\partial W_1} $
     $ b_1 := b_1 - \alpha \frac{\partial L}{\partial b_1} $
     $ W_2 := W_2 - \alpha \frac{\partial L}{\partial W_2} $
     $ b_2 := b_2 - \alpha \frac{\partial L}{\partial b_2} $
   where $ \alpha $ is the learning rate.

# L1 and L2 - Regularization techniques applied to backward propagation to restrict.
## $ L1 $ Regularization (Lasso Regression) When:

- You suspect that many features are irrelevant or redundant, and you want the model to perform feature selection by driving the coefficients of irrelevant features to zero.
- You want a more interpretable model that only includes a subset of the most relevant features.
- You want to reduce the complexity of the model by encouraging sparsity.

The $ L1 $ regularization term adds a penalty to the cost function proportional to the absolute value of the coefficients' magnitude. It encourages sparsity in the model by driving some of the coefficients to zero, effectively performing feature selection. The resulting optimization problem becomes:

$ \text{Cost} = \text{Original Cost} + \lambda \sum_{i=1}^{n} |w_i| $

Where:
- $ \text{Original Cost} $ is the original cost function (e.g., mean squared error).
- $ \lambda $ is the regularization parameter controlling the strength of the penalty.
- $ w_i $ are the model coefficients.

## $ L2 $ Regularization (Ridge Regression) When:

- You believe that all features are potentially useful, but you want to mitigate the impact of multicollinearity (high correlation among features) by reducing the magnitude of large coefficients.
- You prioritize predictive accuracy over model interpretability.
- You want to prevent overfitting by penalizing large coefficients without necessarily eliminating any features.

The $ L2 $ regularization term adds a penalty to the cost function proportional to the square of the coefficients' magnitude. It doesn't lead to sparsity but instead penalizes large coefficients, effectively reducing their impact on the model. The resulting optimization problem becomes:

$ \text{Cost} = \text{Original Cost} + \lambda \sum_{i=1}^{n} w_i^2 $

Where:
- $ \text{Original Cost} $ is the original cost function (e.g., mean squared error).
- $ \lambda $ is the regularization parameter controlling the strength of the penalty.


In [1]:
import torch

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
import numpy as np

import torch.nn as nn
import torch.optim as optim

In [3]:
import pandas as pd

In [4]:
df = pd.read_csv('survey lung cancer.csv')

In [5]:
df

Unnamed: 0,GENDER,AGE,SMOKING,YELLOW_FINGERS,ANXIETY,PEER_PRESSURE,CHRONIC DISEASE,FATIGUE,ALLERGY,WHEEZING,ALCOHOL CONSUMING,COUGHING,SHORTNESS OF BREATH,SWALLOWING DIFFICULTY,CHEST PAIN,LUNG_CANCER
0,M,69,1,2,2,1,1,2,1,2,2,2,2,2,2,YES
1,M,74,2,1,1,1,2,2,2,1,1,1,2,2,2,YES
2,F,59,1,1,1,2,1,2,1,2,1,2,2,1,2,NO
3,M,63,2,2,2,1,1,1,1,1,2,1,1,2,2,NO
4,F,63,1,2,1,1,1,1,1,2,1,2,2,1,1,NO
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
304,F,56,1,1,1,2,2,2,1,1,2,2,2,2,1,YES
305,M,70,2,1,1,1,1,2,2,2,2,2,2,1,2,YES
306,M,58,2,1,1,1,1,1,2,2,2,2,1,1,2,YES
307,M,67,2,1,2,1,1,2,2,1,2,2,2,1,2,YES


In [6]:
#1 = No, 2 = Yes -> YES,NO = 1,0
#Gender -> M,F = 0,1 respectively
#LUNG_CANCER -> YES,NO = 1,0 respectively
df.head()

Unnamed: 0,GENDER,AGE,SMOKING,YELLOW_FINGERS,ANXIETY,PEER_PRESSURE,CHRONIC DISEASE,FATIGUE,ALLERGY,WHEEZING,ALCOHOL CONSUMING,COUGHING,SHORTNESS OF BREATH,SWALLOWING DIFFICULTY,CHEST PAIN,LUNG_CANCER
0,M,69,1,2,2,1,1,2,1,2,2,2,2,2,2,YES
1,M,74,2,1,1,1,2,2,2,1,1,1,2,2,2,YES
2,F,59,1,1,1,2,1,2,1,2,1,2,2,1,2,NO
3,M,63,2,2,2,1,1,1,1,1,2,1,1,2,2,NO
4,F,63,1,2,1,1,1,1,1,2,1,2,2,1,1,NO


### Feature engineering - labeling for categorical values

In [7]:
list_answer_cols = [i for i in df.columns if i not in ['GENDER','AGE','LUNG_CANCER']]

In [8]:
df_new = df.copy()

In [9]:
for i in list_answer_cols:
    df_new[i] = df_new[i].apply(lambda x: 1 if x == 2 else 0).astype(int)

In [10]:
df_new['GENDER'] = df_new['GENDER'].apply(lambda x: 0 if x =='M' else 1)

In [11]:
df_new['LUNG_CANCER'] = df_new['LUNG_CANCER'].apply(lambda x: 1 if x =='YES' else 0)

In [12]:
import re
df_new.columns = [re.sub(' ','_',i) for i in df_new.columns]

In [13]:
import sklearn

In [14]:
from sklearn import linear_model#.LogisticRegression

In [15]:
from sklearn.metrics import classification_report, confusion_matrix

In [16]:
df_new

Unnamed: 0,GENDER,AGE,SMOKING,YELLOW_FINGERS,ANXIETY,PEER_PRESSURE,CHRONIC_DISEASE,FATIGUE_,ALLERGY_,WHEEZING,ALCOHOL_CONSUMING,COUGHING,SHORTNESS_OF_BREATH,SWALLOWING_DIFFICULTY,CHEST_PAIN,LUNG_CANCER
0,0,69,0,1,1,0,0,1,0,1,1,1,1,1,1,1
1,0,74,1,0,0,0,1,1,1,0,0,0,1,1,1,1
2,1,59,0,0,0,1,0,1,0,1,0,1,1,0,1,0
3,0,63,1,1,1,0,0,0,0,0,1,0,0,1,1,0
4,1,63,0,1,0,0,0,0,0,1,0,1,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
304,1,56,0,0,0,1,1,1,0,0,1,1,1,1,0,1
305,0,70,1,0,0,0,0,1,1,1,1,1,1,0,1,1
306,0,58,1,0,0,0,0,0,1,1,1,1,0,0,1,1
307,0,67,1,0,1,0,0,1,1,0,1,1,1,0,1,1


In [17]:
target_var = 'LUNG_CANCER'
ftrs = [i for i in df_new.columns if i != target_var]

In [18]:
model = linear_model.LogisticRegression(solver='liblinear', random_state=0)
model_l1 = linear_model.LogisticRegression(solver='liblinear', random_state=0,C=0.1,penalty='l1')
model_l2 = linear_model.LogisticRegression(solver='liblinear', random_state=0,C=0.1,penalty='l2')

In [19]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test =\
    train_test_split(df_new[ftrs],df_new[target_var], test_size=0.2, random_state=0,stratify=df_new[target_var])

In [20]:
y_train

23     1
61     0
180    1
202    1
79     1
      ..
40     1
21     1
158    1
216    1
177    1
Name: LUNG_CANCER, Length: 247, dtype: int64

In [21]:
y_train.sum()

216

In [22]:
print('Original Dataset: ',df_new.groupby(['LUNG_CANCER'])['GENDER'].count())
print('Train Dataset: ','1:',y_train.count()-y_train.sum(),'0:',-1*(y_train.count()-y_train.count()-y_train.sum())  )
print('Test Dataset: ','1:',y_test.count()-y_test.sum(),'0:',-1*(y_test.count()-y_test.count()-y_test.sum())  )



Original Dataset:  LUNG_CANCER
0     39
1    270
Name: GENDER, dtype: int64
Train Dataset:  1: 31 0: 216
Test Dataset:  1: 8 0: 54


In [23]:
model.fit(x_train,y_train)
model_l1.fit(x_train,y_train)
model_l2.fit(x_train,y_train)

LogisticRegression(C=0.1, random_state=0, solver='liblinear')

In [24]:
from sklearn.metrics import confusion_matrix



In [25]:
model.predict(x_test)

array([1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [26]:
#L2 would be the best because it has multicolinearity by the context in the features, also would want to preserve all the features since it is very confidential and not easily reproducible data.
print('-----------------------L1--------------------------')
print(classification_report(y_test, model_l1.predict(x_test)))
print(pd.crosstab(y_test, model_l1.predict(x_test), rownames=['True'], colnames=['Predicted'], margins=True))
print('-----------------------L2--------------------------')
print(classification_report(y_test, model_l2.predict(x_test)))
print(pd.crosstab(y_test, model_l2.predict(x_test), rownames=['True'], colnames=['Predicted'], margins=True))
print('-----------------------No R--------------------------')
print(classification_report(y_test, model.predict(x_test)))
print(pd.crosstab(y_test, model.predict(x_test), rownames=['True'], colnames=['Predicted'], margins=True))


-----------------------L1--------------------------
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         8
           1       0.87      1.00      0.93        54

    accuracy                           0.87        62
   macro avg       0.44      0.50      0.47        62
weighted avg       0.76      0.87      0.81        62

Predicted   1  All
True              
0           8    8
1          54   54
All        62   62
-----------------------L2--------------------------
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         8
           1       0.87      1.00      0.93        54

    accuracy                           0.87        62
   macro avg       0.44      0.50      0.47        62
weighted avg       0.76      0.87      0.81        62

Predicted   1  All
True              
0           8    8
1          54   54
All        62   62
-----------------------No R-------------------------

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [27]:

X_train_tensor = torch.from_numpy(np.array(x_train)).to(torch.float)
X_test_tensor = torch.from_numpy(np.array(x_test)).to(torch.float)

print('train data : ',
X_train_tensor.shape
)

print('test data : ',
X_test_tensor.shape
)

train data :  torch.Size([247, 15])
test data :  torch.Size([62, 15])


In [28]:
Y_train_tensor = torch.from_numpy(np.array(y_train)).view(-1,1).to(torch.float)
Y_test_tensor = torch.from_numpy(np.array(y_test)).view(-1,1).to(torch.float)


print('train data : ',
Y_train_tensor.shape
)

print('test data : ',
Y_test_tensor.shape
)

train data :  torch.Size([247, 1])
test data :  torch.Size([62, 1])


In [29]:
#Each dimension represents neuron.
#Result from a full/non-fully connected layer is hidden layer, also noted as Z_i
#The model structure works the same, but sigmoid(also activation function) makes it logistic regression as classification

class PimaClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.hidden1 = nn.Linear(15, 12)
        self.act1 = nn.ReLU()
        self.hidden2 = nn.Linear(12, 8)
        self.act2 = nn.ReLU()
        self.output = nn.Linear(8, 1)
        self.act_output = nn.Sigmoid()
 
    def forward(self, x):
        x = self.hidden1(x)
        x = self.act1(x)
        x = self.act2(self.hidden2(x))
        x = self.act_output(self.output(x))
        return x
 




### Model Summary

In [30]:
model_nn = PimaClassifier()
print(model_nn)

PimaClassifier(
  (hidden1): Linear(in_features=15, out_features=12, bias=True)
  (act1): ReLU()
  (hidden2): Linear(in_features=12, out_features=8, bias=True)
  (act2): ReLU()
  (output): Linear(in_features=8, out_features=1, bias=True)
  (act_output): Sigmoid()
)


### forward - layer dimensions

In [31]:
for i in model_nn.named_parameters():
    print(i[0])
    print(i[1].shape)


hidden1.weight
torch.Size([12, 15])
hidden1.bias
torch.Size([12])
hidden2.weight
torch.Size([8, 12])
hidden2.bias
torch.Size([8])
output.weight
torch.Size([1, 8])
output.bias
torch.Size([1])


### NN Model
$ Z = \text{Bias} + W_1X_1 + W_2X_2 + \ldots + W_nX_n $

where,
- $ Z $ is the symbol denoting the output of the neuron in the artificial neural network.
- $ W_i $ are the weights or the beta coefficients associated with the input features $ X_i $.
- $ X_i $ are the independent variables or the inputs to the neuron.
- Bias or intercept ($ W_0 $) is added as a constant term.

## Feed Forward Math Expression
#### The whole concept of this neural network task is complexity of the model which will learn complex data, the dimension trasnformation, and non-linearity utilization for real world problems.

- 0. X (247x15)
- 1. $ Z_1 = W_1 X + b_{1}$, where $ W_{1} $ is (15x12) and $Z_1$ is (247x15)
- 1-1.  $ Z_1 =  AF(Z_1) $
- 2. $ Z_2 = W_2 Z_1 + b_{2}$, where $ W_{2} $ is (12x8) and $Z_2$ is (247x8)
- 2-1.  $ Z_2 =  AF(Z_2) $
- 3. $Z_3=\hat{Y}=Sigmoid(Z_2)$, where $ Z_{3} $ is (247x1)

In [32]:
X_train_tensor.dtype

torch.float32

In [33]:

loss_fn = nn.BCELoss()  # binary cross entropy
optimizer = optim.Adam(model_nn.parameters(), lr=0.001,weight_decay=1e-5)


## BCELoss binary corss entropy
1. $ \mathcal{L}_{\text{BCE}}(\hat{Y}, Y) = -\left(Y \cdot \log(\hat{Y}) + (1 - Y) \cdot \log(1 - \hat{Y})\right) $
2. $ \mathcal{L}_{\text{total}} = \frac{1}{N} \sum_{i=1}^{N} \mathcal{L}_{\text{BCE}}(\hat{Y}_i, Y_i) $

In [34]:

n_epochs = 5000
batch_size = 10

for epoch in range(n_epochs):
    for i in range(0, len(X_train_tensor), batch_size):
        Xbatch = X_train_tensor[i:i+batch_size]
        y_pred = model_nn(Xbatch)
        ybatch = Y_train_tensor[i:i+batch_size]
        loss = loss_fn(y_pred, ybatch)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    if (((epoch % 500) == 0) or ((n_epochs)==(epoch+1))):
        print(f'Finished epoch {epoch}, latest loss {loss}')

Finished epoch 0, latest loss 0.877433717250824
Finished epoch 500, latest loss 0.017450978979468346
Finished epoch 1000, latest loss 0.005272890441119671
Finished epoch 1500, latest loss 0.00186156143900007
Finished epoch 2000, latest loss 0.00014145381283015013
Finished epoch 2500, latest loss 2.9728467779932544e-05
Finished epoch 3000, latest loss 1.3709694030694664e-05
Finished epoch 3500, latest loss 3.865837697958341e-06
Finished epoch 4000, latest loss 2.077662429655902e-06
Finished epoch 4500, latest loss 1.0047675687019364e-06
Finished epoch 4999, latest loss 1.8732900741724734e-07


## Backward propagation
- We use Gradient Descent method to reduce the total loss. 
- This step updates weights layer by layer.
- $ \mathbf{w}_{\text{new}} = \mathbf{w}_{\text{old}} - \alpha \cdot \nabla f(\mathbf{w}_{\text{old}}) $

In [35]:

# compute accuracy (no_grad is optional)
with torch.no_grad():
    y_pred_test = model_nn(X_test_tensor)
 
accuracy = (y_pred_test.round() == Y_test_tensor).float().mean()
print(f"Accuracy {accuracy}")
print(pd.crosstab(y_test, y_pred_test.round().view(-1).numpy(), rownames=['True'], colnames=['Predicted'], margins=True))

Accuracy 0.9032257795333862
Predicted  0.0  1.0  All
True                    
0            3    5    8
1            1   53   54
All          4   58   62


In [36]:
model_nn_l2 = PimaClassifier()
loss_fn_l2 = nn.BCELoss()  # binary cross entropy
optimizer_l2 = optim.Adam(model_nn_l2.parameters(), lr=0.0001,weight_decay=1e-7)


In [37]:

n_epochs = 5000
batch_size = 10
 
for epoch in range(n_epochs):
    for i in range(0, len(X_train_tensor), batch_size):
        Xbatch = X_train_tensor[i:i+batch_size]
        y_pred_l2 = model_nn_l2(Xbatch)
        ybatch = Y_train_tensor[i:i+batch_size]
        loss_l2 = loss_fn_l2(y_pred_l2, ybatch)
        optimizer_l2.zero_grad()
        loss_l2.backward()
        optimizer_l2.step()
    if (((epoch % 500) == 0) or ((n_epochs)==(epoch+1))):
        print(f'Finished epoch {epoch}, latest loss {loss_l2}')

Finished epoch 0, latest loss 2.048273801803589
Finished epoch 500, latest loss 0.07291598618030548
Finished epoch 1000, latest loss 0.045427702367305756
Finished epoch 1500, latest loss 0.03533283248543739
Finished epoch 2000, latest loss 0.029551969841122627
Finished epoch 2500, latest loss 0.02623988874256611
Finished epoch 3000, latest loss 0.024152493104338646
Finished epoch 3500, latest loss 0.022745486348867416
Finished epoch 4000, latest loss 0.021725906059145927
Finished epoch 4500, latest loss 0.021003922447562218
Finished epoch 4999, latest loss 0.02047090046107769


In [38]:

# compute accuracy (no_grad is optional)
with torch.no_grad():
    y_pred_test = model_nn_l2(X_test_tensor)
 
accuracy = (y_pred_test.round() == Y_test_tensor).float().mean()
print(f"Accuracy {accuracy}")
print(pd.crosstab(y_test, y_pred_test.round().view(-1).numpy(), rownames=['True'], colnames=['Predicted'], margins=True))

Accuracy 0.9193548560142517
Predicted  0.0  1.0  All
True                    
0            4    4    8
1            1   53   54
All          5   57   62


In [39]:
model_nn_l1 = PimaClassifier()
loss_fn_l1 = nn.BCELoss()  # binary cross entropy
optimizer_l1 = optim.Adam(model_nn_l1.parameters(), lr=0.0001)


In [40]:
# Regularization coefficient for L1 regularization
l1_lambda = 0.001

In [41]:

n_epochs = 5000
batch_size = 10
 
for epoch in range(n_epochs):
    for i in range(0, len(X_train_tensor), batch_size):
        Xbatch = X_train_tensor[i:i+batch_size]
        y_pred_l1 = model_nn_l1(Xbatch)
        ybatch = Y_train_tensor[i:i+batch_size]
        loss_l1 = loss_fn_l1(y_pred_l1, ybatch)
        # Apply L1 regularization manually
        l1_reg = torch.tensor(0., requires_grad=True)
        for name, param in model_nn_l1.named_parameters():
            if 'weight' in name:
                l1_reg = l1_reg + torch.norm(param, p=1)
        loss_l1 = loss_l1 + l1_lambda * l1_reg
        optimizer_l1.zero_grad()
        loss_l1.backward()
        optimizer_l1.step()
    if (((epoch % 500) == 0) or ((n_epochs)==(epoch+1))):
        print(f'Finished epoch {epoch}, latest loss {loss_l1}')

Finished epoch 0, latest loss 0.9273009300231934
Finished epoch 500, latest loss 0.12008371949195862
Finished epoch 1000, latest loss 0.0819053202867508
Finished epoch 1500, latest loss 0.07156948745250702
Finished epoch 2000, latest loss 0.06547922641038895
Finished epoch 2500, latest loss 0.061503469944000244
Finished epoch 3000, latest loss 0.058417633175849915
Finished epoch 3500, latest loss 0.056181393563747406
Finished epoch 4000, latest loss 0.0545554980635643
Finished epoch 4500, latest loss 0.05319460853934288
Finished epoch 4999, latest loss 0.05202367529273033


In [42]:

# compute accuracy (no_grad is optional)
with torch.no_grad():
    y_pred_test = model_nn_l1(X_test_tensor)
 
accuracy = (y_pred_test.round() == Y_test_tensor).float().mean()
print(f"Accuracy {accuracy}")
print(pd.crosstab(y_test, y_pred_test.round().view(-1).numpy(), rownames=['True'], colnames=['Predicted'], margins=True))


Accuracy 0.9193548560142517
Predicted  0.0  1.0  All
True                    
0            4    4    8
1            1   53   54
All          5   57   62


In [43]:

confusion_matrix(y_test,y_pred_test.round().view(-1).numpy()).ravel()

array([ 4,  4,  1, 53])

In [None]:
### Further Study
- Compare train loss and test loss to check the best epoch in each model
- Check the multicolinearity to see if the hypothesis on the problem is correct that each feature is highly correlated.
- Try AUC-ROC Curve and determine which threshold will work the best performance for the model.
- Try with pure python(By Math Equation) to 