# üß† ML Model Training Cheat Sheet

## üîß Optimization & Training Dynamics
- **Gradient Descent**: Core algorithm for minimizing loss by updating weights  
- **Learning Rate**: Controls step size during optimization  
- **Optimizer**: Strategy for updating weights (e.g., Adam, SGD)  
- **Epoch**: One full pass through the training data  
- **Batch Size**: Number of samples per gradient update  
- **Learning Rate Scheduler**: Adjusts learning rate over time  
- **Backpropagation**: Computes gradients for weight updates  
- **Weight Initialization**: Starting values for model parameters  
- **Gradient Clipping**: Prevents exploding gradients in deep networks  

## üìä Model Behavior & Evaluation
- **Bias**: Error from overly simplistic assumptions (underfitting)  
- **Variance**: Error from sensitivity to training data (overfitting)  
- **Loss Function**: Measures prediction error (e.g., MSE, CrossEntropy)  
- **Regularization**: Penalizes complexity to reduce overfitting (L1, L2, Dropout)  
- **Early Stopping**: Halts training when validation performance stalls  
- **Cross-Validation**: Tests generalization by splitting data  
- **Ablation Study**: Tests impact of removing components/features  

## üß¨ Architecture & Layers
- **Activation Function**: Adds non-linearity (ReLU, Sigmoid, Tanh)  
- **Embedding**: Maps discrete inputs to dense vectors  
- **Batch Normalization**: Normalizes activations across batches  
- **Dropout**: Randomly disables neurons during training  
- **Residual Connection**: Shortcut path to help train deep networks  
- **Receptive Field**: Input area a CNN neuron ‚Äúsees‚Äù  
- **Stride**: Step size in convolution or pooling  

## ‚è±Ô∏è Time Series & Sequential Modeling
- **Autocorrelation**: Correlation of a signal with itself over time  
- **Sliding Window**: Creates overlapping input segments  
- **Time Lag Features**: Past values used as predictors  
- **Padding**: Equalizes sequence lengths for batching  
- **Exploding/Vanishing Gradients**: Common issues in deep sequence models  
- **Teacher Forcing**: Uses ground truth instead of predictions during training  
- **Sequence-to-Sequence (Seq2Seq)**: Maps input sequence to output sequence  
- **Model Drift**: Performance degrades over time due to changing data  

## üß† Data & Workflow
- **Feature Engineering**: Creating informative input features  
- **Label Encoding / One-Hot Encoding**: Converts categorical data to numeric  
- **Normalization / Scaling**: Adjusts feature ranges for stability  
- **Data Leakage**: Test data influencing training (bad!)  
- **Transfer Learning**: Reusing pretrained models on new tasks  


# CNN vs LSTM comparison in timeseries data.

In [33]:
import pandas as pd
import numpy as np
import torch

## Prepara data set

### Read CSV 

In [2]:
df = pd.read_csv("../data_sets/bejieng_pm2.5.csv")


In [3]:
df.shape

(43824, 13)

In [4]:
df_clean = df.dropna(subset=["pm2.5"])

In [5]:
nan_count = df_clean["pm2.5"].isna().sum()
print(nan_count)

0


In [6]:
y = df_clean['pm2.5'].copy()
y = y.to_frame()

In [7]:
X = df_clean.copy()
X.drop(columns=['pm2.5'], inplace=True)
X.head()

Unnamed: 0,No,year,month,day,hour,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
24,25,2010,1,2,0,-16,-4.0,1020.0,SE,1.79,0,0
25,26,2010,1,2,1,-15,-4.0,1020.0,SE,2.68,0,0
26,27,2010,1,2,2,-11,-5.0,1021.0,SE,3.57,0,0
27,28,2010,1,2,3,-7,-5.0,1022.0,SE,5.36,1,0
28,29,2010,1,2,4,-7,-5.0,1022.0,SE,6.25,2,0


In [8]:
X.shape

(41757, 12)

In [9]:
y.head()

Unnamed: 0,pm2.5
24,129.0
25,148.0
26,159.0
27,181.0
28,138.0


### Prepare Labels with awesome PD utilities

In [10]:
# replace NaN with mean
y.fillna(y.mean(), inplace=True)

In [11]:
# Encode direction
X['cbwd'] = pd.factorize(X['cbwd'])[0]

In [12]:
X.fillna(X.mean(), inplace=True)

In [13]:
# Normalize features
X['DEWP'] = (X['DEWP'] - X['DEWP'].min()) / (X['DEWP'].max() - X['DEWP'].min())
X['PRES'] = (X['PRES'] - X['PRES'].min()) / (X['PRES'].max() - X['PRES'].min())
X['Iws'] = (X['Iws'] - X['Iws'].min()) / (X['Iws'].max() - X['Iws'].min())
X['TEMP'] = (X['TEMP'] - X['TEMP'].min()) / (X['TEMP'].max() - X['TEMP'].min())
X['Is'] = (X['Is'] - X['Is'].min()) / (X['Is'].max() - X['Is'].min())
X['Ir'] = (X['Ir'] - X['Ir'].min()) / (X['Ir'].max() - X['Ir'].min())
X['cbwd'] = (X['cbwd'] - X['cbwd'].min()) / (X['cbwd'].max() - X['cbwd'].min())

In [14]:
X = X.astype(np.float16)
display(X.head())

  has_large_values = (abs_vals > 1e6).any()


Unnamed: 0,No,year,month,day,hour,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
24,25.0,2010.0,1.0,2.0,0.0,0.353027,0.24585,0.527344,0.0,0.002371,0.0,0.0
25,26.0,2010.0,1.0,2.0,1.0,0.367676,0.24585,0.527344,0.0,0.003948,0.0,0.0
26,27.0,2010.0,1.0,2.0,2.0,0.426514,0.229492,0.54541,0.0,0.00552,0.0,0.0
27,28.0,2010.0,1.0,2.0,3.0,0.485352,0.229492,0.563477,0.0,0.00869,0.037048,0.0
28,29.0,2010.0,1.0,2.0,4.0,0.485352,0.229492,0.563477,0.0,0.010262,0.074097,0.0


In [15]:
# Create shifted features for multi-step forecasting
target_steps = 4
X_shifted = pd.concat([X.shift(-i) for i in range(target_steps)], axis=1)
X_shifted.dropna(inplace=True)

In [16]:
X_shifted.shape

(41754, 48)

In [17]:
X_shifted_timeseries = X_shifted.values.reshape(-1, target_steps, 12)

In [18]:
X_shifted_timeseries.shape

(41754, 4, 12)

### Prepare Target

In [19]:
y['pm2.5'] = (y['pm2.5'] - y['pm2.5'].min()) / (y['pm2.5'].max() - y['pm2.5'].min())

In [20]:
target_steps = 4
y_shifted = pd.concat([y.shift(-i) for i in range(target_steps)], axis=1)
y_shifted.dropna(inplace=True)

In [21]:
y_shifted_timeseries = y_shifted.values.reshape(-1, target_steps, 1)
y_shifted_timeseries.shape

(41754, 4, 1)

### Setup Torch tensor for train test

In [22]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [23]:
all_data_X = torch.tensor(X_shifted_timeseries, dtype= torch.float32 , device=device)
print(all_data_X.shape)
train_size = int(len(all_data_X) * 0.8)
train_X = all_data_X[:train_size, :]
test_X = all_data_X[train_size:, :]

torch.Size([41754, 4, 12])


In [24]:
all_data_y = torch.tensor(y_shifted_timeseries, dtype= torch.float32 , device=device)
print(all_data_y.shape)
train_size = int(len(all_data_y) * 0.8)
train_y = all_data_y[:train_size, :]
test_y = all_data_y[train_size:, :]

torch.Size([41754, 4, 1])


In [25]:
print('train_X.shape ', train_X.shape)
print('train_y.shape ', train_y.shape)

print('test_X.shape', test_X.shape)
print('test_y.shape', test_y.shape)

train_X.shape  torch.Size([33403, 4, 12])
train_y.shape  torch.Size([33403, 4, 1])
test_X.shape torch.Size([8351, 4, 12])
test_y.shape torch.Size([8351, 4, 1])


### Setup Model

In [26]:
import torch.optim as optim
import torch.utils.data as data
import torch.nn as nn

In [27]:
class LSTMForecaster(nn.Module):
    def __init__(self, input_size=4, hidden_size=50, num_layers=1, output_size=4):
        super().__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        out, _ = self.lstm(x)
        # print('out.shape ', out.shape)
        out = self.fc(out)  # last time step
        return out


In [38]:
import torch.optim as optim
import torch.utils.data as data

model = LSTMForecaster(input_size= train_X.shape[2], hidden_size=20,num_layers=3, output_size=1).to(device)
optimizer = optim.Adam(model.parameters())
loss_fn = nn.MSELoss()
loader = data.DataLoader(data.TensorDataset(train_X, train_y), shuffle=True, batch_size=5)

In [39]:
y_pred = model(train_X)
print('y_pred.shape ', y_pred.shape)
loss = loss_fn(y_pred, train_y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
y_pred_test = model(test_X)
loss = loss_fn(y_pred_test, test_y)
rmse = torch.sqrt(loss_fn(y_pred_test, test_y))
print('Test RMSE: %.3f' % rmse)

y_pred.shape  torch.Size([33403, 4, 1])
Test RMSE: 0.126


#### Train

In [None]:
n_epochs = 200
train_rmse_list = []
test_rmse_list = []
for epoch in range(n_epochs):
    model.train()
    for X_batch, y_batch in loader:
        y_pred = model(X_batch)
        loss = loss_fn(y_pred, y_batch)
        optimizer.zero_grad()
        optimizer.step()
    # Validation
    if epoch % 10 != 0:
        continue
    model.eval()
    with torch.no_grad():
        y_pred = model(train_X)
        train_rmse = torch.sqrt(loss_fn(y_pred, train_y))
        y_pred = model(test_X)
        test_rmse = torch.sqrt(loss_fn(y_pred, test_y))
        train_rmse_list.append(train_rmse.item())
        test_rmse_list.append(test_rmse.item())
    print("Epoch %d: train RMSE %.4f, test RMSE %.4f" % (epoch, train_rmse, test_rmse))

Epoch 0: train RMSE 0.1992, test RMSE 0.1996
Epoch 5: train RMSE 0.1992, test RMSE 0.1996
Epoch 10: train RMSE 0.1992, test RMSE 0.1996
Epoch 15: train RMSE 0.1992, test RMSE 0.1996


### CNN Model

In [42]:
from torchviz import make_dot

#### Define Model

In [83]:
class CNN1DForecaster(nn.Module):
    def __init__(self, time_steps=4, output_size=1, mid_channels=64):
        super().__init__()
        self.cnn1d1 = nn.Conv1d(in_channels=time_steps, out_channels=mid_channels, kernel_size=1).to(device)
        self.cnn1d2 = nn.Conv1d(in_channels=mid_channels, out_channels=4, kernel_size=1).to(device)
        self.AvgPool1d = nn.AvgPool1d(kernel_size=1).to(device)
        self.relu = nn.ReLU(inplace=True).to(device)
        self.Linear = nn.Linear(12, output_size).to(device)
    
    def forward(self, x):
        x = self.cnn1d1(x)
        x = self.relu(x)
        x = self.cnn1d2(x)
        x = self.AvgPool1d(x)
        x = self.relu(x)
        # x = x.view(x.size(0), -1)  # Flatten the tensor
        x = self.Linear(x)
        return x
    
    

In [84]:
model = CNN1DForecaster(time_steps=4, output_size=1).to(device)

In [85]:
train_X.shape

torch.Size([33403, 4, 12])

In [87]:
y_pred = model(train_X)
print('y_pred.shape ', y_pred.shape)
loss = loss_fn(y_pred, train_y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
y_pred_test = model(test_X)
loss = loss_fn(y_pred_test, test_y)
rmse = torch.sqrt(loss_fn(y_pred_test, test_y))
print('Test RMSE: %.3f' % rmse)

y_pred.shape  torch.Size([33403, 4, 1])
Test RMSE: 0.117


In [88]:
n_epochs = 20
train_rmse_list = []
test_rmse_list = []
for epoch in range(n_epochs):
    model.train()
    for X_batch, y_batch in loader:
        y_pred = model(X_batch)
        loss = loss_fn(y_pred, y_batch)
        optimizer.zero_grad()
        optimizer.step()
    # Validation
    if epoch % 5 != 0:
        continue
    model.eval()
    with torch.no_grad():
        y_pred = model(train_X)
        train_rmse = torch.sqrt(loss_fn(y_pred, train_y))
        y_pred = model(test_X)
        test_rmse = torch.sqrt(loss_fn(y_pred, test_y))
        train_rmse_list.append(train_rmse.item())
        test_rmse_list.append(test_rmse.item())
    print("Epoch %d: train RMSE %.4f, test RMSE %.4f" % (epoch, train_rmse, test_rmse))

Epoch 0: train RMSE 0.1135, test RMSE 0.1166
Epoch 5: train RMSE 0.1135, test RMSE 0.1166
Epoch 10: train RMSE 0.1135, test RMSE 0.1166
Epoch 15: train RMSE 0.1135, test RMSE 0.1166


In [60]:
op = model(train_X)
print(op.shape)


torch.Size([33403, 60, 1])


### CNN1D and LSTM model

In [None]:
class CNN1DLSTMForecaster(nn.Module):
    def __init__(self, time_steps=4, output_size=1,hidden_size=20,num_layers = 2, mid_channels=64,features=12):
        super().__init__()
        self.cnn1d1 = nn.Conv1d(in_channels=time_steps, out_channels=mid_channels, kernel_size=1).to(device)
        self.AvgPool1d = nn.AvgPool1d(kernel_size=1).to(device)
        self.relu = nn.ReLU(inplace=True).to(device)
        self.lstm = nn.LSTM(input_size=12, hidden_size=hidden_size, num_layers=num_layers, output_size=1).to(device)
        self.Linear = nn.Linear(12, output_size).to(device)
    
    def forward(self, x):
        x = self.cnn1d1(x)
        x = self.AvgPool1d(x)
        x = self.relu(x)
        x = self.lstm(x)
        # x = x.view(x.size(0), -1)  # Flatten the tensor
        x = self.Linear(x)
        return x

In [None]:


model = CNN1DLSTMForecaster(time_steps= train_X.shape[1], hidden_size=20,num_layers=3, output_size=1).to(device)
optimizer = optim.Adam(model.parameters())
loss_fn = nn.MSELoss()
loader = data.DataLoader(data.TensorDataset(train_X, train_y), shuffle=True, batch_size=5)