## Summary
- **Objective**:
    - Feature engineering was necessary for previous models like Random Forest and XGBoost. To minimize feature engineering while effectively handling raw text, transformer models, particularly BERT, caught my attention. PyTorch provides pre-trained models like BERT that can be used without extensive feature engineering. These models capture context better and can incorporate static features like upvotes.

- **Transformer Models**:
    - Transformer models are highly effective for NLP tasks due to their ability to capture context and relationships within text. However, they are computationally intensive. To manage this, I adjusted the model from the basic BERT to a lighter version such as **distilbert-base-uncased**, which is more efficient.

- **Tuning Parameters**:
    To decrease the computational load, the following parameters can be tuned:

    - **Max Sequence Length**: Reducing the maximum sequence length (max_length) decreases the number of tokens processed, thus reducing computation.
    - **Batch Size**: Smaller batch sizes can help manage memory usage.
    - **Learning Rate**: Adjusting the learning rate can impact the training time and model performance.
    - **Number of Layers**: Using a lighter version of BERT, like DistilBERT, which has fewer layers, reduces computation.
    - Training transformer models on a local machine without a powerful GPU is challenging. A GPU is essential for reasonable training times.

- **Comparison with Tree-Based Models**:
    - Due to computational limitations(**Apple M2, 16GB memory**), I couldn't directly compare the transformer model's performance with the previous tree-based models. However, I am confident that transformers would perform better with larger datasets and higher computational resources.

The following code can be executed successfully with adequate computational resources:

In [None]:
import torch
from torch.utils.data import DataLoader, Dataset
from torch import nn, optim
from transformers import BertTokenizer, BertModel
import pandas as pd

In [None]:
# Initialize tokenizer and model
tokenizer = BertTokenizer.from_pretrained('distilbert-base-uncased')

### Helper Function

In [None]:
# encode text data
def encode_text(text, max_length=64):
    encoded = tokenizer.encode_plus(
        text,
        max_length=max_length,
        truncation=True,
        padding='max_length',
        return_tensors='pt'
    )
    return encoded['input_ids'].squeeze(0), encoded['attention_mask'].squeeze(0)

# preprocess the dataset
def preprocess_data(data):
    data['encoded'] = data.apply(lambda row: encode_text(row['title'] + " " + (row['body'] if pd.notnull(row['body']) else '')), axis=1)
    data['input_ids'], data['attention_mask'] = zip(*data['encoded'])
    data['input_ids'] = list(data['input_ids'])
    data['attention_mask'] = list(data['attention_mask'])
    return data

# Dataset class to handle input_ids, attention_mask, and scores
class ScoreDataset(Dataset):
    def __init__(self, input_ids, attention_mask, scores):
        self.input_ids = torch.stack(input_ids) if not isinstance(input_ids[0], torch.Tensor) else input_ids
        self.attention_mask = torch.stack(attention_mask) if not isinstance(attention_mask[0], torch.Tensor) else attention_mask
        self.scores = torch.tensor(scores) if not isinstance(scores, torch.Tensor) else scores

    def __len__(self):
        return len(self.scores)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.attention_mask[idx], self.scores[idx]

# Define the model
class BertRegressor(nn.Module):
    def __init__(self):
        super(BertRegressor, self).__init__()
        self.bert = BertModel.from_pretrained('distilbert-base-uncased')
        self.regressor = nn.Linear(768, 1)

    def forward(self, input_ids, attention_mask):
        with torch.no_grad():
            outputs = self.bert(input_ids, attention_mask=attention_mask)
        pooled_output = outputs.pooler_output
        score = self.regressor(pooled_output)
        return score

# Training loop
def train(model, loader, optimizer, criterion, device):
    model.train()
    total_loss = 0
    for input_ids, attention_mask, scores in loader:
        input_ids, attention_mask, scores = input_ids.to(device), attention_mask.to(device), scores.to(device)

        optimizer.zero_grad()
        predictions = model(input_ids, attention_mask)
        loss = criterion(predictions, scores.float().unsqueeze(1))
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
    return total_loss / len(loader)


### Training

In [None]:
data = pd.read_csv('askscience_data.csv')
data = preprocess_data(data)

# Assume scores are stored in 'data['score']'
dataset = ScoreDataset(data['input_ids'], data['attention_mask'], data['score'])
loader = DataLoader(dataset, batch_size=4, shuffle=True)

# Setup the device, model, loss function, and optimizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = BertRegressor().to(device)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=2e-5)

# Training process
num_epochs = 3
for epoch in range(num_epochs):
    loss = train(model, loader, optimizer, criterion, device)
    print(f"Epoch {epoch+1}, Loss: {loss:.4f}")