## Data Loading and Preprocessing

**Loading and Combining Data:**

* The code loads the training and testing datasets for fall and non-fall events from .npy files.
* The data is combined to form comprehensive training and testing sets.

**Shuffling Data:**
* Shuffling the data ensures that the model does not learn any order-based biases and generalizes better during training and evaluation.
Converting Data to Strings:
* Since BERT models expect text input, the numerical accelerometer data is converted into string format.



In [2]:
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments
from transformers import EarlyStoppingCallback
import torch
from torch.utils.data import Dataset, DataLoader

# Load data
def load_np_array(file_name):
    X_array = np.load('data/X_' + file_name + '_array.npy')
    y_array = np.load('data/y_' + file_name + '_array.npy')
    return X_array, y_array

X_train_fall, y_train_fall = load_np_array("train_fall")
X_train_notfall, y_train_notfall = load_np_array("train_notfall")
X_test_fall, y_test_fall = load_np_array("test_fall")
X_test_notfall, y_test_notfall = load_np_array("test_notfall")

# Combine data
X_train = np.concatenate((X_train_fall, X_train_notfall), axis=0)
y_train = np.concatenate((y_train_fall, y_train_notfall), axis=0)
X_test = np.concatenate((X_test_fall, X_test_notfall), axis=0)
y_test = np.concatenate((y_test_fall, y_test_notfall), axis=0)

# Shuffle data
train_indices = np.arange(X_train.shape[0])
np.random.shuffle(train_indices)
X_train = X_train[train_indices]
y_train = y_train[train_indices]

test_indices = np.arange(X_test.shape[0])
np.random.shuffle(test_indices)
X_test = X_test[test_indices]
y_test = y_test[test_indices]

# Convert numerical data to strings
def convert_to_string(data):
    return [" ".join(map(str, seq.flatten())) for seq in data]

X_train_text = convert_to_string(X_train)
X_test_text = convert_to_string(X_test)

## Dataset Class Definition
**Custom Dataset Class:**
* This class is designed to handle the custom dataset format.
* It uses the tokenizer to encode the string data into the format required by BERT.
* The class also prepares the labels, which are necessary for supervised learning tasks.

In [3]:
# Create Dataset class
class FallDetectionDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encoding = self.tokenizer.encode_plus(
            text,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }


## Tokenizer and Model Initialization
**Tokenizer Initialization:**

A tokenizer specific to the distilbert-base-uncased model is initialized to preprocess the text data.
Tokenization involves converting text into tokens that BERT can understand, including handling padding and truncation.

**Model Initialization:**

The DistilBertForSequenceClassification model is initialized with two output labels (fall and not fall).
This model architecture is suitable for sequence classification tasks.

**DataLoader Creation:**

DataLoaders are created for both training and testing datasets.
DataLoaders handle batching and shuffling, making the training process more efficient and manageable.

In [4]:
# Tokenizer and Model
model_name = 'distilbert-base-uncased'
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Create DataLoader
max_length = 32  # Adjusted sequence length
train_dataset = FallDetectionDataset(X_train_text, y_train, tokenizer, max_length)
test_dataset = FallDetectionDataset(X_test_text, y_test, tokenizer, max_length)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Training Arguments and Trainer Initialization

Defining Training Arguments:

Training arguments specify the configuration for training, such as the number of epochs, batch sizes, learning rate, and logging settings.
Mixed precision training (fp16) is enabled to speed up the training process and reduce memory usage.
Gradient accumulation is used to simulate larger batch sizes by accumulating gradients over multiple steps.
Trainer Initialization:

The Trainer class from Hugging Face simplifies the training loop.
It handles the training and evaluation processes, model saving, and logging.
An early stopping callback is added to stop training if the model's performance does not improve over a specified number of epochs, preventing overfitting and saving computational resources.

In [5]:
# Training Arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=10,  # Increased number of epochs
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    warmup_steps=100,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    evaluation_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
    fp16=True,  # Enable mixed precision training
    gradient_accumulation_steps=2,  # Simulate larger batch size
    learning_rate=2e-5  # Adjusted learning rate
)

# Trainer with EarlyStoppingCallback
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
)




## Training and Evaluation
**Training the Model:**

The trainer.train() method starts the training process using the specified training arguments and datasets.
Evaluating the Model:

The trainer.evaluate() method evaluates the trained model on the test dataset and returns performance metrics.
Predictions are generated for the test set, and various evaluation metrics such as accuracy, precision, recall, and F1 score are calculated.
A classification report is generated to provide detailed performance metrics for each class (fall and not fall).

In [6]:
# Train the model
trainer.train()

# Evaluate the model
results = trainer.evaluate()

# Generate predictions
predictions = trainer.predict(test_dataset)
y_pred = np.argmax(predictions.predictions, axis=1)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")

# Print classification report
report = classification_report(y_test, y_pred, target_names=['Not Fall', 'Fall'])
print("\nClassification Report:\n", report)


Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 