# Theragun Contextual Advertising Model (Using ktrain + DistilBERT)




## üîç Step 1: Install Required Libraries
Install specific versions of tensorflow, keras, and ktrain to ensure compatibility. \\
This sets up your deep learning environment.

In [None]:
!pip install keras==2.12.0 tensorflow==2.12.0 transformers==4.28.1 ktrain==0.37.2



## üì• Step 2: Import Libraries
Import essential Python libraries and modules used for data manipulation, model building, and evaluation.

In [None]:
import os
import ktrain
import pandas as pd
import numpy as np
from ktrain import text
from sklearn.model_selection import train_test_split

## üîó Step 3: Mount Google Drive
Mount Google Drive to access your dataset directly within Colab, avoiding upload/download hassles.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## ‚öôÔ∏è Step 4: Check GPU Access
Verify that a GPU is available to speed up training, since Transformers models are computationally expensive.

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Tue Jun 10 21:27:40 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   34C    P8              9W /   70W |       2MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## üìÇ Step 5: Load Dataset
Load the JSON file containing HuffPost headlines and their categories into a DataFrame. This data includes multiple categories such as WELLNESS, POLITICS, etc.

In [None]:
data = pd.read_json("drive/MyDrive/news_category_trainingdata.json")
data.head()


Unnamed: 0,category,headline,authors,link,short_description,date
0,CRIME,There Were 2 Mass Shootings In Texas Last Week...,Melissa Jeltsen,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...,2018-05-26
1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2...,Andy McDonald,https://www.huffingtonpost.com/entry/will-smit...,Of course it has a song.,2018-05-26
2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57,Ron Dicker,https://www.huffingtonpost.com/entry/hugh-gran...,The actor and his longtime girlfriend Anna Ebe...,2018-05-26
3,ENTERTAINMENT,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,Ron Dicker,https://www.huffingtonpost.com/entry/jim-carre...,The actor gives Dems an ass-kicking for not fi...,2018-05-26
4,ENTERTAINMENT,Julianna Margulies Uses Donald Trump Poop Bags...,Ron Dicker,https://www.huffingtonpost.com/entry/julianna-...,"The ""Dietland"" actress said using the bags is ...",2018-05-26


## üìä Step 6: Data Preprocessing
Create a binary target label identifying articles related to health and wellness. Here, articles categorized as ‚ÄúWELLNESS‚Äù or ‚ÄúHEALTHY LIVING‚Äù get a label 1, others 0. *This step transforms a multi-class problem into a binary classification problem.*

In [None]:
data['label'] = data['category'].apply(lambda x: 1 if x in ['WELLNESS', 'HEALTHY LIVING'] else 0)

## ‚úÇÔ∏è Step 7: Sample Dataset for Training
To reduce training time while prototyping, take a random 20% sample of the dataset. This balances the need for training data with quicker iterations.

In [None]:
small_data = data.sample(frac=0.2, random_state=42)
texts = small_data['headline']
labels = small_data['label']

## üîÄ Step 8: Train/Validation Split
Split your sampled dataset into training and validation sets. This lets you evaluate model performance on unseen data during training.

In [None]:
train_texts, val_texts, train_labels, val_labels = train_test_split(
    small_data['headline'],
    small_data['label'],
    test_size=0.2,
    stratify=small_data['label'],
    random_state=42
)

## üßπ Step 9: Text Preprocessing with DistilBERT
Use ktrain‚Äôs built-in support for DistilBERT preprocessing to tokenize and encode your text for the transformer model. \\
The max headline length (maxlen) is set to 30 tokens to speed up training but can be increased later.

In [None]:
train_data, val_data, preproc = text.texts_from_array(
    x_train=train_texts.to_numpy(),
    y_train=train_labels.to_numpy(),
    x_test=val_texts.to_numpy(),
    y_test=val_labels.to_numpy(),
    class_names=[0, 1],
    preprocess_mode='distilbert',
    maxlen=30
)

preprocessing train...
language: en
train sequence lengths:
	mean : 10
	95percentile : 15
	99percentile : 17


Is Multi-Label? False
preprocessing test...
language: en
test sequence lengths:
	mean : 10
	95percentile : 15
	99percentile : 16


task: text classification


## üß† Step 10: Build and Train the Model
Create a text classification model using DistilBERT. Train it for 1 epoch initially, with a learning rate of 2e-5. \\
This is a good starting point for fast experimentation.

In [None]:
model = text.text_classifier('distilbert', train_data=train_data, preproc=preproc)
learner = ktrain.get_learner(model, train_data=train_data, val_data=val_data, batch_size=32)

# Train just 1 epoch for speed
learner.fit_onecycle(2e-5, 1)


Is Multi-Label? False
maxlen is 30
done.


begin training using onecycle policy with max lr of 2e-05...


<keras.callbacks.History at 0x7eedb5303190>

## üìà Step 11: Evaluate the Model
After training the model, we evaluate it on the validation dataset to see how well it distinguishes between health/wellness-related headlines (label 1) and non-wellness headlines (label 0).

In [None]:
learner.validate()

              precision    recall  f1-score   support

           0       0.95      0.94      0.94       703
           1       0.60      0.66      0.63       101

    accuracy                           0.90       804
   macro avg       0.77      0.80      0.79       804
weighted avg       0.91      0.90      0.90       804



array([[658,  45],
       [ 34,  67]])

### üîé Interpretation:
 - High performance on Class 0 (non-wellness): The model correctly identifies most headlines that do not relate to wellness, which is expected since it's the majority class.

 - Decent performance on Class 1 (wellness): It correctly identified 67 out of 101 actual wellness headlines, resulting in 66% recall. The precision of 60% means that 60% of headlines the model predicted as wellness were truly wellness-related.

 - F1-Score of 0.63 for Class 1: This is the harmonic mean of precision and recall and is the most useful single metric for the minority class.

## ü§ñ Step 12: Make Predictions
Then we used the trained model to predict the class and probability for new headlines, to help find health and wellness-related news for Theragun.

In [None]:
predictor = ktrain.get_predictor(learner.model, preproc)
predictor.predict(["The best vitamins for winter health"])
predictor.predict_proba(["The best vitamins for winter health"])

array([[0.29979125, 0.7002088 ]], dtype=float32)

### üîé Interpretation:
 - The model predicted class 1, meaning it believes the headline is about health/wellness.

 - The predicted probability is ~70% for class 1 ‚Äî this is a confident prediction.

This confirms the model is able to correctly classify unseen, relevant headlines as wellness-related.

## üíæ Step 13: Save the Model
Save your predictor for future use, so you can deploy it or reload it without retraining.

In [None]:
predictor.save("theragun_predictor")
# Load it later:
# predictor = ktrain.load_predictor("theragun_predictor")