## This notebook requires GPU

This lab must be run in Google Colab in order to use GPU acceleration for model training. Click the button below to open this notebook in Colab, then set your runtime to GPU:

**Runtime > Change Runtime Type > T4 GPU**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/scott2b/coursera-msds-public/blob/main/notebooks/1_ktrain_deep_learning.ipynb)

# Text Classification of News Media Content Categories with Deep Learning

In this notebook, we will build a deep learning model to classify news articles into content categories, with a focus on identifying articles related to health and wellness. This task is crucial for contextual advertising, where ads are displayed alongside content relevant to their target audience.

## Objectives:
- Load and preprocess a dataset of news headlines for classification.
- Address class imbalance in the dataset.
- Utilize a pretrained BERT model for text classification.
- Train and validate the deep learning model.
- Interpret the model's predictions using SHAP values.
- Discuss the implications of using probabilities vs binary classifications.

## Key Points:
- Handling imbalanced datasets to prevent model bias.
- Using pretrained word embeddings from BERT for nuanced text understanding.
- Fine-tuning a deep learning model with TensorFlow and Keras.
- Evaluating model performance with precision, recall, and F1-score.
- Model interpretability with SHAP values.

## Prerequisites:
- Familiarity with Python programming.
- Basic understanding of machine learning and deep learning concepts.
- Experience with pandas, TensorFlow, and Keras libraries.

Let's begin by setting up our environment and importing the necessary libraries.

## Data Setup

The data file for this lab is available in the Coursera lab environment in the `data/` directory. To use it in Colab:

1. In the Coursera lab, click the **Files** tab and navigate to `lab2 / data`
2. Select the checkbox next to `News_Category_Dataset_v3.json`, then click the **Download** button in the toolbar
3. In your Google Drive, create a folder called `coursera-msds` (if you haven't already)
4. Upload the downloaded file to `coursera-msds/`

The code below will mount your Google Drive and load the data automatically.

In [None]:
# ktrain requires TensorFlow-based model classes (TFAutoModel) which
# were removed in transformers v5. We must uninstall the Colab default
# and install a v4.x release.
!pip uninstall -y transformers
!pip install "transformers>=4.43,<5" ktrain shap tf-keras eli5

# Verify the correct version was installed
import transformers
print(f'transformers version: {transformers.__version__}')
assert int(transformers.__version__.split('.')[0]) == 4, (
    f'Expected transformers v4.x but got {transformers.__version__}'
)

In [None]:
# Restart the runtime so all packages load cleanly.
# After restart, skip the two cells above and continue from below.
import os
os.kill(os.getpid(), 9)

> **After the runtime restarts**, skip the install and restart cells above and continue running from here.

In [None]:
import os
os.environ['TF_USE_LEGACY_KERAS'] = 'True'

import numpy as np
import pandas as pd
import ktrain
from ktrain import text
import shap

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Dataset source: https://www.kaggle.com/datasets/rmisra/news-category-dataset
DATA_FILENAME = 'News_Category_Dataset_v3.json'
DRIVE_PATH = f'/content/drive/MyDrive/coursera-msds/{DATA_FILENAME}'
LOCAL_PATH = f'data/{DATA_FILENAME}'

if os.path.exists(DRIVE_PATH):
    data_path = DRIVE_PATH
    print(f'Loading data from Google Drive: {DRIVE_PATH}')
elif os.path.exists(LOCAL_PATH):
    data_path = LOCAL_PATH
    print(f'Loading data from local path: {LOCAL_PATH}')
else:
    raise FileNotFoundError(
        f'Data file not found. Please either:\n'
        f'  1. Upload {DATA_FILENAME} to Google Drive at: MyDrive/coursera-msds/\n'
        f'  2. Place {DATA_FILENAME} in a local data/ directory'
    )

reviews = pd.read_json(data_path, lines=True)

Now, let's preprocess the data by combining the 'headline' and 'short_description' columns and creating a binary 'healthy' label.

In [None]:
# Combine headline and short_description into one column
reviews['combined_text'] = reviews['headline'] + ' ' + reviews['short_description']

# Create a binary label for health-related articles
reviews['healthy'] = np.where((reviews['category'] == 'HEALTHY LIVING'), 1, 0)

To address class imbalance, we will resample the dataset to have equal representation of both classes.


In [None]:
# Balance the dataset by sampling equal instances of both classes
healthy = reviews[reviews['healthy'] == 1]
sample_amount = len(healthy)
not_healthy = reviews[reviews['healthy'] == 0].sample(n=sample_amount)
review_sample = pd.concat([healthy, not_healthy])

Next, we prepare the data for the BERT model, setting a maximum token length and splitting the data into training and validation sets.


In [None]:
# Initialize the BERT model with ktrain
t = text.Transformer('distilbert-base-uncased', maxlen=512, class_names=['healthy', 'not_healthy'])

# Preprocess the text data and split into training and validation sets
train, val, preprocess = text.texts_from_df(
    review_sample,
    'combined_text',
    label_columns=['healthy'],
    val_df=None,
    max_features=20000,
    maxlen=512,
    val_pct=0.1,
    preprocess_mode='distilbert',
    verbose=1
)

We will now train the model using the `autofit` method, which includes early stopping to prevent overfitting.

In [None]:
# Create a learner object with the preprocessed data and model
learner = ktrain.get_learner(t.get_classifier(), train_data=train, val_data=val, batch_size=6)

# Train the model with early stopping
history = learner.autofit(1e-4, checkpoint_folder='checkpoint', epochs=12, early_stopping=True)

After training, we evaluate the model's performance on the validation set.

In [None]:
# Evaluate the model's performance
validation = learner.validate(val_data=val, print_report=True)

Finally, we can interpret the model's predictions using SHAP values.

In [None]:
predictor = ktrain.get_predictor(learner.model, preproc=t)
# Example article for interpretation
article = "Boulder, CO – The University of Colorado, Boulder’s own Professor Pat Ferrucci was honored this week with a prestigious award recognizing him as the top fitness enthusiast on campus."


In [None]:
predictor.explain(article)

The SHAP values will help us understand the contribution of each word to the classification decision.

In conclusion, we have successfully built and evaluated a deep learning model for text classification, with a focus on identifying health and wellness content for contextual advertising. The model's interpretability was enhanced using SHAP values, providing insights into the decision-making process. The findings from this notebook can be leveraged to improve targeted advertising strategies and contribute to more effective marketing campaigns.