# 📊 Machine Learning Project Lifecycle

A complete ML project lifecycle was followed to troubleshoot and deploy an improved model for a news article classification task.

---

## 🧭 Scoping

### 🔍 Business Problem Identification
- A news classifier was required for integration into a news aggregation application.
- Manual sorting was inefficient; rule-based systems underperformed due to data scale.

### 🧪 Existing Solution Review
- A lightweight ML prototype using article titles existed.
- Performance during internal evaluation was suboptimal.

### ✅ Feasibility & Value Assessment
- ML was found suitable for text classification tasks.

### 🗓️ Milestones & Budget Planning
- Success metrics were established.
- A 3-week proof-of-concept based on the prototype was scheduled.
- Timelines and resource needs for future deployment were estimated.

---

## 📂 Data

- A data-centric approach was applied.
- Model code was fixed during initial stages.
- Data quality was iteratively improved to enhance model performance.

---

## 🤖 Modeling

- Model performance was evaluated using defined metrics.
- Weak points were identified and addressed.
- Iterative improvements were made to both data and model.

---

## 🚀 Deployment

- A plan was created for model deployment.
- Monitoring and maintenance procedures were defined.
- Setup checkpoints were used to resume progress without re-running prior sections:
  - Each major stage (Data, Modeling, Deployment) had a `Setup` subsection to load saved results.

---

## 🧩 Support Features

- Inline hints and solutions were available for coding exercises.
- Progress continuity was enabled via setup cells in each section.

---

## 📰 Use Case Summary

- A start-up aimed to improve a news classifier ML model.
- An existing prototype was assessed, refined, and redeployed.
- The full lifecycle was followed: Scoping → Data → Modeling → Deployment.


In [1]:
# prompt: moun g drive again

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# List the contents of the directory to visually confirm lab_utils.py exists
# Make sure this output confirms lab_utils.py is directly in this directory
!ls /content/drive/MyDrive/News_Filtering_Production/News_Filtering_Production/

import sys
import os

# Ensure the path is added to sys.path
# The path should point to the directory containing lab_utils.py
lab_utils_dir = '/content/drive/MyDrive/News_Filtering_Production/News_Filtering_Production/'

# Add the directory to sys.path if it's not already there
if lab_utils_dir not in sys.path:
    sys.path.insert(0, lab_utils_dir) # Use insert(0, ...) to prioritize this path

# Now attempt the import. We can also try reloading the module if it was previously attempted and failed.
try:
    import lab_utils
except ModuleNotFoundError:
    # If the import still fails after adding the path,
    # it might be a caching issue. A more robust way is to ensure
    # the path is in sys.path *before* the first import attempt in the notebook session.
    # If running this cell after a failed import, try restarting the kernel
    # and running this cell first.
    print(f"Failed to import lab_utils from {lab_utils_dir}. Please ensure the path is correct and try restarting the kernel if the file exists.")
    raise # Re-raise the error to make it visible

import tensorflow as tf
import pandas as pd
from sklearn.model_selection import train_test_split

assets	E2			FinalProject.ipynb  news_articles.db  serving
E1	experiment-tracker.csv	lab_utils.py	    __pycache__


## 📂 Data

The available data was examined to assess its suitability for the application. Existing prototype files were stored in the `E1` (Experiment 1) folder, structured as follows:

### 📁 Folder Structure

- `data/`  
  - Contains `train_data.csv` and `test_data.csv`

- `model/`  
  - Holds a TensorFlow `SavedModel` trained on `train_data.csv`

- `vocab/`  
  - Includes two key text files:
    - `labels.txt` – 8 target topics for classification
    - `vocabulary.txt` – Top 10,000 most frequent words from training titles

### 🧰 Setup

Folder paths were saved to variables using the `set_experiment_dirs()` helper function from the `lab_utils.py` file.  
This utility file also includes reusable functions used throughout the notebook.  
Exploration of these functions is encouraged for a better understanding of their operations.


In [3]:
# Working folder for the experiment
BASE_DIR = '/content/drive/MyDrive/News_Filtering_Production/News_Filtering_Production/E1'

# Get the subdirectories that contain the experiment files
data_dir, model_dir, vocab_dir = lab_utils.set_experiment_dirs(BASE_DIR)

print(
    f'base directory: {BASE_DIR}\n\n'
    f'data: {data_dir}\n'
    f'model: {model_dir}\n'
    f'vocab: {vocab_dir}\n'
)

base directory: /content/drive/MyDrive/News_Filtering_Production/News_Filtering_Production/E1

data: /content/drive/MyDrive/News_Filtering_Production/News_Filtering_Production/E1/data
model: /content/drive/MyDrive/News_Filtering_Production/News_Filtering_Production/E1/model
vocab: /content/drive/MyDrive/News_Filtering_Production/News_Filtering_Production/E1/vocab



## 📄 Dataset Loading & Preview

The datasets were loaded into Pandas DataFrames for inspection and manipulation.

### 📊 Columns Overview

Each dataset contains 6 columns. However, only the following 2 were used by the prototype model:

- `title` – Article title (used as model input)
- `topic` – Article category (used as the prediction label, with 8 classes):
  - entertainment, health, technology, world, business, sports, nation, science

### 🗂️ Other Columns (Used Elsewhere in the App)

- `link`
- `domain`
- `published_date`

### 🔍 Preview

The first 10 rows of the training dataset were displayed for initial examination.


In [4]:

pd.set_option('display.max_colwidth', None)

# Load the datasets into dataframes
train_df = pd.read_csv(f'{data_dir}/train_data.csv')
test_df = pd.read_csv(f'{data_dir}/test_data.csv')

# Preview the first 10 rows of the training set
train_df[:10]


Unnamed: 0,title,link,domain,published_date,topic
0,"NSW remains on COVID-19 high alert | The Star | Newcastle, NSW",https://www.newcastlestar.com.au/story/6878465/nsw-remains-on-covid-19-high-alert/?cs=7,newcastlestar.com.au,2020-08-13 17:32:28,HEALTH
1,Missing people: How does someone just disappear?,https://www.bbc.co.uk/news/uk-england-cambridgeshire-53648599,bbc.co.uk,2020-08-04 15:56:20,NATION
2,"An American Pickle review: In a dual role, Seth Rogen’s drama wins over comedy",https://www.polygon.com/2020/8/6/21357120/american-pickle-review-movie-seth-rogen-simon-rich-hbo-max,polygon.com,2020-08-06 15:58:22,ENTERTAINMENT
3,"Katie Spotz to Run 130 Miles to Bring Clean Water to Tanzania, Shares Testimony of Change",http://www.christianitydaily.com/articles/9647/20200807/katie-spotz-run-130-miles-bring-clean-water-tanzania-shares.htm,christianitydaily.com,2020-08-07 19:45:00,NATION
4,CDC Director Says This Fall Could Be ‘the Worst’ We’ve Ever Had Thanks to COVID-19,https://www.self.com/story/worst-fall-ever-covid,self.com,2020-08-13 19:37:52,HEALTH
5,Free water and electricity may be counterproductive – Economist,https://www.ghanaweb.com/GhanaHomePage/NewsArchive/Free-water-and-electricity-may-be-counterproductive-Economist-1031098,ghanaweb.com,2020-08-11 13:25:37,NATION
6,State places Tazewell County on COVID-19 warning list,https://week.com/2020/08/07/state-places-tazewell-county-on-covid-19-warning-list/,week.com,2020-08-07 18:00:24,HEALTH
7,Over 100 people quarantined in Mississippi school district after several test positive for coronavirus,https://www.nbcnews.com/news/us-news/over-100-people-quarantined-mississippi-school-district-after-several-test-n1236012,nbcnews.com,2020-08-06 16:35:00,NATION
8,How global warming is wiping out Africa's oldest baobab trees,http://www.capetalk.co.za/articles/307750/how-global-warming-is-wiping-out-africa-s-oldest-baobab-trees,capetalk.co.za,2018-06-14 12:00:53,WORLD
9,"Delay routine dental checkups, WHO urges, until COVID risk is known","https://www.thepeninsulaqatar.com/article/11/08/2020/Delay-routine-dental-checkups,-WHO-urges,-until-COVID-risk-is-known",thepeninsulaqatar.com,2020-08-11 17:13:00,HEALTH


## 🧾 Define the Data

Effective data preparation was recognized as essential for strong model performance.

### ⚠️ Label Quality Considerations

If human labelers received unclear guidelines, inconsistencies in labeling may have occurred. Such ambiguity can reduce the model's ability to learn the true input-output relationships.

### 🔍 Key Points to Examine

- **Input Feature (`title`)**
  - Are titles clear, representative, and informative enough for classification?

- **Label (`topic`)**
  - Do any labels seem inconsistent or unclear?
  - Are there overlapping classes or subjective interpretations?

### ❓ Potential Questions

- Are all 8 topic classes clearly distinct?
- Could the same title reasonably fit into multiple categories?
- Are there patterns in mislabeled or vague entries?

### 🔬 Action

The dataset was explored for anomalies, ambiguity, or inconsistencies.  
No corrections were made at this stage—issues were only identified for later review.


In [5]:
#printed 10 titles and manually labelled them to find a pattern

# Indices of the dataframe to use
start_index = 30
end_index = 40

# Sample titles to label
train_df[['title']][start_index:end_index]

# # When you're done, uncomment the next line to see the 'true' labels
train_df[['title', 'topic']][start_index:end_index]

Unnamed: 0,title,topic
30,"28% Fall in Mumbai's Home Isolation Cases, 62% Drop in Institutional Quarantine as Covid Curve Plateaus",NATION
31,Sniffer dogs begin trials to detect Covid,HEALTH
32,Seth Rogen says Marvel blockbusters make it difficult for comedy films,ENTERTAINMENT
33,"Farewell to Pete Way, the debauched bassist with a frontman's swagger",ENTERTAINMENT
34,Poo detectives on the COVID-19 frontline,HEALTH
35,Fortnite maker challenges Apple and Google's app store rules through direct-payment discounts,TECHNOLOGY
36,Why iPhone 12 Will Be Another 'Defining Chapter' In Apple's Growth Story,BUSINESS
37,Forex auction bolsters productive sector,NATION
38,Intel Tiger Lake could outrun AMD Ryzen 4000 in single-core performance,TECHNOLOGY
39,Rappi delivery workers in Colombia's Bogota strike over conditions,TECHNOLOGY


## ❓ Example Questions to Consider (Post-Inspection)

After inspecting the data, the following questions may be raised to assess data quality and improve model performance:

### 🧠 Labeling Process
- How were the labels assigned?
- How many annotators were involved, and how were their labels aggregated?
- Was inter-rater reliability measured?

### 🔍 Topic Definitions
- How are the topics defined?
  - For instance, what distinguishes *Science* from *Technology*?

### 🔄 Feature Exploration
- Can any unused columns (`link`, `domain`, `published_date`) serve as predictive features?

### 📈 Data Sufficiency
- Is the title alone a strong enough signal for classification?
- If re-collection were possible, what additional features could improve consistency?

### 🧩 Category Optimization
- Would merging or redefining topics improve accuracy?

> These questions were noted for further exploration. Not all require immediate action.


In [17]:
from tensorflow import keras
from tensorflow.keras import layers

# Define the model architecture
model = keras.Sequential([
    layers.Embedding(input_dim=10000, output_dim=24, input_length=20),
    layers.Dense(24, activation='relu'),
    layers.Flatten(),
    layers.Dense(8, activation='softmax')
])

# Compile the model
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['sparse_categorical_accuracy'])

# Print the model summary to verify it matches the desired structure
model.summary()



In [8]:
# Create a lookup list for the labels
topic_lookup = tf.keras.layers.StringLookup(vocabulary=f'{vocab_dir}/labels.txt', num_oov_indices=0)

# Check the list of labels
topic_lookup.get_vocabulary()

[np.str_('ENTERTAINMENT'),
 np.str_('HEALTH'),
 np.str_('TECHNOLOGY'),
 np.str_('WORLD'),
 np.str_('BUSINESS'),
 np.str_('SPORTS'),
 np.str_('NATION'),
 np.str_('SCIENCE')]

## 🔢 Preprocessing for Evaluation

Before evaluation, the input and output data needed to be transformed into numeric formats compatible with the model.

### 🏷️ Label Encoding

- The `topic` column in `test_df` contained string labels.
- A `StringLookup()` layer was used to map each topic string to a unique integer index.

### 🧾 Title Tokenization

- The `title` column also used string data.
- A `TextVectorization()` layer was employed to tokenize and convert titles into integer sequences.

### 🧠 Vocabulary Settings

- Vocabulary was loaded from `vocab_dir/vocabulary.txt`, which included the top 10,000 most frequent words from the training set.
- Sequences were truncated to the first 20 words, as titles rarely exceeded this length.

### ⚙️ Configuration Summary

| Component              | Setting / Source                              |
|------------------------|-----------------------------------------------|
| Label Encoding         | `StringLookup()` layer                        |
| Title Vectorization    | `TextVectorization()` layer                   |
| Vocabulary File        | `vocab_dir/vocabulary.txt`                    |
| Sequence Truncation    | Max 20 words per title                        |

> These preprocessing steps prepared the test data for model evaluation.


In [9]:
# Title length and vocabulary size used by the team for the prototype
MAX_LENGTH = 20
VOCAB_SIZE = 10000

# Instantiate a layer for text preprocessing
title_preprocessor = tf.keras.layers.TextVectorization(max_tokens=VOCAB_SIZE, output_sequence_length=MAX_LENGTH)

# Load the vocabulary file
title_preprocessor.load_assets(vocab_dir)

# Check the vocabulary size
print(f'vocabulary size: {title_preprocessor.vocabulary_size()}')

# Get a sample title
sample_title = train_df['title'][10]

# Sample title in string format
print(f"sample text: {sample_title}")

# Sample title represented as an integer sequence
print(f"sample text (preprocessed): {title_preprocessor(sample_title)}")

vocabulary size: 10000
sample text: Dengue fever cases in Laos rise to 4256
sample text (preprocessed): [4040 1979   30    2 9339  282    3    1    0    0    0    0    0    0
    0    0    0    0    0    0]


## 📊 Evaluating the Prototype Model

Once preprocessing layers were defined, the test set was converted into a format compatible with the model.

### 🔄 Dataset Conversion

- The `df_to_tfdata()` utility was used to convert `test_df` into a `tf.data.Dataset`.
- This utility applied the `title_preprocessor` and `topic_lookup` layers to:
  - Transform article titles into integer sequences.
  - Convert string labels into integer indices.

### 📈 Running Evaluation

- The model's `evaluate()` method was then called on the processed dataset.
- Performance metrics such as **accuracy** were computed and recorded.

> These metrics served as the official baseline for future improvements.


## 📉 Baseline Accuracy and Experiment Tracking

### ✅ Evaluation Summary

- The prototype model achieved **78% accuracy** on the test dataset.
- This was measured using `SparseCategoricalAccuracy()`, which compares integer-based predictions and labels.

### 🗂️ Experiment Tracking

- An experiment tracking system was encouraged.
- The `experiment-tracker.csv` file was recommended for recording:
  - Accuracy metrics.
  - Experiment configurations.
  - Model directory references.

> Regular experiment logging was emphasized to improve reproducibility and project management.

---

## 🧹 Label and Organize the Data

### 📁 Prototype File Review

- A quick evaluation of the prototype files was performed.
- Current model performance was measured.
- Next, the focus was shifted to organizing the dataset according to data-centric AI principles.

---

## 🔄 Data Pipeline Documentation

### 🔧 Purpose

- Documenting each stage in the data pipeline was suggested.
- Ensures reproducibility and consistency during deployment.

### 📝 Pipeline Tasks

- Record each transformation from raw input to model predictions.
- Track resource dependencies:
  - `vocabulary.txt` (top 10,000 words).
  - `labels.txt` (8 topic categories).
- A pipeline diagram
 <img src='/content/drive/MyDrive/News_Filtering_Production/News_Filtering_Production/assets/pipeline_diagram.png'>
  was referenced for visual structure.

---

## ⚖️ Balanced Dataset Splits

### 📊 Goal

- Ensuring a **balanced distribution** of labels across:
  - Train set
  - Development (dev) set
  - Test set

### 🧪 Method

- Category distribution was checked using `value_counts()` on the train and test sets.

> Balanced splits are crucial for performance consistency, especially on smaller datasets.


In [10]:
# Print the percentage of each class in the train set
train_df.topic.value_counts(normalize=True).sort_index().mul(100).round(1).astype(str) + '%'

Unnamed: 0_level_0,proportion
topic,Unnamed: 1_level_1
BUSINESS,13.2%
ENTERTAINMENT,14.1%
HEALTH,15.1%
NATION,13.2%
SCIENCE,0.7%
SPORTS,13.6%
TECHNOLOGY,15.3%
WORLD,14.8%


In [11]:
# Print the percentage of each class in the test set
test_df.topic.value_counts(normalize=True).sort_index().mul(100).round(1).astype(str) + '%'

Unnamed: 0_level_0,proportion
topic,Unnamed: 1_level_1
BUSINESS,15.9%
ENTERTAINMENT,12.7%
HEALTH,9.3%
NATION,15.9%
SCIENCE,13.2%
SPORTS,14.3%
TECHNOLOGY,8.6%
WORLD,10.1%


## 📊 Addressing Class Imbalance in Dataset Splits

### 🔍 Observation

- The distribution of topics in the train and test datasets is **uneven**.
- The **SCIENCE** category is particularly **overrepresented** in the test set.
- This imbalance may lead to:
  - Weaker learning for underrepresented classes (like SCIENCE in training).
  - Biased model performance.

---

## 🧪 Goal: Balanced Train/Dev/Test Splits

### 🎯 Target Split

- Combine `train_df` and `test_df` into one unified dataset.
- Split into:
  - **60%** for `train_df`
  - **20%** for `dev_df`
  - **20%** for `test_df`

> ✅ You will now have a **dev set** for tracking validation accuracy during training.

---


- The `train_test_split()` function from `sklearn.model_selection` supports stratified splitting.
- Use the `stratify=` argument to ensure equal representation of topics in each set.

---

> 🚫 *Do not perform any data cleaning or transformations yet.*  
You will focus on splitting first, then process each dataset appropriately later.


In [12]:
# Load the train and test sets
train_df = pd.read_csv(f'{data_dir}/train_data.csv')
test_df = pd.read_csv(f'{data_dir}/test_data.csv')

# Combine the two datasets. Set ignore_index to False.
combined_df = pd.concat([train_df,test_df], ignore_index=False)
# Split the combined dataset to 60% train, 20% dev, and 20% test set. Produce a balanced split along the `topic` column.
train_df, test_df = train_test_split(combined_df, test_size=0.2, stratify=combined_df['topic'])
train_df, dev_df = train_test_split(train_df, test_size=0.25, stratify=train_df['topic'])


In [13]:
# Print the percentage of each class in the train set
train_df.topic.value_counts(normalize=True).sort_index().mul(100).round(1).astype(str) + '%'

# Print the percentage of each class in the test set
test_df.topic.value_counts(normalize=True).sort_index().mul(100).round(1).astype(str) + '%'

Unnamed: 0_level_0,proportion
topic,Unnamed: 1_level_1
BUSINESS,13.8%
ENTERTAINMENT,13.8%
HEALTH,13.8%
NATION,13.8%
SCIENCE,3.5%
SPORTS,13.8%
TECHNOLOGY,13.8%
WORLD,13.8%


In [14]:
# Print the percentage of each class in the dev set
dev_df.topic.value_counts(normalize=True).sort_index().mul(100).round(1).astype(str) + '%'


Unnamed: 0_level_0,proportion
topic,Unnamed: 1_level_1
BUSINESS,13.8%
ENTERTAINMENT,13.8%
HEALTH,13.8%
NATION,13.8%
SCIENCE,3.5%
SPORTS,13.8%
TECHNOLOGY,13.8%
WORLD,13.8%


## ⚠️ Note on Class Imbalance

- The **SCIENCE** topic has **fewer examples** than other categories in the dataset.
- This type of imbalance is common in real-world scenarios.
  - 📍 Example: In medical imaging, there are usually more **negative (healthy)** samples than **positive (diseased)** ones.

---

### 🧪 What Can Be Done?

- **Option 1:** Find or collect more data for the underrepresented class (SCIENCE).
- **Option 2:** Use **data augmentation** to synthetically generate samples.
- **Option 3:** Track **per-class metrics** during evaluation to ensure fairness across all labels.

> ⏳ For now, we'll **skip data augmentation** and continue with our current splits to iterate quickly.

---

## 💾 Save the Current State of the Data

To keep your workflow reproducible and track progress:

- Create a new directory: `E2/` (Experiment 2)
- Save the new `train_df`, `dev_df`, and `test_df` splits into this folder.
- This allows you to:
  - Track which datasets were used in which experiments
  - Reuse or reproduce results in future iterations


In [15]:
# Set the base directory for the second experiment
BASE_DIR = '/content/drive/MyDrive/News_Filtering_Production/News_Filtering_Production/E2'

# Set the subdirectories that will contain the experiment files
data_dir, model_dir, vocab_dir = lab_utils.set_experiment_dirs(BASE_DIR)

# Save the datasets
lab_utils.save_data(train_df, data_dir, 'train_data.csv')
lab_utils.save_data(dev_df, data_dir, 'dev_data.csv')
lab_utils.save_data(test_df, data_dir, 'test_data.csv')

# Save the labels
lab_utils.save_labels(topic_lookup, vocab_dir)

# 🧠 Modeling

In the **Modeling** stage, you'll:

- ✅ Select a model to learn patterns from your data.
- 📊 Evaluate its performance.
- 🧪 Perform **error analysis** to identify areas of improvement.

This is the **third step** in the ML development lifecycle:

📍 *You are here:* `Data → Baseline → Modeling → Evaluation → Production`

![ML Lifecycle - Modeling](ml_lifecycle_modeling.png)

---

## ⚙️ Setup

To begin modeling, make sure you've:

- 📂 Created a new `E2/` folder for Experiment 2.
- 💾 Saved the updated, balanced splits: `train_df`, `dev_df`, and `test_df`.
- 🔁 Re-initialize or reload your preprocessing layers (`TextVectorization` and `StringLookup`) if needed for the new dataset.
- 🧱 Optionally define or reuse a base model architecture (e.g., a simple neural network using `tf.keras.Sequential`).

You're now ready to train and evaluate a new model using the improved data.


In [21]:
# Convert the string datasets to Tensorflow datasets
# Ensure lab_utils, train_df, dev_df, topic_lookup, and title_preprocessor are defined before this
import lab_utils # Ensure lab_utils is imported
train_ds = lab_utils.df_to_tfdata(train_df, topic_lookup, title_preprocessor, shuffle=True)
dev_ds = lab_utils.df_to_tfdata(dev_df, topic_lookup, title_preprocessor)
test_ds = lab_utils.df_to_tfdata(test_df, topic_lookup, title_preprocessor)

NUM_EPOCHS = 2 # Or any number of epochs you want to train for
model.fit(train_ds, epochs=NUM_EPOCHS, validation_data=dev_ds, verbose=1)

Epoch 1/2
[1m2040/2040[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 4ms/step - loss: 1.2358 - sparse_categorical_accuracy: 0.5361 - val_loss: 0.5475 - val_sparse_categorical_accuracy: 0.8097
Epoch 2/2
[1m2040/2040[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 3ms/step - loss: 0.4702 - sparse_categorical_accuracy: 0.8367 - val_loss: 0.5171 - val_sparse_categorical_accuracy: 0.8218


<keras.src.callbacks.history.History at 0x7fcea44d4110>

In [28]:
results = model.evaluate(test_ds, verbose=1)
print(f"Test Loss: {results[0]:.4f}")
print(f"Test Sparse Categorical Accuracy: {results[1]*100:.2f}%")

[1m680/680[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - loss: 0.5235 - sparse_categorical_accuracy: 0.8239
Test Loss: 0.5266
Test Sparse Categorical Accuracy: 82.25%


In [31]:
# Make sure the model_dir variable is set to the correct path for E1
BASE_DIR = '/content/drive/MyDrive/News_Filtering_Production/News_Filtering_Production/E1'
data_dir, model_dir, vocab_dir = lab_utils.set_experiment_dirs(BASE_DIR)

# Save the model to the specified directory
model.save(model_dir + '/my_model.keras') # Added

In [33]:
import lab_utils
import tensorflow as tf
import pandas as pd
from sklearn.model_selection import train_test_split

# Working folder for the experiment
BASE_DIR = '/content/drive/MyDrive/News_Filtering_Production/News_Filtering_Production/E1'

# Get the subdirectories that contain the experiment files
_, model_dir, _ = lab_utils.set_experiment_dirs(BASE_DIR)

# Load the model
model = tf.keras.models.load_model(model_dir + '/my_model.keras') # Load the .keras file

# Working folder for the experiment
BASE_DIR = '/content/drive/MyDrive/News_Filtering_Production/News_Filtering_Production/E2'

# Title length and vocabulary size used by the team for the prototype
MAX_LENGTH = 20
VOCAB_SIZE = 10000

# Get the subdirectories that contain the experiment files
data_dir, model_dir, vocab_dir = lab_utils.set_experiment_dirs(BASE_DIR)

# Load the train and test sets
train_df = pd.read_csv(f'{data_dir}/train_data.csv')
dev_df = pd.read_csv(f'{data_dir}/dev_data.csv')
test_df = pd.read_csv(f'{data_dir}/test_data.csv')

# Instantiate a layer for text preprocessing
title_preprocessor = tf.keras.layers.TextVectorization(max_tokens=VOCAB_SIZE, output_sequence_length=MAX_LENGTH)

# Create a lookup list for the labels
topic_lookup = tf.keras.layers.StringLookup(vocabulary=f'{vocab_dir}/labels.txt', num_oov_indices=0)

## 🏗️ Select and Train the Model

Now that you’ve improved the dataset, let’s analyze how the **prototype model** performs with the **new balanced splits**.

You're still practicing **data-centric AI**, so before changing any model architecture, your first step is to **retrain the same model** using the updated data.

### ⚠️ Why retrain?

The previous model was trained on an **imbalanced dataset**. That means:
- The `TextVectorization` layer used a vocabulary that reflected biased topic distributions.
- The label encoding (`StringLookup`) was based on old frequencies.

To get reliable results, you need to **reinitialize your preprocessing layers** using the new `train_df`.

---

## 🔄 Update the Vocabulary

Use the new `train_df["title"]` column to adapt a fresh `TextVectorization` layer:

```python
# Rebuild title preprocessor with updated vocabulary from balanced train set
title_preprocessor = TextVectorization(
    max_tokens=10000,
    output_mode="int",
    output_sequence_length=20
)

title_preprocessor.adapt(train_df["title"])


In [34]:
# Extract the titles from the new training set
train_inputs = train_df['title']

# Generate a new vocabulary
title_preprocessor.adapt(train_inputs)

# Save the new vocabulary
lab_utils.save_vocab(title_preprocessor, vocab_dir)

## 🧠 Train and Evaluate the Model on Balanced Dataset

Now that the preprocessing layers have been updated, it's time to **train the prototype model from scratch** using the new balanced train/dev/test splits.

To do that, you can either:
- Rebuild the same model architecture manually, or
- Use the `model_reset_weights()` helper from `lab_utils.py` to reset an already loaded model.

### 🔁 Reset the Model Weights

If you've already loaded the prototype model earlier, reset it like this:

```python
# Reset model weights to allow training from scratch
model_reset_weights(model)


In [35]:
NUM_EPOCHS = 5

# Convert the string datasets to Tensorflow datasets
train_ds = lab_utils.df_to_tfdata(train_df, topic_lookup, title_preprocessor, shuffle=True)
dev_ds = lab_utils.df_to_tfdata(dev_df, topic_lookup, title_preprocessor)
test_ds = lab_utils.df_to_tfdata(test_df, topic_lookup, title_preprocessor)

# Reset the model weights
model = lab_utils.model_reset_weights(model)

# Train the model. Use the dev set to check if your model is overfitting.
model.fit(train_ds, epochs=NUM_EPOCHS, validation_data=dev_ds, verbose=1)

Epoch 1/5
[1m2040/2040[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 3ms/step - loss: 1.4291 - sparse_categorical_accuracy: 0.4354 - val_loss: 0.5936 - val_sparse_categorical_accuracy: 0.7910
Epoch 2/5
[1m2040/2040[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 3ms/step - loss: 0.5138 - sparse_categorical_accuracy: 0.8224 - val_loss: 0.5202 - val_sparse_categorical_accuracy: 0.8203
Epoch 3/5
[1m2040/2040[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 3ms/step - loss: 0.3751 - sparse_categorical_accuracy: 0.8720 - val_loss: 0.5365 - val_sparse_categorical_accuracy: 0.8224
Epoch 4/5
[1m2040/2040[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 3ms/step - loss: 0.3068 - sparse_categorical_accuracy: 0.8954 - val_loss: 0.5764 - val_sparse_categorical_accuracy: 0.8174
Epoch 5/5
[1m2040/2040[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 3ms/step - loss: 0.2512 - sparse_categorical_accuracy: 0.9154 - val_loss: 0.6292 - val_sparse_categorical_accuracy:

<keras.src.callbacks.history.History at 0x7fcea389c7d0>

In [36]:
# Compute the loss and metrics
model.evaluate(test_ds)

[1m680/680[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - loss: 0.6454 - sparse_categorical_accuracy: 0.8146


[0.6458833813667297, 0.8113996982574463]

In [40]:
# Save the model
model.save(model_dir + '/my_model.keras')

- **Model Saving**
  - A backup of the trained model was saved to the `E2` directory:
    ```python
    model.save(model_dir)
    ```

- **Performance Overview**
  - Accuracy increased from approximately **78%** to **81%** as a result of dataset modifications.

- **Next Step**
  - Accuracy will be further analyzed based on **data slices** to gain deeper insights.

### 🔍 Error Analysis

- **Purpose**
  - Initial model performance may be suboptimal. Error analysis was performed to identify areas with the highest potential for improvement.

- **Benefits**
  - Key components affecting performance were prioritized.
  - Unnecessary optimizations were avoided by focusing only on impactful elements.

### 🎯 Prioritization Strategy

- **Approach**
  - Model performance was analyzed across different data categories.
  - Each of the 8 target classes was evaluated to guide further improvements.


In [41]:
# Get the list of topics
topics = topic_lookup.get_vocabulary()

# Evaluate the model's performance for each topic
lab_utils.print_metric_per_topic(dev_df, topics, topic_lookup, title_preprocessor, model)

ACCURACY PER TOPIC:

ENTERTAINMENT: 83.93 
HEALTH: 80.50 
TECHNOLOGY: 88.77 
WORLD: 66.10 
BUSINESS: 100.00 
SPORTS: 85.73 
NATION: 63.00 
SCIENCE: 81.59 


### 📊 Result Interpretation

- **Outlier Detection**
  - Categories with unusually high or low performance were identified for further investigation.

- **Comparison with Baseline**
  - Where available, human-level performance (HLP) was used as a reference.
  - Gaps between model performance and HLP guided optimization focus.

### ⚠️ Anomaly Identification

- **Example Observation**
  - The BUSINESS category showed unexpectedly high performance.
  - Possible data leakage or imbalance was considered and examined.


In [42]:
# filter the training dataframe's business articles
train_df[train_df.topic=='BUSINESS']

Unnamed: 0,title,link,domain,published_date,topic
4,Why iPhone 12 Will Be Another 'Defining Chapter' In Apple's Growth Story,https://uk.sports.yahoo.com/news/stocks-markets-europe-ftse-latest-us-china-coronavirus-074214470.html,uk.sports.yahoo.com,2020-08-17 07:42:00,BUSINESS
5,Why iPhone 12 Will Be Another 'Defining Chapter' In Apple's Growth Story,https://www.reuters.com/article/us-russia-banks-vtb/russias-vtb-bank-raises-retail-lending-forecast-amid-record-low-rates-idUSKCN25112D,reuters.com,2020-08-05 08:45:00,BUSINESS
17,Why iPhone 12 Will Be Another 'Defining Chapter' In Apple's Growth Story,https://gnghockey.com/2020/08/13/global-filters-and-particle-monitoring-market-industry-analysis-and-forecast-2020-2027/,gnghockey.com,2020-08-13 11:53:28,BUSINESS
43,Why iPhone 12 Will Be Another 'Defining Chapter' In Apple's Growth Story,https://www.bbc.co.uk/news/uk-england-stoke-staffordshire-53640494,bbc.co.uk,2020-08-03 20:11:04,BUSINESS
46,Why iPhone 12 Will Be Another 'Defining Chapter' In Apple's Growth Story,https://www.producebluebook.com/2020/08/03/thomson-international-recalls-onions-in-salmonella-outbreak/,producebluebook.com,2020-08-03 13:11:00,BUSINESS
...,...,...,...,...,...
65197,Why iPhone 12 Will Be Another 'Defining Chapter' In Apple's Growth Story,https://www.oxfordmail.co.uk/news/18630087.oxfords-cheapest-property-rentals/,oxfordmail.co.uk,2020-08-08 04:00:00,BUSINESS
65208,Why iPhone 12 Will Be Another 'Defining Chapter' In Apple's Growth Story,https://www.forbes.com/sites/greatspeculations/2020/08/12/johnson-controls-stock-is-back-at-40-but-can-it-gain-any-more/,forbes.com,2020-08-12 13:30:00,BUSINESS
65221,Why iPhone 12 Will Be Another 'Defining Chapter' In Apple's Growth Story,https://www.indiatoday.in/business/story/explained-how-govt-is-attracting-foreign-firms-to-manufacture-locally-in-india-1712015-2020-08-17,indiatoday.in,2020-08-17 05:01:22,BUSINESS
65251,Why iPhone 12 Will Be Another 'Defining Chapter' In Apple's Growth Story,https://www.njmmanews.com/coronavirus-impact-on-3d-and-virtual-realityfacebook-oculus-microsofthololens-google-samsung-htc-vive/,njmmanews.com,2020-08-15 15:11:33,BUSINESS


### 🛠️ Data Quality Issue Identified

- **Problem**
  - All article titles were found to be identical:  
    *"Why iPhone 12 Will Be Another 'Defining Chapter' In Apple's Growth Story"*

- **Impact**
  - The model learned only this repeated pattern, reducing its ability to generalize to real-world data.

- **Root Cause**
  - Column values were accidentally overwritten during a character encoding fix by a previous developer.

### 💾 Recovery and Reprocessing

- **Solution**
  - Original data was restored using a backup file containing correct values.

- **Next Steps**
  - New train, dev, and test sets were generated.
  - These datasets were saved to a new folder named `E3` for a fresh experiment.


In [46]:
# Set the experiment folder
BASE_DIR = '/content/drive/MyDrive/News_Filtering_Production/News_Filtering_Production/E3'

# Set the subdirectories that will contain the experiment files
data_dir, model_dir, vocab_dir = lab_utils.set_experiment_dirs(BASE_DIR)

# Load the backup CSV
combined_df = pd.read_csv(f'/content/drive/MyDrive/News_Filtering_Production/News_Filtering_Production/backup.csv')

# Generate train, dev, and test sets as you did before.
train_df, test_df = train_test_split(combined_df, test_size=0.2, stratify=combined_df['topic'])
train_df, dev_df = train_test_split(train_df, test_size=0.25, stratify=train_df['topic'])

# Save the datasets under the E3 folder
lab_utils.save_data(train_df, data_dir, 'train_data.csv')
lab_utils.save_data(dev_df, data_dir, 'dev_data.csv')
lab_utils.save_data(test_df, data_dir, 'test_data.csv')

### 🧱 Vocabulary Update

- **Requirement**
  - The title preprocessor was updated and a new vocabulary was generated using the corrected training data.

- **Steps Taken**
  - Titles were extracted from the `train_df`.
  - The `adapt()` method was used on `title_preprocessor` to build the vocabulary.
  - The new vocabulary was saved using the `save_vocab()` utility from `lab_utils`.

- **Storage**
  - The updated vocabulary was saved in the `E3/vocab_dir` folder.


In [47]:
# Generate a new vocabulary based on the new training set
train_inputs = train_df['title']
title_preprocessor.adapt(train_inputs)

# Save the new vocabulary and labels
lab_utils.save_vocab(title_preprocessor, vocab_dir)
lab_utils.save_labels(topic_lookup, vocab_dir)

### 🔄 Data Conversion and Model Retraining

- **Data Preparation**
  - DataFrames were converted into `tf.data.Dataset` format using `lab_utils.df_to_tfdata`.

- **Model Training**
  - The model was retrained using the updated datasets.
  - Training was performed for **5 epochs** with `dev_df` used for validation.


In [48]:
NUM_EPOCHS = 5

# Convert the dataframes to numeric features. Remember to shuffle the training set.
train_ds = lab_utils.df_to_tfdata(train_df, topic_lookup, title_preprocessor, shuffle=True)
dev_ds = lab_utils.df_to_tfdata(dev_df, topic_lookup, title_preprocessor)
test_ds = lab_utils.df_to_tfdata(test_df, topic_lookup, title_preprocessor)

# Reset the model weights
model = lab_utils.model_reset_weights(model)

# Train the model
model.fit(train_ds, epochs=NUM_EPOCHS, validation_data=dev_ds, verbose=1)

Epoch 1/5
[1m2040/2040[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 3ms/step - loss: 1.7103 - sparse_categorical_accuracy: 0.3495 - val_loss: 0.7938 - val_sparse_categorical_accuracy: 0.7361
Epoch 2/5
[1m2040/2040[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 3ms/step - loss: 0.6907 - sparse_categorical_accuracy: 0.7663 - val_loss: 0.7310 - val_sparse_categorical_accuracy: 0.7573
Epoch 3/5
[1m2040/2040[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 3ms/step - loss: 0.5240 - sparse_categorical_accuracy: 0.8257 - val_loss: 0.7377 - val_sparse_categorical_accuracy: 0.7638
Epoch 4/5
[1m2040/2040[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 3ms/step - loss: 0.4388 - sparse_categorical_accuracy: 0.8525 - val_loss: 0.7783 - val_sparse_categorical_accuracy: 0.7599
Epoch 5/5
[1m2040/2040[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 3ms/step - loss: 0.3720 - sparse_categorical_accuracy: 0.8757 - val_loss: 0.8389 - val_sparse_categorical_accuracy: 

<keras.src.callbacks.history.History at 0x7fce0e23f650>

In [49]:
# Evaluate the model on the test set and write the results on the experiment tracker
model.evaluate(test_ds)

[1m680/680[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - loss: 0.7942 - sparse_categorical_accuracy: 0.7687


[0.8204027414321899, 0.759411633014679]

In [53]:
# Save the model to model_dir
model.save(model_dir + '/my_model.keras')

In [54]:
lab_utils.print_metric_per_topic(dev_df, topics, topic_lookup, title_preprocessor, model)

ACCURACY PER TOPIC:

ENTERTAINMENT: 83.50 
HEALTH: 79.30 
TECHNOLOGY: 85.23 
WORLD: 59.80 
BUSINESS: 69.13 
SPORTS: 87.53 
NATION: 63.73 
SCIENCE: 79.60 


### 📉 Topic-wise Evaluation

- **Evaluation**
  - The model was evaluated on each topic using `lab_utils.print_metric_per_topic()` with `dev_df`.

- **Expected Outcome**
  - Accuracy on **business** articles dropped from 100%, as the model now needs to learn a broader range of relevant words.

- **Insight**
  - More realistic performance was observed across all topics, indicating improved generalization.


### 🔍 Label Ambiguity and Improving Human-Level Performance (HLP)

- **Identifying Low Accuracy Topics**
  - After fixing the BUSINESS topic, attention was shifted to categories with lower accuracy, such as **WORLD** and **NATION**.
  - Errors for NATION predictions were retrieved using:
    ```python
    lab_utils.get_errors(model, dev_df, title_preprocessor, topic_lookup, 'NATION')
    ```

- **Ambiguity in Labels**
  - Some titles, e.g., *"COVID-19 hospital admissions up slightly across St. Louis area"*, could belong to multiple categories (HEALTH and NATION).
  - Labelers may have inconsistent instructions, causing ambiguity and affecting model performance.

- **Possible Solutions**
  - Allow multi-label annotations where multiple topics can be selected for a single article:
    | Title   | ENTERTAINMENT | HEALTH | TECHNOLOGY | WORLD | BUSINESS | SPORTS | NATION | SCIENCE |
    |---------|---------------|--------|------------|-------|----------|--------|--------|---------|
    | Title 1 |               |        |            |       |          |        |        |         |
    | Title 2 |               |        |            |       |          |        |        |         |
  
  - Alternatively, merge related categories to reduce class count (e.g., merge SCIENCE and TECHNOLOGY).

- **Business Considerations**
  - Changes require approval from product/business owners due to impact on app features and backend systems.

- **Model Adjustment for Multi-Label Evaluation**
  - To account for ground truth labels in top two predictions, the model was recompiled with `SparseTopKCategoricalAccuracy` metric:
    ```python
    from tensorflow.keras.metrics import SparseTopKCategoricalAccuracy
    
    model.compile(
      optimizer=model.optimizer,
      loss=model.loss,
      metrics=[SparseTopKCategoricalAccuracy(k=2)]
    )
    ```


In [55]:
# Set the top-K accuracy to 2
model.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam',
              metrics=[tf.keras.metrics.SparseTopKCategoricalAccuracy(k=2)]
             )

# Check the accuracy
model.evaluate(dev_ds)

[1m680/680[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - loss: 0.8285 - sparse_top_k_categorical_accuracy: 0.8876


[0.8388991951942444, 0.8832911849021912]

In [57]:
# Print the accuracy per topic
lab_utils.print_metric_per_topic(dev_df, topics, topic_lookup, title_preprocessor, model)

ACCURACY PER TOPIC:

ENTERTAINMENT: 91.70 
HEALTH: 91.20 
TECHNOLOGY: 92.43 
WORLD: 83.40 
BUSINESS: 81.70 
SPORTS: 92.13 
NATION: 86.10 
SCIENCE: 86.89 


### 📈 Improved Accuracy with Top-2 Predictions

- **Observations**
  - Accuracy improved across all categories by considering the top two model predictions.
  - WORLD and NATION topics showed the largest gains, exceeding 20 percentage points.
  - Further analysis of differing top predictions can help evaluate multi-label or category-merging approaches.

---

### ➕ Adding More Data

- **Rationale**
  - Increasing training data is more likely to enhance model performance than degrade it.

- **Category Imbalance**
  - SCIENCE has fewer examples, possibly causing lower accuracy despite top-2 prediction consideration.

- **Data Augmentation Techniques**
  - Synonym replacement  
  - Random insertion  
  - Random swap  
  - Random deletion  
  - *Note:* Only apply augmentation to the training set, not to dev or test sets.

- **Additional Data Sources**
  - Incorporate datasets such as:
    - AG News Classification Dataset  
    - Huffington Post Headlines Dataset  
  - Aim: Increase category diversity and vocabulary coverage.

---

### ⚙️ Model Parameter Modification

- **Next Steps**
  - Experiment with architecture by:
    - Adding layers  
    - Changing embedding and dense layer dimensionality  

- **Prototype Model Example**
  ```python
  def build_model():
      # Prototype code to build model
      ...
      return model


### ⚙️ Further Preprocessing and Model Improvement

- **TextVectorization Layer Tuning**
  - Experiment with parameters such as `max_tokens` to optimize vocabulary size.
  - Improve vocabulary quality by removing stop words (e.g., *a, the, is, and, are*) to focus on meaningful words.

- **Alternative Approaches**
  - Consider different model architectures or pre-trained models for text classification.
  - Note: Compute constraints in the current environment may require training on machines with GPU support.

---

### 🚀 Deployment Considerations

- **Post-Retraining**
  - The cleaned and retrained model prototype is ready for deployment.

- **Challenges**
  - Deployment involves both software engineering and machine learning aspects.
  - Continuous monitoring is essential to detect issues and maintain model performance during operation.


In [59]:
# Packages need in the Deployment Stage

import lab_utils
import tensorflow as tf
import pandas as pd
import os
import shutil
import requests
import json
import sqlite3
import numpy as np
import matplotlib.pyplot as plt

### 🚀 Deploying the Model in Production

- **System Overview**
  - The prototype model processes news articles in hourly batches on a cloud platform.
  - A high-level system diagram (`network.png`) illustrates the setup.

---

### 🔧 Local Deployment Setup

- **Deployment Framework**
  - TensorFlow Serving is used for local model deployment.
  - Models are stored in the `serving` folder within the workspace.

- **Getting Started**
  - TensorFlow Serving is pre-installed in the lab environment.
  - For local setups outside the lab:
    - Install TensorFlow Serving following [official instructions](https://www.tensorflow.org/tfx/serving/setup).
    - Use `tensorflow-model-server` binary installed via APT or Docker container for serving.

- **Commands**
  - Docker and native TensorFlow Serving commands are provided (see collapsible sections above).
  - Copy-paste commands into a new cell to start serving the model.

- **Next Step**
  - Perform sample predictions to verify the deployment.


In [60]:
SERVING_DIR = f'{os.getcwd()}/serving'
os.environ["SERVING_DIR"] = SERVING_DIR

print(f'SERVING_DIR: {SERVING_DIR}')
print(f'os.environ["SERVING_DIR"]: {os.environ["SERVING_DIR"]}')

SERVING_DIR: /content/serving
os.environ["SERVING_DIR"]: /content/serving


copy the model in E2 into the serving directory under a folder named 1.

In [62]:
os.makedirs(f'{SERVING_DIR}/1', exist_ok=True)
shutil.copytree('/content/drive/MyDrive/News_Filtering_Production/News_Filtering_Production/E2/model', f'{SERVING_DIR}/1', dirs_exist_ok=True)

'/content/serving/1'

run Tensorflow Serving in the background and it will detect the model available for serving.

In [63]:
%%bash --bg
nohup tensorflow_model_server \
  --rest_api_port=8501 \
  --model_name=newsapp_model \
  --model_base_path="${SERVING_DIR}" > ./serving/server.log 2>&1

### 🖥️ Model Server Testing

- **Server Status**
  - Model server is running and listening on port **8501**.

- **Testing Prediction Endpoint**
  - Send a test request with a dummy input array of shape `(1, 20)` representing a title.
  - The response will return 8 values corresponding to the softmax output layer predictions.

- **Usage**
  - The test command can be run inside this notebook (with `!`) or directly in a Terminal:
    - Open Terminal via `File -> New Launcher -> Other -> Terminal`.
    - Run the command without the `!` to confirm external accessibility.

- **Troubleshooting**
  - If an error appears regarding "expecting a string input":
    - Multiple models may exist in the `serving` directory from previous runs.
    - Remove existing model folders by uncommenting the cleanup line in the Deploy section.
    - Copy the correct model to the `serving` directory again before retrying.

In [64]:
command = (
    'curl -d \'{"instances": [[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]]}\' \\' +
    '-X POST http://localhost:8501/v1/models/newsapp_model:predict'
)

os.system(command)

1792

### 🤖 Programmatic Requests to the Model Server

- **Process**
  - Load the title preprocessor to convert raw text strings into numeric arrays compatible with the model input.
  - Send these processed inputs as requests to the model server.

- **Potential Issue**
  - The preprocessing step happens outside the serving environment.
  - This can cause inconsistencies if the preprocessor changes or differs from what the deployed model expects.
  - Also, it requires the client to have the preprocessing logic and artifacts, increasing complexity and maintenance overhead.

- **Best Practice Suggestion**
  - Integrate preprocessing into the deployed model pipeline or API service to ensure consistent input handling.


In [None]:
# Used by the team on their prototype
MAX_LENGTH = 20
VOCAB_SIZE = 10000

# Instantiate a layer for text preprocessing
title_preprocessor = tf.keras.layers.TextVectorization(max_tokens=VOCAB_SIZE, output_sequence_length=MAX_LENGTH)

# Load the vocabulary file
title_preprocessor.load_assets('/content/drive/MyDrive/News_Filtering_Production/News_Filtering_Production/E2/vocab')

# Sample input
sample_input = 'Sample title'

# Preprocess the string
sample_input_ds = title_preprocessor(sample_input)

# Add a batch dimension
sample_input_ds = tf.expand_dims(sample_input_ds, axis=0)

# Compose the data
data = json.dumps({"instances": sample_input_ds.numpy().tolist()})

# Define the headers
headers = {"content-type": "application/json"}

# Send the request
json_response = requests.post('http://localhost:8501/v1/models/newsapp_model:predict', data=data, headers=headers)

# Get the predictions
predictions = json.loads(json_response.text)['predictions']

print(predictions)

### 🔁 Ensuring Consistent Preprocessing During Serving

- **Issue Identified**
  - Preprocessing (e.g. vocabulary loading) occurs outside the model pipeline.
  - There's a risk of accidentally loading the wrong vocabulary from another experiment (e.g., E3 instead of E2).
  - This mismatch can cause the model to produce unreliable predictions despite being well-trained.

- **Solution**
  - Integrate the `title_preprocessor` directly into the model's input pipeline before exporting.
  - This ensures:
    - The correct vocabulary is always used.
    - You can feed raw strings to the model at inference time.
    - External clients don't need access to the preprocessing logic or files.

- **Benefits**
  - Reduces risk of human error.
  - Simplifies deployment and inference pipeline.
  - Improves robustness and reproducibility of predictions.


### 🧩 Preprocessing Integration for Deployment

- **Best Practice**
  - Attach the `title_preprocessor` **after** training, just before deployment.
  - Reason: Attaching it during training can slow down performance, especially on GPU setups.

- **Deployment-Ready Export**
  - Use `model.export()` instead of `model.save()` for serving purposes.
  - `model.export()` generates a lighter version optimized for TensorFlow Serving.

- **Action Taken**
  - The integrated model (with preprocessing attached) was exported to the `serving` directory under folder `2`.
  - This version is now ready for deployment and inference with raw string inputs.

- **Outcome**
  - Ensures reliable predictions by embedding the preprocessing pipeline into the model itself.
  - Simplifies serving architecture by reducing dependency on external preprocessing logic.


In [75]:
!kill $(ps aux | grep 'tensorflow_model_server' | awk '{print $2}')

^C


In [76]:
%%bash --bg
nohup tensorflow_model_server \
  --rest_api_port=8501 \
  --model_config_file="${SERVING_DIR}/models.config" \
  --model_config_file_poll_wait_seconds=10 \
  --allow_version_labels_for_unavailable_models=true \
  --model_base_path="${SERVING_DIR}" > ./serving/server.log 2>&1

And to the 2nd model as well that accepts string inputs:

In [None]:
data = json.dumps({"instances": [[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]]})

headers = {"content-type": "application/json"}
json_response = requests.post('http://localhost:8501/v1/models/newsapp_model/labels/deprecated:predict', data=data, headers=headers)
predictions = json.loads(json_response.text)['predictions']

print(predictions)

In [None]:
data = json.dumps({"instances": ["sample title 1", "sample title 2"]})

headers = {"content-type": "application/json"}
json_response = requests.post('http://localhost:8501/v1/models/newsapp_model/labels/stable:predict', data=data, headers=headers)
predictions = json.loads(json_response.text)['predictions']

print(predictions)

In [None]:
#to test model by deploying it on canary through flask api

### ⚙️ Degrees of Automation in ML Deployment

Rather than deploying the model for full automation, it's often more practical to adopt a **partial automation** approach—especially when model performance isn't yet ideal.

#### 🧠 Human-in-the-Loop Strategy

- **Observation from Error Analysis:**
  - If the model's top class prediction has a **confidence score ≥ 60%**, it is typically correct.
  - If **< 60%**, it's less reliable and should be **reviewed manually**.

- **Proposed Automation Flow:**
  1. Predict topic for each article.
  2. Check the probability of the top prediction.
     - **If ≥ 60%** → Accept prediction automatically.
     - **If < 60%** → Flag for human review.

- **Proof-of-Concept:**
  - Apply this logic to **100 rows** from the `dev_df` dataset.
  - Count how many predictions are auto-accepted vs. sent to human reviewers.

#### ✅ Benefits

- Builds trust with stakeholders before full automation.
- Improves quality while gradually increasing automation confidence.
- Allows model refinement over time with real-world feedback.

> ⚠️ Tip: Always log flagged items for auditing and possible retraining to improve future accuracy.


In [None]:
# working folder for the experiment
BASE_DIR = './E2'

# get the subdirectories that contain the experiment files
data_dir, _, vocab_dir = lab_utils.set_experiment_dirs(BASE_DIR)

# Load the dev set
dev_df = pd.read_csv(f'{data_dir}/dev_data.csv')

# Create a lookup list for the labels
topic_lookup = tf.keras.layers.StringLookup(vocabulary=f'{vocab_dir}/labels.txt', num_oov_indices=0)

# Get 100 rows from the dev set and prepare the titles as a list
title_df = dev_df['title'][:100].reset_index(drop=True)

# Convert to a list
dev_np = title_df.to_numpy().tolist()

# Convert to a JSON object
data = json.dumps({"instances": dev_np})

# Set the header
headers = {"content-type": "application/json"}

# Send to the deployed model
json_response = requests.post('http://localhost:8501/v1/models/newsapp_model/labels/stable:predict', data=data, headers=headers)

# Get the predictions
predictions = json.loads(json_response.text)['predictions']

### Partial Automation with Confidence Threshold

To optimize classification reliability, a filtering mechanism was implemented:

- **Confidence Threshold**:  
  - If the model's top prediction exceeds **60% probability**, the article is:
    - Automatically classified  
    - Forwarded directly to the application's frontend  

  - If the prediction falls **below 60%**, the article is:
    - Flagged for manual review  
    - Handled by a human-in-the-loop process


In [None]:
# Preview the results

# Show the predictions as percentages
pd.options.display.float_format = '{:.2%}'.format

# Format the results as a dataframe
pred_df = pd.DataFrame(predictions, columns=topic_lookup.get_vocabulary())
pred_df = pd.concat([title_df, pred_df], axis=1)
pred_df

In [None]:
# Threshold to meet for automation
THRESHOLD = 0.6

# List for manual checking
below_threshold = []

# Collect rows that don't meet the threshold
for i, prediction in enumerate(predictions):
    if max(prediction) < THRESHOLD:
        prediction = prediction.copy()
        prediction.insert(0, dev_np[i])
        below_threshold.append(prediction)

# Preview the results. These will be classified manually.
columns = topic_lookup.get_vocabulary()
columns.insert(0, 'title')
pd.DataFrame(below_threshold, columns=columns)

In [None]:
#examine data and add more filters for manual checking thereby rechecking and making predictions better overtime

### Monitoring and Maintaining the ML System

Deployed ML systems require continuous monitoring to ensure long-term reliability. Over time, **model performance degrades** due to changes in data, necessitating maintenance strategies.

---

### Types of Data Drift

- **Concept Drift**  
  Occurs when the relationship between input and output shifts.  
  *Example*: An athlete retires and becomes a celebrity, but the model continues to label their articles as SPORTS.

- **Data Drift**  
  Occurs when the distribution of input data changes.  
  *Example*: A new celebrity emerges post-training, leading to poor predictions due to unseen vocabulary.

---

### Establishing Metrics and Alarms

To detect performance issues early, multiple metrics should be tracked and thresholds defined. Alarms can then be triggered for investigation.

#### 1. Software Metrics
- Track system capacity and performance:
  - Available memory
  - CPU/GPU utilization
  - Latency
  - Throughput
  - Server load  
- Helps determine when to **scale resources**.

#### 2. Input Metrics
- Analyze incoming data characteristics:
  - Average title length
  - Count of unknown or out-of-vocabulary tokens
  - Input request frequency per data source  
- Useful for spotting **shifts in data distribution**.

#### 3. Output Metrics
- Monitor prediction results:
  - Distribution across categories
  - Ratio of automated vs. manual classifications
  - Confidence scores of predictions  
- Highlights potential **model bias or degradation**.

---

### Example: Monitoring Unknown Tokens

A high number of unknown (OOV) tokens in titles may signal **vocabulary insufficiency**. This is monitored by:

- Collecting processed input data from the system database
- Counting and plotting the occurrence of unknown tokens
- Using trends to decide whether vocabulary updates are required


In [None]:
# Connect to the database
con = sqlite3.connect("news_articles.db")
cur = con.cursor()

# Preview some entries
for row in cur.execute("SELECT id,title FROM news_articles WHERE id < 5"):
    print(row)

You will then load the title preprocessor in Experiment 2 so you can transform the string inputs into integer sequences.

In [None]:
VOCAB_SIZE=10000
MAX_LENGTH=20

# working folder for the experiment
BASE_DIR = './E1'

# get the subdirectories that contain the experiment files
_, model_dir, vocab_dir = lab_utils.set_experiment_dirs(BASE_DIR)

# Load the model
model = tf.keras.models.load_model(model_dir)

# Instantiate a layer for text preprocessing
title_preprocessor = tf.keras.layers.TextVectorization(max_tokens=VOCAB_SIZE, output_sequence_length=MAX_LENGTH)

# Load the vocabulary file
title_preprocessor.load_assets(vocab_dir)

For each integer sequence, you want to count the number of 1's because that represents an out-of-vocabulary word.

In [None]:
# Initialize list containing the number of `1`'s per title
unk_counts = []

# Iterate over the titles in the database
for row in cur.execute("SELECT title FROM news_articles"):

    # Convert the title to an integer sequence
    sequence = title_preprocessor(row[0])

    # Count the number of `1`
    unk_count = np.count_nonzero(sequence == 1)

    # Append to the list
    unk_counts.append(unk_count)

Now plot the results.

Notice that there is a spike in the later entries. we can inspect these rows and see what might be the problem.

In [None]:
# print rows after id=100
for row in cur.execute("SELECT id,title FROM news_articles WHERE ID > 100"):
    print(row)

### Input Filtering

The system encountered **input errors** due to non-English articles. Since the model was trained exclusively on English data, it failed to recognize the words, leading to unreliable predictions.  
→ Actionable insight: **Identify and filter non-English sources** to prevent irrelevant entries from affecting the application.

---

### Project Wrap-Up

This project implemented the **entire Machine Learning Lifecycle** on a real-world problem—from scoping and data preparation to modeling, deployment, and monitoring. Challenges such as data imbalance, labeling ambiguity, and production constraints were effectively addressed.

---

### Key Insights & Recommendations

#### ✅ Label Ambiguity
- **Improvement**: Define stricter labeling guidelines and use multiple annotators with consensus checking to reduce inconsistency.

#### ✅ Model Accuracy
- **Best Accuracy Achieved**: _[insert highest accuracy observed]_  
- **Assessment**: If accuracy ≥ 60% with confidence, partial automation can be adopted (with human-in-the-loop for low-confidence predictions).

#### ✅ Next Steps
- **Resources Required**:  
  - Access to additional labeled English data  
  - Integration of text augmentation techniques  
  - Model retraining and periodic evaluation  
- **Timeline**: _[insert estimated timeline, e.g., 2–4 weeks for next iteration]_

#### ✅ App Modifications
- Enable articles to belong to **multiple categories**, using multi-label classification.
- Incorporate **language detection** and filter non-supported entries.
- Improve UI feedback by showing model confidence to end-users.

---

### Final Note

This system is now ready for partial deployment with proper monitoring. Future iterations should focus on expanding data diversity, improving classification accuracy, and adapting to changes in input data.
