<a href="https://colab.research.google.com/github/sarthakbiswas97/design-llm-apps-exercises/blob/main/Text_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install wikipedia
!pip install --upgrade datasets fsspec

**Getting and preparing data **

In [None]:
import re
import random
from datasets import load_dataset
from itertools import islice

# 1) Configuration
C4_SAMPLES    = 1000
WIKI_SAMPLES  = 1000
RAW_FILE      = "combined.txt"
SHUFFLED_FILE = "combined_shuffled.txt"
TRAIN_FILE    = "train.txt"
VALID_FILE    = "valid.txt"
BAD_LABEL     = "__label__bad"
GOOD_LABEL    = "__label__good"

def clean_text(text):
    """
    1) strip whitespace
    2) lowercase
    3) remove non-a–z characters
    4) collapse all whitespace to single spaces
    """
    text = text.strip().lower()
    text = re.sub(r"[^a-z\s]", " ", text)
    text = " ".join(text.split())
    return text

# 2) Write raw, labeled examples to RAW_FILE
with open(RAW_FILE, "w", encoding="utf-8") as out:
    # 2a) C4 "bad" examples
    c4_iter = load_dataset(
        "allenai/c4", "en.noclean", streaming=True
    )["train"]
    for rec in islice(c4_iter, C4_SAMPLES):
        txt = clean_text(rec["text"])
        if txt:
            out.write(f"{BAD_LABEL} {txt}\n")

    # 2b) Wikipedia "good" examples via HF snapshot
    # Use config "20220301.en" for English wiki
    wiki_stream = load_dataset(
        "wikipedia",
        "20220301.en",
        streaming=True,
        trust_remote_code=True
    )["train"]
    # shuffle a buffer of 10k, then take WIKI_SAMPLES
    wiki_shuf = wiki_stream.shuffle(buffer_size=10_000, seed=42)
    for rec in islice(wiki_shuf, WIKI_SAMPLES):
        # take only the first paragraph
        first_para = rec["text"].split("\n\n", 1)[0]
        txt = clean_text(first_para)
        if txt:
            out.write(f"{GOOD_LABEL} {txt}\n")

# 3) Read all lines back and shuffle in memory
with open(RAW_FILE, "r", encoding="utf-8") as f:
    lines = f.readlines()
random.shuffle(lines)

# 4) Write out the shuffled dataset
with open(SHUFFLED_FILE, "w", encoding="utf-8") as f:
    f.writelines(lines)

# 5) Split 80/20 into train.txt and valid.txt
split_idx = int(len(lines) * 0.8)
with open(TRAIN_FILE, "w", encoding="utf-8") as f_train:
    f_train.writelines(lines[:split_idx])
with open(VALID_FILE, "w", encoding="utf-8") as f_valid:
    f_valid.writelines(lines[split_idx:])

# 6) Sanity check: print a few samples
print("=== TRAIN SAMPLE ===")
for ln in lines[:5]:
    print(ln.strip())
print("\n=== VALID SAMPLE ===")
for ln in lines[split_idx : split_idx + 5]:
    print(ln.strip())


**Model training**

In [None]:
!pip install --upgrade fasttext
!pip install numpy==1.26.4

In [None]:
import fasttext
import numpy as np

# print(f"Using NumPy version: {np.__version__}")

MODEL_OUTPUT_FILE = "quality_classifier.bin"

print("Starting model training...")
model = fasttext.train_supervised(
    input=TRAIN_FILE,
    lr=0.5,
    epoch=25,
    wordNgrams=2,
    dim=150,
    # autotuneValidationFile=VALID_FILE
    # autotuneDuration=600,             # e.g., 10 minutes for autotune
    # autotuneMetric="f1",
    verbose=2 # To see training progress
)
print("Training complete.")

# Save the trained model
model.save_model(MODEL_OUTPUT_FILE)
print(f"Model saved to {MODEL_OUTPUT_FILE}")

# Evaluate on your validation set
print("\nEvaluating on the validation set:")
N, P, R = model.test(VALID_FILE)
print(f"Validation N: {N}")
print(f"Validation P@1: {P:.4f}")
print(f"Validation R@1: {R:.4f}")

# If you used autotune, you can see the best F1 score and hyperparameters
# if 'autotuneValidationFile' in model.f.getArgs().__dict__: # Check if autotune was used
#     print(f"Best F1 score achieved by autotune: {model.get_best_f1_score()}")
#     print(f"Best hyperparameters: {model.get_best_hyperparameters()}")


**Evaluation on "RealNewsLike" C4**

In [None]:
REALNEWS_SAMPLES = 10000

try:
  realnewslike = load_dataset("allenai/c4", "realnewslike", streaming=True)["train"]
except Exception as e:
  print(f"Error loading realnewslike: {e}")

predictions_on_realnews = []
processed_count = 0

for rec in islice(realnewslike, REALNEWS_SAMPLES * 2):
  if processed_count >= REALNEWS_SAMPLES:
    break

  original_text = rec.get("text", "")
  cleaned_text = clean_text(original_text)

  if cleaned_text:
    predicted_labels, probabilities = model.predict(cleaned_text)
    predictions_on_realnews.append({
            "original_text": original_text[:500] + "...", # Store a snippet
            "cleaned_text": cleaned_text[:500] + "...",
            "predicted_label": predicted_labels[0],
            "probability": probabilities[0]
    })
  processed_count += 1

print(f"\nMade predictions on {len(predictions_on_realnews)} 'realnewslike' samples.")

# Now, proceed to Phase 4: Analysis
# For example, print some predictions:
print("\n--- Sample Predictions on 'RealNewsLike' C4 ---")
for i, pred in enumerate(predictions_on_realnews[:10]): # Print first 10
    print(f"\nSample {i+1}:")
    # print(f"Original Snippet: {pred['original_text']}")
    print(f"Predicted Label: {pred['predicted_label']} (Confidence: {pred['probability']:.4f})")
    if i < 3: # Print more details for the very first few
        print(f"Cleaned Snippet for Prediction: {pred['cleaned_text'][:200]}...")


# Count the distribution of predictions
good_count = sum(1 for p in predictions_on_realnews if p['predicted_label'] == GOOD_LABEL)
bad_count = sum(1 for p in predictions_on_realnews if p['predicted_label'] == BAD_LABEL)

if predictions_on_realnews:
    print(f"\nDistribution on 'RealNewsLike' C4 ({len(predictions_on_realnews)} samples):")
    print(f"Predicted as GOOD: {good_count} ({good_count/len(predictions_on_realnews)*100:.2f}%)")
    print(f"Predicted as BAD:  {bad_count} ({bad_count/len(predictions_on_realnews)*100:.2f}%)")
else:
    print("No predictions were made on realnews data.")





1.  **Created a quality classifier:**
    *   Positive examples from Wikipedia.
    *   Negative examples from the unclean version of C4.
2.  **Trained the classifier** using fastText.
3.  **Fed documents from the `realnewslike` subset of C4 to this classifier.**
4.  **Gathered results** from this classification.

The output you've provided:

*   **Made predictions on 10000 'realnewslike' samples.**
*   **Distribution on 'RealNewsLike' C4 (10000 samples):**
    *   Predicted as GOOD: 877 (8.77%)
    *   Predicted as BAD: 9123 (91.23%)

This is the crucial piece of information the exercise asked you to find.

Now, let's briefly touch upon the final question of the exercise: **"Is this classifier able to do a good job?"**

Based on your results:

*   Your classifier, trained to see Wikipedia as the gold standard for "good" and very messy C4 (`en.noclean`) as "bad", considers the vast majority (91.23%) of the `realnewslike` C4 subset to be "bad."
*   This suggests that, according to the features your model learned (word n-grams, etc.), the `realnewslike` C4 data is significantly different from Wikipedia and more similar to the noisy C4 data it was trained to identify as "bad."

Whether this constitutes a "good job" is interpretive and depends on the goal:

*   **If the goal was to create a very strict filter that only accepts text of extremely high, Wikipedia-like quality:** Then one might argue it's doing a "good job" by being highly selective. It's effectively saying that `realnewslike` C4, while cleaner than `en.noclean`, still doesn't meet the bar set by Wikipedia.
*   **If the goal was to identify most reasonably well-written news-like articles as "good":** Then it might not be doing a "good job," as it's rejecting a large portion of the `realnewslike` dataset. This could mean your definition of "bad" (based on `en.noclean`) is too broad or that `realnewslike` C4 has characteristics that your model, trained on the extremes, flags as low quality.

**In conclusion:**

*   **Exercise Flow:** **Yes, you have completed all the steps outlined in the exercise.**
*   **Classifier Performance ("Good Job?"):** The classifier is performing *consistently* with its training. It has learned to differentiate Wikipedia from very noisy text, and it's applying that learning to the `realnewslike` subset. The high "bad" rate for `realnewslike` is an interesting finding and provides insight into how different these datasets are, at least from the perspective of your model.