<a href="https://colab.research.google.com/github/sarthakbiswas97/design-llm-apps-exercises/blob/main/language_detect.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from newspaper import Article

url = 'https://timesofindia.indiatimes.com/india/save-your-lives-first-when-pakistan-army-commander-abandoned-post-during-operation-sindoor/articleshow/121364227.cms'
article = Article(url)
article.download()
article.parse()

print("Title:", article.title)
print("Text:", article.text)
# newspaper3k also has other useful attributes like article.authors, article.publish_date, etc.


C4 is an English language dataset, constructed by filtering out text from the raw dataset with less than 0.99 probability of being English according to langdetect. However, a lot of non-English data persists in this dataset. If you know a second language, then use the realnewslike subset of C4 to find instances in which text from that language appears. In what contexts do these non-English text fragments appear? Could an LLM learn these languages using these leftover fragments?

In [None]:
! pip install --upgrade datasets
!pip install langdetect


In [None]:
from datasets import load_dataset
from langdetect import detect_langs

articles_to_process = 100
confidence_threshold = 0.99

try:
  realnewslike = load_dataset("allenai/c4", "realnewslike", split="train", streaming=True)

  for i, news_text in enumerate(realnewslike):
    if i >= articles_to_process:
      break

    article_text = news_text.get("text")

    if not article_text or not article_text.strip():
      continue

    try:
      detected_lang_list = detect_langs(article_text)

      if not detected_lang_list:
        continue

      # Get the top language detection (most probable)
      top_detection = detected_lang_list[0]
      detected_lang_code = top_detection.lang  # e.g., 'en', 'es'
      detected_lang_prob = top_detection.prob  # e.g., 0.999


      # Condition 1: Detected language is English, but confidence is below the threshold
      is_low_confidence_english = (detected_lang_code == 'en' and detected_lang_prob < confidence_threshold)

      # Condition 2: Detected language is NOT English
      is_not_english = (detected_lang_code != 'en')

      if is_low_confidence_english or is_not_english:
        print("Found low confidence English or non-English article")
        print("------------------------------------")
        if is_low_confidence_english:
          print(f"Type: Low Confidence English")
        else: # is_not_english
          print(f"Type: Detected as Non-English")

      print(f"Detected Language: {detected_lang_code} (Probability: {detected_lang_prob:.4f})")
      print("Snippet (first 300 chars):")
      print(article_text[:300] + "...") # Print a snippet for context
      print("------------------------------------")

    except Exception as e:
      print("got error in language detection", e)

except Exception as e:
  print("Got some error: ",e)







# New section

Create a quality classifier using [fasttext](https://fasttext.cc/docs/en/support.html). Your positive examples can be drawn from Wikipedia, and the negative examples can be randomly drawn from the [unclean version of C4](https://huggingface.co/datasets/allenai/c4). Once trained, feed documents from the realnewslike subset of C4 to this classifier. Is this classifier able to do a good job?

In [None]:
!pip install --upgrade fasttext
%cd fastText
!pip install .
!pip install numpy==1.26.4


In [None]:
# Getting and preparing data
!wget https://dl.fbaipublicfiles.com/fasttext/data/cooking.stackexchange.tar.gz && tar xvzf cooking.stackexchange.tar.gz
!head cooking.stackexchange.txt
!wc cooking.stackexchange.txt
!head -n 12404 cooking.stackexchange.txt > cooking.train
!tail -n 3000 cooking.stackexchange.txt > cooking.valid

In [None]:
# Step 4: Import and use fastText
import fasttext
import numpy as np # Import to check version

print(f"Using NumPy version: {np.__version__}") # Should show 1.26.4

# Train the model
model = fasttext.train_supervised(input="cooking.train")
model.save_model("model_cooking.bin")

# Make predictions
# Note: model.predict returns a tuple (labels, probabilities)
# We usually care about the labels for simple prediction.
predictions1 = model.predict("Which baking dish is best to bake a banana bread ?")
print(f"Prediction 1: {predictions1}")

predictions2 = model.predict("Why not put knives in the dishwasher?")
print(f"Prediction 2: {predictions2}")

# Test the model
test_results = model.test("cooking.valid")
print(f"Test results (N, P@1, R@1): {test_results}")

# To get precision and recall at k=5:
test_results_k5 = model.test("cooking.valid", k=5)
print(f"Test results (N, P@5, R@5): {test_results_k5}")
