# üí¨ Lesson 2: AI Baca & Faham Text!

**Masa:** 60 minit

**Goal:** Buat AI yang boleh detect sama ada review/komen tu positive atau negative!

---

## ü§î Kenapa Belajar Ni?

Korang tau tak:
- **Shopee/Lazada** - Auto-detect review bagus atau teruk
- **Twitter/X** - Trending topics analysis
- **YouTube** - Filter spam comments
- **ChatGPT** - Faham apa korang tanya!

Semua ni guna **NLP (Natural Language Processing)**! üî•

## üß† Macam Mana AI Faham Bahasa?

1. **Language Model Pretraining:** Model belajar structure bahasa (grammar, vocabulary)
2. **Fine-tuning:** Kita ajar untuk specific task (positive vs negative)

Macam orang yang dah tau Bahasa Inggeris, kita just ajar "ni positive, ni negative"!

In [None]:
# Kalau guna Google Colab:
# !pip install -Uqq fastai

from fastai.text.all import *

## Step 1: Download Dataset

Kita guna IMDB movie reviews dulu sebab dah ready.

Lepas ni kita test dengan **Manglish**! üá≤üáæ

In [None]:
# Download dataset (movie reviews)
path = untar_data(URLs.IMDB_SAMPLE)
print(f"Dataset: {path}")
print(f"Contents: {path.ls()}")

In [None]:
# Baca data
df = pd.read_csv(path/'texts.csv')
print(f"Total reviews: {len(df)}")
df.head()

In [None]:
# Tengok satu contoh review
print("Contoh review:")
print(df.iloc[0]['text'][:300] + "...")
print(f"\nLabel: {df.iloc[0]['label']}")

## Step 2: Sediakan Data

### üß† Concept: Tokenization

AI tak faham text macam kita. Kena tukar jadi numbers!

```
"Best gila movie ni!" 
     ‚Üì
["best", "gila", "movie", "ni", "!"]
     ‚Üì
[234, 567, 89, 12, 5]
```

In [None]:
# Buat DataLoaders
dls = TextDataLoaders.from_df(
    df, 
    path=path, 
    text_col='text', 
    label_col='label', 
    valid_col='is_valid'
)

In [None]:
# Tengok macam mana text diprocess
dls.show_batch(max_n=3)

### üëÄ Notice the Special Tokens!

- `xxbos` = Beginning of sentence (start)
- `xxmaj` = Next word starts with capital letter
- `xxunk` = Unknown word (tak ada dalam vocabulary)

Ni cara AI "encode" extra information dalam text!

## Step 3: Buat Model

In [None]:
# AWD-LSTM = sejenis neural network untuk text
learn = text_classifier_learner(dls, AWD_LSTM, metrics=accuracy)
print("Model ready! üöÄ")

## Step 4: Cari Learning Rate Terbaik

In [None]:
# Learning rate finder - cari "sweet spot" untuk training
learn.lr_find()

## Step 5: TRAIN! üèãÔ∏è

In [None]:
# Train untuk 2 epochs
learn.fine_tune(2, 1e-2)

### üéâ Check Accuracy!

Kalau dapat ~85%+, AI kita dah boleh detect sentiment dengan baik!

## Step 6: Tengok Results

In [None]:
# Show predictions vs actual
learn.show_results()

In [None]:
# Confusion matrix
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix()

## Step 7: Test dengan Review Sendiri! üéÆ

Jom test dengan English dulu, lepas tu cuba Manglish!

In [None]:
# Test dengan custom reviews (English)
english_reviews = [
    "This movie was absolutely fantastic! Best film I've seen all year.",
    "Terrible waste of time. The plot made no sense and acting was awful.",
    "It was okay, nothing special but not bad either."
]

print("=== ENGLISH REVIEWS ===")
for review in english_reviews:
    pred, pred_idx, probs = learn.predict(review)
    emoji = "üëç" if pred == 'positive' else "üëé"
    print(f"\n{emoji} {pred.upper()} ({probs[pred_idx]:.1%})")
    print(f"   \"{review[:50]}...\"")

In [None]:
# üá≤üáæ SEKARANG CUBA MANGLISH!
# Model ni trained on English, tapi let's see!

manglish_reviews = [
    "Wah best gila movie ni! Highly recommend!",
    "Boring la cerita ni, waste of time je",
    "Oklah, not bad but not great also",
    "Gila babeng sedap makanan dia! Must try!",
    "Terrible service, waited 1 hour for food. Never again!",
    "Packaging cantik, delivery laju. 5 stars!",
    "Barang sampai rosak, seller tak reply. Very bad"
]

print("=== MANGLISH REVIEWS ===")
for review in manglish_reviews:
    pred, pred_idx, probs = learn.predict(review)
    emoji = "üëç" if pred == 'positive' else "üëé"
    print(f"\n{emoji} {pred.upper()} ({probs[pred_idx]:.1%})")
    print(f"   \"{review}\"")

### ü§î Discussion: Boleh Ke AI Faham Manglish?

**Try dengan review korang sendiri!**

In [None]:
# KORANG PUNYA TURN!
# Tukar text dalam quotes ni dengan review korang sendiri

my_review = "Tukar text ni dengan review korang!"

pred, pred_idx, probs = learn.predict(my_review)
emoji = "üëç" if pred == 'positive' else "üëé"
print(f"{emoji} AI kata: {pred.upper()}")
print(f"Confidence: {probs[pred_idx]:.1%}")

---

# üèÜ CHALLENGE: Review Classifier Competition!

### Activity (15 minit)

1. **Each student writes 3 reviews** (boleh pasal apa-apa):
   - 1 positive review
   - 1 negative review  
   - 1 tricky/neutral review

2. **Test dengan AI** - betul ke prediction dia?

3. **Share yang paling funny/surprising!**

### Contoh Topics:
- Review kantin sekolah
- Review game Mobile Legends/PUBG
- Review movie/drama Korea
- Review kedai makan dekat Shah Alam
- Review cikgu favourite (joking! üòÇ)

In [None]:
# COMPETITION TIME!
# Tampal review korang kat sini

student_reviews = [
    # Contoh - tukar dengan review korang!
    "Kantin sekolah best! Nasi goreng dia sedap gila",
    "WiFi sekolah slow sangat, nak submit assignment pun susah",
    "Cikgu matematik explain okay je, kadang-kadang faham kadang-kadang tak"
]

print("üèÜ STUDENT REVIEWS COMPETITION üèÜ\n")
for i, review in enumerate(student_reviews, 1):
    pred, pred_idx, probs = learn.predict(review)
    emoji = "üëç" if pred == 'positive' else "üëé"
    print(f"Review #{i}: {emoji} {pred.upper()} ({probs[pred_idx]:.1%})")
    print(f"   \"{review}\"\n")

---

## üí° Discussion Questions

1. **Kenapa AI kadang-kadang silap dengan Manglish?**
   - Trained mostly on English
   - "Best gila" - "gila" in English means crazy/bad!
   
2. **Macam mana Shopee/Lazada boleh detect fake reviews?**
   - Same concept - train AI dengan real vs fake examples
   
3. **Privacy concerns?**
   - Shopee tau apa korang suka based on reviews korang baca!

---

## üåü Real-World Applications

| Platform | Use Case |
|----------|----------|
| Shopee/Lazada | Filter fake reviews, sentiment analysis |
| TikTok | Detect harmful comments |
| Grab | Analyze driver/rider feedback |
| Banks (Maybank etc) | Customer complaint classification |
| News sites | Fake news detection |

---

## üè† Homework Ideas

1. **Buat Malay Sentiment Classifier**
   - Collect reviews dalam Bahasa Malaysia
   - Train model baru!

2. **Analyze Twitter Trending**
   - Scrape tweets pasal trending topic
   - What's the overall sentiment?

3. **K-pop Fanwar Analysis** üòÇ
   - Collect tweets pasal BLACKPINK vs BTS
   - Which fandom more positive?

---

## üî• Bonus: AI Yang Boleh Tulis!

Kalau ada masa, cuba buat **Language Model** yang boleh generate text!

In [None]:
# BONUS: Language Model (text generator)
# Uncomment untuk try!

# dls_lm = TextDataLoaders.from_df(df, text_col='text', is_lm=True, valid_pct=0.1)
# learn_lm = language_model_learner(dls_lm, AWD_LSTM, metrics=[accuracy, Perplexity()])
# learn_lm.fine_tune(1)

# # Generate text!
# print(learn_lm.predict("This movie was", n_words=20))