What you’re seeing is a classic example of **misclassification due to a very small dataset** and the limitations of a simple model like `MultinomialNB` on text features. Let me break it down step by step.

---

### 1. **How the Model Sees the Sentences**

The model doesn’t “understand” language. It only sees **numbers representing word importance** (TF-IDF scores).

For example:

* `'I am learning neural networks.'` → contains words like `learning`, `neural`, `networks`. These words were likely **only seen in Tech sentences** in your small training set. So the model predicts **Tech**, which is correct.

* `'It's raining heavily.'` → contains `raining`, `heavily`. These words **weren’t in your training set**, so the model tries to match whatever words are closest.

---

### 2. **Why `'It's raining heavily.'` was classified as Tech**

* With only **10 sentences**, there are **very few examples of Non-Tech words**.
* TF-IDF assigns **0 weight to words not seen in training**, so effectively the vector for this sentence is mostly zeros.
* The Naive Bayes classifier defaults to predicting the **most frequent class in the training set**, which is Tech in this tiny sample.

This is why it predicted **Tech incorrectly**.

---

### 3. **Limitations of this approach**

* **Tiny dataset**: 10 sentences is far too few to generalize.
* **Simple features**: TF-IDF only captures word frequency, not context.
* **Simple model**: Naive Bayes assumes **words are independent**—it doesn’t understand phrases or sentence meaning.

---

### 4. **How to improve**

* **Add more labeled data**: At least hundreds of sentences per class.
* **Use richer representations**: Word embeddings (Word2Vec, GloVe) or transformers (BERT) capture meaning, so “raining” is understood as weather-related.
* **Handle unknown words**: More training data reduces zero-feature issues.

---

If you want, I can write a **small BERT-based classifier** that would correctly classify `'It's raining heavily.'` as Non-Tech even with a small dataset. It’s much more robust than TF-IDF + Naive Bayes.


In [None]:
#---------------------------------------------
# NLP Exploration: Sentence Classification
#---------------------------------------------

# Import necessary libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix

# ----------------- DATA -----------------
# Example sentences (replace with your actual sentences)
sentences = [
    "I love reading books on data science.",
    "The weather today is sunny and warm.",
    "Python is a great programming language for AI.",
    "I need to buy groceries after work.",
    "The new movie was fantastic!",
    "Machine learning models require careful tuning.",
    "My car broke down on the way home.",
    "Data visualization helps understand trends.",
    "I enjoy hiking during the weekends.",
    "Artificial intelligence is transforming industries."
]

# Example labels for classification (mock labels for demonstration)
# Suppose we want to classify sentences as 'Tech' or 'Non-Tech'
labels = [
    "Tech", "Non-Tech", "Tech", "Non-Tech", "Non-Tech",
    "Tech", "Non-Tech", "Tech", "Non-Tech", "Tech"
]

# Convert to a pandas DataFrame for easier handling
df = pd.DataFrame({"sentence": sentences, "label": labels})
print("Data preview:\n", df.head())

# ----------------- PREPROCESSING -----------------
# Convert text into numerical features using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english')  # Remove common words like 'the', 'is'
X = vectorizer.fit_transform(df['sentence'])        # Features matrix
y = df['label']                                     # Target labels

print("\nFeature names (sample):", vectorizer.get_feature_names_out()[:10])
print("Shape of features matrix:", X.shape)

# ----------------- SPLIT DATA -----------------
# Normally, we split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# ----------------- MODEL TRAINING -----------------
# Use a simple Naive Bayes classifier suitable for text data
model = MultinomialNB()
model.fit(X_train, y_train)

# ----------------- PREDICTION -----------------
y_pred = model.predict(X_test)

# ----------------- EVALUATION -----------------
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

# ----------------- PREDICT NEW SENTENCES -----------------
new_sentences = [
    "I am learning neural networks.",
    "It's raining heavily."
]
X_new = vectorizer.transform(new_sentences)
predictions = model.predict(X_new)

for sent, pred in zip(new_sentences, predictions):
    print(f"Sentence: '{sent}' -> Predicted Label: {pred}")
