<a href="https://colab.research.google.com/github/samiha-mahin/NLP/blob/main/NLP_Fundamentals.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## **What is NLP?**

**NLP (Natural Language Processing)** is a part of AI that helps computers understand human language (like English, Bangla, etc.).

In simple words:

NLP = Computer learning to read, understand, and reply to human text or speech.

---

## Easy Real-Life Examples

### Example 1 — Chatbot

When you talk to **ChatGPT**, it understands your message and replies.

You type:

> "How are you?"

Computer understands meaning → replies.

That understanding = NLP.

---

### Example 2 — Google Translate

You write:

> "I love you"

It translates to Bangla.

Understanding + translating language = NLP

---

### Example 3 — YouTube Recommendations

When you search:

> "sad songs"

It understands your words and shows related videos.

Understanding your text = NLP

---

## Super Simple Example

Sentence:

> "I love cats"

Computer cannot understand words.

So NLP converts it into something the computer understands:

"I" → word
"love" → emotion
"cats" → animal

Now computer knows the meaning.

---


NLP is a field of Artificial Intelligence that enables computers to understand, interpret, and generate human language.

---


## **What is Bag of Words (BoW)?**

Bag of Words = A way to convert text into numbers by counting words.

Computers don’t understand sentences, so we turn words into numbers.

 It only cares about:

* Which words appear
* How many times they appear

 It does NOT care about:

* Grammar
* Word order
* Meaning

That’s why it’s called a “bag” — words are just thrown inside like items in a bag.

---

## Easy Example

Sentences:

1. "I love cats"
2. "I love dogs"

### Step 1 — Make Vocabulary (all unique words)

Vocabulary =
I, love, cats, dogs

---

### Step 2 — Count words in each sentence

Sentence 1: "I love cats"

| I | love | cats | dogs |
| - | ---- | ---- | ---- |
| 1 | 1    | 1    | 0    |

Sentence 2: "I love dogs"

| I | love | cats | dogs |
| - | ---- | ---- | ---- |
| 1 | 1    | 0    | 1    |

Now the computer understands sentences as numbers.

---

## Why We Use Bag of Words

To train ML models for:

* Spam detection
* Sentiment analysis
* Text classification

---

## Super Simple Definition (Exam Style)

Bag of Words is a technique in NLP that represents text as numerical vectors based on word frequency, ignoring grammar and word order.

---

## One Problem of BoW

These two sentences become the same:

* "Dog bites man"
* "Man bites dog"

Because BoW ignores order.

---



**What does CountVectorizer do?**

CountVectorizer converts text into numbers by counting how many times each word appears.

It is used to create the **Bag of Words** representation.

It is from the Python ML library scikit-learn.

**Example 1**

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

# Our sentences (documents)
sentences = [
    "I love cats",
    "I love dogs",
    "Cats and dogs both are cute but I love cats most"
]

# Create Bag of Words model
vectorizer = CountVectorizer()

# Convert text → numbers
bow = vectorizer.fit_transform(sentences)

# Show vocabulary (unique words)
print("Vocabulary:", vectorizer.get_feature_names_out())

# Show BoW matrix
print("BoW Matrix:\n", bow.toarray())


Vocabulary: ['and' 'are' 'both' 'but' 'cats' 'cute' 'dogs' 'love' 'most']
BoW Matrix:
 [[0 0 0 0 1 0 0 1 0]
 [0 0 0 0 0 0 1 1 0]
 [1 1 1 1 2 1 1 1 1]]


**Example 2**

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Sample dataset
X = [
    "Win money now",
    "Free prize waiting",
    "Claim your reward",
    "Congratulations you won",
    "Call me later",
    "Let's meet tomorrow",
    "Are you coming today",
    "Dinner tonight?",
    "Important update for you",
    "Limited offer just for you"
]

y = [1,1,1,1,0,0,0,0,1,1] # 1 = spam, 0 = not spam

# Bag of Words
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Model training
model = MultinomialNB()
model.fit(X_train, y_train)

# Prediction
y_pred = model.predict(X_test)

# Accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))


Accuracy: 0.6666666666666666


**Example-3**

In [12]:
class Category :
    BOOKS = "BOOKS"
    CLOTHING = "CLOTHING"
train_x = ["i love the book", "this is a great book", "the fit is great", "i love the shoes"]
train_y = [Category.BOOKS, Category.BOOKS, Category.CLOTHING, Category.CLOTHING]

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(binary=True)
train_x_vectors = vectorizer.fit_transform(train_x)

print(vectorizer.get_feature_names_out())
print(train_x_vectors.toarray())


['book' 'fit' 'great' 'is' 'love' 'shoes' 'the' 'this']
[[1 0 0 0 1 0 1 0]
 [1 0 1 1 0 0 0 1]
 [0 1 1 1 0 0 1 0]
 [0 0 0 0 1 1 1 0]]


In [14]:
from sklearn import svm
clf_svm = svm.SVC(kernel='linear')
clf_svm.fit(train_x_vectors, train_y)


In [15]:
test_x = vectorizer.transform(['i like the book'])
clf_svm.predict(test_x)

array(['BOOKS'], dtype='<U8')