# 🧠 Natural Language Processing (NLP)

---

## 🌍 **1. Introduction to NLP**

### 🔹 What is NLP?

Natural Language Processing (NLP) is a subfield of Artificial Intelligence (AI) that helps machines understand, interpret, and generate human language.

It bridges **computers and human communication**, allowing systems to process text and speech.

### 🔹 Why NLP?

* Automates language-based tasks (translation, summarization, chatbots, etc.)
* Enables voice assistants like Alexa, Siri
* Helps analyze opinions in reviews and social media (sentiment analysis)

### 🔹 Real-world Applications:

| Application         | Description                            | Example                              |
| ------------------- | -------------------------------------- | ------------------------------------ |
| Chatbots            | Conversational AI for customer support | ChatGPT, Replika                     |
| Sentiment Analysis  | Determine emotion behind text          | Classify tweets as Positive/Negative |
| Machine Translation | Translate text between languages       | Google Translate                     |
| Text Summarization  | Generate concise summaries             | Summarize news articles              |
| Spam Detection      | Classify emails as spam/non-spam       | Gmail spam filter                    |

---

## 🧹 **2. Text Preprocessing**

Before feeding text into models, we clean and prepare it.

### 🔹 Common Steps:

| Step                 | Purpose                               | Example                             |
| -------------------- | ------------------------------------- | ----------------------------------- |
| Lowercasing          | Normalize case                        | “Hello” → “hello”                   |
| Tokenization         | Split text into words/tokens          | “I love NLP” → [‘I’, ‘love’, ‘NLP’] |
| Removing Punctuation | Clean unwanted symbols                | “Hello!!!” → “Hello”                |
| Stopword Removal     | Remove common words                   | “is, the, and…”                     |
| Lemmatization        | Reduce to root form (dictionary form) | “running” → “run”                   |
| Stemming             | Reduce to word stem                   | “studies” → “studi”                 |

### 🔹 Example (using NLTK and SpaCy)


In [7]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string
import os

# Set NLTK data path
nltk.data.path.append(os.path.abspath("nltk_data"))

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')


['nlp', 'amazing', 'field', 'artificial', 'intelligence']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [8]:


text = "NLP is an amazing field of Artificial Intelligence!"

# Lowercase
text = text.lower()

# Tokenize
tokens = word_tokenize(text)

# Remove punctuation
tokens = [word for word in tokens if word.isalnum()]

# Remove stopwords
tokens = [word for word in tokens if word not in stopwords.words('english')]

# Lemmatization
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(word) for word in tokens]

print(tokens)


# ✅ **Output:**

# ['nlp', 'amazing', 'field', 'artificial', 'intelligence']

['nlp', 'amazing', 'field', 'artificial', 'intelligence']


## 🔠 **3. Text Vectorization and Classification**

Machine learning models understand **numbers**, not text — so we must convert text into numeric form.

### 🔹 1. Bag of Words (BoW)

Represents text as word frequency.

| Text           | love | movie | bad |
| -------------- | ---- | ----- | --- |
| “I love movie” | 1    | 1     | 0   |
| “bad movie”    | 0    | 1     | 1   |


In [9]:
from sklearn.feature_extraction.text import CountVectorizer

sentences = ["I love NLP", "NLP loves Python", "Python is great for NLP"]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sentences)

print(vectorizer.get_feature_names_out())
print(X.toarray())

['for' 'great' 'is' 'love' 'loves' 'nlp' 'python']
[[0 0 0 1 0 1 0]
 [0 0 0 0 1 1 1]
 [1 1 1 0 0 1 1]]


### 🔹 2. TF-IDF (Term Frequency - Inverse Document Frequency)

Weights words based on how important they are across all documents.

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

sentences = ["I love NLP", "NLP loves Python", "Python is great for NLP"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(sentences)

print(vectorizer.get_feature_names_out())
print(X.toarray())

['for' 'great' 'is' 'love' 'loves' 'nlp' 'python']
[[0.         0.         0.         0.861037   0.         0.50854232
  0.        ]
 [0.         0.         0.         0.         0.72033345 0.42544054
  0.54783215]
 [0.50461134 0.50461134 0.50461134 0.         0.         0.29803159
  0.38376993]]


### 🔹 3. Text Classification Example

Classify movie reviews as Positive or Negative using Logistic Regression.

In [11]:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

X = ["I love this movie", "I hate this film", "This is great", "This is terrible"]
y = [1, 0, 1, 0]

vectorizer = CountVectorizer()
X_vec = vectorizer.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_vec, y, test_size=0.25)
model = LogisticRegression()
model.fit(X_train, y_train)
pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, pred))

Accuracy: 0.0


## 🧩 **4. Embedding Layer in NLP**

### 🔹 What is an Embedding?

An **embedding** represents words as **dense vectors** in a continuous vector space, capturing **semantic meaning**.

Example:

* “king” and “queen” → similar vectors
* “apple” and “banana” → close in vector space

### 🔹 Why Use It?

* Captures **contextual relationships**
* Reduces **dimensionality**
* Improves **model accuracy**

### 🔹 Example (Using Keras Embedding Layer)

In [13]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense

model = Sequential([
    Embedding(input_dim=50, output_dim=8, input_length=5),
    Flatten(),
    Dense(1, activation='sigmoid')
])

model.summary()


# 🔸 `input_dim` = vocabulary size
# 🔸 `output_dim` = vector size for each word
# 🔸 `input_length` = length of input sequence

## 😊 **5. Sentiment Analysis**

Sentiment analysis detects emotional tone (positive/negative/neutral) in text.

### 🔹 Example using Logistic Regression

In [14]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

texts = ["I love this product", "This is the worst ever", "Absolutely fantastic", "Not good at all"]
labels = [1, 0, 1, 0]

model = Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('classifier', LogisticRegression())
])

model.fit(texts, labels)

test_text = ["I really love this!", "Terrible experience"]
print(model.predict(test_text))

[1 0]


✅ Output:

```
[1 0]
```

(1 → Positive, 0 → Negative)



## 🔁 **6. Sequence Models (RNN, LSTM)**

Language has **sequential nature** — word order matters.
Sequence models (RNNs, LSTMs, GRUs) process data where the **previous context** affects current output.

### 🔹 Example: Sentiment Classification with LSTM

In [16]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
import numpy as np

sentences = ["I love NLP", "I hate this movie", "NLP is great", "This is bad"]
labels = [1, 0, 1, 0]

tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)
X = tokenizer.texts_to_sequences(sentences)
X = pad_sequences(X, padding='post')

# Convert labels to a NumPy array
labels = np.array(labels)


model = Sequential([
    Embedding(input_dim=50, output_dim=8, input_length=X.shape[1]),
    LSTM(16),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X, labels, epochs=10)

Epoch 1/10




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2s/step - accuracy: 0.5000 - loss: 0.6933
Epoch 2/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step - accuracy: 0.5000 - loss: 0.6927
Epoch 3/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step - accuracy: 0.7500 - loss: 0.6922
Epoch 4/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 35ms/step - accuracy: 1.0000 - loss: 0.6918
Epoch 5/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 38ms/step - accuracy: 1.0000 - loss: 0.6913
Epoch 6/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 35ms/step - accuracy: 1.0000 - loss: 0.6908
Epoch 7/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 36ms/step - accuracy: 1.0000 - loss: 0.6903
Epoch 8/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 37ms/step - accuracy: 1.0000 - loss: 0.6898
Epoch 9/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 60ms

<keras.src.callbacks.history.History at 0x7a2aa0753e00>

✅ Output:

```
J'aime apprendre le traitement du langage naturel !
```

## 🧩 **8. Introduction to Large Language Models (LLMs)**

### 🔹 What are LLMs?

Large Language Models (LLMs) like **GPT, BERT, LLaMA** are deep learning models trained on **massive text datasets** to understand and generate human-like language.

### 🔹 Key Features:

* Trained on billions of words
* Understand context, tone, and semantics
* Perform multiple NLP tasks without task-specific training

### 🔹 Example using OpenAI or Hugging Face

In [17]:

from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-fr")
result = translator("I love learning NLP!")
print(result[0]['translation_text'])

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/301M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

Device set to use cpu


J'adore apprendre le NLP !


In [18]:
from transformers import pipeline

generator = pipeline("text-generation", model="gpt2")
print(generator("NLP is a fascinating field because", max_length=30))

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'NLP is a fascinating field because it captures not just the diversity of the animal kingdom, but also the diversity of its living community.\n\nSome of the animals, including elephants and giraffes, are also diverse, and some of the animals, including elephants and giraffes, are also diverse, and some of the animals, including elephants and giraffes, are also diverse, say some of the researchers.\n\nFor instance, the researchers say the animals that are most closely related to elephants, lions and elephants, are also the most closely related to people, but that elephants are most closely related to humans.\n\nThey say that this diversity is reflected in the animals\' social behaviors, but that other animals may also have different social behaviors.\n\nThe researcher also found that a significant proportion of the animals that are closest to humans, e.g., elephants and lions, are also close to other animals.\n\nIn addition, the researchers found that people and othe

---

## 🧠 **9. Assignment Section**

### 🔹 **Classwork**

1. Preprocess a given paragraph (tokenize, remove stopwords, lemmatize).
2. Create a TF-IDF matrix for 5 sample sentences.
3. Build a sentiment classifier using Logistic Regression.
4. Implement an LSTM model for text classification.
5. Translate 5 English sentences to French using Hugging Face.

---

### 🔹 **Practice Assignment**

| Task | Description                                                                   |
| ---- | ----------------------------------------------------------------------------- |
| 1    | Collect 100 tweets and perform sentiment analysis                             |
| 2    | Build a spam detection model using TF-IDF                                     |
| 3    | Build a chatbot using a trained LSTM or pre-trained model                     |
| 4    | Perform machine translation using Hugging Face                                |
| 5    | Fine-tune a pre-trained LLM (BERT or DistilBERT) for sentiment classification |