## Level 3: Intermediate NLP Models

Welcome to **Level 3**!  
Now we are building **real-world NLP models** that are smarter and more useful.

# 1. Named Entity Recognition (NER)

### ➔ Definition:  
NER identifies **names of people, locations, organizations, dates, etc.** inside the text.

### ➔ Why use NER?  
- Extract important information automatically.  
- Useful in **search engines**, **chatbots**, **news summarization**.

---


In [1]:
# !pip install spacy 
# python -m spacy download en_core_web_sm


In [2]:
## Code for NER:

# Install spacy if not installed
# pip install spacy
# python -m spacy download en_core_web_sm

import spacy

# Load English NLP model
nlp = spacy.load("en_core_web_sm")

# Sample text
text = "Google plans to open a new office in Paris by 2025."

# Process text
doc = nlp(text)

# Print entities
for ent in doc.ents:
    print(ent.text, "->", ent.label_)

Google -> ORG
Paris -> GPE
2025 -> DATE


### ✅ Explanation:
- `en_core_web_sm` is a **small English language model**.
- It identifies entities like `ORG` (organization), `GPE` (geopolitical entity), and `DATE`.

---

# 2. Language Modeling (Predict Next Word)

### ➔ Definition:  
Language Modeling predicts the **next most likely word** in a sentence.

### ➔ Why use Language Modeling?  
- Used in **autocomplete**, **text generation**, **chatbots**.

Version match rule of thumb:


TensorFlow	Keras Version
2.11	2.11
2.12	2.12
2.13	2.13
2.15/2.16	Keras 3.x

In [None]:

#  installing tensorflow 2.12
# pip install keras==2.12
import tensorflow as tf 

tf.__version__




'2.19.0'

In [2]:
## Code for Language Modeling:

# Install keras if not installed
# pip install keras tensorflow
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

import numpy as np

# Sample data
corpus = ["I love deep learning", "I love natural language processing", "Deep learning is amazing"]

# Tokenization
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1

# Create input sequences
input_sequences = []
for line in corpus:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

# Pad sequences
max_seq_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_seq_len, padding='pre'))

# Split input and labels
X = input_sequences[:,:-1]
y = input_sequences[:,-1]

# One-hot encoding of labels
from keras.utils import to_categorical
y = to_categorical(y, num_classes=total_words)

# Build model
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense

model = Sequential()
model.add(Embedding(total_words, 10, input_length=max_seq_len-1))
model.add(LSTM(100))
model.add(Dense(total_words, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam')
model.summary()

# Train model
model.fit(X, y, epochs=200, verbose=2)



Epoch 1/200
1/1 - 7s - 7s/step - loss: 2.3035
Epoch 2/200
1/1 - 0s - 123ms/step - loss: 2.3005
Epoch 3/200
1/1 - 0s - 135ms/step - loss: 2.2975
Epoch 4/200
1/1 - 0s - 106ms/step - loss: 2.2946
Epoch 5/200
1/1 - 0s - 122ms/step - loss: 2.2915
Epoch 6/200
1/1 - 0s - 123ms/step - loss: 2.2884
Epoch 7/200
1/1 - 0s - 107ms/step - loss: 2.2851
Epoch 8/200
1/1 - 0s - 115ms/step - loss: 2.2817
Epoch 9/200
1/1 - 0s - 101ms/step - loss: 2.2780
Epoch 10/200
1/1 - 0s - 196ms/step - loss: 2.2742
Epoch 11/200
1/1 - 0s - 225ms/step - loss: 2.2701
Epoch 12/200
1/1 - 0s - 223ms/step - loss: 2.2656
Epoch 13/200
1/1 - 0s - 134ms/step - loss: 2.2609
Epoch 14/200
1/1 - 0s - 381ms/step - loss: 2.2557
Epoch 15/200
1/1 - 0s - 295ms/step - loss: 2.2502
Epoch 16/200
1/1 - 0s - 127ms/step - loss: 2.2441
Epoch 17/200
1/1 - 0s - 211ms/step - loss: 2.2376
Epoch 18/200
1/1 - 0s - 186ms/step - loss: 2.2305
Epoch 19/200
1/1 - 0s - 164ms/step - loss: 2.2228
Epoch 20/200
1/1 - 0s - 134ms/step - loss: 2.2145
Epoch 21/200

<keras.src.callbacks.history.History at 0x1698a8f5040>

### ✅ Explanation:
- We **tokenize sentences** and **create small sequences**.
- LSTM learns the context and **predicts next word** probabilities.

# 3. Text Summarization (Extractive)

### ➔ Definition:  
Extract **important sentences** from a paragraph to create a summary.

### ➔ Why use Summarization?  
- Save time reading long articles.  
- Get **quick key points** automatically.


In [3]:
## Code for Extractive Summarization:

# Install scikit-learn if not installed
# pip install scikit-learn

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

# Sample paragraph
text = """Machine learning is a branch of artificial intelligence. 
It enables machines to learn from experience without being explicitly programmed. 
Applications of machine learning are everywhere — from healthcare to finance to autonomous vehicles."""

# Split into sentences
sentences = text.split('.')

# Vectorize sentences
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(sentences)

# Cluster sentences
n_clusters = 2
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
kmeans.fit(X)

# Select representative sentence from each cluster
summary = []
for i in range(n_clusters):
    idx = (kmeans.labels_ == i).nonzero()[0][0]
    summary.append(sentences[idx])

# Print summary
print("Summary:")
for sent in summary:
    print(sent.strip())

Summary:
It enables machines to learn from experience without being explicitly programmed
Machine learning is a branch of artificial intelligence


### ✅ Explanation:
- We **vectorize** sentences using **TF-IDF**.
- We **cluster** them and **pick key sentences** from different clusters.

---

# 📚 Mini Assignments

➔ 1. Apply NER on a news article.  
➔ 2. Train a small next-word prediction model on your own quotes.  
➔ 3. Summarize a Wikipedia article using extractive summarization.

---

# ✅ Done!

---

**Would you also like me to continue next with**:  
🔥 **Level 4: Advanced NLP Models (Transformers, BERT, GPT, T5, etc.)**  