### <div align="center">Introduction to NLP</div>

#### 1.2: Three Techniques of doing NLP
- NLP Techniques
  1. Rules & Heuristics
  2. Machine Learning
  3. Deep Learning

#### 1.3: NLP Tasks
- Converting text to number is called Embedding, computer understand only number.
- Naive Bayes Classifier work well with text problem.
- Below are the major NLP tasks:
  1. Text Classification
  2. Language Translation
  3. Text Similarity
  4. Language Modeling
  5. Information Extraction
  6. Text Summarization
  7. Information Retrieval
  8. Topic Modeling
  9. Chat Bots
  10. Voice Assistants

#### 1.4: NLP Pipeline
- NLP application pipeline or steps are different based on the type of the task (classification, summarization, chatbot etc.)
- Steps in a classification task (as well as many other NLP tasks) are:
  1. Data Acquisition
  2. Preprocessing
  3. Feature Extraction
  4. Parsing & Syntax Analysis
  5. Model Building
  6. Post processing and Evaluation (If result is not good repeat the steps)
  7. Deployment
  8. Monitor & Update
- Building a good quality NLP application requires thoughtful and iterative approach.

#### 1.5: Tools Overview
- General guideline to use tool
  - Hugging Face, LLM APIs, Langchain – used to build state-of-the-art, modern NLP and Gen AI applications.
  - Spacy, Gensim – used to build classical NLP applications with low latency pipelines.
  - NLTK – mainly used in research, teaching, and experimentation.
  - Above are loose guidelines. Many industrial NLP applications use multiple of these tools in a single application.
- Spacy
  - Provides most efficient NLP algorithm for a given task. Hence if you care about the end result, go with Spacy.
  - It is object oriented.
  - It is user friendly.
  - Provides most efficient NLP algorithm for a given task. Hence if you care about the end result, go with Spacy.
  - Perfect for app developer.
  - It is a new library and has very active user community. 
- NLTK
  - Provides access to many algorithms. If you care about specific algo and customizations go with NLTK.
  - It is mainly a string processing library.
  - It is less user friendly compare to spacy.
  - Provides access to many algorithms. If you care about specific algo and customizations go with NLTK.
  - Perfect for researchers.
  - It is old library and not as active as spacy.
- NLP is a huge multidisciplinary field where a variety of tools and libraries are used to build NLP applications.

#### 1.6: MCQ
- What is Hugging Face best known for?
  - Hosting and providing pre-trained transformer models
  - Hugging Face is best known for its Transformers library and Model Hub, which host and provide access to thousands of pre-trained models for NLP and other AI tasks.
- What is the purpose of sentiment analysis?
  - Classifying the emotional tone of a text
- What is the benefit of LangChain?
  - It makes building GenAI applications easier and faster
- What is the primary goal of Named Entity Recognition
  - Identifying proper nouns and classifying them into categories like names, locations, and organizations

### <div align="center">Text Pre-Processing</div>

#### 2.1: Tokenization Techniques
- Pre-Processing stage
  - Sentense Tokenization -> Word Tokenization -> Stop word removal -> Stemming, Lemmatization
- Tokenization is the foundational step in NLP, breaking down text into smaller, processable units.
- Word tokenization splits text into words, making it simple but prone to out-of-vocabulary issues.
- Character tokenization breaks text into individual characters, providing flexibility for unknown words.
- Subword tokenization efficiently handles rare words by splitting them into frequently used sub-parts.
- Subword tokenization is a popular approach used by models such as BERT, GPT, etc., due to the advantages it offers.

#### 2.2: Tokenization in Spacy

- spacy.blank("en") is used to create a blank language object. Using this we can tokenize a sentence.
- Tokens in Spacy have attributes such as like_num, is_currency, is_alpha, etc., that help identify the type of token.
- Spacy provides support for defining custom tokenizers.

Create blank language object and tokenize words in a sentence

In [3]:
import spacy
nlp = spacy.blank("en")

doc = nlp("Dr. Strange loves pav bhaji of mumbai as it costs only 2$ per plate.")

for token in doc:
    print(token)

Dr.
Strange
loves
pav
bhaji
of
mumbai
as
it
costs
only
2
$
per
plate
.


Creating blank language object gives a tokenizer and an empty pipeline. We will look more into language pipelines in next tutorial

In [4]:
# Using index to grab tokens
doc[0]
token = doc[1]
token.text

'Strange'

In [5]:
type(nlp)

spacy.lang.en.English

In [6]:
type(doc)

spacy.tokens.doc.Doc

In [7]:
type(token)

spacy.tokens.token.Token

In [8]:
nlp.pipe_names

[]

##### Token attributes

In [9]:
doc = nlp("Tony gave two $ to Peter.")

In [10]:
token0 = doc[0]
token0

Tony

In [11]:
token0.is_alpha

True

In [12]:
token0.like_num

False

In [13]:
token2 = doc[2]
token2

two

In [14]:
token2.like_num

True

In [15]:
for token in doc:
    print(token, "==>", "index: ", token.i, "is_alpha:", token.is_alpha, 
          "is_punct:", token.is_punct, 
          "like_num:", token.like_num,
          "is_currency:", token.is_currency,
         )

Tony ==> index:  0 is_alpha: True is_punct: False like_num: False is_currency: False
gave ==> index:  1 is_alpha: True is_punct: False like_num: False is_currency: False
two ==> index:  2 is_alpha: True is_punct: False like_num: True is_currency: False
$ ==> index:  3 is_alpha: False is_punct: False like_num: False is_currency: True
to ==> index:  4 is_alpha: True is_punct: False like_num: False is_currency: False
Peter ==> index:  5 is_alpha: True is_punct: False like_num: False is_currency: False
. ==> index:  6 is_alpha: False is_punct: True like_num: False is_currency: False


#### 2.3: Language Processing Pipeline in Spacy

- SpaCy provides many pre-trained pipelines. ex: spacy.load("en_core_web_sm") loads a small pipeline for the English language.
- This pipeline has components such as tok2vec, tagger, parser, ner, lemmatizer, etc.
- token.pos_ gives the part of speech component and token.lemma_ returns the lemmatized base word.
- doc.ents is used to retrieve the Named Entity Recognition (NER) entities.

##### Blank nlp pipeline

In [16]:
nlp = spacy.blank("en")

doc = nlp("Captain america ate 100$ of samosa. Then he said I can do this all day.")

for token in doc:
    print(token)

Captain
america
ate
100
$
of
samosa
.
Then
he
said
I
can
do
this
all
day
.


<img height=300 width=400 src="../../screenshots/spacy_blank_pipeline.jpg" />

In [18]:
nlp.pipe_names

[]

nlp.pipe_names is empty array indicating no components in the pipeline. Pipeline is something that starts with a tokenizer 

More general diagram for nlp pipeline may look something like below

<img height=300 width=400 src="../../screenshots/spacy_loaded_pipeline.jpg" />

In [None]:
nlp = spacy.load("en_core_web_sm")
nlp.pipe_names

In [None]:
nlp.pipeline

sm in en_core_web_sm means small. There are other models available as well such as medium, large etc. Check this: https://spacy.io/usage/models#quickstart

In [None]:
doc = nlp("Captain america ate 100$ of samosa. Then he said I can do this all day.")

for token in doc:
    print(token, " | ", spacy.explain(token.pos_), " | ", token.lemma_)

##### Named Entity Recognition

In [None]:
doc = nlp("Tesla Inc is going to acquire twitter for $45 billion")
for ent in doc.ents:
    print(ent.text, ent.label_)

In [None]:
from spacy import displacy

displacy.render(doc, style="ent")

#### 2.4: Stemming & Lemmatization

- Stemming: Use fixed rule such as remove ing, able etc. to derive a base word
- Lemmatization: Here we need knowledge of a language (a.k.a linguistic language) to derive a base word (a.k.a Lemma).
- Stemming reduces words to their root form by stripping suffixes, often leading to incomplete or less meaningful roots (e.g., "running" → "run").
- Lemmatization reduces words to their base or dictionary form (lemma) by considering the word's meaning and context (e.g., "running" ↦ "run").
- Lemmatization is more accurate than stemming but computationally heavier, making it preferable for tasks requiring linguistic precision.
- Stemming is faster and simpler, useful for quick text normalization when perfect accuracy isn't critical.
- Both techniques improve text preprocessing by reducing word variations, aiding in tasks like search, classification, and sentiment analysis.

In [None]:
nlp = spacy.load("en_core_web_sm")

doc = nlp("Mando talked for 3 hours although talking isn't his thing")
doc = nlp("eating eats eat ate adjustable rafting ability meeting better")
for token in doc:
    print(token, " | ", token.lemma_)

In [None]:
nlp.pipe_names

In [None]:
ar = nlp.get_pipe('attribute_ruler')

ar.add([[{"TEXT":"Bro"}],[{"TEXT":"Brah"}]],{"LEMMA":"Brother"})

doc = nlp("Bro, you wanna go? Brah, don't say no! I am exhausted")
for token in doc:
    print(token.text, "|", token.lemma_)

In [None]:
doc[6]

In [None]:
doc[6].lemma_

#### 2.5: Part of Speech (POS) Tagging

POS tags

In [None]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("Elon flew to mars yesterday. He carried biryani masala with him")

for token in doc:
    print(token," | ", token.pos_, " | ", spacy.explain(token.pos_))

In [None]:
doc = nlp("Wow! Dr. Strange made 265 million $ on the very first day")

for token in doc:
    print(token," | ", token.pos_, " | ", spacy.explain(token.pos_))

Tags

In [None]:
doc = nlp("Wow! Dr. Strange made 265 million $ on the very first day")

for token in doc:
    print(token," | ", token.pos_, " | ", spacy.explain(token.pos_), " | ", token.tag_, " | ", spacy.explain(token.tag_))

In below sentences Spacy figures out the past vs present tense for quit

In [None]:
doc = nlp("He quits the job")

print(doc[1].text, "|", doc[1].tag_, "|", spacy.explain(doc[1].tag_))

In [None]:
doc = nlp("He quits the job")

print(doc[1].text, "|", doc[1].tag_, "|", spacy.explain(doc[1].tag_))

Removing all SPACE, PUNCT and X token from text

In [None]:
earnings_text="""Microsoft Corp. today announced the following results for the quarter ended December 31, 2021, as compared to the corresponding period of last fiscal year:

·         Revenue was $51.7 billion and increased 20%
·         Operating income was $22.2 billion and increased 24%
·         Net income was $18.8 billion and increased 21%
·         Diluted earnings per share was $2.48 and increased 22%
“Digital technology is the most malleable resource at the world’s disposal to overcome constraints and reimagine everyday work and life,” said Satya Nadella, chairman and chief executive officer of Microsoft. “As tech as a percentage of global GDP continues to increase, we are innovating and investing across diverse and growing markets, with a common underlying technology stack and an operating model that reinforces a common strategy, culture, and sense of purpose.”
“Solid commercial execution, represented by strong bookings growth driven by long-term Azure commitments, increased Microsoft Cloud revenue to $22.1 billion, up 32% year over year” said Amy Hood, executive vice president and chief financial officer of Microsoft."""

doc = nlp(earnings_text)

filtered_tokens = []

for token in doc:
    if token.pos_ not in ["SPACE", "PUNCT", "X"]:
        filtered_tokens.append(token)

In [None]:
filtered_tokens[:10]

In [None]:
count = doc.count_by(spacy.attrs.POS)
count

In [None]:
doc.vocab[96].text

In [None]:
for k,v in count.items():
    print(doc.vocab[k].text, "|",v)

#### 2.6: Stop Words

- Stop words are common words (e.g., "the", "and", "is") that add little value to text analysis and are often removed to improve model efficiency.
- Removing stop words reduces dimensionality and processing time, allowing models to focus on meaningful terms.
- Stop word removal enhances performance in tasks like text classification, sentiment analysis, and information retrieval.
- Careful selection of stop words is essential, as removing critical words can sometimes alter the meaning of text.
- Popular NLP libraries like NLTK, SpaCy, and Scikit-learn offer built-in stop word lists that can be customized based on task requirements.

In [None]:
import spacy

from spacy.lang.en.stop_words import STOP_WORDS

len(STOP_WORDS)

In [None]:
nlp = spacy.load("en_core_web_sm")

doc = nlp("We just opened our wings, the flying part is coming soon")

for token in doc:
    if token.is_stop:
        print(token)

In [None]:
def preprocess(text):
    doc = nlp(text)
    
    no_stop_words = [token.text for token in doc if not token.is_stop]
    return " ".join(no_stop_words)

In [None]:
preprocess("Musk wants time to prepare for a trial over his")

In [None]:
preprocess("The other is not other but your divine brother")

In [22]:
import pandas as pd
df = pd.read_json("../../data/doj_press.json",lines=True)
df.shape

(13087, 6)

In [23]:
df.head(5)

Unnamed: 0,id,title,contents,date,topics,components
0,,Convicted Bomb Plotter Sentenced to 30 Years,"PORTLAND, Oregon. – Mohamed Osman Mohamud, 23,...",2014-10-01T00:00:00-04:00,[],[National Security Division (NSD)]
1,12-919,$1 Million in Restitution Payments Announced t...,WASHINGTON – North Carolina’s Waccamaw River...,2012-07-25T00:00:00-04:00,[],[Environment and Natural Resources Division]
2,11-1002,$1 Million Settlement Reached for Natural Reso...,BOSTON– A $1-million settlement has been...,2011-08-03T00:00:00-04:00,[],[Environment and Natural Resources Division]
3,10-015,10 Las Vegas Men Indicted \r\nfor Falsifying V...,WASHINGTON—A federal grand jury in Las Vegas...,2010-01-08T00:00:00-05:00,[],[Environment and Natural Resources Division]
4,18-898,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir...",2018-07-09T00:00:00-04:00,[Environment],[Environment and Natural Resources Division]


In [24]:
# Filter out those rows that do not have any topics associated with the case
df = df[df["topics"].str.len() != 0]
df.head()

Unnamed: 0,id,title,contents,date,topics,components
4,18-898,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir...",2018-07-09T00:00:00-04:00,[Environment],[Environment and Natural Resources Division]
7,14-1412,14 Indicted in Connection with New England Com...,A 131-count criminal indictment was unsealed t...,2014-12-17T00:00:00-05:00,[Consumer Protection],[Civil Division]
19,17-1419,2017 Southeast Regional Animal Cruelty Prosecu...,The United States Attorney’s Office for the Mi...,2017-12-14T00:00:00-05:00,[Environment],"[Environment and Natural Resources Division, U..."
22,15-1562,21st Century Oncology to Pay $19.75 Million to...,"21st Century Oncology LLC, has agreed to pay $...",2015-12-18T00:00:00-05:00,"[False Claims Act, Health Care Fraud]",[Civil Division]
23,17-1404,21st Century Oncology to Pay $26 Million to Se...,21st Century Oncology Inc. and certain of its ...,2017-12-12T00:00:00-05:00,"[Health Care Fraud, False Claims Act]","[Civil Division, USAO - Florida, Middle]"


In [25]:
df.shape

(4688, 6)

In [26]:
df =df.head(100)
df.shape

(100, 6)

In [27]:
df["contents_new"] = df.contents.apply(preprocess)

NameError: name 'preprocess' is not defined

In [None]:
df

#### 2.7: Named Entity Recognition (NER)

- How to build my own NER
  1. Simple lookup: Manually add entity in database. Then look the text, tokenize and compare.
  2. Rule Based NER: Spacy provide a class EntityRuler to specify all the rules.
  3. Machine Learning: CRF (Conditional Random Fields)

In [None]:
nlp = spacy.load("en_core_web_sm")
nlp.pipe_names

In [None]:
doc = nlp("Tesla Inc is going to acquire twitter for $45 billion")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_, " | ", spacy.explain(ent.label_))

In [None]:
from spacy import displacy

displacy.render(doc, style="ent")

In [None]:
# List down all the entities
nlp.pipe_labels['ner']

In [None]:
doc = nlp("Michael Bloomberg founded Bloomberg in 1982")
for ent in doc.ents:
    print(ent.text, "|", ent.label_, "|", spacy.explain(ent.label_))

In [None]:
doc = nlp("Tesla Inc is going to acquire Twitter Inc for $45 billion")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_, " | ", ent.start_char, "|", ent.end_char)

In [None]:
doc = nlp("Tesla is going to acquire Twitter for $45 billion")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_)

In [None]:
s = doc[2:5]
s

In [None]:
type(s)

In [None]:
from spacy.tokens import Span

s1 = Span(doc, 0, 1, label="ORG")
s2 = Span(doc, 5, 6, label="ORG")

doc.set_ents([s1, s2], default="unmodified")

In [None]:
for ent in doc.ents:
    print(ent.text, " | ", ent.label_)

#### 2.8: Regular Expressions (Regex)
- Regular expressions (regex) are powerful tools for pattern matching and text manipulation in NLP pipelines.
- They enable efficient searching, extraction, and replacement of text based on specific patterns (e.g., emails, phone numbers).
- Regex helps in data cleaning, tokenization, and text preprocessing by identifying and handling structured text.
- Mastering regex simplifies complex text parsing tasks, reducing the need for extensive manual text processing.
- Regular expressions are widely supported in Python (re module) and other programming languages, making them essential for text-based projects.

#### 2.9: MCQ
- What does doc.ents return in SpaCy?
  - Named Entity Recognitions
- What are stop words in NLP?
  - Common words that are often removed from the text
- Why are stop words removed during text preprocessing?
  - To improve model efficiency
- What is the benefit of character-level tokenization?
  - Provides flexibility for unknown words
- What does spacy.blank("en") do?
  - Creates a blank language object
- What is the main advantage of subword tokenization?
  - It helps in handling rare words
- Why is lemmatization preferred over stemming in NLP?
  - More linguistic accuracy