# NLP Pipeline
## This notebook outlines the main concepts and phases involved in NLP pipeline

![NLP Pipeline](https://raw.githubusercontent.com/subashgandyer/datasets/main/images/NLP_Pipeline.png)

## NLP Pipeline
- Data Acquisition
- Text Extraction
- Text Cleaning
- Text Pre-processing
- Feature Engineering**
- Modeling***
- Evaluation***
- Deployment****
- Monitoring & Model Improvement****




- **   Will be seen in detail in the next lecture

- ***  **Deep LearningI** course

- ****   **Full Stack Data Science Systems** course

### 1.  Data Acquisition

#### Use a public dataset

- Easily available
- If found a similar dataset that can work for your problem in hand
    - Download, build a model, evaluate the model

#### Scrape data

- Wiki pages
- Articles
- Webpages

    - Data annotation to be done later for labeling the scraped data

#### Product Intervention

- You have to work with product team
    - collect more data
    - very important

Pros
- Accurate

Cons
- Takes a long time

#### Data Augmentation

- Use a small dataset to create more data
    - Synonym replacement
        - Randomly choose "k" words in a sentence that are not stop words
        - Replace these words with their synonyms
    - Back Translation
    - TF-IDF-based word replacement
    - Bigram Flipping
    - Entity replacement
    - Noise addition

### 2. Text Extraction

- Process of extracting raw text from input data source
    - Remove all unwanted non-textual information
        - Markup data
        - Metadata

![Source formats](https://raw.githubusercontent.com/subashgandyer/datasets/main/images/NLP_Text-formats.png)

### Web scraping
Scrape the following url and extract the text

URL: https://stackoverflow.com/questions/415511/how-to-get-the-current-time-in-python


![Stack overflow page](https://raw.githubusercontent.com/subashgandyer/datasets/main/images/NLP_Stackoverflow.png)



Task
- look at the url
- extract question
- extract answer
- Display them as shown below

This is **Text extraction from webpages**

In [2]:
pip install urllib

Note: you may need to restart the kernel to use updated packages.


ERROR: Could not find a version that satisfies the requirement urllib (from versions: none)
ERROR: No matching distribution found for urllib
You should consider upgrading via the 'c:\users\sansk\appdata\local\programs\python\python37\python.exe -m pip install --upgrade pip' command.


In [5]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import ssl
myurl = "https://stackoverflow.com/questions/415511/how-to-get-the-current-time-in-python"
gcontext = ssl.SSLContext()
html = urlopen(myurl, context=gcontext).read()
soupified = BeautifulSoup(html, "html.parser")
question_text = soupified.find("div", {"class": "s-prose js-post-body"})
print(f"Question = \n {question_text.get_text().strip()}")
print("\n\n\n")

answer_text = soupified.find("div", {"class": "answer"})
answer = answer_text.find("div", {"class": "s-prose js-post-body"})
print(f"Answer = \n {answer.get_text().strip()}")

Question = 
 What is the module/method used to get the current time?




Answer = 
 Use:
>>> import datetime
>>> datetime.datetime.now()
datetime.datetime(2009, 1, 6, 15, 8, 24, 78915)

>>> print(datetime.datetime.now())
2009-01-06 15:08:24.789150

And just the time:
>>> datetime.datetime.now().time()
datetime.time(15, 8, 24, 78915)

>>> print(datetime.datetime.now().time())
15:08:24.789150

See the documentation for more information.
To save typing, you can import the datetime object from the datetime module:
>>> from datetime import datetime

Then remove the leading datetime. from all of the above.


### Extraction from PDF documents
- Use the following PDF to text converstion libraries
    - PyPDF
    - PDFMiner
    - PyPDF2
    - Fitz
    - ...

In [8]:
import wget
wget.download("https://raw.githubusercontent.com/subashgandyer/datasets/main/images/NLP_scanned_image.png")

  0% [                                                                            ]      0 / 439693  1% [.                                                                           ]   8192 / 439693  3% [..                                                                          ]  16384 / 439693  5% [....                                                                        ]  24576 / 439693  7% [.....                                                                       ]  32768 / 439693  9% [.......                                                                     ]  40960 / 439693 11% [........                                                                    ]  49152 / 439693 13% [.........                                                                   ]  57344 / 439693 14% [...........                                                                 ]  65536 / 439693 16% [............                                                                ]  73728 / 439693

'NLP_scanned_image.png'

### Extraction from scanned images
- Use Tesseract OCR library
- Use wget to download the png

Task is to extract text from this url: https://raw.githubusercontent.com/subashgandyer/datasets/main/images/NLP_scanned_image.png

Input:

![Scanned Image](https://raw.githubusercontent.com/subashgandyer/datasets/main/images/NLP_scanned_image.png)


Output:


’in the nineteenth century the only Kind of linguistics considered\nseriously
was this comparative and historical study of words in languages\nknown or
believed to Fe cognate—say the Semitic languages, or the Indo-\nEuropean
languages. It is significant that the Germans who really made\nthe subject what
it was, used the term Indo-germanisch. Those who know\nthe popular works of 
Otto Jespersen will remember how fitmly he\ndeclares that linguistic 
science is historical. And those who have noticed’

In [11]:
pip install pytesseract

Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\users\sansk\appdata\local\programs\python\python37\python.exe -m pip install --upgrade pip' command.


In [2]:
from PIL import Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
filename = "NLP_scanned_image.png"
text = pytesseract.image_to_string(Image.open(filename))
print(text)

In the nineteenth century the only kind of linguistics considered
seriously was this comparative and historical study of words in languages
known or believed to be cognate—say the Semitic languages, or the Indo-
European languages. It is significant that the Germans who really made
the subject what it was, used the term Indo-germanisch. Those who know
the popular works of Otto Jespersen will remember how firmly he
declares that linguistic science is historical. And those who have noticed



### 3. Text Cleaning

#### Unicode Removal from text
- Remove non-textual symbols and special characters
- Use string.encode("utf-8")

I love this!!! 😊  Let's all be happy !😊

In [3]:
text = "I love this!!! 😊  Let's all be happy !😊"

In [4]:
cleaned_text = text.encode("utf-8")

In [5]:
cleaned_text

b"I love this!!! \xf0\x9f\x98\x8a  Let's all be happy !\xf0\x9f\x98\x8a"

#### Spelling Correction
- textblob
- pyspellchecker
- Microsoft's API Spell Checker

#### textblob
- pip install textblob
- Import TextBlob from textblob
- Use TextBlob(string).correct()

In [6]:
pip install textblob

Collecting textblob
  Downloading textblob-0.17.1-py2.py3-none-any.whl (636 kB)
Installing collected packages: textblob
Successfully installed textblob-0.17.1
Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\users\sansk\appdata\local\programs\python\python37\python.exe -m pip install --upgrade pip' command.


In [21]:
from textblob import TextBlob
 
# a = "cmputr"
# print(f"Original text: {str(a)}")
 
# b = TextBlob(a)
# print(f"corrected text: {str(b.correct())}")

# c = str(TextBlob(a).correct())
# c

x="gngr"
y=TextBlob(x)
print(y.correct())
z=str(TextBlob(x).correct())
print(z)

nor
nor


#### pyspellchecker : https://pyspellchecker.readthedocs.io/en/latest/
- pip install pyspellchecker
- Import SpellChecker from spellchecker
- Use SpellChecker.unknown(incorrect_string)
- Iterate through misspelled using
    - SpellChecker.correction(word)
    - SpellChecker.candidates(word)

In [3]:
pip install pyspellchecker

Collecting pyspellchecker
  Downloading pyspellchecker-0.6.3-py3-none-any.whl (2.7 MB)
     ---------------------------------------- 2.7/2.7 MB 14.5 MB/s eta 0:00:00
Installing collected packages: pyspellchecker
Successfully installed pyspellchecker-0.6.3
Note: you may need to restart the kernel to use updated packages.


In [7]:
from spellchecker import SpellChecker
spell = SpellChecker()
 
# find those words that may be misspelled
misspelled = spell.unknown(["cmputr", "watr", "study", "wrte"])
 
for word in misspelled:
    # Get the one `most likely` answer
    print(spell.correction(word))
 
    # Get a list of `likely` options
    print(spell.candidates(word))



water
{'atr', 'wanr', 'watt', 'water', 'waar', 'wath', 'war', 'wate', 'wats', 'watc', 'wat', 'wart'}
computer
{'impute', 'caputo', 'compute', 'caput', 'computer'}


#### Microsoft's API Spell Checker

In [2]:
import requests
import json

api_key = "<ENTER-KEY-HERE>"
example_text = "Hollo, wrld" 

data = {'text': example_text}
params = {
    'mkt':'en-us',
    'mode':'proof'
    }
headers = {
    'Content-Type': 'application/x-www-form-urlencoded',
    'Ocp-Apim-Subscription-Key': api_key,
    }
response = requests.post(endpoint, headers=headers, params=params, data=data)
json_response = response.json()
print(json.dumps(json_response, indent=4))

NameError: name 'endpoint' is not defined

### 4. Text Pre-processing
### Process of preparing raw text and extract knowledge
- Sentence Segmentation
- Word Tokenization
- Stop words
- Stemming & Lemmatization
- Contractions
- Whitespace
- POS tagging
- Parsing
- Entity

#### Sentence Segmentation : Breaking big document into sentences
- nltk
- sent_tokenize()

##### Task: Given a piece of text, split them into sentences, and print them one by one

In [8]:
mytext = """In the previous chapter, we saw examples of some common NLP applications that we might encounter in everyday life. If we were asked to build such an application, think about how we would approach doing so at our organization. We would normally walk through the requirements and break the problem down into several sub-problems, then try to develop a step-by-step procedure to solve them. Since language processing is involved, we would also list all the forms of text processing needed at each step. This step-by-step processing of text is known as pipeline. """

In [9]:
from nltk.tokenize import sent_tokenize
my_sentences = sent_tokenize(mytext)
for idx, sent in enumerate(my_sentences):
    print(f"Sentence {idx+1} \n {sent}\n\n")

Sentence 1 
 In the previous chapter, we saw examples of some common NLP applications that we might encounter in everyday life.


Sentence 2 
 If we were asked to build such an application, think about how we would approach doing so at our organization.


Sentence 3 
 We would normally walk through the requirements and break the problem down into several sub-problems, then try to develop a step-by-step procedure to solve them.


Sentence 4 
 Since language processing is involved, we would also list all the forms of text processing needed at each step.


Sentence 5 
 This step-by-step processing of text is known as pipeline.




#### Word Tokenization : Breaking a sentence into words (tokens)
- word_tokenize()

##### Task: Given a sentence, split them into words and print them

In [10]:
sentence = "This step-by-step processing of text is known as pipeline."

In [11]:
from nltk.tokenize import word_tokenize
print(word_tokenize(sentence))

['This', 'step-by-step', 'processing', 'of', 'text', 'is', 'known', 'as', 'pipeline', '.']


#### Removal Stop Words : Removing words that would be useless to our processing
- import stopwords from corpus
- Check 
    - if the present word is not a stop word, allow it
    - else throw it away

In [12]:
from nltk.corpus import stopwords
from string import punctuation
from nltk import sent_tokenize, word_tokenize

In [14]:
stop_words = set(stopwords.words('english'))
filtered = [word for word in word_tokenize(my_sentences[0]) if word not in stop_words]
print(f"Sentence = {my_sentences[0]}\n\n")
print(f"Cleaned text = {filtered}\n")

Sentence = In the previous chapter, we saw examples of some common NLP applications that we might encounter in everyday life.


Cleaned text = ['In', 'previous', 'chapter', ',', 'saw', 'examples', 'common', 'NLP', 'applications', 'might', 'encounter', 'everyday', 'life', '.']



### Stemming : Reduces the word to a base form
- PorterStemmer.stem()

In [15]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
word1, word2 = "cars", "revolution" 
print(stemmer.stem(word1), stemmer.stem(word2))

car revolut


### Lemmatization : Reducing the word to a base form (available in dictionary)
- WordNetLemmatizer.lemmatize()

In [18]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("worse", pos="a"))

bad


### Which is better ? Stemming or Lemmatization?
- Try some samples and make a decision
    - well-dressed
    - better
    - airliner
    - was
    - meeting

In [19]:
word_list = ['well-dressed', 'airliner', 'better', 'was', 'meeting', 'uncomfortable']

In [23]:
print("Stemmer vs Lemmatizer results\n")
for word in word_list:
    print(f" Stem of {word} = {stemmer.stem(word)}")
    print(f" Lemma of {word} = {lemmatizer.lemmatize(word, pos='a')}\n")

Stemmer vs Lemmatizer results

 Stem of well-dressed = well-dress
 Lemma of well-dressed = well-dressed

 Stem of airliner = airlin
 Lemma of airliner = airliner

 Stem of better = better
 Lemma of better = good

 Stem of was = wa
 Lemma of was = was

 Stem of meeting = meet
 Lemma of meeting = meeting

 Stem of uncomfortable = uncomfort
 Lemma of uncomfortable = uncomfortable



### Contractions : Expanding contractions
- Use Regular expressions
    - don't --> do not
    - isn't --> is not
    - aren't --> are not
    - we're --> we are
    - they're --> they are

In [24]:
test_sentence = "Everything we’re doing now is great. However, we don't want to relax now. And this isn't the time to relax at all."

In [25]:
import re
pattern = r'we[\’\']re'
replacement = 'we are'
expanded_sentence = re.sub(pattern,replacement,test_sentence)
print(expanded_sentence)

Everything we are doing now is great. However, we don't want to relax now. And this isn't the time to relax at all.


#### Generalize the contraction expansion

In [38]:
pattern = r'\w[\’\']re'
replacement = 'e are'
expanded_sentence = re.sub(pattern,replacement,test_sentence)
print(expanded_sentence)

Everything we are doing now is great. However, we don't want to relax now. And this isn't the time to relax at all.


#### Write one regular expression for don't types

In [36]:

pattern = r'\w[\’\']t'
replacement = ' not'
expanded_sentence = re.sub(pattern,replacement,test_sentence)
print(expanded_sentence)

Everything we’re doing now is great. However, we do not want to relax now. And this is not the time to relax at all.


### POS Tagging : Finding the Parts-Of-Speech of words
- Use spacy
- pip install spacy
- python -m spacy download en_core_web_sm

In [49]:
pip install spacy

SyntaxError: invalid syntax (<ipython-input-49-b180974b5590>, line 2)

In [43]:
import spacy

#### Load Spacy Language Model

In [50]:
nlp = spacy.load("en_core_web_sm")

In [51]:
text = ("When Sebastian Thrun started working on self-driving cars at "
        "Google in 2007, few people outside of the company took him "
        "seriously. “I can tell you very senior CEOs of major American "
        "car companies would shake my hand and turn away because I wasn’t "
        "worth talking to,” said Thrun, in an interview with Recode earlier "
        "this week.")

In [53]:
doc = nlp(text)
doc

When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously. “I can tell you very senior CEOs of major American car companies would shake my hand and turn away because I wasn’t worth talking to,” said Thrun, in an interview with Recode earlier this week.

#### Find noun phrases in the document

In [54]:
[chunk.text for chunk in doc.noun_chunks]

['Sebastian Thrun',
 'self-driving cars',
 'Google',
 'few people',
 'the company',
 'him',
 'I',
 'you',
 'very senior CEOs',
 'major American car companies',
 'my hand',
 'I',
 'Thrun',
 'an interview',
 'Recode']

#### Verbs

In [55]:
[token.lemma_ for token in doc if token.pos_ == "VERB"]

['start', 'work', 'drive', 'take', 'tell', 'shake', 'turn', 'talk', 'say']

#### Adjectives

In [56]:
[token.lemma_ for token in doc if token.pos_ == "ADJ"]

['few', 'senior', 'major', 'american', 'worth']

#### Entities

In [57]:
for entity in doc.ents:
    print(entity.text, entity.label_)

Sebastian Thrun PERSON
Google ORG
2007 DATE
American NORP
Thrun PERSON
Recode ORG
earlier this week DATE


### Parsing

In [58]:
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
            chunk.root.head.text)

Sebastian Thrun Thrun nsubj started
self-driving cars cars pobj on
Google Google pobj at
few people people nsubj took
the company company pobj of
him him dobj took
I I nsubj tell
you you dative tell
very senior CEOs CEOs dobj tell
major American car companies companies pobj of
my hand hand dobj shake
I I nsubj was
Thrun Thrun nsubj said
an interview interview pobj in
Recode Recode pobj with


In [59]:
for token in doc:
    print(token.text, token.dep_, token.head.text, token.head.pos_,
            [child for child in token.children])

When advmod started VERB []
Sebastian amod Thrun PROPN []
Thrun nsubj started VERB [Sebastian]
started advcl took VERB [When, Thrun, working]
working xcomp started VERB [on, at, in]
on prep working VERB [cars]
self npadvmod driving VERB []
- punct driving VERB []
driving amod cars NOUN [self, -]
cars pobj on ADP [driving]
at prep working VERB [Google]
Google pobj at ADP []
in prep working VERB [2007]
2007 pobj in ADP []
, punct took VERB []
few amod people NOUN []
people nsubj took VERB [few, outside]
outside advmod people NOUN [of]
of prep outside ADV [company]
the det company NOUN []
company pobj of ADP [the]
took ROOT took VERB [started, ,, people, him, seriously, .]
him dobj took VERB []
seriously advmod took VERB []
. punct took VERB []
“ punct tell VERB []
I nsubj tell VERB []
can aux tell VERB []
tell ccomp said VERB [“, I, can, you, CEOs, shake]
you dative tell VERB []
very advmod senior ADJ []
senior amod CEOs NOUN [very]
CEOs dobj tell VERB [senior, of]
of prep CEOs NOUN [com

In [60]:
from spacy import displacy
displacy.render(doc, style='dep')

### Entity Recognition

In [61]:
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Sebastian Thrun 5 20 PERSON
Google 61 67 ORG
2007 71 75 DATE
American 173 181 NORP
Thrun 271 276 PERSON
Recode 299 305 ORG
earlier this week 306 323 DATE


In [62]:
displacy.render(doc, style='ent')

### Coreference Resolution
- Resolving what sets of pronouns or nouns in a set of sentences link to the same person or thing

- Use neuralcoref

In [None]:
pip uninstall neuralcoref

In [None]:
import spacy
import neuralcoref

In [None]:
nlp = spacy.load('en_core_web_sm')

In [None]:
neuralcoref.add_to_pipe(nlp)

In [None]:
elon_text = """Musk was born to a Canadian mother and South African father and raised in Pretoria, South Africa. He briefly attended the University of Pretoria before moving to Canada when he was 17 to attend Queen's University. He transferred to the University of Pennsylvania two years later, where he received dual bachelor's degrees in economics and physics. He moved to California in 1995 to begin a Ph.D. in applied physics and material sciences at Stanford University but dropped out after two days to pursue a business career, co-founding web software company Zip2 with his brother Kimbal. The start-up was acquired by Compaq for $307 million in 1999. Musk co-founded online bank X.com that same year, which merged with Confinity in 2000 to form the company PayPal and was subsequently bought by eBay in 2002 for $1.5 billion."""

In [None]:
doc = nlp(elon_text)

In [None]:
resolved_doc = doc._.coref_resolved
print(resolved_doc)

### 5. Feature Engineering
- Set of methods that will accomplish the task of extracting features for model building
- Converting pieces of text into some numeric vectors

#### Two major categories
- Classical NLP / ML Pipeline
- Deep Learning Pipeline

#### Classical NLP
![Classical NLP](https://raw.githubusercontent.com/subashgandyer/datasets/main/images/NLP_Classical_FE.png)


- Converts the raw data into a format that can be consumed by a machine
- Convert text into **numerical vectors**
- In Classical NLP, feature extraction is **handcrafted or hand-engineered** by domain experts who are solving the problem

### Deep Learning NLP
![DL NLP](https://raw.githubusercontent.com/subashgandyer/datasets/main/images/NLP_DL_FE.png)

- Feature Extraction happens automatically as part of the model training process
- **Neurons** extract features

### 6. Modeling
- Process of building a model with the data
    - Simple Heuristics
        - Regular Expressions
        - Rule-baased approaches
    - Probabilistic models
        - HMM
        - CRF
    - Neural Network models
        - RNN
        - LSTM
    - Ensemble models
    - Transfer learning

![Modeling Principles](https://raw.githubusercontent.com/subashgandyer/datasets/main/images/NLP_Modeling.png)

### 7. Evaluation : Measuring how good the model is
- Use the right metric
- Follow the right evaluation process

Types of Evaluation
- Intrinsic evaluation
- Extrinsic evaluation

#### Intrinsic Evaluation

![Intrinsic Evaluation_1](https://raw.githubusercontent.com/subashgandyer/datasets/main/images/NLP_Intrinsic_Evaluation1.png)


![Intrinsic Evaluation_2](https://raw.githubusercontent.com/subashgandyer/datasets/main/images/NLP_Intrinsic_Evaluation2.png)

![Intrinsic Evaluation_3](https://raw.githubusercontent.com/subashgandyer/datasets/main/images/NLP_Intrinsic_Evaluation3.png)

#### Extrinsic Evaluation

- Involves the business metrics outside the AI/ML team

Takeway
- First, check if you achieve good intrinsic evaluation metric
- Then, go for extrinsic evaluation

### 8. Deployment : Serving the built models to the customers
Major cloud providers
- Google Cloud Platform (GCP)
- Amazon Web Services (AWS)
- Microsoft Azure

### 9. Monitoring & Model Updation
- Monitoring of models' efficiency must be done on a constant real-time basis
- Performance dashboards to be included in the project

![Monitoring](https://raw.githubusercontent.com/subashgandyer/datasets/main/images/NLP_Monitoring.png)