# Chapter 2. NLP pipeline

The key stages in the NLP pipeline:

1. Data acquisition
2. Text cleaning
3. Pre-processing
4. Feature engineering
5. Modeling
6. Evaluation
7. Deployment
8. Monitoring and model updating

![Untitled](Chapter%202%20NLP%20pipeline%20c140380e660240a9a622b98490b31b28/Untitled.png)

## 1. Data Acquisition

Ways to collecting data:

- Use a public dataset
- Scrape data: crawl data t·ª´ internet
- Product intervention: Thu th·∫≠p data t·ª´ vi·ªác ch·∫°y c√°c s·∫£n ph·∫©m th·ª±c t·∫ø
- Data augmentation:
    - Synonym replacement: Ch·ªçn ng·∫´u nhi√™n `k` t·ª´ trong c√¢u m√† kh√¥ng ph·∫£i l√† t·ª´ k·∫øt th√∫c c√¢u, replace c√°c t·ª´ n√†y b·∫±ng synonyms c·ªßa ch√∫ng. Ta c√≥ th·ªÉ d√πng Synsets trong Wordnet [3,4]
    - Back translation: Gi·∫£ s·ª≠ ta c√≥ S1 b·∫±ng ti·∫øng Anh, ta c√≥ th·ªÉ d√πng google translate ƒë·ªÉ d·ªãch n√≥ sang S2 l√† ti·∫øng ƒê·ª©c, t·ª´ S2 ta l·∫°i d√πng n√≥ ƒë·ªÉ d·ªãch ng∆∞·ª£c l·∫°i ti·∫øng Anh ‚áí 2 c√¢u c√≥ ng·ªØ nghƒ©a l√† gi·ªëng nhau nh∆∞ng c·∫•u tr√∫c c·ªßa n√≥ l·∫°i kh√°c.

    ![Untitled](Chapter%202%20NLP%20pipeline%20c140380e660240a9a622b98490b31b28/Untitled%201.png)
    - TF-IDF-based word replacement
    - Bigram flipping
    - Replacing entities
    - Adding noise to data
    - Advanced techniques:
        - Snorkel
        - Easy Data Augmentation
        - Active Learning

## 2. Text extraction and cleanup

![Untitled](Chapter%202%20NLP%20pipeline%20c140380e660240a9a622b98490b31b28/Untitled%202.png)

### HTML Parsing and Cleanup

### Unicode Normalization



In [1]:
import emoji

text = 'I love üçï! Shall we book a üöó to gizza?'
Text = text.encode("utf-8")

Text

b'I love \xf0\x9f\x8d\x95! Shall we book a \xf0\x9f\x9a\x97 to gizza?'

### Spelling Correction

C√≥ th·ªÉ s·ª≠ d·ª•ng REST API cho vi·ªác s·ª≠a l·ªói ch√≠nh t·∫£ c·ªßa Microsoft

In [2]:
import requests
import json

api_key = ""
example_text = "Hollo, wrld"
endpoint = "https://api.cognitive.microsoft.com/bing/v7.0/SpellCheck"

data = {
    'text': example_text
}

params = {
    'mkt':'en-us',
    'mode':'proof'
    }

headers = {
    'Content-Type': 'application/x-www-form-urlencoded',
    'Ocp-Apim-Subscription-Key': api_key,
    }

response = requests.post(endpoint, headers=headers, params=params, data=data)

json_response = response.json()
print(json.dumps(json_response, indent=4))


{
    "error": {
        "code": "401",
        "message": "Access denied due to invalid subscription key or wrong API endpoint. Make sure to provide a valid key for an active subscription and use a correct regional API endpoint for your resource."
    }
}


### System-Specific Error Correction

In [3]:
from PIL import Image
from pytesseract import image_to_string
import pytesseract


pytesseract.pytesseract.tesseract_cmd  = r'C:\Program Files\Tesseract-OCR\tesseract'
filename = "./test_img.png"

text = image_to_string(Image.open(filename))
print(text)

In the nineteenth century the only kind of linguistics considered
seriously was this comparative and historical stady of word in angus

known or believed to be cagnate‚Äîsay the Semitic languages, or the Indo-
European languages. It is significant that the Germans who really made
the subject what it was, used the term Indo-germanisch, Those who know
the popular works of Otto Jespersen will remember how firmly he
declares that linguistic science is historical. And those who have noticed







## 3. Pre-Processing

Common pre-processing steps used in NLP software:

- Preliminaries:
    - Sentence segmentation and word tokenization
- Frequent steps
    - Stop word removal
    - stemming and lemmatization
    - removing digits/punctuation
    - lowercasing
- Other steps:
    - Normalization, lagunage detection, code mixing, transliteration
- Advanced processing
    - POS tagging, parsing, corereference resolution

### Sentence Segmentation


In [4]:
from nltk.tokenize import sent_tokenize, word_tokenize

mytext = """
In the previous chapter, we saw examples of some common NLP
applications that we might encounter in everyday life. If we were asked to
build such an application, think about how we would approach doing so at our
organization. We would normally walk through the requirements and break the
problem down into several sub-problems, then try to develop a step-by-step
procedure to solve them. Since language processing is involved, we would also
list all the forms of text processing needed at each step. This step-by-step
processing of text is known as pipeline. It is the series of steps involved in
building any NLP model. These steps are common in every NLP project, so it
makes sense to study them in this chapter. Understanding some common procedures
in any NLP pipeline will enable us to get started on any NLP problem encountered
in the workplace. Laying out and developing a text-processing pipeline is seen
as a starting point for any NLP application development process. In this
chapter, we will learn about the various steps involved and how they play
important roles in solving the NLP problem and we‚Äôll see a few guidelines
about when and how to use which step. In later chapters, we‚Äôll discuss
specific pipelines for various NLP tasks (e.g., Chapters 4‚Äì7).r
"""

my_sentences = sent_tokenize(mytext)

my_sentences

['\nIn the previous chapter, we saw examples of some common NLP\napplications that we might encounter in everyday life.',
 'If we were asked to\nbuild such an application, think about how we would approach doing so at our\norganization.',
 'We would normally walk through the requirements and break the\nproblem down into several sub-problems, then try to develop a step-by-step\nprocedure to solve them.',
 'Since language processing is involved, we would also\nlist all the forms of text processing needed at each step.',
 'This step-by-step\nprocessing of text is known as pipeline.',
 'It is the series of steps involved in\nbuilding any NLP model.',
 'These steps are common in every NLP project, so it\nmakes sense to study them in this chapter.',
 'Understanding some common procedures\nin any NLP pipeline will enable us to get started on any NLP problem encountered\nin the workplace.',
 'Laying out and developing a text-processing pipeline is seen\nas a starting point for any NLP applicat

### Word tokenization

In [5]:
for sentence in my_sentences:
    print(sentence)
    print(word_tokenize(sentence))


In the previous chapter, we saw examples of some common NLP
applications that we might encounter in everyday life.
['In', 'the', 'previous', 'chapter', ',', 'we', 'saw', 'examples', 'of', 'some', 'common', 'NLP', 'applications', 'that', 'we', 'might', 'encounter', 'in', 'everyday', 'life', '.']
If we were asked to
build such an application, think about how we would approach doing so at our
organization.
['If', 'we', 'were', 'asked', 'to', 'build', 'such', 'an', 'application', ',', 'think', 'about', 'how', 'we', 'would', 'approach', 'doing', 'so', 'at', 'our', 'organization', '.']
We would normally walk through the requirements and break the
problem down into several sub-problems, then try to develop a step-by-step
procedure to solve them.
['We', 'would', 'normally', 'walk', 'through', 'the', 'requirements', 'and', 'break', 'the', 'problem', 'down', 'into', 'several', 'sub-problems', ',', 'then', 'try', 'to', 'develop', 'a', 'step-by-step', 'procedure', 'to', 'solve', 'them', '.']
Sinc

![](./Chapter%202%20NLP%20pipeline%20c140380e660240a9a622b98490b31b28/Untitled%203.png)

### Frequent steps

- Sau khi ƒë√£ sentence segmener v√† word tokenizer, ta c√≥ ƒë∆∞·ª£c m·∫£ng c√°c t·ª´ c√≥ nghƒ©a. Vi·ªác l√†m ti·∫øp theo ƒë√≥ l√† l·ªçc b·ªè ƒëi nh·ªØng t·ª´ kh√¥ng c·∫ßn thi·∫øt cho vi·ªác ph√¢n t√≠ch v√≠ d·ª• nh∆∞ c√°c t·ª´ `a, an, the, of, of, in...` hay c√≤n g·ªçi l√† c√°c `stop words`. Ngo√†i ra th√¨ c√≤n m·ªôt s·ªë t·ª´ ng·ªØ kh√°c n·ªØa kh√¥ng th·ª±c s·ª± li√™n quan ƒë·∫øn ng·ªØ c·∫£nh c≈©ng c·∫ßn ƒë∆∞·ª£c l·ªçc.
- V·∫•n ƒë·ªÅ n·ªØa ƒë√≥ l√† v·ªÅ ch·ªØ hoa v√† ch·ªØ th∆∞·ªùng, th∆∞·ªùng th√¨ ta s·∫Ω ƒë·ªÉ hoa h·∫øt ho·∫∑c th∆∞·ªùng h·∫øt v√† h·∫ßu h·∫øt l√† ƒë·ªÉ th∆∞·ªùng h·∫øt.
- Lo·∫°i b·ªè ƒëi c√°c d·∫•u c√¢u, m·ªôt s·ªë ch·ªØ s·ªë kh√¥ng c·∫ßn thi·∫øt.

In [6]:
import nltk
from nltk.corpus import stopwords
from string import punctuation

punctuation = list(punctuation)

def preprocess_corpus(texts):
    mystopwords = set(stopwords.words("english"))
    def remove_stops_digits(tokens):
        return [token.lower() for token in tokens if token not in mystopwords
                        and not token.isdigit() and token not in punctuation]
    return [remove_stops_digits(word_tokenize(text)) for text in texts]

preprocess_corpus(my_sentences)

[['in',
  'previous',
  'chapter',
  'saw',
  'examples',
  'common',
  'nlp',
  'applications',
  'might',
  'encounter',
  'everyday',
  'life'],
 ['if',
  'asked',
  'build',
  'application',
  'think',
  'would',
  'approach',
  'organization'],
 ['we',
  'would',
  'normally',
  'walk',
  'requirements',
  'break',
  'problem',
  'several',
  'sub-problems',
  'try',
  'develop',
  'step-by-step',
  'procedure',
  'solve'],
 ['since',
  'language',
  'processing',
  'involved',
  'would',
  'also',
  'list',
  'forms',
  'text',
  'processing',
  'needed',
  'step'],
 ['this', 'step-by-step', 'processing', 'text', 'known', 'pipeline'],
 ['it', 'series', 'steps', 'involved', 'building', 'nlp', 'model'],
 ['these',
  'steps',
  'common',
  'every',
  'nlp',
  'project',
  'makes',
  'sense',
  'study',
  'chapter'],
 ['understanding',
  'common',
  'procedures',
  'nlp',
  'pipeline',
  'enable',
  'us',
  'get',
  'started',
  'nlp',
  'problem',
  'encountered',
  'workplace'],
 [

### Stemming and lemmatization



**Stemming** c√≥ nghƒ©a l√† remove m·ªôt s·ªë ph·∫ßn prefix v√† suffix ƒëi ƒë·ªÉ ƒë∆∞·ª£c c√°c form gi·ªëng nhau v√≠ d·ª•:

`'car' v√† 'cars'` s·∫Ω ƒë∆∞·ª£c stemming th√†nh `car`. Vi·ªác n√†y ƒë∆∞·ª£c th·ª±c hi·ªán b·∫±ng c√°ch apply m·ªôt s·ªë rules nh·∫•t ƒë·ªãnh v√≠ d·ª• nh∆∞ t·ª´ n√†o k·∫øt th√∫c b·∫±ng `es` th√¨ b·ªè ƒëi `es`

In [7]:
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()

word1, word2 = 'car', 'cars'


print(stemmer.stem(word1), stemmer.stem(word2))

car car


ƒê·ªëi v·ªõi c√°c t·ª´ m√† bi·∫øn th·ªÉ c·ªßa n√≥ kh√¥ng ƒë∆°n gi·∫£n l√† th√™m suffix ho·∫∑c prefix v√≠ d·ª•: good, better,... m√† ch√∫ng v·∫´n c√≥ c√πng nghƒ©a v·ªõi nhau => c≈©ng c·∫ßn ph·∫£i chuy·ªÉn v·ªÅ d·∫°ng base form nh·∫•t ƒë·ªãnh. Vi·ªác th·ª±c hi·ªán n√†y ƒë∆∞·ª£c g·ªçi l√† `Lemmatization`.

![](./Chapter%202%20NLP%20pipeline%20c140380e660240a9a622b98490b31b28/Untitled%204.png)

In [4]:
import nltk

nltk.download('omw-1.4')
from nltk.stem import WordNetLemmatizer




lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("better ", pos = "a"))

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\hoang\AppData\Roaming\nltk_data...


better 


![](./Chapter%202%20NLP%20pipeline%20c140380e660240a9a622b98490b31b28/Untitled%205.png)

Trong ti·∫øng Vi·ªát th√¨ kh√¥ng c·∫ßn d√πng ƒë·∫øn stemming v√† lemmatization v√¨ ti·∫øng Vi·ªát kh√¥ng c√≥ ki·ªÉu bi·∫øn th·ªÉ nh∆∞ ti·∫øng Anh. 2 Kƒ© thu·∫≠t tr√™n th∆∞·ªùng ƒë∆∞·ª£c d√πng trong ti·∫øng Anh

### Other Pre-processing Steps

#### Text normalization

#### Language Detection

Khi x·ª≠ l√Ω text data v√≠ d·ª• nh∆∞ khi ƒëi crawl data t·ª´ m·ªôt trang web b√°n h√†ng n√†o ƒë√≥ th√¨ ta s·∫Ω g·∫∑p tr∆∞·ªùng h·ª£p ƒë√≥ l√† craw v·ªÅ nhi·ªÅu lo·∫°i ng√¥n ng·ªØ kh√°c nhau do ng∆∞·ªùi d√πng t·ª´ c√°c qu·ªëc gia kh√°c nhau b√¨nh lu·∫≠n, ... D√≥ ƒë√≥ vi·ªác ph√¢n lo·∫°i ra ng√¥n ng·ªØ ƒë·ªÉ x·ª≠ l√Ω l√† quan tr·ªçng. Ta c√≥ th·ªÉ s·ª≠ d·ª•ng th∆∞ vi·ªán Polygot

#### Code mixing and transliteration

Khi m·ªôt ng∆∞·ªùi bi·∫øt nhi·ªÅu ng√¥n ng·ªØ, c√≥ kh·∫£ nƒÉng trong khi n√≥i ho·∫∑c vi·∫øt, ng∆∞·ªùi ƒë√≥ s·∫Ω v√¥ √Ω d√πng multiple languages trong c√¢u => ƒê√¢y g·ªçi l√† `code mixing`.

Trong khi vi·∫øt, khi ng∆∞·ªùi ƒë√≥ s·ª≠ d·ª•ng c√°c t·ª´ m√† m√¨nh n√≥i d∆∞·ªõi d·∫°ng ƒë√°nh v·∫ßn ng·ªØ √¢m trong ti·∫øng anh (v√≠ d·ª• `ch√†o` trong ti·∫øng vi·ªát n·∫øu n√≥i ng·ªØ √¢m theo ti·∫øng Anh s·∫Ω l√† `chao` => vi·∫øt th√†nh ch·ªØ l√† `chao`) => G·ªçi l√† transliteration.

![](./Chapter%202%20NLP%20pipeline%20c140380e660240a9a622b98490b31b28/Untitled%206.png)

### Advanced Processing

## 4. Feature engineering

Sau khi qua ti·ªÅn x·ª≠ l√Ω raw text, b∆∞·ªõc ti·∫øp theo l√† bi·∫øn ƒë·ªïi data ƒë√£ qua x·ª≠ l√Ω v·ªÅ d·∫°ng data m√† c√°c model AI c√≥ th·ªÉ l√†m vi·ªác ƒë∆∞·ª£c, ƒë√≥ ch√≠nh l√† c√°c d·∫°ng ma tr·∫≠n v√† vector. Vi·ªác n√†y c√≤n ƒë∆∞·ª£c g·ªçi l√† `feature extraction`

![](./Chapter%202%20NLP%20pipeline%20c140380e660240a9a622b98490b31b28/Untitled%207.png)

### Classical NLP / ML Pipeline



### DL Pipeline

## 5. Modeling

T√≥m t·∫Øt l·∫°i ki·∫øn th·ª©c:

- B√™n c·∫°nh vi·ªác d√πng Model ML v√† DL, vi·ªác k·∫øt h·ª£p s·ª≠ d·ª•ng ph∆∞∆°ng ph√°p ph·ªèng ƒëo√°n (heuristic -> ki·ªÉu nh∆∞ rule-based, if else v·ªõi ƒëi·ªÅu ki·ªán n√†o ƒë√≥ cho d·ªØ li·ªáu). Vi·ªác s·ª≠ d·ª•ng c√°c ph∆∞∆°ng ph√°p t√πy thu·ªôc v√†o b√†i to√°n l·ªõn hay nh·ªè, d·ªØ li·ªáu nhi·ªÅu hay √≠t, kh√¥ng ph·∫£i l√∫c n√†o c≈©ng heuristic v√† kh√¥ng ph·∫£i l√∫c n√†o c≈©ng d√πng model.

- Khi s·ª≠ d·ª•ng model th√¨ vi·ªác s·ª≠ d·ª•ng model stacking (S·ª≠ d·ª•ng nhi·ªÅu model trong ƒë√≥ ƒë·∫ßu ra model n√†y l√† ƒë·∫ßu v√†o cho model ti·∫øp theo) ho·∫∑c model ensemble (S·ª≠ d·ª•ng nhi·ªÅu model song song ƒë·ªÉ t√¨m ra model cu·ªëi c√πng) ƒë∆∞·ª£c khuy·∫øn kh√≠ch cho ra k·∫øt qu·∫£ t·ªët h∆°n -> Trong tr∆∞·ªùng h·ª£p ƒë√°p ·ª©ng ƒë∆∞·ª£c hi·ªáu nƒÉng y√™u c·∫ßu.

Vi·ªác s·ª≠ d·ª•ng k·∫øt h·ª£p stacking + ensemble c√≥ th·ªÉ th·ªÉ hi·ªán qua h√¨nh sau:

![](../assets/images/Screenshot%20from%202023-10-04%2009-59-50.png)

- S·ª≠ d·ª•ng feature engineering m·ªôt c√°ch t·ªët h∆°n
- S·ª≠ d·ª•ng transfer learning
- S·ª≠ d·ª•ng heuristic ƒë·ªÉ handle m·ªôt s·ªë tr∆∞·ªùng h·ª£p nh·∫•t ƒë·ªãnh m√† model kh√¥ng th·ªÉ handle

Vi·ªác ph√°t tri·ªÉn b√†i to√°n d·ª±a theo data c√≥ th·ªÉ  tham kh·∫£o b·∫£ng sau:

![](../assets/images/Screenshot%20from%202023-10-04%2010-03-19.png)

## 5. Evaluation

Evaluation c√≥ 2 lo·∫°i:

- Intrinsic: Focus v√†o intermediary objectives
- Extrinsic: Focus v√†o final objectives


V√≠ d·ª• v·ªõi h·ªá th·ªëng cls spam email:

- Intrinsic: ƒëo ƒë·∫°c c√°c metric precision, recall
- Extrinsic: ƒëo ƒë·∫°c th·ªùi gian m√† user wasted do vi·ªác sai s·ªë trong ph√¢n lo·∫°i email -> email r√°c v√†o inbox, email chu·∫©n v√†o th√πng r√°c

Hay n√≥i c√°ch kh√°c Intrinsic ƒë√°nh gi√° model trong qu√° tr√¨nh ph√°t tri·ªÉn, c√≤n ƒë√°nh gi√° cu·ªëi c√πng khi model ƒëc ƒë∆∞a v√†o business th√¨ d√πng Extrinsic

### Intrinsic Evaluation

![](../assets/images/Screenshot%20from%202023-10-04%2010-05-37.png)
![](../assets/images/Screenshot%20from%202023-10-04%2010-05-58.png)

### Extrinsic Evaluation

## 6. Post-modeling phases

Sau khi c√≥ model th√¨ c·∫ßn:

- Deploy model
- Monitoring
- Model updating

![](../assets/images/Screenshot%20from%202023-10-04%2010-54-42.png)

## 7. Wokring with Other Languages

![](../assets/images/Screenshot%20from%202023-10-04%2011-05-14.png)

## 8. Case study

Xem x√©t case studo v·ªÅ  tool c·∫£i thi·ªán chƒÉm s√≥c kh√°ch h√†ng c·ªßa Uber: Customer Obsession Ticketing Assistant (COTA)

Uber v·∫≠n h√†nh h∆°n 400 th√†nh ph·ªë tr√™n to√†n th·∫ø gi·ªõi v√¨ v·∫≠y m√† l∆∞·ª£ng tickets cho c√°c v·∫•n ƒë·ªÅ kh√°c nhau h√†ng ng√†y l√† v√¥ c√πng nhi·ªÅu. V√† c≈©ng s·∫Ω c√≥ nhi·ªÅu solutions cho m·ªói ticket. M·ª•c ti√™u c·ªßa COTA l√† ranking c√°c solution n√†y v√† ch·ªçn ra c√°i t·ªët nh·∫•t cho ticket. 

![](../assets/images/Screenshot%20from%202023-10-04%2011-16-16.png)

### Break ra ch√∫t:

- ƒê·∫ßu ti√™n th√¥ng tin c·∫ßn ƒë·ªÉ c√≥ th·ªÉ identify ticket issue v√† ch·ªçn ra solution ƒë·∫øn t·ª´ 3 ngu·ªìn:

    - Ticket info (metadata)
    - Ticket text -> N·ªôi dung c·ªßa ticket
    - Trip data -> D·ªØ li·ªáu chuy·∫øn ƒëi c·ªßa kh√°ch h√†ng

- Sau ƒë√≥ d·ªØ li·ªáu `Ticket text` ƒë∆∞·ª£c ƒë∆∞a qua pre-processing

- Sau khi processing th√¨ d√πng LSI v√† TF-IDF ƒë·ªÉ extract feature. Process n√†y g·ªçi l√† topic modeling:

    - Chi ti·∫øt v·ªÅ c√°ch Uber d√πng task n√†y l√†: Uber thu th·∫≠p l·ªãch s·ª≠ c√°c tickets cho m·ªói solution t·ª´ CSDL, t·∫°o ra bag-of-words vector representation cho m·ªói solution v√† t·∫°o topic model d·ª±a tr√™n c√°c representation n√†y. Khi ƒë√≥ v·ªõi ticket input cho model, n√≥ s·∫Ω ƒë∆∞·ª£c t√≠nh to√°n cosine similarity v·ªõi m·ªói solution -> Nh·∫≠n ƒë∆∞·ª£c k·∫øt qu·∫£ l√† ticker text's similarity cho t·∫•t c·∫£ c√°c possible solutions.
    - B∆∞·ªõc ti·∫øp theo l√† k·∫øt h·ª£p k·∫øt qu·∫£ cosine similarity t·ª´ b∆∞·ªõc tr∆∞·ªõc v·ªõi Ticket info v√† Trip data ƒë·ªÉ ranking, ch·ªçn ra 3 solution t·ªët nh·∫•t.