# **1. Introduction to NLP**
![](https://i.imgur.com/aRMVbVe.jpg)

**Natural language processing (NLP)** refers to the branch of computer science—and more specifically, the branch of artificial intelligence or AI—concerned with giving computers the ability to understand text and spoken words in much the same way human beings can.

NLP is used to convert text from one language to another, provide a summary to a large amount of text, respond to customer queries in chatbots, digital assistants. It also found application in voice-operated GPS systems and other consumer conveniences. NLP is becoming increasingly popular in companies for providing business solutions to enhance customer experiences, streamline operations and increase profit.



# **2. Task of NLP**

The task of NLP is complex as the natural language is ambiguous and uncertain. There are different types of ambiguities present in natural language:

1. **Lexical Ambiguity:** It is defined as the ambiguity associated with the meaning of a single word. A single word can have different meanings. Also, a single word can be a noun, adjective, or verb. For example, The word “bank” can have different meanings. It can be a financial bank or a riverbank. Similarly, the word “clean” can be a noun, adverb, adjective, or verb.

2. **Syntactic Ambiguity:** It is defined as the ambiguity associated with the way the words are parsed. For example, The sentence “Visiting relatives can be boring.”  This sentence can have two different meanings. One is that visiting a relative’s house can be boring. The second is that visiting relatives at your place can be boring.

3. **Semantic Ambiguity:** It is defined as ambiguity when the meaning of the words themselves can be ambiguous. For example, The sentence “Mary knows a little french.” In this sentence the word “little french” is ambiguous. As we don’t know whether it is about the language french or a person.


Several NLP tasks break down human text and voice data in ways that help the computer make sense of what it's ingesting. Some of these tasks include the following:

- **Speech recognition**, also called **speech-to-text**, is the task of reliably converting voice data into text data. Speech recognition is required for any application that follows voice commands or answers spoken questions. What makes speech recognition especially challenging is the way people talk—quickly, slurring words together, with varying emphasis and intonation, in different accents, and often using incorrect grammar.
- **Part of speech tagging**, also called **grammatical tagging**, is the process of determining the part of speech of a particular word or piece of text based on its use and context. Part of speech identifies ‘make’ as a verb in ‘I can make a paper plane,’ and as a noun in ‘What make of car do you own?’
- **Word sense disambiguation** is the selection of the meaning of a word with multiple meanings  through a process of semantic analysis that determine the word that makes the most sense in the given context. For example, word sense disambiguation helps distinguish the meaning of the verb 'make' in ‘make the grade’ (achieve) vs. ‘make a bet’ (place).
- **Named entity recognition, or NEM**, identifies words or phrases as useful entities. NEM identifies ‘Kentucky’ as a location or ‘Fred’ as a man's name.
- **Co-reference resolution** is the task of identifying if and when two words refer to the same entity. The most common example is determining the person or object to which a certain pronoun refers (e.g., ‘she’ = ‘Mary’),  but it can also involve identifying a metaphor or an idiom in the text  (e.g., an instance in which 'bear' isn't an animal but a large hairy person).
- **Sentiment analysis** attempts to extract subjective qualities—attitudes, emotions, sarcasm, confusion, suspicion—from text.
- **Natural language generation** is sometimes described as the opposite of speech recognition or speech-to-text; it's the task of putting structured information into human language. 


## 2.1. Phases of Natural Language Processing

There are roughly five phases of Natural language processing:
phases of NLP


1. **Lexical Analysis:**

The first phase is lexical analysis/morphological processing. In this phase, the sentences, paragraphs are broken into tokens. These tokens are the smallest unit of text. It scans the entire source text and divides it into meaningful lexemes. For example, The sentence “He goes to college.” is divided into [ ‘He’ , ‘goes’ , ‘to’ , ‘college’, ‘.’] . There are five tokens in the sentence. A paragraph may also be divided into sentences.
lexical analysis

2. **Syntactic Analysis/Parsing:** 

The second phase is Syntactic analysis. In this phase, the sentence is checked whether it is well-formed or not. The word arrangement is studied and a syntactic relationship is found between them. It is checked for word arrangements and grammar. For example, the sentence “Delhi goes to him” is rejected by the syntactic parser.

3. **Semantic Analysis:**  

The third phase is Semantic Analysis. In this phase, the sentence is checked for the literal meaning of each word and their arrangement together. For example, The sentence “I ate hot ice cream” will get rejected by the semantic analyzer because it doesn’t make sense.

4. **Discourse Integration:** 

The fourth phase is discourse integration. In this phase, the impact of the sentences before a particular sentence and the effect of the current sentence on the upcoming sentences is determined.

5. **Pragmatic Analysis:** 

The last phase of natural language processing is Pragmatic analysis. Sometimes the discourse integration phase and pragmatic analysis phase are combined. The actual effect of the text is discovered by applying the set of rules that characterize cooperative dialogues.

# **3. What is text analysis?**

Text analysis is the process of using computer systems to read and understand human-written text for business insights. Text analysis software can independently classify, sort, and extract information from text to identify patterns, relationships, sentiments, and other actionable knowledge. You can use text analysis to efficiently and accurately process multiple text-based sources such as emails, documents, social media content, and product reviews, like a human would.

## How does text analysis work?

The core of text analysis is training computer software to associate words with specific meanings and to understand the semantic context of unstructured data. This is similar to how humans learn a new language by associating words with objects, actions, and emotions. 

Text analysis software works on the principles of deep learning and natural language processing.

**Deep learning**

Artificial intelligence is the field of data science that teaches computers to think like humans. Machine learning is a technique within artificial intelligence that uses specific methods to teach or train computers. Deep learning is a highly specialized machine learning method that uses neural networks or software structures that mimic the human brain. Deep learning technology powers text analysis software so these networks can read text in a similar way to the human brain.
Natural language processing

**Natural language processing (NLP)** 

It uses linguistic models and statistics to train the deep learning technology to process and analyze text data, including handwritten text images. NLP methods such as optical character recognition (OCR) convert text images into text documents by finding and understanding the words in the images.




# **4. Tokenization**

By tokenizing, you can conveniently split up text by word or by sentence. This will allow you to work with smaller pieces of text that are still relatively coherent and meaningful even outside of the context of the rest of the text. It’s your first step in turning unstructured data into structured data, which is easier to analyze.

When you’re analyzing text, you’ll be tokenizing by word and tokenizing by sentence. Here’s what both types of tokenization bring to the table:

1. **Tokenizing by word:** Splitting a sentence in words.

```
Hello everyone. Welcome to Party.

output: ['Hello', 'everyone', '.', 'Welcome', 'to', 'Party', '.'] 

```

2. **Tokenizing by sentence:** Splitting as sentence

```
Hello everyone. Welcome to NLP Lecture. You are studying NLP.

Output: ['Hello everyone.',
 'Welcome to NLP Lecture.',
 'You are studying NLP.']

```

3. **RegexpTokenizer:** Splits the sentence into words based on regular expression. 

```
tokenizer = RegexpTokenizer("[\w']+")
text = "Let's see how it's working."

Output: ["Let's", 'see', 'how', "it's", 'working']

```


# **5.Stop words**

Stop words are words that you want to ignore, so you filter them out of your text when you’re processing it. 

Very common words like 'in', 'is', and 'an' are often used as stop words since they don’t add a lot of meaning to a text in and of themselves.

# **6. Stemming**

Stemming is a text processing task in which you reduce words to their root, which is the core part of a word. 

![](https://i.imgur.com/ugo42th.jpg)

For example, the words **“helping” and “helper” share the root “help.” ** 

Stemming allows you to zero in on the basic meaning of a word rather than all the details of how it’s being used. 


**Understemming and overstemming are two ways stemming can go wrong:**

- **Understemming** happens when two related words should be reduced to the same stem but aren’t. This is a false negative.
- **Overstemming** happens when two unrelated words are reduced to the same stem even though they shouldn’t be. This is a false positive.

The Porter stemming algorithm dates from 1979, so it’s a little on the older side. The Snowball stemmer, which is also called Porter2, is an improvement on the original and is also available through NLTK, so you can use that one in your own projects. 

It’s also worth noting that the purpose of the Porter stemmer is not to produce complete words but to find variant forms of a word.

Fortunately, you have some other ways to reduce words to their core meaning, such as lemmatizing, which you’ll see later.

# **7. Lemmatizing**

Now that you’re up to speed on parts of speech, you can circle back to lemmatizing. Like stemming, lemmatizing reduces words to their core meaning, but it will give you a complete English word that makes sense on its own instead of just a fragment of a word. 

![](https://i.imgur.com/WySc1xi.png)

# **8. Tagging Parts of Speech**

Part of speech is a grammatical term that deals with the roles words play when you use them together in sentences. T**agging parts of speech, or POS tagging,** is the task of labeling the words in your text according to their part of speech.

In English, there are eight parts of speech:

![](https://i.imgur.com/cLrLYZJ.jpg)


**Here’s a summary that you can use to get started with NLTK’s POS tags:**

![](https://i.imgur.com/Ht9ct7x.jpg)

# **9. Feature Generation using TF-IDF**

In **Term Frequency(TF)**, you just count the number of words occurred in each document. The main issue with this Term Frequency is that it will give more weight to longer documents. Term frequency is basically the output of the BoW model.

**IDF(Inverse Document Frequency)** measures the amount of information a given word provides across the document. IDF is the logarithmically scaled inverse ratio of the number of documents that contain the word and the total number of documents.
function

TF-IDF(Term Frequency-Inverse Document Frequency) normalizes the document term matrix. It is the product of TF and IDF. Word with high tf-idf in a document, it is most of the times occurred in given documents and must be absent in the other documents. So the words must be a signature word.

![](https://i.imgur.com/U7V1mIr.jpg)

where 

d refers to a document, 

N is the total number of documents, 

df is the number of documents with term t.

## **Bag of Words (BoW) Model**

The Bag of Words (BoW) model is the simplest form of text representation in numbers. Like the term itself, we can represent a sentence as a bag of words vector (a string of numbers).

Let’s recall the three types of movie reviews we saw earlier:

- Review 1: This movie is very scary and long
- Review 2: This movie is not scary and is slow
- Review 3: This movie is spooky and good

We will first build a vocabulary from all the unique words in the above three reviews. The vocabulary consists of these 11 words: ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’,  ‘slow’, ‘spooky’,  ‘good’.

We can now take each of these words and mark their occurrence in the three movie reviews above with 1s and 0s. This will give us 3 vectors for 3 reviews:

![](https://i.imgur.com/2VK6OKb.jpg)


We will again use the same vocabulary we had built in the Bag-of-Words model to show how to calculate the TF for Review #2:

Review 2: This movie is not scary and is slow

Here,

- Vocabulary: ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’,  ‘slow’, ‘spooky’,  ‘good’
- Number of words in Review 2 = 8
- TF for the word ‘this’ = (number of times ‘this’ appears in review 2)/(number of terms in review 2) = 1/8

Similarly,

![](https://i.imgur.com/VpLgmYX.jpg)

We can calculate the IDF values for the all the words in Review 2:

IDF(‘this’) =  log(number of documents/number of documents containing the word ‘this’) = log(3/3) = log(1) = 0

Similarly,

![](https://i.imgur.com/zT5furz.jpg)

Hence, we see that words like “is”, “this”, “and”, etc., are reduced to 0 and have little importance; while words like “scary”, “long”, “good”, etc. are words with more importance and thus have a higher value.

We can now compute the TF-IDF score for each word in the corpus. Words with a higher score are more important, and those with a lower score are less important

We can now calculate the TF-IDF score for every word in Review 2:

TF-IDF(‘this’, Review 2) = TF(‘this’, Review 2) * IDF(‘this’) = 1/8 * 0 = 0

Similarly,

![](https://i.imgur.com/9y8xQoV.jpg)



# **10. Naïve Bayes Algorithm**

Naive Bayes classifiers are built on Bayesian classification methods. These rely on Bayes's theorem, which is an equation describing the relationship of conditional probabilities of statistical quantities. In Bayesian classification, we're interested in finding the probability of a label given some observed features, which we can write as P(L | features). Bayes's theorem tells us how to express this in terms of quantities we can compute more directly:


![](https://i.imgur.com/s8VSM3x.png)
![](https://i.imgur.com/O9fMgga.png)

If we are trying to decide between two labels—let's call them L1 and L2—then one way to make this decision is to compute the ratio of the posterior probabilities for each label:
![](https://i.imgur.com/Oezwnrk.jpg)

All we need now is some model by which we can compute P(features | Li) for each label. Such a model is called a generative model because it specifies the hypothetical random process that generates the data. Specifying this generative model for each label is the main piece of the training of such a Bayesian classifier. 

![](https://i.imgur.com/NiXudcR.png)

![](https://i.imgur.com/YpLt2Rr.png)