## Week 1 - Introduction to NLP and Deep Learning

**Learning Agenda of this Notebook:**

- Overview of Natural Language Processing (NLP) and its applications
- Introduction to deep learning and its application in NLP
- NLP Tasks such as Text classification, Sentiment Analysis, Language Translation, Named Entity recognition
- Text data preprocessing techniques such as tokenization, stemming, lemmatization, stop word removal, Text normalization and Text standardization
- Understanding of Parsing and its role in NLP: dependency parsing and constituency parsing
- Introduction to regular expressions and pattern matching in text data preprocessing
- Familiarization with commonly used NLP datasets and their respective tasks
- Introduction to popular NLP libraries like NLTK, spaCy and TextBlob
- Understanding of Text similarity and Text distance
- Understanding of Corpus and Corpus Linguistics
- Understanding of N-Grams and its importance in NLP
- Introduction to data annotation and data labeling in NLP tasks
- Understanding of Text data characteristics: structured and unstructured data



### Machine learning and deep learning




<figure>
    
    
<img src="images/ai-vs-machine-learning-vs-deep-learning.png" width ="400px" height ="700px">
    
Image source: [Link to source](https://docs.microsoft.com/en-us/azure/machine-learning/media/concept-deep-learning-vs-machine-learning/ai-vs-machine-learning-vs-deep-learning.png)
</figure>



### Overview of Natural Language Processing (NLP) and its applications

<figure>
<img src="images/j1.jpg">
    
Image source: [Link to source](https://www.google.com/url?sa=i&url=https%3A%2F%2Fsubscription.packtpub.com%2Fbook%2Fdata%2F9781838550295%2F1%2Fch01lvl1sec04%2Fapplications-of-natural-language-processing&psig=AOvVaw1jTunC3MeeYfHfFcWzuozV&ust=1674721897007000&source=images&cd=vfe&ved=0CBIQ3YkBahcKEwjg-u-wp-L8AhUAAAAAHQAAAAAQAw)
</figure>


### Introduction to deep learning and its application in NLP

<img src="images/DL_applications.png" height=400px width=400px>


Image Source: [Link to source](https://www.google.com/url?sa=i&url=https%3A%2F%2Fjournalofbigdata.springeropen.com%2Farticles%2F10.1186%2Fs40537-021-00444-8&psig=AOvVaw03mY82rWFCsvu-GsSQzlYd&ust=1675844987191000&source=images&cd=vfe&ved=0CBIQ3YkBahcKEwig7cvI_4L9AhUAAAAAHQAAAAAQGQ)

### Visual recognition using deep learning
<img src ="images/4aff4eece55dedcc202f316e15ef037a.jpg">

# NLP TASKS

<img src="images/nlp t.png">

Image source: [Link to source](https://medium.com/nlplanet/what-tasks-can-i-solve-with-nlp-today-1b1823cc8cdf)



## Text Classification

Text classification is the task of automatically assigning a text document to one or more pre-defined categories (a.k.a. classes), based on its content.


### Sentiment Analysis

Sentiment Analysis: Understanding whether a text has a positive sentiment (e.g. “the dinner was nice”) or a negative sentiment (e.g. “the dinner was awful”).

<img src="images/text classify1.png" width ="700px" height ="1000px">

<img src="images/text classify 2.png" width ="700px" height ="1000px">

Image source: [Link to source](https://medium.com/nlplanet/what-tasks-can-i-solve-with-nlp-today-1b1823cc8cdf)




<img src="images/Sentiment.png" width ="400px" height ="400px">




### Information Retrieval and Semantic Search


Information retrieval (IR) is focused on understanding the user’s intent (typically expressed with a query) and providing the most relevant results. Searches can be based on full-text or metadata searches. Traditional information retrieval systems work by (1) efficiently matching texts between queries and documents, and (2) assigning different importance to different words in a smart way.

<img src="images/information retriving.png" width ="700px" height ="1000px">

Image source: [Link to source](https://medium.com/nlplanet/what-tasks-can-i-solve-with-nlp-today-1b1823cc8cdf)



Semantic search is related to information retrieval in that it is concerned with finding the best match for a user’s query.

<img src="images/semantic search.png" width ="700px" height ="1000px">

Image source: [Link to source](https://medium.com/nlplanet/what-tasks-can-i-solve-with-nlp-today-1b1823cc8cdf)





### Text Summarization

Text summarization is the process of generating a short, accurate, and representative summary of a longer text document. The goal of text summarization is to create a condensed version of the original document that captures its essential information while being significantly shorter.

Examples of text summarization use cases are:

- Extracting key information from public news articles and producing insights such as trends and news spotlights.
- Allowing clustering of documents by their relevant content.


<img src="images/summary.png" width ="700px" height ="1000px">

Image source: [Link to source](https://turbolab.in/types-of-text-summarization-extractive-and-abstractive-summarization-basics/)





### Question Answering


Question Answering (QA) is focused on techniques to automatically answer questions posed in natural language.
Broadly speaking, there are two types of QA systems
- extractive
- generative.

#### Extractive 

Extractive Question Answering takes a question as input and retrieves the most relevant answer from a large database of potential answers.


<img src="images/extraction.png" width ="700px" height ="1000px">

Image source: [Link to source](https://medium.com/nlplanet/what-tasks-can-i-solve-with-nlp-today-1b1823cc8cdf)

#### Generative

Generative Question Answering, on the other hand, generates an answer from scratch based on the question and sometimes also on additional context information.



<img src="images/generative.png" width ="700px" height ="1000px">

Image source: [Link to source](https://medium.com/nlplanet/what-tasks-can-i-solve-with-nlp-today-1b1823cc8cdf)





###  Named-Entity Recognition

Named-Entity Recognition is a subtask of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.


<img src="images/named reco.png" width ="700px" height ="1000px">

Image source: [Link to source](https://medium.com/nlplanet/what-tasks-can-i-solve-with-nlp-today-1b1823cc8cdf)







### Knowledge Graphs

<img src="images/text graphs.png" width ="700px" height ="1000px">

Image source: [Link to source](https://medium.com/nlplanet/what-tasks-can-i-solve-with-nlp-today-1b1823cc8cdf)






### Text Generation

Text Generation is the task of automatically creating natural language text similar to those produced by humans.


<img src="images/text generation.png" width ="700px" height ="1000px">

Image source: [Link to source](https://medium.com/nlplanet/what-tasks-can-i-solve-with-nlp-today-1b1823cc8cdf)



### Speech-to-Text and Text-to-Speech

- In Speech-to-Text (STT), also called Automatic Speech Recognition (ASR), the computer listens to a person speaking and converts the sounds into written words. 
- In Text-to-Speech (TTS), the computer reads written text and converts it into spoken words.

<img src="images/speech -text.png" width ="400px" height ="400px">

Image source: [Link to source](https://medium.com/nlplanet/what-tasks-can-i-solve-with-nlp-today-1b1823cc8cdf)

<img src="images/speech to text.png" width ="700px" height ="1000px">

Image source: [Link to source](https://www.kardome.com/blog-posts/difference-speech-and-voice-recognition)


<img src="images/speech.png" width ="700px" height ="1000px">

Image source: [Link to source](https://www.google.com/search?q=Speech-to-Text+and+Text-to-Speech&sxsrf=AJOqlzVmL0dFLGQbjABkyMH3F9oxTTdOxw:1677599011146&source=lnms&tbm=isch&sa=X&ved=2ahUKEwjA4Ympx7j9AhXIs6QKHR-uCE8Q_AUoAXoECAEQAw&biw=1536&bih=735&dpr=1.25)



### Image Search


Image Search is the process of searching for images based on their visual content and using textual queries. It’s a multimodal task as it concerns data in different modalities: text and image.

Nowadays, image-based search engines are developed somewhat similar to semantic text-based search engines:

- All the images are embedded and represented as vectors.
- The query is embedded as well.
- The best results are the images with the highest vector similarity to the query.

<img src="images/image search.png" width ="700px" height ="1000px">

Image source: [Link to source](https://medium.com/nlplanet/what-tasks-can-i-solve-with-nlp-today-1b1823cc8cdf)





### Text data preprocessing techniques


<img src="images/text processing.png" width ="500px" height ="500px">

Image source: [Link to source](https://basilkjose.medium.com/data-preprocessing-natural-language-competition-processing-dcbbf9d014e8)






###  Tokenization, Stemming, Lemmatization, stop word removal, Text normalization and Text standardization


#### Text normalization 

Text normalization reduces variations in word forms to a common form when the variations mean the same thing. For example, US and U.S.A become USA; Product, product and products become product and so on.

<img src="images/text normal.png" width ="700px" height ="700px">

Image source: [Link to source](https://devopedia.org/text-normalization)


#### Tokenization

Tokenization is the process of breaking down the given text in natural language processing into the smallest unit in a sentence called a token. Punctuation marks, words, and numbers can be considered tokens.

<img src="images/token.png"  width ="500px" height ="500px">

Image source: [Link to source](https://medium.com/mlearning-ai/nlp-tokenization-stemming-lemmatization-and-part-of-speech-tagging-9088ac068768)


Thus, we can do the split into tokens in a very practical way with two different libraries.

- Stemming
- Lemmatization 

#### Stemming

Stemming is definitely the simpler of the two approaches. With stemming, words are reduced to their word stems. A word stem need not be the same root as a dictionary-based morphological root, it just is an equal to or smaller form of the word.

#### Lemmatization 

The aim of lemmatization, like stemming, is to reduce inflectional forms to a common base form. As opposed to stemming, lemmatization does not simply chop off inflections. Instead, it uses lexical knowledge bases to get the correct base forms of words.


<img src="images/library s and t.png"  width ="500px" height ="500px">

Image source: [Link to source](https://medium.com/mlearning-ai/nlp-tokenization-stemming-lemmatization-and-part-of-speech-tagging-9088ac068768)




<img src="images/steming.JPG"  width ="700px" height ="700px">

Image source: [Link to source](https://medium.com/mlearning-ai/nlp-tokenization-stemming-lemmatization-and-part-of-speech-tagging-9088ac068768)














<img src="images/text 3.png"  width ="500px" height ="500px">

<img src="images/text 2.png"  width ="500px" height ="500px">


###  Parsing and its role in NLP

Parsing is the task of converting a sentence into a tree that captures syntactic relations among words.

<img src="images/Parsing process.png" width ="400px" height ="400px">

Image source: [Link to source](http://www.warse.org/IJACST/static/pdf/file/ijacst02432015.pdf)

We will review two distinct approaches to parsing:

- Constituent-based parsing:The output is a tree of constituents; leaves are words; nodes above leaves are pre-terminal nodes (tagged by part of speech labels); interior nodes are constituents called phrases.

- Dependency-based:All the nodes in the tree are words; links among words are labeled by syntactic function.


The following figures illustrates the different representations:

<img src="images/dvsc.png" width ="600px" height ="600px">

Image source: [Link to source](https://linguistics.stackexchange.com/questions/7280/why-is-constituency-needed-since-dependency-gets-the-job-done-more-easily-and-e)




<img src="images/constituent-dependency.png" width ="600px" height ="600px">

Image source: [Link to source](https://www.cs.bgu.ac.il/~elhadad/nlp13/nlp03.html)











 

### some detailed explanation to dependency parsing and constituency parsing

#### Constituency Parsing

The constituency parse tree is based on the formalism of context-free grammars. In this type of tree, the sentence is divided into constituents, that is, sub-phrases that belong to a specific category in the grammar.

In English, for example, the phrases “a dog”, “a computer on the table” and “the nice sunset” are all noun phrases, while “eat a pizza” and “go to the beach” are verb phrases.

The grammar provides a specification of how to build valid sentences, using a set of rules. As an example, the rule VP - VNP means that we can form a verb phrase (VP) using a verb (V) and then a noun phrase (NP).

While we can use these rules to generate valid sentences, we can also apply them the other way around, in order to extract the syntactical structure of a given sentence according to the grammar.


Let’s dive straight into an example of a constituency parse tree for the simple sentence, “I saw a fox”:

<img src="images/constituency_parse_tree-1.png" width ="500px" height ="500px">

Image source: [Link to source](https://www.baeldung.com/wp-content/uploads/sites/4/2020/06/constituency_parse_tree-1.png)


#### Dependency Parsing

As opposed to constituency parsing, dependency parsing doesn’t make use of phrasal constituents or sub-phrases. Instead, the syntax of the sentence is expressed in terms of dependencies between words — that is, directed, typed edges between words in a graph.

As opposed to constituency parsing, dependency parsing doesn’t make use of phrasal constituents or sub-phrases. Instead, the syntax of the sentence is expressed in terms of dependencies between words — that is, directed, typed edges between words in a graph.

More formally, a dependency parse tree is a graph G = (V, E) where the set of vertices V contains the words in the sentence, and each edge in E connects two words. The graph must satisfy three conditions:

- There has to be a single root node with no incoming edges.
- For each node v in V, there must be a path from the root R to v.
- Each node except the root must have exactly 1 incoming edge.

Additionally, each edge in E has a type, which defines the grammatical relation that occurs between the two words.

Let’s see what the previous example looks like if we perform dependency parsing:

<img src="images/dependency_parse_tree.png" width ="700px" height ="700px">

Image source: [Link to source](https://www.baeldung.com/wp-content/uploads/sites/4/2020/06/dependency_parse_tree.png)




### Understanding of N-Grams and its importance in NLP

No. of N-gram = X - (N - 1)
Where,

X is the total number of words in a sentence.

An N-gram is a contiguous sequence of n items from a given sample of text or speech. In Natural Language Processing, the concept of N-gram is widely used for text analysis. An N-gram of size 1 is referred to as a “unigram“, size 2 is a “bigram”, size 3 is a “trigram”.

<img src="images/n_gram_ex.png">