  
  
#            Chapter 2: NLP tools and libraries

<img src="https://user-images.githubusercontent.com/7065401/55025843-7d99a280-4fe0-11e9-938a-4879d95c4130.png"
    style="width:150px; float: right; margin: 0 40px 40px 40px;"></img>
    
<img src="https://www.searchenginejournal.com/wp-content/uploads/2020/08/an-introduction-to-natural-language-processing-with-python-for-seos-5f3519eeb8368-1520x800.webp" style="width:300px; float: left; margin: 0 40px 40px 40px;"></img>

    


![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

# 1- Introduction

## a- Context

![image.png ](attachment:image.png)


![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

Knowledge is important to correctly understand what is being said
### ==> Semantics

## b- Specializations within NLP

**NLP** serves as the overarching umbrella term that encompasses various subfields, types, or branches of natural language processing and understanding. NLP includes tasks related to text analysis, language generation, and language understanding.

![image.png](attachment:image.png)

**NLU (Natural Language Understanding)** is a specialized area of NLP that emphasizes the ability to understand and interpret the meaning, intent, and context of human language. NLU focuses on tasks such as sentiment analysis, chatbot interactions, and named entity recognition.

**NLG (Natural Language Generation)** is another specialized area of NLP that focuses on the generation of human-like text or spoken language by computers. NLG involves tasks such as text summarization, content generation, and automated report writing.

**Both? Yes, it's possible:**

Examples:
- Chatbots and Virtual Assistants:
Chatbots need to understand user queries or statements (NLU) and generate coherent and contextually relevant responses (NLG).
- Conversational Recommender Systems:
Systems that engage users in a dialogue to recommend products, services, or content must understand user preferences (NLU) and present personalized recommendations (NLG).
- Content Summarization and Paraphrasing:
Systems that summarize long articles or rephrase sentences must grasp the content (NLU) and generate concise summaries or alternative phrasings (NLG).

**Is chat GPT NLP or NLG?**

He said : I am primarily an NLU (Natural Language Understanding) model. My main function is to understand and generate human-like text based on the input and queries I receive. While I can perform some NLG (Natural Language Generation) tasks, my primary strength lies in NLU, which enables me to comprehend and provide informative responses.

## 2- End-to-end NLP pipeline

A Natural Language Processing (NLP) pipeline is a series of data processing steps or components used to convert *raw text data* into a structured format that can be analyzed by NLP models and algorithms. These pipelines are designed to perform various tasks related to understanding and extracting information from text data. Here are the typical components of an NLP pipeline:

![image.png](attachment:image.png)

1. **Text Preprocessing**:

*a. Standard steps:*
   - Tokenization: Splitting text into individual words or tokens.
   - Lowercasing: Converting all text to lowercase for consistency.
   - Stopword Removal: Removing common words (e.g., "and," "the") that do not carry significant meaning.
   - Punctuation Removal: Removing punctuation marks.
   - Spell Correction: Correcting common spelling errors.

*b. Text Cleaning:*
   - Removing HTML tags, special characters, or unwanted symbols.
   - Handling contractions (e.g., "I'm" to "I am").
   - Removing or replacing URLs, email addresses, and other sensitive information.

*c. Text Normalization:*
   - Stemming: Reducing words to their root form (e.g., "running" to "run").
   - Lemmatization: Reducing words to their base or dictionary form (e.g., "better" to "good").

2. **Feature Extraction**:
   - Bag of Words (BoW): Creating a matrix of word frequencies or presence/absence.
   - Term Frequency-Inverse Document Frequency (TF-IDF): Assigning weights to words based on their importance in a document.
   - Word Embeddings: Representing words as dense vector representations (e.g., Word2Vec, GloVe).
   - Named Entity Recognition (NER):  Identifying and classifying entities (e.g., names of people, organizations, locations) in text.
   - Part-of-Speech Tagging (POS): Assigning grammatical tags (e.g., noun, verb, adjective) to words in a sentence. These POS tags can be used as features, especially in syntactic or grammatical analysis tasks.

2. **Modeling**:  
    - building and training machine learning or deep learning models to perform specific natural language processing tasks
    - selecting an appropriate model architecture
    - training the model on labeled data
    - tuning hyperparameters
    - evaluating its performance using suitable metrics.
    - The goal is to create a model that can accurately handle the NLP task at hand, such as text classification, sentiment analysis, machine translation, or text generation.
    - It's not a linear process



The specific components and order of these steps in an NLP pipeline can vary depending on the task and the goals of the analysis. NLP pipelines are commonly used in applications like text classification, information retrieval, chatbots, and more. They play a crucial role in extracting meaningful insights from unstructured text data.

## 3- NLP libraries in Python: SpaCy vs NLTK

### a- What is NLTK?

![image.png](attachment:image.png)
**NLTK (Natural Language Toolkit)**: NLTK is a comprehensive library for NLP that provides easy-to-use interfaces for over 50 corpora and lexical resources, such as WordNet. It also includes various text processing libraries.





<br><br>
<br><br>
<br><br>
<br><br>



## b- What is spaCy?

![image.png](attachment:image.png)
2. **spaCy**: spaCy is a fast and efficient NLP library that offers pre-trained models for several languages. It's known for its ease of use and speed in tokenization, part-of-speech tagging, named entity recognition, and more.
<br><br>
<br><br>
<br><br>
<br><br>
<br><br>
<br><br>
<br><br>


### c- SpaCy vS NLTK: who wins th battle ? 😉

Spacy is an Object-Oriented library. It's designed around the concept of "processing pipelines" where you create a "nlp" object that contains various NLP components such as tokenizers, part-of-speech taggers, and named entity recognizers.
NLTK, on the other hand, is a more modular and procedural library, string processing library.

Here is an example to show the difference between them.

### Example:

#### i- Package Installation: SpaCy

In [None]:
# prompt: Package Installation: SpaCy

!pip install spacy



In [None]:
import spacy

In [None]:
#checking


In [None]:
# Load the English language model
nlp = spacy.load("en_core_web_sm")

#### ii- Tokenization with SpaCy

In [None]:
# prompt: Tokenization with SpaCy
text = "Nlp is very enjoyable.It obviously has a fun part too. think of what makes NLP cool ?"
doc = nlp(text)

for sentence in doc.sents:
  print(sentence)

Nlp is very enjoyable.
It obviously has a fun part too.
think of what makes NLP cool ?


In [None]:
# let's confuse spacy. add Mrs. and see if he will consider it as a sentence
text =" Mrs. Tasnime likes to tokenize texts! she is having a lot of fun."
doc = nlp(text)

for sentence in doc.sents:
  print(sentence)

 Mrs. Tasnime likes to tokenize texts!
she is having a lot of fun.


In [None]:
# let's confuse spacy. add Mrs. and see if he will consider it as a sentence
text =" Mrs. Tasnime likes to tokenize texts! she is having a lot of fun."
doc = nlp(text)

for token in doc:
  print(token.text)

 
Mrs.
Tasnime
likes
to
tokenize
texts
!
she
is
having
a
lot
of
fun
.


#### iii- NLTK

In [None]:
import nltk
# The Punkt tokenizer is a pre-trained tokenizer provided by NLTK for various languages
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

#### iv-  Tokenization with NLTK

In [None]:
# TAB: many different tokenizers are presented
# provides customization techniques
# allow you to select specific settings
from nltk.tokenize import sent_tokenize
sent_tokenize("Nlp is very enjoyable. It obviously has a fun part too. think of what makes NLP cool ?")

['Nlp is very enjoyable.',
 'It obviously has a fun part too.',
 'think of what makes NLP cool ?']

You put strings as an input, you get strings as an output ==> NLTK: String processing library

In [None]:
from nltk.tokenize import word_tokenize
word_tokenize("Nlp is very enjoyable. It obviously has a fun part too. think of what makes NLP cool ?")

['Nlp',
 'is',
 'very',
 'enjoyable',
 '.',
 'It',
 'obviously',
 'has',
 'a',
 'fun',
 'part',
 'too',
 '.',
 'think',
 'of',
 'what',
 'makes',
 'NLP',
 'cool',
 '?']

#### Conclusion :
**SpaCy** is often preferred for production-level NLP applications due to its speed, efficiency, and ease of use. **NLTK** is more versatile and educational, making it a good choice for research and experimentation.

## 4- Other libraries for NLP in Python

![image.png](attachment:image.png)
1. **Gensim**: Gensim is a library for topic modeling and document similarity analysis. It's particularly useful for training word embeddings using Word2Vec and Doc2Vec models.


💡 *Word embeddings* are vector representations of words in a high-dimensional space, typically in the form of real-valued vectors.

💡Gensim is considered super-fast because it is optimized for efficient memory usage and performance. It utilizes techniques like streaming and incremental processing, enabling it to handle large datasets and train models quickly.

💡 How it works: In the context of data cleaning, streaming refers to processing data as it becomes available, rather than loading the entire dataset into memory at once. It involves sequentially reading and processing data in smaller chunks,
<br><br><br><br><br><br><br>

![image.png](attachment:image.png)

2. **scikit-learn**: While scikit-learn is primarily used for machine learning, it includes modules like CountVectorizer, TfidfVectorizer, TfidfTransformer, LogisticRegression, MultinomialNB, etc. for text feature extraction and text classification, making it a valuable tool for NLP tasks.

<br><br><br><br><br><br><br>

![image.png](attachment:image.png)

3. **Transformers (Hugging Face)**: Transformers is a library by Hugging Face that provides pre-trained models for a wide range of NLP tasks, including BERT, GPT-2, and more. It's widely used for tasks like text classification, translation, and text generation.

💡 Hugging Face is a company and open-source community that has made significant contributions to the field of Natural Language Processing (NLP).

💡 BERT (Bidirectional Encoder Representations from Transformers) is a deep learning model for natural language understanding and representation. It was introduced by Google AI researchers in a 2018 paper titled "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding."



<br><br><br><br><br><br><br>

![image.png](attachment:image.png)

4. **TextBlob**: TextBlob is a simple NLP library that provides a high-level API for diving into common NLP tasks.


💡 TextBlob is an easy-to-use library for natural language processing (NLP) that offers a user-friendly way to perform common NLP tasks without needing in-depth knowledge of NLP techniques.

💡 API (Application Programming Interface) a set of rules and protocols that allows different software applications to communicate with each other.

<br><br><br><br><br><br><br>

![image.png](attachment:image.png)

5. **fastText**: fastText, developed by Facebook, is an open-source, free, lightweight library that allows users to learn text representations and perform text classification tasks.

<br><br><br><br><br><br><br>

![image.png](attachment:image.png)

6 and 7. **PyTorch and TensorFlow**: While not NLP libraries per se, these deep learning frameworks are commonly used for building and training neural network models for various NLP tasks. They offer flexibility for custom model development.
<br><br><br><br><br><br><br>


💡 PyTorch is a popular ML library for Python based on Torch, which is an ML library implemented in C. It was originally developed by **Facebook**, but is now used by Twitter, Salesforce, and many other major organizations and businesses.


💡 Tensorflow: Originally developed by **Google**, TensorFlow is an open-source library for high-performance numerical computation using data flow graphs.
Under the hood, it’s actually a framework for creating and running computations involving tensors. The principal application for TensorFlow is in neural networks.




These libraries cover a wide range of NLP tasks, from basic text processing to advanced machine learning and deep learning tasks. The choice of library often depends on the specific NLP task you're working on and your preference for ease of use, performance, and available pre-trained models.