#  Natural Language Processing Part 1

## Yahia Chammami

## I-Natural Language Toolkit  (NLTK)

### 1. Introduction to nltk

**NLTK is a versatile library that is commonly used for various NLP tasks, including text classification, sentiment analysis, information extraction, and text generation. It serves as a valuable resource for researchers, developers, and data scientists working with natural language data in Python.**

**Corpora** : NLTK includes a vast collection of linguistic corpora, such as the Penn Treebank, WordNet, and various text collections. These corpora serve as valuable resources for linguistic research and NLP tasks.

**Text Processing Libraries**: NLTK provides a wide range of text processing tools and modules, including tokenization, stemming, lemmatization, part-of-speech tagging, parsing, and more. These tools enable you to preprocess and analyze text data effectively.

**Machine Learning**: NLTK includes utilities and algorithms for text classification, sentiment analysis, and other machine learning-based NLP tasks.

**Linguistic Resources**: The library offers access to lexical resources like WordNet, which is a large lexical database of English, and various language grammars and parsers.

**Text Corpora and Lexical Resources**: NLTK provides access to a variety of text corpora and lexical resources, including dictionaries and thesauri.

**Natural Language Processing and Linguistics Algorithms**: It includes various algorithms for tasks such as parsing, semantic reasoning, and machine translation.

**Visualization and Tools**: NLTK offers tools for visualization and exploration of linguistic data.

**Community and Resources**: NLTK has a large and active user community, along with extensive documentation and educational resources. It’s widely used in academia and industry for NLP research and applications.



### 2. Corpora(Corpus)
**In NLTK (Natural Language Toolkit), a corpus (plural: corpora) refers to a large and structured
collection of text or speech data. Corpora in NLTK are used for various natural language processing
(NLP) tasks, including linguistic research, text analysis, and the development of NLP models and
algorithms. These corpora are often used as training and testing data for NLP tasks, and they
provide researchers and practitioners with a broad range of textual resources for different languages
and domains.**

NLTK includes various built-in **corpora** that cover different domains, languages, and types of text
data. Some of the most commonly used corpora in NLTK include:

**Gutenberg Corpus**: A collection of classic literary texts, such as novels and essays, fromthe Project Gutenberg digital library.
**Brown Corpus**: A corpus of American English text from diverse sources, classified into
numerous genres and used for linguistic research.
**Inaugural Address Corpus**: A collection of U.S. presidential inaugural addresses, useful for studying the language used in political speeches.
**WordNet**: While not a text corpus, WordNet is a lexical database that NLTK provides access to. It’s a resource for looking up word meanings, synonyms, antonyms, and other lexical information.
**Penn Treebank Corpus**: A collection of newspaper text with part-of-speech tagging, syntactic parsing, and other linguistic annotations.
**Reuters Corpus**: A collection of news articles from the Reuters news agency, often used for text classification and information retrieval tasks.
**Movie Reviews Corpus**: A collection of movie reviews categorized as positive or negative, frequently used for sentiment analysis and text classification.
**Chat-80 Data**: A corpus of chat-room-style conversations.
**Web Text Corpus**: A corpus containing text from a variety of web sources.

**These corpora serve different purposes in linguistic research and NLP tasks, such as text classification, language modeling, sentiment analysis, and more. NLTK provides tools and methods
to access and work with these corpora, making it a valuable resource for NLP practitioners and
researchers**

In [101]:
import nltk

# This will open the NLTK downloader GUI
nltk.download()


showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

**The nltk.download() command opens the NLTK Data downloader, allowing you to download various datasets and resources that NLTK uses.**

In [102]:
from nltk.corpus import movie_reviews
# Get the categories (labels)
movie_reviews.categories()


['neg', 'pos']

**The movie_reviews corpus in NLTK contains movie reviews categorized as positive and negative. The categories() function is used to obtain the categories or labels associated with the reviews. In the case of the movie_reviews corpus, there are two categories: 'pos' for positive reviews and 'neg' for negative reviews.**

In [103]:
# Get the words from the movie_reviews corpus
all_words = movie_reviews.words()

# Print the first 10 words
print(all_words[:10])

['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party']


**The movie_reviews.words() function provides a flat list of words, and you can use it to perform various text processing tasks, such as frequency analysis, feature extraction, and more.**

## II- Text Pre-Processing

Data pre-processing is the process of making the machine understand things better or making the
input more machine understandable. Some standard practices for doing that are:

### 1.Tokenization

**Tokenization is the process of breaking a text into individual words or “tokens.” NLTK (Natural
Language Toolkit) provides various methods for tokenizing text in Python. Here’s how you can
tokenize text using NLTK**

#### a) Using NLTK’s Default Tokenizer:
NLTK comes with a default tokenizer called word_tokenize, which can be used to split text
into words.

In [104]:
from nltk.tokenize import word_tokenize
text = "Yahia knows that Tokenization is the process of breaking down text into words or phrases."
tokens = word_tokenize(text)
print(tokens)

['Yahia', 'knows', 'that', 'Tokenization', 'is', 'the', 'process', 'of', 'breaking', 'down', 'text', 'into', 'words', 'or', 'phrases', '.']


#### b) Using NLTK’s Sentence Tokenizer:
If you want to split text into sentences, you can use NLTK’s sent_tokenize.

In [105]:
from nltk.tokenize import sent_tokenize
text = "This is the first sentence. And this is the second one!"
sentences = sent_tokenize(text)
print(sentences)

['This is the first sentence.', 'And this is the second one!']


#### c) Custom Tokenization:
You can also create a custom tokenizer using regular expressions to split text based on specific
patterns. For example, you can tokenize text based on spaces and punctuation.

In [106]:
import re
text = "Custom tokenization can be done with regular expressions. For example, split text based on spaces and punctuation!"
# Tokenize based on spaces and punctuation
tokens = re.split(r'\s+|[,;.!]', text)
# Remove empty strings
tokens = [token for token in tokens if token]
print(tokens)

['Custom', 'tokenization', 'can', 'be', 'done', 'with', 'regular', 'expressions', 'For', 'example', 'split', 'text', 'based', 'on', 'spaces', 'and', 'punctuation']


The choice of tokenizer depends on your specific NLP task and the characteristics of the text you
are working with. NLTK provides flexibility and allows you to use the tokenizer that best fits
your needs, whether it’s the default tokenizer, a sentence tokenizer, or a custom tokenizer based on
regular expressions.

In [107]:
from nltk.tokenize import word_tokenize
data = "I pledge to be a data scientist one day"
tokenized_text=word_tokenize(data)
print(tokenized_text)
print(type(tokenized_text))

['I', 'pledge', 'to', 'be', 'a', 'data', 'scientist', 'one', 'day']
<class 'list'>


In [108]:
from nltk.tokenize import sent_tokenize
para="""Cake is a form of sweet food made from flour sugar ,and other ingredients,
that is usually baked.In their oldest forms, cakes were modifications of bread, 
but cakes now cover a wide range of preparationsthat can be simple or elaborate,
and that share features with other dessertssuch as pastries, meringues, custards,
and pies.The most commonly used cakeingredients include flour,
sugar, eggs, butter or oil or margarine, a liquid, and leavening agents,
such as baking soda or baking powder. Common additional ingredients and flavourings include dried, candied, or fresh
fruit, nuts, cocoa, and extracts such as vanilla, with numerous
substitutions for the primary ingredients.Cakes can also be filled with
fruit preserves, nuts or dessert sauces (like pastry cream), iced with
buttercream or other icings, and decorated with marzipan, piped borders, or
candied fruit."""
tokenized_para=sent_tokenize(para)
print(tokenized_para)
print(type(tokenized_para))


['Cake is a form of sweet food made from flour sugar ,and other ingredients,\nthat is usually baked.In their oldest forms, cakes were modifications of bread, \nbut cakes now cover a wide range of preparationsthat can be simple or elaborate,\nand that share features with other dessertssuch as pastries, meringues, custards,\nand pies.The most commonly used cakeingredients include flour,\nsugar, eggs, butter or oil or margarine, a liquid, and leavening agents,\nsuch as baking soda or baking powder.', 'Common additional ingredients and flavourings include dried, candied, or fresh\nfruit, nuts, cocoa, and extracts such as vanilla, with numerous\nsubstitutions for the primary ingredients.Cakes can also be filled with\nfruit preserves, nuts or dessert sauces (like pastry cream), iced with\nbuttercream or other icings, and decorated with marzipan, piped borders, or\ncandied fruit.']
<class 'list'>


### 2. Punctuation Removal
Removing punctuation from text is a common preprocessing step in natural language processing
(NLP) tasks. Punctuation removal helps simplify text and can be useful for various NLP tasks
like text classification, text analysis, and text mining. You can remove punctuation from text in
Python using various methods, including regular expressions and string manipulation.


**Here’s how to remove punctuation from text using Python:**

#### a) Using Regular Expressions:
You can use the re library to remove punctuation using regular expressions. In this example,
we’ll remove all non-alphanumeric characters (i.e., remove everything that is not a letter or
a number):


In [109]:
import re
text = "Hello, World! This is an example text with some punctuation."
# Remove non-alphanumeric characters using regular expression
text_without_punctuation = re.sub(r'[^A-Za-z0-9 ]+', '', text)
print(text_without_punctuation)

Hello World This is an example text with some punctuation


#### b) Using String Manipulation:
You can also remove punctuation by iterating through each character in the text and keeping
only the characters that are letters or spaces:

In [110]:
# Input text containing punctuation
text = "Hello, World! This is an example text with some punctuation."

# Remove punctuation using string manipulation
text_without_punctuation = ''.join(char for char in text if char.isalnum() or char.isspace())

# Print the result (text without punctuation)
print(text_without_punctuation)


Hello World This is an example text with some punctuation


#### c) Using the string Module:
Python’s string module provides a string of all punctuation characters. You can use this
module to remove punctuation from text:

In [111]:
import string

# Input text containing punctuation
text = "Hello, World! This is an example text with some punctuation."

# Create a translator using str.maketrans to remove punctuation
translator = str.maketrans('', '', string.punctuation)

# Remove punctuation from the text using the translator
text_without_punctuation = text.translate(translator)

# Print the result (text without punctuation)
print(text_without_punctuation)


Hello World This is an example text with some punctuation


In [112]:
from nltk.tokenize import RegexpTokenizer

# Initialize a RegexpTokenizer with a regular expression pattern
tokenizer = RegexpTokenizer(r'\w+')

# Input text
text = "Wow! I am excited to learn data science"

# Tokenize the text using the defined regular expression pattern
result = tokenizer.tokenize(text)

# Print the result (list of words)
print(result)



['Wow', 'I', 'am', 'excited', 'to', 'learn', 'data', 'science']


### 3. Stop Words Removal
Stop words are words which occur frequently in a corpus. e.g a, an, the, in. Frequently occurring
words are removed from the corpus for the sake of text-normalization.

In [113]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Get the set of English stopwords
to_be_removed = set(stopwords.words('english'))

# Input paragraph containing multiple sentences
para = """Cake is a form of sweet food made from flour, sugar, and other ingredients, that is usually baked.
In their oldest forms, cakes were modifications of bread, but cakes now cover a wide range of preparations
that can be simple or elaborate, and that share features with other desserts such as pastries, meringues, custards,
and pies."""

# Tokenize the paragraph into words
tokenized_para = word_tokenize(para)
print(tokenized_para)

# Remove stopwords from the tokenized list
modified_token_list = [word for word in tokenized_para if not word in to_be_removed]

# Print the modified token list (without stopwords)
print(modified_token_list)


['Cake', 'is', 'a', 'form', 'of', 'sweet', 'food', 'made', 'from', 'flour', ',', 'sugar', ',', 'and', 'other', 'ingredients', ',', 'that', 'is', 'usually', 'baked', '.', 'In', 'their', 'oldest', 'forms', ',', 'cakes', 'were', 'modifications', 'of', 'bread', ',', 'but', 'cakes', 'now', 'cover', 'a', 'wide', 'range', 'of', 'preparations', 'that', 'can', 'be', 'simple', 'or', 'elaborate', ',', 'and', 'that', 'share', 'features', 'with', 'other', 'desserts', 'such', 'as', 'pastries', ',', 'meringues', ',', 'custards', ',', 'and', 'pies', '.']
['Cake', 'form', 'sweet', 'food', 'made', 'flour', ',', 'sugar', ',', 'ingredients', ',', 'usually', 'baked', '.', 'In', 'oldest', 'forms', ',', 'cakes', 'modifications', 'bread', ',', 'cakes', 'cover', 'wide', 'range', 'preparations', 'simple', 'elaborate', ',', 'share', 'features', 'desserts', 'pastries', ',', 'meringues', ',', 'custards', ',', 'pies', '.']


### 4. Stemming
It is reduction of inflection from words. Words with same origin will get reduced to a form which
may or may not be a word.
#### a) Porter Stemmer

In [114]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

# Initialize the Porter Stemmer
stemmer = PorterStemmer()

# Input text containing multiple sentences
content = """Cake is a form of sweet food made from flour, sugar, and other ingredients,
that is usually baked. In their oldest forms, cakes were modifications of bread,
but cakes now cover a wide range of preparations that can be simple or elaborate,
and that share features with other desserts such as pastries, meringues, custards, and pies."""

# Tokenize the text into words
tokenized_content = word_tokenize(content)

# Apply stemming to each word
stemmed_words = [stemmer.stem(word) for word in tokenized_content]

# Print the stemmed words
print(stemmed_words)


['cake', 'is', 'a', 'form', 'of', 'sweet', 'food', 'made', 'from', 'flour', ',', 'sugar', ',', 'and', 'other', 'ingredi', ',', 'that', 'is', 'usual', 'bake', '.', 'in', 'their', 'oldest', 'form', ',', 'cake', 'were', 'modif', 'of', 'bread', ',', 'but', 'cake', 'now', 'cover', 'a', 'wide', 'rang', 'of', 'prepar', 'that', 'can', 'be', 'simpl', 'or', 'elabor', ',', 'and', 'that', 'share', 'featur', 'with', 'other', 'dessert', 'such', 'as', 'pastri', ',', 'meringu', ',', 'custard', ',', 'and', 'pie', '.']


#### b) Lancaster Stemmer

In [115]:
from nltk.stem import LancasterStemmer
from nltk.tokenize import word_tokenize

# Initialize the Lancaster Stemmer
stemmer = LancasterStemmer()

# Input text containing multiple sentences
content = """Cake is a form of sweet food made from flour, sugar, and other ingredients, that is usually baked.
In their oldest forms, cakes were modifications of bread, but cakes now cover a wide range of preparations
that can be simple or elaborate, and that share features with other desserts such as pastries, meringues, custards,
and pies."""

# Tokenize the text into words
tokenized_content = word_tokenize(content)

# Apply stemming to each word
stemmed_words = [stemmer.stem(word) for word in tokenized_content]

# Print the stemmed words
print(stemmed_words)


['cak', 'is', 'a', 'form', 'of', 'sweet', 'food', 'mad', 'from', 'flo', ',', 'sug', ',', 'and', 'oth', 'ingredy', ',', 'that', 'is', 'us', 'bak', '.', 'in', 'their', 'oldest', 'form', ',', 'cak', 'wer', 'mod', 'of', 'bread', ',', 'but', 'cak', 'now', 'cov', 'a', 'wid', 'rang', 'of', 'prep', 'that', 'can', 'be', 'simpl', 'or', 'elab', ',', 'and', 'that', 'shar', 'feat', 'with', 'oth', 'dessert', 'such', 'as', 'pastry', ',', 'meringu', ',', 'custard', ',', 'and', 'pie', '.']


### 5. Lemmatization
It is another process of reducing inflection from words. The way its different from stemming is that
it reduces words to their origins which have actual meaning. Stemming sometimes generates words
which are not even words.

In [116]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Initialize the WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()

# Input text containing multiple sentences
content = """Cake is a form of sweet food made from flour, sugar, and other ingredients, that is usually baked.
In their oldest forms, cakes were modifications of bread, but cakes now cover a wide range of preparations
that can be simple or elaborate, and that share features with other desserts such as pastries, meringues, custards,
and pies."""

# Tokenize the text into words
tokenized_content = word_tokenize(content)

# Lemmatize each word
lemmatized_words = [lemmatizer.lemmatize(word) for word in tokenized_content]

# Print the lemmatized words
print(lemmatized_words)


['Cake', 'is', 'a', 'form', 'of', 'sweet', 'food', 'made', 'from', 'flour', ',', 'sugar', ',', 'and', 'other', 'ingredient', ',', 'that', 'is', 'usually', 'baked', '.', 'In', 'their', 'oldest', 'form', ',', 'cake', 'were', 'modification', 'of', 'bread', ',', 'but', 'cake', 'now', 'cover', 'a', 'wide', 'range', 'of', 'preparation', 'that', 'can', 'be', 'simple', 'or', 'elaborate', ',', 'and', 'that', 'share', 'feature', 'with', 'other', 'dessert', 'such', 'a', 'pastry', ',', 'meringue', ',', 'custard', ',', 'and', 'pie', '.']


### 6. POS Tagging
POS tagging is the process of identifying parts of speech of a sentence. It is able to identify nouns,
pronouns, adjectives etc. in a sentence and assigns a POS token to each word. There are different
methods to tag, but we will be using the universal style of tagging.

In [117]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

# Input text containing multiple sentences
content = """Cake is a form of sweet food made from flour, sugar, and other ingredients, that is usually baked.
In their oldest forms, cakes were modifications of bread, but cakes now cover a wide range of preparations
that can be simple or elaborate, and that share features with other desserts such as pastries, meringues, custards,
and pies."""

# Tokenize the text into sentences
sentences = sent_tokenize(content)

# Tokenize each sentence into words
words = [word_tokenize(sentence) for sentence in sentences]

# Perform part-of-speech tagging using the 'universal' tagset
pos_tags = [nltk.pos_tag(sentence, tagset="universal") for sentence in words]

# Print the result (list of sentences, each containing words with their POS tags)
print(pos_tags)


[[('Cake', 'NOUN'), ('is', 'VERB'), ('a', 'DET'), ('form', 'NOUN'), ('of', 'ADP'), ('sweet', 'ADJ'), ('food', 'NOUN'), ('made', 'VERB'), ('from', 'ADP'), ('flour', 'NOUN'), (',', '.'), ('sugar', 'NOUN'), (',', '.'), ('and', 'CONJ'), ('other', 'ADJ'), ('ingredients', 'NOUN'), (',', '.'), ('that', 'DET'), ('is', 'VERB'), ('usually', 'ADV'), ('baked', 'VERB'), ('.', '.')], [('In', 'ADP'), ('their', 'PRON'), ('oldest', 'ADJ'), ('forms', 'NOUN'), (',', '.'), ('cakes', 'NOUN'), ('were', 'VERB'), ('modifications', 'NOUN'), ('of', 'ADP'), ('bread', 'NOUN'), (',', '.'), ('but', 'CONJ'), ('cakes', 'NOUN'), ('now', 'ADV'), ('cover', 'VERB'), ('a', 'DET'), ('wide', 'ADJ'), ('range', 'NOUN'), ('of', 'ADP'), ('preparations', 'NOUN'), ('that', 'DET'), ('can', 'VERB'), ('be', 'VERB'), ('simple', 'ADJ'), ('or', 'CONJ'), ('elaborate', 'ADJ'), (',', '.'), ('and', 'CONJ'), ('that', 'ADP'), ('share', 'NOUN'), ('features', 'NOUN'), ('with', 'ADP'), ('other', 'ADJ'), ('desserts', 'NOUN'), ('such', 'ADJ'), ('

### 7. Chunking
Chunking also known as shallow parsing, is practically a method in NLP applied to POS tagged data
to gain further insights from it. It is done by grouping certain words on the basis of a pre-defined
rule. The text is then parsed according to the rule to group data for phrase creation.

In [118]:
import nltk
from nltk.tokenize import word_tokenize

# Input text
content = "Cake is a form of sweet food made from flour, sugar, and other␣\n↪ingredients, that is usually baked."

# Tokenize the input text into words
tokenized_text = nltk.word_tokenize(content)

# Perform part-of-speech tagging on the tokenized words
tagged_token = nltk.pos_tag(tokenized_text)

# Define a simple grammar for NP (Noun Phrase) extraction
grammar = "NP: {<DT>?<JJ>*<NN>}"

# Create a regular expression parser based on the defined grammar
phrases = nltk.RegexpParser(grammar)

# Apply the parser to the tagged tokens to extract noun phrases
result = phrases.parse(tagged_token)

# Print the result (the parsed tree structure)
print(result)

# Visualize the result by drawing the parsed tree
result.draw()


(S
  Cake/NNP
  is/VBZ
  (NP a/DT form/NN)
  of/IN
  (NP sweet/JJ food/NN)
  made/VBN
  from/IN
  (NP flour/NN)
  ,/,
  (NP sugar/NN)
  ,/,
  and/CC
  other␣/JJ
  ↪ingredients/NNS
  ,/,
  that/DT
  is/VBZ
  usually/RB
  baked/VBN
  ./.)


### 8. Word Embeddings
Word Embeddings is a NLP technique in which we try to capture the context, semantic meaning
and inter relation of words with each other. It is done by creation of a word vector. Word vectors
when projected upon a vector space can also show similarity between words.The technique or word
embeddings which we discuss here today is Word-to-vec. We would be doing so with the help of
Gensim which is another cool library like NLTK

In [119]:
from gensim.models import Word2Vec
# Sample corpus
corpus = [
["I", "love", "machine", "learning"],
["Word", "embeddings", "are", "important"],
["NLTK", "is", "a", "natural", "language", "toolkit"]
]
# Train Word2Vec model
model = Word2Vec(corpus, vector_size=100, window=5, min_count=1, sg=0)
# Save the model for later use
model.save("word2vec.model")


In [120]:
from gensim.models import Word2Vec
# Load the Word2Vec model
model = Word2Vec.load("word2vec.model")
# Explore the embeddings
vector = model.wv["machine"]
similar_words = model.wv.most_similar("machine", topn=3)
print("Vector for 'machine':", vector)
print("Most similar words to 'machine':", similar_words)


Vector for 'machine': [ 9.7702928e-03  8.1651136e-03  1.2809718e-03  5.0975787e-03
  1.4081288e-03 -6.4551616e-03 -1.4280510e-03  6.4491653e-03
 -4.6173059e-03 -3.9930656e-03  4.9244044e-03  2.7130984e-03
 -1.8479753e-03 -2.8769434e-03  6.0107317e-03 -5.7167388e-03
 -3.2367026e-03 -6.4878250e-03 -4.2346325e-03 -8.5809948e-03
 -4.4697891e-03 -8.5112294e-03  1.4037776e-03 -8.6181965e-03
 -9.9166557e-03 -8.2016252e-03 -6.7726658e-03  6.6805850e-03
  3.7845564e-03  3.5616636e-04 -2.9579818e-03 -7.4283206e-03
  5.3341867e-04  4.9989222e-04  1.9561886e-04  8.5259555e-04
  7.8633073e-04 -6.8160298e-05 -8.0070542e-03 -5.8702733e-03
 -8.3829118e-03 -1.3120425e-03  1.8206370e-03  7.4171280e-03
 -1.9634271e-03 -2.3252917e-03  9.4871549e-03  7.9704521e-05
 -2.4045217e-03  8.6048469e-03  2.6870037e-03 -5.3439722e-03
  6.5881060e-03  4.5101536e-03 -7.0544672e-03 -3.2317400e-04
  8.3448651e-04  5.7473574e-03 -1.7176545e-03 -2.8065301e-03
  1.7484308e-03  8.4717153e-04  1.1928272e-03 -2.6342822e-03
 -