#**Natural Language Processing**
Part of computer science and artificial intelligence which deals with human languages.

**Applications**
1. Spell Checking
2. Keyword Search
3. Information Extracting
4. Advertisement Matching
5. Sentiment Analysis
6. Speech Recognition
7. Implementation of Chatbot
8. Machine Translation

**Components**
1. Natural Language Understanding
  - Mapping input to Useful Representations
  - Analyzing different aspects of the language

2. Natural Language Generation
  - Text Planning - Retreving contents from the knowledge base.
  - Sentence Planning - Choosing required words from meaningful phrases setting tone of the sentence.
  - Text Realization - Mapping sentence plans into sentence structure

**Ambiguities**
  - Lexical Ambiguity - Presence of two or more more meaning within a single word.
  - Syntactical Ambiguity - Presence of two or more meaning within a single sentence or a sequence of words.(Structured or Grammatical Ambiguity)
  - Referential Ambiguity - When a word or phrase can be interpreted to refer to more than one item.

**NLTK**

  

----
#Tokenization

In [2]:
import os
import nltk
import nltk.corpus

In [3]:
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> q


True

In [4]:
#print(os.listdir(nltk.data.find("corpora")))

In [5]:
nltk.download('brown')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


True

In [6]:
from nltk.corpus import brown
brown.words()

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

In [7]:
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


True

In [8]:
nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

This code is using NLTK to load the words from the Gutenberg corpus that contains the text of Shakespeare's play 'Hamlet'. It essentially retrieves a list of words from the specified text file.

In [9]:
hamlet = nltk.corpus.gutenberg.words('shakespeare-hamlet.txt')
hamlet

['[', 'The', 'Tragedie', 'of', 'Hamlet', 'by', ...]

The variable 'hamlet' contains a list of words from the play 'Hamlet' allowing us to perform various text analysis tasks using NLTK on this particular text.
For eg. we could analyze word frequencies, perform sentiment analysis, or any other NLP-related tasks on the text of 'Hamlet'.

In [10]:
for word in hamlet[500: 515]:
  print(word, sep = ' ', end = '\n')

thou
that
vsurp
'
st
this
time
of
night
,
Together
with
that
Faire
and


In [11]:
AI = """According to the father of Artificial Intelligence, John McCarthy, it is “The science and engineering of making intelligent machines, especially intelligent computer programs”.
Artificial Intelligence is a way of making a computer, a computer-controlled robot, or a software think intelligently, in the similar manner the intelligent humans think.
AI is accomplished by studying how human brain thinks, and how humans learn, decide, and work while trying to solve a problem, and then using the outcomes of this study as a basis of developing intelligent software and systems.
Philosophy of AI

While exploiting the power of the computer systems, the curiosity of human, lead him to wonder, “Can a machine think and behave like humans do?” Thus, the development of AI started with the intention of creating similar
intelligence in machines that we find and regard high in humans.

Goals of AI
• To Create Expert Systems − The systems which exhibit intelligent behavior, learn, demonstrate, explain, and advice its users.
• To Implement Human Intelligence in Machines − Creating systems that understand, think, learn, and behave like humans.

What Contributes to AI?
Artificial intelligence is a science and technology based on disciplines such as Computer Science, Biology, Psychology, Linguistics, Mathematics, and Engineering. A major thrust of AI is in the development of computer functions associated with human intelligence, such as reasoning, learning, and problem solving.
Out of the following areas, one or multiple areas can contribute to build an intelligent system."""

In [12]:
type(AI)

str

**word_tokenize Function**

Function from nltk, used for tokenizing a text into words.

In [13]:
from nltk.tokenize import word_tokenize   # Tokenization is the process of breaking down a text into smaller units, such as sentences or words.

In [14]:
nltk.download('punkt')    # a pre-trained unsupervised machine learning model for tokenizing sentences into individual words.
# It works by using an unsupervised learning algorithm to learn the most likely sentence boundaries based on the distribution of words in a given language.
# In NLTK, used for sentence segmentation -> break a text into sentences

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [15]:
AI_tokens = word_tokenize(AI)     #-> Saves as a list
AI_tokens[5:20]

['Artificial',
 'Intelligence',
 ',',
 'John',
 'McCarthy',
 ',',
 'it',
 'is',
 '“',
 'The',
 'science',
 'and',
 'engineering',
 'of',
 'making']

In [16]:
print(type(AI_tokens))
print(len(AI_tokens))

<class 'list'>
286


**Frequency Distribution Class(FreqDist)**

Class in NLTK that helps for analysis of frequency of each element in a dataset. Or, analyze the frequency of words in a text.

In [17]:
from nltk.probability import FreqDist
fdist = FreqDist()

In [18]:
for word in AI_tokens:
  fdist[word.lower()] += 1
fdist

FreqDist({',': 30, 'of': 14, 'the': 13, 'and': 12, '.': 9, 'a': 9, 'to': 7, 'intelligence': 6, 'intelligent': 6, 'ai': 6, ...})

In [19]:
# Checking the frequency of each word
print(fdist['intelligence'])

6


In [20]:
print(len(fdist))

126


In [21]:
#Top ten tokens with highest frequency

fdist_top10 = fdist.most_common(10)
print(fdist_top10)

[(',', 30), ('of', 14), ('the', 13), ('and', 12), ('.', 9), ('a', 9), ('to', 7), ('intelligence', 6), ('intelligent', 6), ('ai', 6)]


**blankline_tokenize**
- Tokenizing a text into chunks based on blank lines.
- Designed to split a text into sections separated by one or more blank lines.

In [22]:
from nltk.tokenize import blankline_tokenize
AI_blank = blankline_tokenize(AI)
len(AI_blank)

4

In [23]:
AI_blank[0]

'According to the father of Artificial Intelligence, John McCarthy, it is “The science and engineering of making intelligent machines, especially intelligent computer programs”.\nArtificial Intelligence is a way of making a computer, a computer-controlled robot, or a software think intelligently, in the similar manner the intelligent humans think.\nAI is accomplished by studying how human brain thinks, and how humans learn, decide, and work while trying to solve a problem, and then using the outcomes of this study as a basis of developing intelligent software and systems.\nPhilosophy of AI'

#Tokenization
- Bigrams - Tokens of two consecutive written words as Bigrams
- Trigrams - Tokens of three consecutive written words as Trigram
- Ngrams - Tokens of any number of consecutive written words known as Ngrams.

**Bigrams** provide context by capturing relationships between adjacent words. They help understand how words are often used together. Use example: Predicting the next word in a sequence.

**Trigrams** capture relationships between three words, offering more information about the structure and flow of the text. Use example: Part-of-Speech tagging and Named Entity Recognition.

In [24]:
from nltk.util import bigrams, trigrams, ngrams

In [25]:
string = "The best and most beautiful things in the world cannot be seen or even touched, they must be felt with the heart"
quotes_tokens = nltk.word_tokenize(string)
quotes_tokens

['The',
 'best',
 'and',
 'most',
 'beautiful',
 'things',
 'in',
 'the',
 'world',
 'can',
 'not',
 'be',
 'seen',
 'or',
 'even',
 'touched',
 ',',
 'they',
 'must',
 'be',
 'felt',
 'with',
 'the',
 'heart']

In [26]:
# Create a Bigram
quotes_bigrams = list(nltk.bigrams(quotes_tokens))
print(quotes_bigrams)

[('The', 'best'), ('best', 'and'), ('and', 'most'), ('most', 'beautiful'), ('beautiful', 'things'), ('things', 'in'), ('in', 'the'), ('the', 'world'), ('world', 'can'), ('can', 'not'), ('not', 'be'), ('be', 'seen'), ('seen', 'or'), ('or', 'even'), ('even', 'touched'), ('touched', ','), (',', 'they'), ('they', 'must'), ('must', 'be'), ('be', 'felt'), ('felt', 'with'), ('with', 'the'), ('the', 'heart')]


In [29]:
# Create a Trigram
quotes_trigrams = list(nltk.trigrams(quotes_tokens))
print(quotes_trigrams)

[('The', 'best', 'and'), ('best', 'and', 'most'), ('and', 'most', 'beautiful'), ('most', 'beautiful', 'things'), ('beautiful', 'things', 'in'), ('things', 'in', 'the'), ('in', 'the', 'world'), ('the', 'world', 'can'), ('world', 'can', 'not'), ('can', 'not', 'be'), ('not', 'be', 'seen'), ('be', 'seen', 'or'), ('seen', 'or', 'even'), ('or', 'even', 'touched'), ('even', 'touched', ','), ('touched', ',', 'they'), (',', 'they', 'must'), ('they', 'must', 'be'), ('must', 'be', 'felt'), ('be', 'felt', 'with'), ('felt', 'with', 'the'), ('with', 'the', 'heart')]


#Ngrams

```
quotes_ngrams = list(nltk.ngrams(quotes_tokens, N))
```



In [34]:
quotes_ngrams = list(nltk.ngrams(quotes_tokens, 5))
quotes_ngrams

[('The', 'best', 'and', 'most', 'beautiful'),
 ('best', 'and', 'most', 'beautiful', 'things'),
 ('and', 'most', 'beautiful', 'things', 'in'),
 ('most', 'beautiful', 'things', 'in', 'the'),
 ('beautiful', 'things', 'in', 'the', 'world'),
 ('things', 'in', 'the', 'world', 'can'),
 ('in', 'the', 'world', 'can', 'not'),
 ('the', 'world', 'can', 'not', 'be'),
 ('world', 'can', 'not', 'be', 'seen'),
 ('can', 'not', 'be', 'seen', 'or'),
 ('not', 'be', 'seen', 'or', 'even'),
 ('be', 'seen', 'or', 'even', 'touched'),
 ('seen', 'or', 'even', 'touched', ','),
 ('or', 'even', 'touched', ',', 'they'),
 ('even', 'touched', ',', 'they', 'must'),
 ('touched', ',', 'they', 'must', 'be'),
 (',', 'they', 'must', 'be', 'felt'),
 ('they', 'must', 'be', 'felt', 'with'),
 ('must', 'be', 'felt', 'with', 'the'),
 ('be', 'felt', 'with', 'the', 'heart')]

#**Stemming**
Normalize words into its base form or root form

PorterStemmer is a stemming algorithm implemented in the NLTK library.
Reduces words to its base or root form.

In [35]:
from nltk.stem import PorterStemmer
pst = PorterStemmer()

In [36]:
pst.stem("having")

'have'

In [37]:
words_to_stem = ['giving', 'give', 'given', 'gave']
for words in words_to_stem:
  print(words, ":", pst.stem(words))

giving : give
give : give
given : given
gave : gave


**Lancaster Stemmer**

More aggressive than the Porter Stemmer. Tends to be more liberal in its stemming and may produce shorter stems.

In [39]:
from nltk.stem import LancasterStemmer
lanstem = LancasterStemmer()

In [40]:
words_to_stem = ['giving', 'give', 'given', 'gave']
for words in words_to_stem:
  print(words,":",lanstem.stem(words))

giving : giv
give : giv
given : giv
gave : gav


In [41]:
words_to_stem = ['bringing', 'brought', 'bring', 'brinjal']
for words in words_to_stem:
  print(words, ":", lanstem.stem(words))

bringing : bring
brought : brought
bring : bring
brinjal : brind


**Snowbell Stemmer**

This framework supports multiple languages. The language used must be specified.

In [43]:
from nltk.stem import SnowballStemmer
sbst = SnowballStemmer('english')

In [44]:
sbst.stem("Generation")

'generat'

In [45]:
words_to_stem = ['giving', 'gave','bringing', 'brough','having', 'have']
for words in words_to_stem:
  print(words, ":", sbst.stem(words))

giving : give
gave : gave
bringing : bring
brough : brough
having : have
have : have


#**Lemmatization**
- Groups together different inflected forms of a word, called Lemma.
- Somehow similar to Stemming, as it maps several words into one common root.
- Output of Lemmatisation is a proper word.