# Natural Language Processing

![](NLP.png)
![](NLP1.png)

### Install NLTk

In [1]:
!pip install nltk



### Import Libraries

In [2]:
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import re

### Downloads

In [3]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Error loading punkt: <urlopen error [Errno -3] Temporary
[nltk_data]     failure in name resolution>
[nltk_data] Error loading wordnet: <urlopen error [Errno -3] Temporary
[nltk_data]     failure in name resolution>
[nltk_data] Error loading stopwords: <urlopen error [Errno -3]
[nltk_data]     Temporary failure in name resolution>


False

### Initializers

In [4]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

### Let's Start

In [5]:
paragraph = """
Elon Reeve Musk (/ˈiːlɒn/ EE-lon; born June 28, 1971) is a business magnate and investor. Musk is the founder, chairman, CEO and chief technology officer of SpaceX; angel investor, CEO, product architect and former chairman of Tesla, Inc.; owner, chairman and CTO of X Corp.; founder of the Boring Company; co-founder of Neuralink and OpenAI; and president of the Musk Foundation. He is the wealthiest person in the world, with an estimated net worth of US$232 billion as of September 2023, according to the Bloomberg Billionaires Index, and $253 billion according to Forbes, primarily from his ownership stakes in both Tesla and SpaceX.[4][5]

Musk was born in Pretoria, South Africa, and briefly attended the University of Pretoria before immigrating to Canada at age 18, acquiring citizenship through his Canadian-born mother. Two years later, he matriculated at Queen's University in Kingston, Ontario. Musk later transferred to the University of Pennsylvania, and received bachelor's degrees in economics and physics there. He moved to California in 1995 to attend Stanford University. However, Musk dropped out after two days and, with his brother Kimbal, co-founded online city guide software company Zip2. The startup was acquired by Compaq for $307 million in 1999, and with $12 million of the money he made, that same year Musk co-founded X.com, a direct bank. X.com merged with Confinity in 2000 to form PayPal. 
"""

In [6]:
paragraph

"\nElon Reeve Musk (/ˈiːlɒn/ EE-lon; born June 28, 1971) is a business magnate and investor. Musk is the founder, chairman, CEO and chief technology officer of SpaceX; angel investor, CEO, product architect and former chairman of Tesla, Inc.; owner, chairman and CTO of X Corp.; founder of the Boring Company; co-founder of Neuralink and OpenAI; and president of the Musk Foundation. He is the wealthiest person in the world, with an estimated net worth of US$232 billion as of September 2023, according to the Bloomberg Billionaires Index, and $253 billion according to Forbes, primarily from his ownership stakes in both Tesla and SpaceX.[4][5]\n\nMusk was born in Pretoria, South Africa, and briefly attended the University of Pretoria before immigrating to Canada at age 18, acquiring citizenship through his Canadian-born mother. Two years later, he matriculated at Queen's University in Kingston, Ontario. Musk later transferred to the University of Pennsylvania, and received bachelor's degree

# PreProcessing

### Tokenization

![](Token.png)

In [7]:
sentences = nltk.sent_tokenize(paragraph)

In [8]:
corpus = []
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])
    review = review.lower()
    corpus.append(review)

In [9]:
corpus

[' elon reeve musk    i l n  ee lon  born june           is a business magnate and investor ',
 'musk is the founder  chairman  ceo and chief technology officer of spacex  angel investor  ceo  product architect and former chairman of tesla  inc   owner  chairman and cto of x corp   founder of the boring company  co founder of neuralink and openai  and president of the musk foundation ',
 'he is the wealthiest person in the world  with an estimated net worth of us     billion as of september       according to the bloomberg billionaires index  and      billion according to forbes  primarily from his ownership stakes in both tesla and spacex ',
 '        musk was born in pretoria  south africa  and briefly attended the university of pretoria before immigrating to canada at age     acquiring citizenship through his canadian born mother ',
 'two years later  he matriculated at queen s university in kingston  ontario ',
 'musk later transferred to the university of pennsylvania  and receive


# Introduction
<h4>

In the field of Natural Language Processing i.e., NLP, Lemmatization and Stemming are Text Normalization techniques. These techniques are used to prepare words, text, and documents for further processing.

Languages such as English, Hindi consists of several words which are often derived from one another. Further, Inflected Language is a term used for a language that contains derived words. For instance, word “historical” is derived from the word “history” and hence is the derived word.

There is always a common root form for all inflected words. Further, degree of inflection varies from lower to higher depending on the language.

To sum up, root form of derived or inflected words are attained using Stemming and Lemmatization.

The package namely, nltk.stem is used to perform stemming via different classes. We import PorterStemmer from nltk.stem to perform the above task.

For instance, ran, runs, and running are derived from one word i.e., run, therefore the lemma of all three words is run. Lemmatization is used to get valid words as the actual word is returned. </h4>

#### Stemming

![In the code given below, one sentence is taken at a time and word tokenization is applied i.e., converting sentence to words. After that, stopwords (such as the, and, etc) are ignored and stemming is applied on all other words. Finally, stem words are joined to make a sentence.
Note: Stopwords are the words that do not add any value to the sentence](stemming1.png)![](stemming2.png)

#### What is Lemmatization

![Image description](Stemming_Lemmatization.png)

### Let's Do

#### Stemming

In [10]:
for i in corpus:   # I'm not applying stemming just printing instead of saving
    words = nltk.word_tokenize(i)
    for word in words:
        if word not in set(stopwords.words('english')):
            print(stemmer.stem(word))

elon
reev
musk
l
n
ee
lon
born
june
busi
magnat
investor
musk
founder
chairman
ceo
chief
technolog
offic
spacex
angel
investor
ceo
product
architect
former
chairman
tesla
inc
owner
chairman
cto
x
corp
founder
bore
compani
co
founder
neuralink
openai
presid
musk
foundat
wealthiest
person
world
estim
net
worth
us
billion
septemb
accord
bloomberg
billionair
index
billion
accord
forb
primarili
ownership
stake
tesla
spacex
musk
born
pretoria
south
africa
briefli
attend
univers
pretoria
immigr
canada
age
acquir
citizenship
canadian
born
mother
two
year
later
matricul
queen
univers
kingston
ontario
musk
later
transfer
univers
pennsylvania
receiv
bachelor
degre
econom
physic
move
california
attend
stanford
univers
howev
musk
drop
two
day
brother
kimbal
co
found
onlin
citi
guid
softwar
compani
zip
startup
acquir
compaq
million
million
money
made
year
musk
co
found
x
com
direct
bank
x
com
merg
confin
form
paypal


#### Lemmatization

In [11]:
import re   # Now applying Lemmitization
corpus = []
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i]) # it says that otherthan(^) a-z and A-z replace all with ' ' (whitespace )in sentence of i
    review = review.lower()
    review = review.split()
    review = [lemmatizer.lemmatize(word) for word in review if not word  in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

In [12]:
corpus

['elon reeve musk l n ee lon born june business magnate investor',
 'musk founder chairman ceo chief technology officer spacex angel investor ceo product architect former chairman tesla inc owner chairman cto x corp founder boring company co founder neuralink openai president musk foundation',
 'wealthiest person world estimated net worth u billion september according bloomberg billionaire index billion according forbes primarily ownership stake tesla spacex',
 'musk born pretoria south africa briefly attended university pretoria immigrating canada age acquiring citizenship canadian born mother',
 'two year later matriculated queen university kingston ontario',
 'musk later transferred university pennsylvania received bachelor degree economics physic',
 'moved california attend stanford university',
 'however musk dropped two day brother kimbal co founded online city guide software company zip',
 'startup acquired compaq million million money made year musk co founded x com direct bank

### Bog of Words

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(binary=True,ngram_range=(2,3)) # it includes both bigrams and trigrams 

In [14]:
X = cv.fit_transform(corpus)

In [15]:
cv.vocabulary_  ### Just checking the index number of the words 

{'elon reeve': 95,
 'reeve musk': 197,
 'musk ee': 158,
 'ee lon': 93,
 'lon born': 135,
 'born june': 32,
 'june business': 126,
 'business magnate': 41,
 'magnate investor': 139,
 'elon reeve musk': 96,
 'reeve musk ee': 198,
 'musk ee lon': 159,
 'ee lon born': 94,
 'lon born june': 136,
 'born june business': 33,
 'june business magnate': 127,
 'business magnate investor': 42,
 'musk founder': 161,
 'founder chairman': 110,
 'chairman ceo': 53,
 'ceo chief': 49,
 'chief technology': 59,
 'technology officer': 212,
 'officer spacex': 169,
 'spacex angel': 205,
 'angel investor': 12,
 'investor ceo': 124,
 'ceo product': 51,
 'product architect': 191,
 'architect former': 14,
 'former chairman': 102,
 'chairman tesla': 57,
 'tesla inc': 214,
 'inc owner': 120,
 'owner chairman': 175,
 'chairman cto': 55,
 'cto corp': 83,
 'corp founder': 81,
 'founder boring': 108,
 'boring company': 30,
 'company co': 74,
 'co founder': 68,
 'founder neuralink': 112,
 'neuralink openai': 167,
 'open

### TF-IDF

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer(ngram_range=(3,3))
X2 = cv.fit_transform(corpus)

In [21]:
X2[0].toarray()

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.35355339, 0.        , 0.        , 0.        ,
        0.35355339, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.35355339, 0.35355339, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.35355339, 0.        , 0.        , 0.        ,
        0.35355339, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.  