# Text Preprocessing

## Tokenization

 #### Tokenization, in the realm of Natural Language Processing (NLP) and machine learning, refers to the process of converting a sequence of text into smaller parts, known as tokens.

In [1]:
!pip install nltk

Defaulting to user installation because normal site-packages is not writeable


In [12]:
corpus = """Hello, my name is Sijan.
Hi there! beautiful's world."""

In [13]:
print(corpus)

Hello, my name is Sijan.
Hi there! beautiful's world.


In [14]:
## Tokenization
## Paragraph --> Sentences
from nltk.tokenize import sent_tokenize

In [15]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/sijan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [16]:
documents = sent_tokenize(corpus)

In [17]:
documents

['Hello, my name is Sijan.', 'Hi there!', "beautiful's world."]

In [18]:
for sentence in documents:
    print(sentence)

Hello, my name is Sijan.
Hi there!
beautiful's world.


In [19]:
## Tokenization
## Paragraph-->words
## sentence-->words
from nltk.tokenize import word_tokenize

In [20]:
word_tokenize(corpus)

['Hello',
 ',',
 'my',
 'name',
 'is',
 'Sijan',
 '.',
 'Hi',
 'there',
 '!',
 'beautiful',
 "'s",
 'world',
 '.']

In [21]:
for sentence in documents:
    print(word_tokenize(sentence))

['Hello', ',', 'my', 'name', 'is', 'Sijan', '.']
['Hi', 'there', '!']
['beautiful', "'s", 'world', '.']


In [22]:
from nltk.tokenize import wordpunct_tokenize
wordpunct_tokenize(corpus)

['Hello',
 ',',
 'my',
 'name',
 'is',
 'Sijan',
 '.',
 'Hi',
 'there',
 '!',
 'beautiful',
 "'",
 's',
 'world',
 '.']

In [23]:
from nltk.tokenize import TreebankWordTokenizer

In [24]:
tokenizer = TreebankWordTokenizer()

In [25]:
tokenizer.tokenize(corpus)

['Hello',
 ',',
 'my',
 'name',
 'is',
 'Sijan.',
 'Hi',
 'there',
 '!',
 'beautiful',
 "'s",
 'world',
 '.']

# Stemming

#### Stemming, in Natural Language Processing (NLP), refers to the process of reducing a word to its word stem that affixes to suffixes and prefixes or the roots.

In [28]:
words = ["eating","eats","eaten","writing","writes","programming","programs","history","finally","finalized"]

### PorterStemmer

In [29]:
from nltk.stem import PorterStemmer

In [30]:
stemming = PorterStemmer()

In [31]:
for word in words:
    print(word+"--->"+stemming.stem(word))

eating--->eat
eats--->eat
eaten--->eaten
writing--->write
writes--->write
programming--->program
programs--->program
history--->histori
finally--->final
finalized--->final


In [32]:
stemming.stem("congratulations")

'congratul'

In [34]:
stemming.stem("sitting")

'sit'

### RegexpStemmer Class
##### NLTK has RegexpStemmer Class with the help of which we can easily implement Regular Expression Stemmer Algorithms. It basically takes a single regular expression and removes any prefix or suffix that matches the expression.

In [35]:
from nltk.stem import RegexpStemmer

In [38]:
reg_stemmer = RegexpStemmer("ing$|s$|e$|able$", min=4)

In [39]:
reg_stemmer.stem("eating")

'eat'

In [43]:
reg_stemmer.stem("class")

'clas'

### Snowball Stemmer
#### better than PorterStemmer

In [44]:
from nltk.stem import SnowballStemmer

In [46]:
snowball_stemmer = SnowballStemmer("english")

In [48]:
for word in words:
    print(word+"--->"+snowball_stemmer.stem(word))

eating--->eat
eats--->eat
eaten--->eaten
writing--->write
writes--->write
programming--->program
programs--->program
history--->histori
finally--->final
finalized--->final


#### Why snowball stemmer better than porter stemmer?

In [49]:
stemming.stem("fairly"), stemming.stem("sportingly")

('fairli', 'sportingli')

In [51]:
snowball_stemmer.stem("fairly"), snowball_stemmer.stem("sportingly")

('fair', 'sport')

## Lemmatization

### Wordnet Lemmatizer
#### Lemmatization technique is like stemming. The output we will get after lemmatization is called "lemma", which is a root word rather root stem, the output of stemming. After lemmatization, we will be getting a valid word that means the same thing.

#### NLTK provides WordNetLemmatizer class which is a thin wrapper around the wordnet corpus. This class uses morphy() function to the WordNet CorpusReader class to find a lemma

In [53]:
from nltk.stem import WordNetLemmatizer

In [54]:
lemmatizer = WordNetLemmatizer()

In [56]:
import nltk
nltk.download("wordnet")

[nltk_data] Downloading package wordnet to /home/sijan/nltk_data...


True

In [61]:
'''
POS- Noun-n
verb-v
adjective-a
adverb-r
'''
lemmatizer.lemmatize("going",pos='v')

'go'

In [62]:
words = ["eating","eats","eaten","writing","writes","programming","programs","history","finally","finalized"]

In [64]:
for word in words:
    print(word+"--->"+lemmatizer.lemmatize(word, pos="v"))

eating--->eat
eats--->eat
eaten--->eat
writing--->write
writes--->write
programming--->program
programs--->program
history--->history
finally--->finally
finalized--->finalize


In [65]:
lemmatizer.lemmatize("fairly",pos="v"), lemmatizer.lemmatize("sportingly")

('fairly', 'sportingly')

## Stopwords with NLTK

In [69]:
## Hacking
paragraph = '''A commonly used hacking definition is the act of compromising digital devices and networks through unauthorized access to an account or computer system. Hacking is not always a malicious act, but it is most commonly associated with illegal activity and data theft by cyber criminals. 

But what is hacking in a cyber security context? 

Hacking in cyber security refers to the misuse of devices like computers, smartphones, tablets, and networks to cause damage to or corrupt systems, gather information on users, steal data and documents, or disrupt data-related activity.

A traditional view of hackers is a lone rogue programmer who is highly skilled in coding and modifying computer software and hardware systems. But this narrow view does not cover the true technical nature of hacking. Hackers are increasingly growing in sophistication, using stealthy attack methods designed to go completely unnoticed by cybersecurity software and IT teams. They are also highly skilled in creating attack vectors that trick users into opening malicious attachments or links and freely giving up their sensitive personal data.

As a result, modern-day hacking involves far more than just an angry kid in their bedroom. It is a multibillion-dollar industry with extremely sophisticated and successful techniques.'''

In [70]:
from nltk.stem import PorterStemmer

In [71]:
from nltk.corpus import stopwords

In [72]:
import nltk
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /home/sijan/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [73]:
stopwords.words("english")

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [74]:
from nltk.stem import PorterStemmer

In [75]:
stemmer = PorterStemmer()

In [77]:
from nltk.tokenize import sent_tokenize

In [104]:
sentences = sent_tokenize(paragraph)

In [80]:
type(sentences)

list

In [81]:
## Apply Stopwords and filter And then apply Stemming

for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [stemmer.stem(word) for word in words if word not in set(stopwords.words("english"))]
    sentences[i] = ' '.join(words) ## converting all the list of words into sentences

In [82]:
sentences

['a commonli use hack definit act compromis digit devic network unauthor access account comput system .',
 'hack alway malici act , commonli associ illeg activ data theft cyber crimin .',
 'but hack cyber secur context ?',
 'hack cyber secur refer misus devic like comput , smartphon , tablet , network caus damag corrupt system , gather inform user , steal data document , disrupt data-rel activ .',
 'a tradit view hacker lone rogu programm highli skill code modifi comput softwar hardwar system .',
 'but narrow view cover true technic natur hack .',
 'hacker increasingli grow sophist , use stealthi attack method design go complet unnot cybersecur softwar it team .',
 'they also highli skill creat attack vector trick user open malici attach link freeli give sensit person data .',
 'as result , modern-day hack involv far angri kid bedroom .',
 'it multibillion-dollar industri extrem sophist success techniqu .']

In [84]:
## Apply Stopwords and filter And then apply Stemming

for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [snowball_stemmer.stem(word) for word in words if word not in set(stopwords.words("english"))]
    sentences[i] = ' '.join(words) ## converting all the list of words into sentences

In [85]:
sentences

['a common use hack definit act compromis digit devic network unauthor access account comput system .',
 'hack alway malici act , common associ illeg activ data theft cyber crimin .',
 'but hack cyber secur context ?',
 'hack cyber secur refer misus devic like comput , smartphon , tablet , network caus damag corrupt system , gather inform user , steal data document , disrupt data-rel activ .',
 'a tradit view hacker lone rogu programm high skill code modifi comput softwar hardwar system .',
 'but narrow view cover true technic natur hack .',
 'hacker increas grow sophist , use stealthi attack method design go complet unnot cybersecur softwar it team .',
 'they also high skill creat attack vector trick user open malici attach link freeli give sensit person data .',
 'as result , modern-day hack involv far angri kid bedroom .',
 'it multibillion-dollar industri extrem sophist success techniqu .']

In [94]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [105]:
## Apply Stopwords and filter And then apply Stemming

for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [lemmatizer.lemmatize(word.lower(), pos="v") for word in words if word not in set(stopwords.words("english"))]
    sentences[i] = ' '.join(words) ## converting all the list of words into sentences

In [106]:
sentences

['a commonly use hack definition act compromise digital devices network unauthorized access account computer system .',
 'hack always malicious act , commonly associate illegal activity data theft cyber criminals .',
 'but hack cyber security context ?',
 'hack cyber security refer misuse devices like computers , smartphones , tablets , network cause damage corrupt systems , gather information users , steal data document , disrupt data-related activity .',
 'a traditional view hackers lone rogue programmer highly skilled cod modify computer software hardware systems .',
 'but narrow view cover true technical nature hack .',
 'hackers increasingly grow sophistication , use stealthy attack methods design go completely unnoticed cybersecurity software it team .',
 'they also highly skilled create attack vectors trick users open malicious attachments link freely give sensitive personal data .',
 'as result , modern-day hack involve far angry kid bedroom .',
 'it multibillion-dollar industr

In [103]:
lemmatizer.lemmatize("techniques", pos="v")

'techniques'

## Parts of Speech (pos) tagging

In [107]:
import nltk
sentences = nltk.sent_tokenize(paragraph)

In [109]:
from nltk.corpus import stopwords

In [108]:
sentences

['A commonly used hacking definition is the act of compromising digital devices and networks through unauthorized access to an account or computer system.',
 'Hacking is not always a malicious act, but it is most commonly associated with illegal activity and data theft by cyber criminals.',
 'But what is hacking in a cyber security context?',
 'Hacking in cyber security refers to the misuse of devices like computers, smartphones, tablets, and networks to cause damage to or corrupt systems, gather information on users, steal data and documents, or disrupt data-related activity.',
 'A traditional view of hackers is a lone rogue programmer who is highly skilled in coding and modifying computer software and hardware systems.',
 'But this narrow view does not cover the true technical nature of hacking.',
 'Hackers are increasingly growing in sophistication, using stealthy attack methods designed to go completely unnoticed by cybersecurity software and IT teams.',
 'They are also highly skil

In [111]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/sijan/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [112]:
## we will find the pos tag
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [word for word in words if word not in set(stopwords.words("english"))]
    # sentences[i] = " ".join(words) #converting all the list of words into sentences
    pos_tag = nltk.pos_tag(words)
    print(pos_tag)

[('A', 'DT'), ('commonly', 'RB'), ('used', 'VBN'), ('hacking', 'VBG'), ('definition', 'NN'), ('act', 'NN'), ('compromising', 'VBG'), ('digital', 'JJ'), ('devices', 'NNS'), ('networks', 'NNS'), ('unauthorized', 'JJ'), ('access', 'NN'), ('account', 'NN'), ('computer', 'NN'), ('system', 'NN'), ('.', '.')]
[('Hacking', 'VBG'), ('always', 'RB'), ('malicious', 'JJ'), ('act', 'NN'), (',', ','), ('commonly', 'RB'), ('associated', 'JJ'), ('illegal', 'JJ'), ('activity', 'NN'), ('data', 'NNS'), ('theft', 'NN'), ('cyber', 'NN'), ('criminals', 'NNS'), ('.', '.')]
[('But', 'CC'), ('hacking', 'VBG'), ('cyber', 'JJ'), ('security', 'NN'), ('context', 'NN'), ('?', '.')]
[('Hacking', 'VBG'), ('cyber', 'JJ'), ('security', 'NN'), ('refers', 'NNS'), ('misuse', 'VBP'), ('devices', 'NNS'), ('like', 'IN'), ('computers', 'NNS'), (',', ','), ('smartphones', 'NNS'), (',', ','), ('tablets', 'NNS'), (',', ','), ('networks', 'NNS'), ('cause', 'VBP'), ('damage', 'NN'), ('corrupt', 'JJ'), ('systems', 'NNS'), (',', ','

In [119]:
x = "Taj Mahal is a beautiful Monument".split()
nltk.pos_tag(x)

[('Taj', 'NNP'),
 ('Mahal', 'NNP'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('beautiful', 'JJ'),
 ('Monument', 'NN')]

In [117]:
"Taj Mahal is a beautiful Monument".split()

['Taj', 'Mahal', 'is', 'a', 'beautiful', 'Monument']

## Named Entity Recognition

In [122]:
sentence = '''The Effiel Tower was built from 1887 to 1889 by French engineer Gustave Effiel, whose company specialized in building metal frameworks and structures.'''

In [123]:
sentence

'The Effiel Tower was built from 1887 to 1889 by French engineer Gustave Effiel, whose company specialized in building metal frameworks and structures.'

In [124]:
import nltk
words = nltk.word_tokenize(sentence)

In [127]:
tag_elements = nltk.pos_tag(words)

In [134]:
nltk.download('maxent_ne_chunker')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /home/sijan/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!


True

In [135]:
nltk.download('words')

[nltk_data] Downloading package words to /home/sijan/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [137]:
nltk.ne_chunk(tag_elements).draw()

# Word2vec Implementation

In [1]:
!pip install gensim

Defaulting to user installation because normal site-packages is not writeable
Collecting gensim
  Downloading gensim-4.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.5 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.5/26.5 MB[0m [31m220.8 kB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:03[0m
[?25hCollecting smart-open>=1.8.1
  Downloading smart_open-7.0.1-py3-none-any.whl (60 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.8/60.8 KB[0m [31m195.0 kB/s[0m eta [36m0:00:00[0m kB/s[0m eta [36m0:00:01[0m:01[0m
Collecting wrapt
  Downloading wrapt-1.16.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (80 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m80.3/80.3 KB[0m [31m89.0 kB/s[0m eta [36m0:00:00[0m1m162.8 kB/s[0m eta [36m0:00:01[0m
[?25hInstalling collected packages: wrapt, smart-open

In [1]:
import gensim

In [2]:
from gensim.models import Word2Vec, KeyedVectors

In [1]:
import gensim.downloader as api

In [2]:
wv = api.load("word2vec-google-news-300")

In [3]:
vec_king = wv["king"]

In [4]:
vec_king

array([ 1.25976562e-01,  2.97851562e-02,  8.60595703e-03,  1.39648438e-01,
       -2.56347656e-02, -3.61328125e-02,  1.11816406e-01, -1.98242188e-01,
        5.12695312e-02,  3.63281250e-01, -2.42187500e-01, -3.02734375e-01,
       -1.77734375e-01, -2.49023438e-02, -1.67968750e-01, -1.69921875e-01,
        3.46679688e-02,  5.21850586e-03,  4.63867188e-02,  1.28906250e-01,
        1.36718750e-01,  1.12792969e-01,  5.95703125e-02,  1.36718750e-01,
        1.01074219e-01, -1.76757812e-01, -2.51953125e-01,  5.98144531e-02,
        3.41796875e-01, -3.11279297e-02,  1.04492188e-01,  6.17675781e-02,
        1.24511719e-01,  4.00390625e-01, -3.22265625e-01,  8.39843750e-02,
        3.90625000e-02,  5.85937500e-03,  7.03125000e-02,  1.72851562e-01,
        1.38671875e-01, -2.31445312e-01,  2.83203125e-01,  1.42578125e-01,
        3.41796875e-01, -2.39257812e-02, -1.09863281e-01,  3.32031250e-02,
       -5.46875000e-02,  1.53198242e-02, -1.62109375e-01,  1.58203125e-01,
       -2.59765625e-01,  

In [7]:
vec_king.shape

(300,)

In [None]:
wv.most_similar("cricket")

In [None]:
wv.most_similar("happy")

In [7]:
wv.similarity("hockey", "sports")

0.5354152

In [5]:
vec = wv["king"]-wv["man"]+wv["woman"]

In [6]:
vec

array([ 4.29687500e-02, -1.78222656e-01, -1.29089355e-01,  1.15234375e-01,
        2.68554688e-03, -1.02294922e-01,  1.95800781e-01, -1.79504395e-01,
        1.95312500e-02,  4.09919739e-01, -3.68164062e-01, -3.96484375e-01,
       -1.56738281e-01,  1.46484375e-03, -9.30175781e-02, -1.16455078e-01,
       -5.51757812e-02, -1.07574463e-01,  7.91015625e-02,  1.98974609e-01,
        2.38525391e-01,  6.34002686e-02, -2.17285156e-02,  0.00000000e+00,
        4.72412109e-02, -2.17773438e-01, -3.44726562e-01,  6.37207031e-02,
        3.16406250e-01, -1.97631836e-01,  8.59375000e-02, -8.11767578e-02,
       -3.71093750e-02,  3.15551758e-01, -3.41796875e-01, -4.68750000e-02,
        9.76562500e-02,  8.39843750e-02, -9.71679688e-02,  5.17578125e-02,
       -5.00488281e-02, -2.20947266e-01,  2.29492188e-01,  1.26403809e-01,
        2.49023438e-01,  2.09960938e-02, -1.09863281e-01,  5.81054688e-02,
       -3.35693359e-02,  1.29577637e-01,  2.41699219e-02,  3.48129272e-02,
       -2.60009766e-01,  

In [None]:
wv.most_similar([vec])