In [None]:
 # Inside nltk library you will get stop word, stemming, lemmatization. So install it
!pip install nltk

In [1]:
# import libraries
import nltk # Natural Language Tool Kit
nltk.download('punkt')
#punkt - This tokenizer divides a text into a list of sentences by using an unsupervised algorithm
# to build a model for abbreviation words, collocations, and words that start sentences.
nltk.download('stopwords')
nltk.download('wordnet')
#wordnet-WordNET is a lexical database of words in more than 200 languages in which we have adjectives, adverbs, nouns,
# and verbs grouped differently into a set of cognitive synonyms, where each word in the database is expressing its distinct concept.
nltk.download('omw-1.4')
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
# PortStemmer - for stemming
# WordNetLematizer - for lemmatization
import re # for regular expression

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


In [2]:
# text data :para="""paste whole paragraph here"""
para = """— Tensor and Tensorflow: A powerful combo 💪
The Google Brain team developed an advanced AI framework named Tensorflow years back. After that, Google designed its own processing unit named Tensor Processing Unit or TPU to perform more efficiently with the Tensorflow. The invention of TPU was a revolution in AI that has significantly expedited the training of huge machine learning models with millions (or, billions) of parameters. Nevertheless, that technology could not be used in low-power devices such as smartphones in Edge AI. The entrance of Google into the AI chip manufacturing club for low-power devices can be the next revolution in this industry. Many companies such as FogHorn and BlinkAI are working in Edge AI using currently existing AI chips in the market. However, the efficacy that Google can create by the combination of TensorFlow and Tensor will be game-changing. Welcome to the club, Google!

— Tensor is an AI chip designed by AI! 😲
Isn’t that cool? The story is started from an article published in Nature titles “A graph placement methodology for fast chip design”. To design a processing chip, there is a crucial step referred to as “floor planning” where the engineering team must place a large number of components such that a series of physical requirements including power consumption and performance get satisfied. I don’t go further into its details as I am also not an expert in hardware engineering. However, when you have a large series of choices to make with a series of constraints AI can kick in. You may remember how the AlphaGo project defeated a professional human Go player. This is exactly the same. Tensor is the real outcome of this project that is a new milestone in the AI industry. Kudos, Google!

— Tensor helps us build ethical AI. 💡
This is a double-edged sword statement. Ethical AI has various aspects from data privacy to AI for all. Tensor helps many users have the opportunity to try the latest AI advancement while they have no concern about their privacy. Why? Because the AI engine is running on the chip, and no data is sent to the cloud for further computation. On the other hand, the more tightly Google binds AI software and hardware, the harder it will be for other companies to compete. I don’t want to see days that other companies can not even compete on performing AI inference, i.e., compete on using AI. We almost lost the game of model training to giant tech companies. It would be a nightmare if we lose the game on AI inference to them as well. That is why I believe “Tensor helps us build ethical AI” is a double-edged sword."""

In [3]:
para

'— Tensor and Tensorflow: A powerful combo 💪\nThe Google Brain team developed an advanced AI framework named Tensorflow years back. After that, Google designed its own processing unit named Tensor Processing Unit or TPU to perform more efficiently with the Tensorflow. The invention of TPU was a revolution in AI that has significantly expedited the training of huge machine learning models with millions (or, billions) of parameters. Nevertheless, that technology could not be used in low-power devices such as smartphones in Edge AI. The entrance of Google into the AI chip manufacturing club for low-power devices can be the next revolution in this industry. Many companies such as FogHorn and BlinkAI are working in Edge AI using currently existing AI chips in the market. However, the efficacy that Google can create by the combination of TensorFlow and Tensor will be game-changing. Welcome to the club, Google!\n\n— Tensor is an AI chip designed by AI! 😲\nIsn’t that cool? The story is started

In [4]:
len(para) # returns no. of characters in sentence or paragraph

2602

In [5]:
# in this paragraph special symbols, quotation marks, punchuation symbols, emojis are present
# these are all unnecessary things.
# Tokenization
# Word Tokenization

document = "We are learning tokenization in NLP"
nltk.word_tokenize(document)

['We', 'are', 'learning', 'tokenization', 'in', 'NLP']

In [6]:
len(document)

35

We can't do word tokenization directly on paragraph. \n represents new line in paragraph. For that Sentence tokenization need to be done.

In [7]:
#Sentence Tokenization
sent = nltk.sent_tokenize(para)

In [8]:
len(sent) # length is 29 sentences i.e. 29 documents

29

In [9]:
sent[0] # shows first document

'— Tensor and Tensorflow: A powerful combo 💪\nThe Google Brain team developed an advanced AI framework named Tensorflow years back.'

In [10]:
# Text Cleaning - remove unnecessary things - punctuation marks, symbols, emojis, etc. using sub()
# Text normalization - convert each word in lower case
#  sub() returns a string where all matching occurrences of the specified pattern are replaced by the replace string.
corpus = []

for i in range(len(sent)):
  txt = re.sub('[^a-zA-Z]',' ',sent[i])# except a-zA-Z remove everything from each sentence
  txt = txt.lower()
  corpus.append(txt)

In [11]:
corpus

['  tensor and tensorflow  a powerful combo   the google brain team developed an advanced ai framework named tensorflow years back ',
 'after that  google designed its own processing unit named tensor processing unit or tpu to perform more efficiently with the tensorflow ',
 'the invention of tpu was a revolution in ai that has significantly expedited the training of huge machine learning models with millions  or  billions  of parameters ',
 'nevertheless  that technology could not be used in low power devices such as smartphones in edge ai ',
 'the entrance of google into the ai chip manufacturing club for low power devices can be the next revolution in this industry ',
 'many companies such as foghorn and blinkai are working in edge ai using currently existing ai chips in the market ',
 'however  the efficacy that google can create by the combination of tensorflow and tensor will be game changing ',
 'welcome to the club  google ',
 '  tensor is an ai chip designed by ai ',
 '  isn t

In [None]:
# now we can perform stemming and lemmatization
#Stemming
stemmer = PorterStemmer()

In [None]:
stemmer.stem('goes')

'goe'

In [None]:
stemmer.stem('history')

'histori'

In [None]:
stemmer.stem('finally')

'final'

In [None]:
stemmer.stem("developed")

'develop'

In [None]:

for i in corpus:
  words = nltk.word_tokenize(i)
  print(words)
  # we get separate list for each sentence

['tensor', 'and', 'tensorflow', 'a', 'powerful', 'combo', 'the', 'google', 'brain', 'team', 'developed', 'an', 'advanced', 'ai', 'framework', 'named', 'tensorflow', 'years', 'back']
['after', 'that', 'google', 'designed', 'its', 'own', 'processing', 'unit', 'named', 'tensor', 'processing', 'unit', 'or', 'tpu', 'to', 'perform', 'more', 'efficiently', 'with', 'the', 'tensorflow']
['the', 'invention', 'of', 'tpu', 'was', 'a', 'revolution', 'in', 'ai', 'that', 'has', 'significantly', 'expedited', 'the', 'training', 'of', 'huge', 'machine', 'learning', 'models', 'with', 'millions', 'or', 'billions', 'of', 'parameters']
['nevertheless', 'that', 'technology', 'could', 'not', 'be', 'used', 'in', 'low', 'power', 'devices', 'such', 'as', 'smartphones', 'in', 'edge', 'ai']
['the', 'entrance', 'of', 'google', 'into', 'the', 'ai', 'chip', 'manufacturing', 'club', 'for', 'low', 'power', 'devices', 'can', 'be', 'the', 'next', 'revolution', 'in', 'this', 'industry']
['many', 'companies', 'such', 'as',

In [None]:
# Perform Tokenization, stemming and stop word removal

for i in corpus:
  words = nltk.word_tokenize(i) # for each sentence in corpus perform word tokenization
  for i in words: # for each unique value inside word
    if i not in set(stopwords.words('english')): # will check words in stopwords from set of english stopwords
      print(stemmer.stem(i)) # the words whch are not present in stopwords set print by performing stemming using stem() function.
      # powerful - power, google-googl ... doesn't make sense

In [None]:
# Lemmatization
lemma = WordNetLemmatizer()

In [None]:
lemma.lemmatize('google')

'google'

In [None]:
lemma.lemmatize('historical')

'historical'

In [None]:
lemma.lemmatize('coming')

'coming'

In [None]:
for i in corpus:
  words = nltk.word_tokenize(i)
  for i in words:
    if i not in set(stopwords.words('english')):
      print(lemma.lemmatize(i))
      #proper google, powerful.. meaningful words are returned by lemmatization

# **Feature Extraction**

In [None]:
# convert text data to BoW (Bag of Words) or TFIDF (Term Frequency - Inverse Document Frequency)
# CountVectorizer - for BoW, TfidfVectorizer - for TFIDF
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer


In [None]:
cv = CountVectorizer() # BoW - frequency will be displayed
# Bag of Words (BoW) simply counts the frequency of words in a document.
#cv = CountVectorizer(binary=True) # only binary weight will be displayed i.e. present or not
x = cv.fit_transform(corpus)#
cv.vocabulary_ # following words are taken as columns, no. represents index of each word
#print(max(cv.vocabulary_))

In [None]:
# Now let us see what type of BoW it has created. BoW - gives frequency of term in document, Binary Weight - tells word is present or not
# We can't directly print matrix x. so we need to convert it to array
x[0].toarray() # BoW for 1st sentence,    'tensorflow': index is: 181. It occurred 2 times in 1st sentence. Check corpus[0].

array([[0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
        0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 0, 0, 1, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]])

In [None]:
corpus[0] # for this sentence above BoW is created. Why does it contain more values than the no. of words in this sentence?
# Column contains all unique words.

'  tensor and tensorflow  a powerful combo   the google brain team developed an advanced ai framework named tensorflow years back '

In [None]:
# convert array into dataframe, DTM - Document Term Matrix
x = pd.DataFrame(x.toarray(),columns=cv.get_feature_names_out())
x
# for 'advanced' word index was 1 so it is present at colummn index no. 1
# in 5th document word 'ai' is occuring for 2 times.
# BoW - gives you frequency
# Binary weights - tells whether word is present or not
# If you don't want to print frequency in dataframe, keep binary=True in 'cv = CountVectonizer(binary=True)'

Unnamed: 0,about,advanced,advancement,after,ai,all,almost,alphago,also,am,...,when,where,while,why,will,with,working,would,years,you
0,0,1,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,2,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,2,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# We can start model building now. We have 216 columns i.e. 216 unique words are there.
# 29 rows - 29 documents / sentences are there.

# **TF-IDF**
Term frequency works by looking at the frequency of a particular term you are concerned with relative to the document.

Inverse document frequency looks at how common (or uncommon) a word is amongst the corpus.

The highest scoring words of a document are the most relevant to that document.



In [None]:
tf = TfidfVectorizer()
x = tf.fit_transform(corpus)

In [None]:
# in tf-idf we will see weightage not 0 and 1
x.toarray()

array([[0.        , 0.26420014, 0.        , ..., 0.        , 0.26420014,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.29867304, 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [None]:
#convert it into dataframe
x = pd.DataFrame(x.toarray(),columns = tf.get_feature_names_out()) #get_feature_names_out() - gives you list of unique words inside corpus
x # here we are retaining information related to the frequency of words, TDM-Term Document Frequency

Unnamed: 0,about,advanced,advancement,after,ai,all,almost,alphago,also,am,...,when,where,while,why,will,with,working,would,years,you
0,0.0,0.2642,0.0,0.0,0.11172,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2642,0.0
1,0.0,0.0,0.0,0.230918,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.187753,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.093004,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.178827,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.120739,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.113623,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.219569,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.259624,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.26467,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.419187,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
x.shape

(29, 216)

In [None]:
tf.get_feature_names_out()

In [None]:
len(tf.get_feature_names_out())

216