<a href="https://colab.research.google.com/github/vincent4u/CE807_Text_Analytics/blob/main/week2/lab02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Processing and Representation

## Normalizing Text

In [1]:
text = "Hey Amazon - my package never arrived https://www.amazon.com/gp/css/order-history?ref_=nav_orders_first PLEASE FIX ASAP! @AmazonHelp"

In [2]:
text = text.lower()

print(text)

hey amazon - my package never arrived https://www.amazon.com/gp/css/order-history?ref_=nav_orders_first please fix asap! @amazonhelp


## Removing Unicode Characters

In [4]:
import re

In [5]:
text = re.sub(r"(@\[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", "", text)

print(text)

hey amazon  my package never arrived  please fix asap amazonhelp


**Your Turn: 1. remove only URL. 2. remove only numbers. 3. remove only special characters like $, @ etc. Do this for given text, try with different text inputs**

Hint: https://docs.python.org/3/howto/regex.html

## Removing Stopwords

In [6]:
import nltk.corpus
nltk.download('stopwords')
from nltk.corpus import stopwords


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [7]:
stop = stopwords.words('english')
text = "my package from amazon never arrived fix this asap"
text = " ".join([word for word in text.split() if word not in (stop)])

print(text)

package amazon never arrived fix asap


## Stemming and Lemmatization

Stemming

In [8]:
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer


In [None]:
words = ["jump", "jumped", "jumps", "jumping"]
stemmer = PorterStemmer()
for word in words:
  print(word + " = " + stemmer.stem(word))

jump = jump
jumped = jump
jumps = jump
jumping = jump


Lemmatization

In [None]:
import nltk
nltk.download('wordnet')
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package wordnet to /root/nltk_data...


In [None]:
words = ["jump", "jumped", "jumps", "jumping"]
lemmatizer = WordNetLemmatizer()
for word in words:
  print(word + " = " + lemmatizer.lemmatize(word))

jump = jump
jumped = jumped
jumps = jump
jumping = jumping


Play with different Stemming and Lemmatization examples and algorithms, and build intuition.

## Part of Speech (POS) Tagging

There are eight main parts of speech, and using NLTK to tag each within our data allows us to glean further useful insight from our text.

For instance, by tagging and grouping our adjectives, we can calculate the most and least used descriptors, which points us towards our products’ strengths and weaknesses.

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
tokens = nltk.word_tokenize("amazon package never arrived fix asap")

print(tokens)

['amazon', 'package', 'never', 'arrived', 'fix', 'asap']


In [None]:
import nltk
nltk.download('averaged_perceptron_tagger')



[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [None]:
tokens = ['amazon', 'package', 'never', 'arrived', 'fix', 'asap']
pos = nltk.pos_tag(tokens)

print(pos)

[('amazon', 'JJ'), ('package', 'NN'), ('never', 'RB'), ('arrived', 'VBD'), ('fix', 'JJ'), ('asap', 'NN')]


**Your turn: Identify different POS tags and think about which one you think would be useful.**

## Text Representation

How to represent a corpus using one hot encoding using sklearn?

In [None]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

In [None]:
corpus = ["How to format my hard disk problem", " Hard disk format problems ", "My amazon review is available at"]

We need to get the words in all sentences. We could perform different pre-processing step, however we are not going to do that here.

Easiest way to get words in a sentence is to split by space, let's do that.

### Bag of words representation

You already saw this in the last lab.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer


In [None]:
vectorizer = CountVectorizer(stop_words=['to', 'at'])


In [None]:
X = vectorizer.fit_transform(corpus)

In [None]:
vectorizer.get_feature_names_out()

array(['amazon', 'available', 'disk', 'format', 'hard', 'how', 'is', 'my',
       'problem', 'problems', 'review'], dtype=object)

In [None]:
len(vectorizer.get_feature_names_out())

11

Let's get Bag of word represemtation of `corpus` sentences

In [None]:
X.toarray()

array([[0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0],
       [0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0],
       [1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1]])

In [None]:
test = [' testing bow representation amazon amazon']

In [None]:
y = vectorizer.transform(test)

In [None]:
y.toarray()

array([[2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [None]:
X.toarray().shape

(3, 11)

CountVectorizer has a number of very useful options, discussed at the page:
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
you should spend some time now familiarizing yourself with them.

**Your turn:  Take some test sentences and create it's bag of word representations.**

**Your turn:  How you will deal with the unknow words which was not seen during the vectorizer training time.**

### tf-idf based representation

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer


Creating an instance of TfidfVectorizer.



In [None]:
tfidf = TfidfVectorizer()


Let’s transform the data now.



In [None]:
transformed = tfidf.fit_transform(corpus)


**Your turn: Find way to get word/token names, and view tf-idf representation of the `cropus`**

### One hot vector representation

**Your turn:  Use hot vector based representation of words to represent any sentence** You might need to use `OneHotEncoder` and `LabelEncoder`

### word2vec Representation

We will see how to load and do some basic operations using word2vec representation

There are many word2vec models available; we will use
Gensim one. We will have to download word2vec, which takes time (2GB download). So while
it is downloading, we could play with the online demo and try to understand how word2vec
works.

Demos:

*   https://turbomaze.github.io/word2vecjson/
*   http://epsilon-it.utu.fi/wv_demo/
*   http://nlp.polytechnique.fr/word2vec
*   http://vectors.nlpl.eu/explore/embeddings/en/



In [None]:
import gensim.downloader as api


In [None]:
wv = api.load('word2vec-google-news-300')



**Your Turn: Find out how to save and load word2vec model, so that you don't need to download is again and again**


Let's retrieve the vocabulary of a model

In [None]:
for index, word in enumerate(wv.index_to_key):
    if index == 50: # For simplicity, we are looking at 50 word only
        break
    print(f"word #{index}/{len(wv.index_to_key)} is {word}")


word #0/3000000 is </s>
word #1/3000000 is in
word #2/3000000 is for
word #3/3000000 is that
word #4/3000000 is is
word #5/3000000 is on
word #6/3000000 is ##
word #7/3000000 is The
word #8/3000000 is with
word #9/3000000 is said
word #10/3000000 is was
word #11/3000000 is the
word #12/3000000 is at
word #13/3000000 is not
word #14/3000000 is as
word #15/3000000 is it
word #16/3000000 is be
word #17/3000000 is from
word #18/3000000 is by
word #19/3000000 is are
word #20/3000000 is I
word #21/3000000 is have
word #22/3000000 is he
word #23/3000000 is will
word #24/3000000 is has
word #25/3000000 is ####
word #26/3000000 is his
word #27/3000000 is an
word #28/3000000 is this
word #29/3000000 is or
word #30/3000000 is their
word #31/3000000 is who
word #32/3000000 is they
word #33/3000000 is but
word #34/3000000 is $
word #35/3000000 is had
word #36/3000000 is year
word #37/3000000 is were
word #38/3000000 is we
word #39/3000000 is more
word #40/3000000 is ###
word #41/3000000 is up
word 

Let's get Vector representation of a word

In [None]:
word = 'king'
vec_king = wv[word]

Let's calculate Word Similarity

In [None]:
pairs = [
    ('car', 'minivan'),   # a minivan is a kind of car
    ('car', 'bicycle'),   # still a wheeled vehicle
    ('car', 'airplane'),  # ok, no wheels, but still a vehicle
    ('car', 'cereal'),    # ... and so on
    ('car', 'communism'),
]
for w1, w2 in pairs:
    print('%r\t%r\t%.2f' % (w1, w2, wv.similarity(w1, w2)))

'car'	'minivan'	0.69
'car'	'bicycle'	0.54
'car'	'airplane'	0.42
'car'	'cereal'	0.14
'car'	'communism'	0.06


Let's find nearest words to give words

In [None]:
words = ['car', 'minivan']
print(wv.most_similar(positive= words, topn=5))

[('kings', 0.7138045430183411), ('queen', 0.6510956883430481), ('monarch', 0.6413194537162781), ('crown_prince', 0.6204220056533813), ('prince', 0.6159993410110474)]


**Your Turn: Play with different words and build intuition about the representation**