# Text Representation

This example shows frequency based word embedding with various vectorizers.

## Import packages 

* **punkt** - tokenizer
* **wordnet**, **omw-1.4** - lemmatizer
* **averaged_perceptron_tagger** - Part of speech (POS)

In [1]:
import nltk
nltk.download('punkt') 
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger') 

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ytan\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ytan\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\ytan\AppData\Roaming\nltk_data...
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ytan\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ytan\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

### Data

List of documents

In [1]:
d1 = "He is a good guy, he is not bad"
d2 = "feet wolves cooked boys girls ,!<@!"
d3 = "He is not a good guy, he is bad"

c1 = [d1, d2, d3]

## Tokenizers

`word_tokenizer()` and `WhitespaceTokenizer()`

In [2]:
token_d1 = nltk.word_tokenize(d1)
print(token_d1)

tokenizer2 = nltk.tokenize.WhitespaceTokenizer()
token_d12 = tokenizer2.tokenize(d1)
print(token_d12)

NameError: name 'nltk' is not defined

## Bag of Words (BOW)

* use `CountVectorizer()`
* `fit()` learns from docs the vocabulary indexed alphabetically 

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer1 = CountVectorizer()
vectorizer1.fit(c1)

print(vectorizer1.vocabulary_)

{'he': 7, 'is': 8, 'good': 5, 'guy': 6, 'not': 9, 'bad': 0, 'feet': 3, 'wolves': 10, 'cooked': 2, 'boys': 1, 'girls': 4}


* `transform()` create vector of word counts for each doc in the space spanned by vocabulary 

In [5]:
v1 = vectorizer1.transform(c1)
print(v1.toarray())

[[1 0 0 0 0 1 1 2 2 1 0]
 [0 1 1 1 1 0 0 0 0 0 1]
 [1 0 0 0 0 1 1 2 2 1 0]]


## Stemming

* use Porter stemmer
* **token_d2** contains a list of token
* for each **token**, if `isalpha()` apply the stemmer

In [6]:
token_d2 = nltk.word_tokenize(d2.lower())

stemmer = nltk.stem.PorterStemmer()
stemmed_token_d2 = [stemmer.stem(token) for token in token_d2 if token.isalpha()]

## alternative way
# stemmed_token_d2 = []
# for token in token_d2:
#     if token.isalpha():
#         stemmed.token_d2.append(stemmer.stem(token))

print(stemmed_token_d2)
print(token_d2)

['feet', 'wolv', 'cook', 'boy', 'girl']
['feet', 'wolves', 'cooked', 'boys', 'girls', ',', '!', '<', '@', '!']


## Lemmatizing

In [7]:
lemmatizer = nltk.stem.WordNetLemmatizer()
lemmatized_token_d2 = [lemmatizer.lemmatize(token) for token in token_d2 if token.isalpha()]
print(lemmatized_token_d2)

['foot', 'wolf', 'cooked', 'boy', 'girl']


## Remove stop words

In [8]:
from nltk.corpus import stopwords

stop_words_removed = [token for token in token_d1 if not token in stopwords.words('english') 
                      if token.isalpha()]

print(token_d1)
print(stop_words_removed)

['He', 'is', 'a', 'good', 'guy', ',', 'he', 'is', 'not', 'bad']
['He', 'good', 'guy', 'bad']


## TF-IDF Vectorizer

* `fit()` learns from docs the vocabulary indexed alphabetically
* `transform()` create vector of TF-IDF measures for each doc in the space spanned by vocabulary 

In [9]:
###TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer2 = TfidfVectorizer()
vectorizer2.fit(c1)
print(vectorizer2.vocabulary_)

v2 = vectorizer2.transform(c1)
print(v2.toarray())


{'he': 7, 'is': 8, 'good': 5, 'guy': 6, 'not': 9, 'bad': 0, 'feet': 3, 'wolves': 10, 'cooked': 2, 'boys': 1, 'girls': 4}
[[0.28867513 0.         0.         0.         0.         0.28867513
  0.28867513 0.57735027 0.57735027 0.28867513 0.        ]
 [0.         0.4472136  0.4472136  0.4472136  0.4472136  0.
  0.         0.         0.         0.         0.4472136 ]
 [0.28867513 0.         0.         0.         0.         0.28867513
  0.28867513 0.57735027 0.57735027 0.28867513 0.        ]]


## Bag of n-grams

* **ngram_range** defines the range of n-grams
* **min_df** deinfes minimum document frequency

In [10]:
vectorizer3 = TfidfVectorizer(ngram_range=(1, 2), min_df=2)
vectorizer3.fit(c1)
v3 = vectorizer3.transform(c1)

print(v3.toarray())
print(vectorizer3.vocabulary_)


[[0.22941573 0.22941573 0.22941573 0.22941573 0.22941573 0.45883147
  0.45883147 0.45883147 0.22941573 0.22941573]
 [0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.        ]
 [0.22941573 0.22941573 0.22941573 0.22941573 0.22941573 0.45883147
  0.45883147 0.45883147 0.22941573 0.22941573]]
{'he': 5, 'is': 7, 'good': 1, 'guy': 3, 'not': 9, 'bad': 0, 'he is': 6, 'good guy': 2, 'guy he': 4, 'is not': 8}


## Part of speech (POS) tag

* **`pos_tag()`** returns 2-tuple of (token, POS_tag)

In [11]:
d4 = "I drink water in parties"
d5 = "I grab a drink in parties"
token4 = nltk.word_tokenize(d4)

POS_token4 = nltk.pos_tag(token4)

print(POS_token4)

[('I', 'PRP'), ('drink', 'VBP'), ('water', 'NN'), ('in', 'IN'), ('parties', 'NNS')]


* join each (token, POS_tag) of 2-tuple by a underscore 
* join the POS tagged tokens back to a document

In [12]:
c2 = [d4, d5]
POS_c2 = []
for doc in c2:
    token_doc = nltk.word_tokenize(doc)
    POS_token_doc = nltk.pos_tag(token_doc)
    POS_token_temp = []
    for i in POS_token_doc:
        POS_token_temp.append(i[0] + "_" + i[1])
    POS_c2.append(" ".join(POS_token_temp))

print(POS_c2)

['I_PRP drink_VBP water_NN in_IN parties_NNS', 'I_PRP grab_VBP a_DT drink_NN in_IN parties_NNS']


* apply a vectorizer

In [15]:
vectorizer4 = TfidfVectorizer()
vectorizer4.fit(POS_c2)
print(vectorizer4.vocabulary_)

POS_v3 = vectorizer4.transform(POS_c2)
print(POS_v3.toarray())

{'i_prp': 4, 'drink_vbp': 2, 'water_nn': 7, 'in_in': 5, 'parties_nns': 6, 'grab_vbp': 3, 'a_dt': 0, 'drink_nn': 1}
[[0.         0.         0.53309782 0.         0.37930349 0.37930349
  0.37930349 0.53309782]
 [0.47042643 0.47042643 0.         0.47042643 0.33471228 0.33471228
  0.33471228 0.        ]]
