# Tokenisation

Tokenisation is the process of converting a stream of characters to tokens.
Tokenisation can be implemented at two levels:
- Sentence Level (sent_tokenize)
- Word Level (word_tokenize)

Important words:
- Features: Features are the smallest unit in natural language. They are th individual words in a document
- Document: A document is a single text datapoint. For example a tweet.
- Lexicon: The complete vocabulary. Contains all the words and their meanings.
- Corpora: A collection of documents. This is the complete universe of data. eg: entire wikipedia/ encyclopedia


In [22]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [42]:
text = "Hello Mr. Smith, how is your cookie ? The weather is great, and the cookie is awesome. The cookie is pinkish-blue. You shouldn't eat so many cookie."
text = text.lower()
print("Sentence Tokeniser:", sent_tokenize(text))
print("Word Tokeniser:", word_tokenize(text))

Sentence Tokeniser: ['hello mr. smith, how is your cookie ?', 'the weather is great, and the cookie is awesome.', 'the cookie is pinkish-blue.', "you shouldn't eat so many cookie."]
Word Tokeniser: ['hello', 'mr.', 'smith', ',', 'how', 'is', 'your', 'cookie', '?', 'the', 'weather', 'is', 'great', ',', 'and', 'the', 'cookie', 'is', 'awesome', '.', 'the', 'cookie', 'is', 'pinkish-blue', '.', 'you', 'should', "n't", 'eat', 'so', 'many', 'cookie', '.']


In [43]:
from nltk.corpus import stopwords
stpwrds = stopwords.words()
stpwrds.extend([".",",","?","n't"])
features = set(word_tokenize(text)).difference(stpwrds)
print(features, len(features))

{'cookie', 'awesome', 'smith', 'pinkish-blue', 'eat', 'great', 'hello', 'weather', 'mr.', 'many'} 10


# Feature Vectorisation

Word Embeddings or Word vectorization is a methodology in NLP to map words or phrases from vocabulary to a corresponding vector of real numbers which used to find word predictions, word similarities/semantics.

The process of converting words into numbers are called Vectorization
Embeddings are the end product of vectorization


## Types of Encoding

- Frequency Based
    - 
    - One hot Encoding
    - Count Vectoriser
    - Bag of Words

- Prediction Based
    - 
    - 


In [44]:
from sklearn.feature_extraction.text import CountVectorizer
count_vectoriser = CountVectorizer(stop_words= stpwrds)
X = count_vectoriser.fit_transform(sent_tokenize(text))

for i in range(3):
    print(sent_tokenize(text)[i])
    print(X.toarray()[i], X.toarray().shape[1])

hello mr. smith, how is your cookie ?
[0 0 1 0 0 1 0 1 0 1 0] 11
the weather is great, and the cookie is awesome.
[1 0 1 0 1 0 0 0 0 0 1] 11
the cookie is pinkish-blue.
[0 1 1 0 0 0 0 0 1 0 0] 11




In [45]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
print(sent_tokenize(text))
tfifd_vectoriser = TfidfVectorizer(stop_words = stpwrds)
# tfifd_vectoriser = TfidfVectorizer()
tf_idf = tfifd_vectoriser.fit_transform(sent_tokenize(text))
feature_names = tfifd_vectoriser.get_feature_names()
print(feature_names)

list_dense = pd.DataFrame(tf_idf.todense(), columns = feature_names)
print(list_dense)

['hello mr. smith, how is your cookie ?', 'the weather is great, and the cookie is awesome.', 'the cookie is pinkish-blue.', "you shouldn't eat so many cookie."]
['awesome', 'blue', 'cookie', 'eat', 'great', 'hello', 'many', 'mr', 'pinkish', 'smith', 'weather']
    awesome      blue    cookie       eat     great     hello      many  \
0  0.000000  0.000000  0.288477  0.000000  0.000000  0.552805  0.000000   
1  0.552805  0.000000  0.288477  0.000000  0.552805  0.000000  0.000000   
2  0.000000  0.663385  0.346182  0.000000  0.000000  0.000000  0.000000   
3  0.000000  0.000000  0.346182  0.663385  0.000000  0.000000  0.663385   

         mr   pinkish     smith   weather  
0  0.552805  0.000000  0.552805  0.000000  
1  0.000000  0.000000  0.000000  0.552805  
2  0.000000  0.663385  0.000000  0.000000  
3  0.000000  0.000000  0.000000  0.000000  


## Cosine vs Euclidean Similarity

Different methods to convert text to vectors:
- Cosine Similarity
    - 
- Euclidean Distance
    - 

Difference:


# Stemming & Lemmatisation

# POS Tagging

In [46]:
import nltk
from nltk.tokenize import word_tokenize

print(nltk.pos_tag(word_tokenize(text)))


[('hello', 'NN'), ('mr.', 'NN'), ('smith', 'NNP'), (',', ','), ('how', 'WRB'), ('is', 'VBZ'), ('your', 'PRP$'), ('cookie', 'NN'), ('?', '.'), ('the', 'DT'), ('weather', 'NN'), ('is', 'VBZ'), ('great', 'JJ'), (',', ','), ('and', 'CC'), ('the', 'DT'), ('cookie', 'NN'), ('is', 'VBZ'), ('awesome', 'JJ'), ('.', '.'), ('the', 'DT'), ('cookie', 'NN'), ('is', 'VBZ'), ('pinkish-blue', 'JJ'), ('.', '.'), ('you', 'PRP'), ('should', 'MD'), ("n't", 'RB'), ('eat', 'VB'), ('so', 'RB'), ('many', 'JJ'), ('cookie', 'NN'), ('.', '.')]
