# Word Tokenization



> **Tokenize text into words.**

Tokenization is a way to split text into tokens. These tokens could be paragraphs, sentences, or individual words.


There are many libraries such as NLTK, spaCy, Genisim which have built-in functions to perform several tasks.


---

 > Requires more string manipulation techniques than str.split()
 




In [26]:
# built-in functions in Python 
# This does a dirty- splut, we can see'.' included

senten = """This is introduction to NLP & NLU."""

senten.split()

['This', 'is', 'introduction', 'to', 'NLP', '&', 'NLU.']

In [27]:
str.split(senten)

['This', 'is', 'introduction', 'to', 'NLP', '&', 'NLU.']

In [28]:
# tokenizing using NLTK

import nltk.data
nltk.download('punkt')
nltk.word_tokenize(senten)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['This', 'is', 'introduction', 'to', 'NLP', '&', 'NLU', '.']

*The major difference between built-in function and NLTK/spaCy is split function ignores the '.' where as in NLTK/spaCy it is seperated.*

**One-hot Vectors**


In [29]:
# importing numpy
import numpy as np
# creating a vocab to keep track of all unique words
token_sequence = str.split(senten)
# Sorted lexographically (lexically) so numbers come before letters,
# and capital letters come before lowercase letters
vocab = sorted(set(token_sequence))
', '.join(vocab)
num_tokens = len(token_sequence)
vocab_size = len(vocab)

# For each word in the sentence, mark the column for that 
# word in your vocabulary with a 1
onehot_vectors = np.zeros((num_tokens, vocab_size), int) 
for i, word in enumerate(token_sequence):
  onehot_vectors[i, vocab.index(word)] = 1
  ' '.join(vocab)
  print(onehot_vectors)


[[0 0 0 1 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]]
[[0 0 0 1 0 0 0]
 [0 0 0 0 0 1 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]]
[[0 0 0 1 0 0 0]
 [0 0 0 0 0 1 0]
 [0 0 0 0 1 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]]
[[0 0 0 1 0 0 0]
 [0 0 0 0 0 1 0]
 [0 0 0 0 1 0 0]
 [0 0 0 0 0 0 1]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]]
[[0 0 0 1 0 0 0]
 [0 0 0 0 0 1 0]
 [0 0 0 0 1 0 0]
 [0 0 0 0 0 0 1]
 [0 1 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]]
[[0 0 0 1 0 0 0]
 [0 0 0 0 0 1 0]
 [0 0 0 0 1 0 0]
 [0 0 0 0 0 0 1]
 [0 1 0 0 0 0 0]
 [1 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]]
[[0 0 0 1 0 0 0]
 [0 0 0 0 0 1 0]
 [0 0 0 0 1 0 0]
 [0 0 0 0 0 0 1]
 [0 1 0 0 0 0 0]
 [1 0 0 0 0 0 0]
 [0 0 1 0 0 0 0]]



*   As we can see it is a bit hard to read the zeros in the above representation and pandas can make the representation a bit easy to read.

*  DataFrames  in Pandas will make this more informative.

* It wraps a 1D array with some helper functionality in an object called a Series. 

* Pandas is useful with tables of numbers like lists of lists, 2D numpy arrays, 2D numpy matrices, arrays of arrays, dictionaries of dictionaries, and so on.





In [30]:
# One-hot vector sequence
import pandas as pd 
pd.DataFrame(onehot_vectors, columns=vocab)

Unnamed: 0,&,NLP,NLU.,This,introduction,is,to
0,0,0,0,1,0,0,0
1,0,0,0,0,0,1,0
2,0,0,0,0,1,0,0
3,0,0,0,0,0,0,1
4,0,1,0,0,0,0,0
5,1,0,0,0,0,0,0
6,0,0,1,0,0,0,0


In [31]:
# remove the zeros from above matrix

df = pd.DataFrame(onehot_vectors, columns=vocab)
df[df == 0] = ''
df

Unnamed: 0,&,NLP,NLU.,This,introduction,is,to
0,,,,1.0,,,
1,,,,,,1.0,
2,,,,,1.0,,
3,,,,,,,1.0
4,,1.0,,,,,
5,1.0,,,,,,
6,,,1.0,,,,


# Normalizing the Text

Lower casing is one of the pre-processing steps in textual data.

We want both USA and usa to be the same while analysis.

In [32]:
# Converting a list to a dataframe

text=['This is introduction to NLP & NLU', 'It is likely to be useful to students',
     'Deep learning is the new electrcity' , 'python is the best language!' , 
     'I like this note-book', 'I want to learn more from these note-books']


import pandas as pd
df = pd.DataFrame({'tweet':text})
print(df)

                                        tweet
0           This is introduction to NLP & NLU
1       It is likely to be useful to students
2         Deep learning is the new electrcity
3                python is the best language!
4                       I like this note-book
5  I want to learn more from these note-books


In [33]:
# Lowercasing to this dataframe

df['tweet'] = df['tweet'].apply(lambda x: " ".join(x.lower()
for x in x.split()))
df['tweet']

0             this is introduction to nlp & nlu
1         it is likely to be useful to students
2           deep learning is the new electrcity
3                  python is the best language!
4                         i like this note-book
5    i want to learn more from these note-books
Name: tweet, dtype: object