## Building your vocabulary with a tokenizer

### The simplest way, Word tokenization!

In [19]:
sentence = """Hello, I would like to order 2 medium pizza from the nearest store before 6pm."""
sentence.split()

['Hello,',
 'I',
 'would',
 'like',
 'to',
 'order',
 '2',
 'medium',
 'pizza',
 'from',
 'the',
 'nearest',
 'store',
 'before',
 '6pm.']

### Bag of words Vectorization (BOW)

In [20]:
from re import T
import pandas as pd

#Creating the dataframe, transposing the data
df = pd.DataFrame(pd.Series(dict([(token, 1) for token in sentence.split()])), columns=['sent']).T

table of vectors corresponding to texts in a corpus

In [21]:
df

Unnamed: 0,"Hello,",I,would,like,to,order,2,medium,pizza,from,the,nearest,store,before,6pm.
sent,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1


Let's add more texts to the corpus, and show the sparse Bag of words vectors

In [22]:
sentences = """Thomas Jefferson began building Monticello at the age of 26.\n"""
sentences += """Construction was done mostly by local masons and carpenters.\n"""
sentences += "He moved into the South Pavilion in 1770.\n"
sentences += """Turning Monticello into a neoclassical masterpiece was Jefferson's obsession."""

BOW manually 

In [25]:
corpus = {}

for i, sent in enumerate(sentences.split('\n')):
    corpus['sent{}'.format(i)] = dict((tok, 1) for tok in sent.split())

df = pd.DataFrame.from_records(corpus).fillna(0).astype(int).T

In [27]:
corpus

{'sent0': {'Thomas': 1,
  'Jefferson': 1,
  'began': 1,
  'building': 1,
  'Monticello': 1,
  'at': 1,
  'the': 1,
  'age': 1,
  'of': 1,
  '26.': 1},
 'sent1': {'Construction': 1,
  'was': 1,
  'done': 1,
  'mostly': 1,
  'by': 1,
  'local': 1,
  'masons': 1,
  'and': 1,
  'carpenters.': 1},
 'sent2': {'He': 1,
  'moved': 1,
  'into': 1,
  'the': 1,
  'South': 1,
  'Pavilion': 1,
  'in': 1,
  '1770.': 1},
 'sent3': {'Turning': 1,
  'Monticello': 1,
  'into': 1,
  'a': 1,
  'neoclassical': 1,
  'masterpiece': 1,
  'was': 1,
  "Jefferson's": 1,
  'obsession.': 1}}

The one-hot vectors

In [29]:
df

Unnamed: 0,Thomas,Jefferson,began,building,Monticello,at,the,age,of,26.,...,South,Pavilion,in,1770.,Turning,a,neoclassical,masterpiece,Jefferson's,obsession.
sent0,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
sent1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
sent2,0,0,0,0,0,0,1,0,0,0,...,1,1,1,1,0,0,0,0,0,0
sent3,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,1,1,1,1,1,1


BOW using sklearn 

In [24]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

X = vectorizer.fit_transform(sentences.split('\n'))
df_bow_sklearn = pd.DataFrame(X.toarray(),columns=vectorizer.get_feature_names())
df_bow_sklearn.head()

Unnamed: 0,1770,26,age,and,at,began,building,by,carpenters,construction,...,moved,neoclassical,obsession,of,pavilion,south,the,thomas,turning,was
0,0,1,1,0,1,1,1,0,0,0,...,0,0,0,1,0,0,1,1,0,0
1,0,0,0,1,0,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,1
2,1,0,0,0,0,0,0,0,0,0,...,1,0,0,0,1,1,1,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,1,1
