https://towardsdatascience.com/5-simple-ways-to-tokenize-text-in-python-92c6804edfc4

Table of Contents
1. Simple tokenization with .split
2. Tokenization with NLTK
3. Convert a corpus to a vector of token counts with Count Vectorizer (sklearn)
4. Tokenize text in different languages with spaCy
5. Tokenization with Gensim

In [None]:

text = "Here’s to the crazy ones, the misfits, the rebels, the troublemakers, \
the round pegs in the square holes. The ones who see things differently — they’re \
not fond of rules. You can quote them, disagree with them, glorify or vilify them,\
 but the only thing you can’t do is ignore them because they change things. They push \
 the human race forward, and while some may see them as the crazy ones, we see genius, \
 because the ones who are crazy enough to think that they can change the world, are the ones who do."



 the split() method doesn’t consider punctuation symbols as a separate token. This might change your project results. for example do.
 

In [None]:

#text.split()


**2. Tokenization with NLTK**

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:

from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize


In [None]:

text_sentence=sent_tokenize(text)
text_sentence


['Here’s to the crazy ones, the misfits, the rebels, the troublemakers, the round pegs in the square holes.',
 'The ones who see things differently — they’re not fond of rules.',
 'You can quote them, disagree with them, glorify or vilify them, but the only thing you can’t do is ignore them because they change things.',
 'They push  the human race forward, and while some may see them as the crazy ones, we see genius,  because the ones who are crazy enough to think that they can change the world, are the ones who do.']

In [None]:

len(text_sentence)


4

In this case, the apostrophe (‘) in “here’s” and the comma (,) in “ones,” were considered as tokens.

In [None]:

text_word = word_tokenize(text)
#text_word


In [None]:
len(text_word)

115

**3. Convert a corpus to a vector of token counts with Count Vectorizer (sklearn)**

The previous methods become less useful when dealing with a large corpus because you’ll need to represent the tokens differently. Count Vectorizer will help us convert a collection of text documents to a vector of token counts. 

In [None]:

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd


In [None]:

texts = ["""Here’s to the crazy ones, the misfits, the rebels, the troublemakers, \
the round pegs in the square holes. The ones who see things differently — they’re \
not fond of rules. You can quote them, disagree with them, glorify or vilify them, \
but the only thing you can’t do is ignore them because they change things. They push \
the human race forward, and while some may see them as the crazy ones, we see genius, \
because the ones who are crazy enough to think that they can change the world, are the ones who do.""" ,
 
'I choose a lazy person to do a hard job. Because a lazy person will find an easy way to do it.']

texts


['Here’s to the crazy ones, the misfits, the rebels, the troublemakers, the round pegs in the square holes. The ones who see things differently — they’re not fond of rules. You can quote them, disagree with them, glorify or vilify them, but the only thing you can’t do is ignore them because they change things. They push the human race forward, and while some may see them as the crazy ones, we see genius, because the ones who are crazy enough to think that they can change the world, are the ones who do.',
 'I choose a lazy person to do a hard job. Because a lazy person will find an easy way to do it.']

In [None]:

df=pd.DataFrame({'author':['jobs','Gates'] ,'text':texts})
df


Unnamed: 0,author,text
0,jobs,"Here’s to the crazy ones, the misfits, the reb..."
1,Gates,I choose a lazy person to do a hard job. Becau...


In [None]:

cv = CountVectorizer(stop_words='english')
cv_matrix = cv.fit_transform(df['text'])

#create document term matrix

df_dtm = pd.DataFrame(cv_matrix.toarray(), index=df['author'].values, columns=cv.get_feature_names())
df_dtm



Unnamed: 0,change,choose,crazy,differently,disagree,easy,fond,forward,genius,glorify,...,round,rules,square,thing,things,think,troublemakers,vilify,way,world
jobs,2,0,3,1,1,0,1,1,1,1,...,1,1,1,1,2,1,1,1,0,1
Gates,0,1,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
