# Tokenization

- Extract tokens/terms/words from sentences
- Tokenizers from packages like NLTK (Natural Language Toolkit) & spaCY to extract tokens.


In [21]:
import pandas as pd, spacy, nltk, re

In [2]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/sylvia/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

One document will always be in form of 1 string.

In [3]:
doc = 'I visited my grandparents last week; We had a good time together'

If you want to extract words there are 2 ways - you could do it manually or u can use library like nltk.

# Manual Process


In [4]:
# use lower and split the doc
tokens = doc.lower().split()
tokens

['i',
 'visited',
 'my',
 'grandparents',
 'last',
 'week;',
 'we',
 'had',
 'a',
 'good',
 'time',
 'together']

In [5]:
# Remove characters - replacing not a word character with space
doc_cleaned = re.sub('[^\w\s]','',doc.lower()) # removing semi-colon
doc_cleaned

'i visited my grandparents last week we had a good time together'

In [6]:
# split the cleaned doc using space 
tokens = doc_cleaned.split(' ')
tokens

['i',
 'visited',
 'my',
 'grandparents',
 'last',
 'week',
 'we',
 'had',
 'a',
 'good',
 'time',
 'together']

# Automated process of tokenization

## 3 major types of tokenizer are - word_tokenize, RegexpTokenizer and TweetTokenizer.

In [7]:
from nltk.tokenize import word_tokenize
tokens = word_tokenize(doc.lower())
tokens

['i',
 'visited',
 'my',
 'grandparents',
 'last',
 'week',
 ';',
 'we',
 'had',
 'a',
 'good',
 'time',
 'together']

In the above list ';' came as a seperate token compared to manual process

# **RegexpTokenizer** - If we want to apply regular expressions and then want to extract the token

In [8]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+') # setting the pattern to make tokens of only sequence of words and nothing else
tokens = tokenizer.tokenize(doc.lower()) 
tokens

['i',
 'visited',
 'my',
 'grandparents',
 'last',
 'week',
 'we',
 'had',
 'a',
 'good',
 'time',
 'together']

# TweetTokenizer

In [9]:
doc2 = '@john This product is really cool!!!😀😃😄😁😆😅 #awesome'

In case of tweet messages if you use normal tokenizer i.e. word_tokenize then @, emoji's, !, # will be recognised as seperate tokens which would not be helpful cz we would like @john as username to appear together and #awesome to appear together. For this use TweetTokenizer from nltk library.

In [10]:
tokens = word_tokenize(doc2)
tokens

['@',
 'john',
 'This',
 'product',
 'is',
 'really',
 'cool',
 '!',
 '!',
 '!',
 '😀😃😄😁😆😅',
 '#',
 'awesome']

* @john, #awesome should have come together but
* All the tokens except pair of smileys are coming in seperate lines

In [11]:
from nltk.tokenize import TweetTokenizer
tweet_tokenizer = TweetTokenizer()
token1 = tweet_tokenizer.tokenize(doc2) # doc here will be 1 tweet message at a time
token1

['@john',
 'This',
 'product',
 'is',
 'really',
 'cool',
 '!',
 '!',
 '!',
 '😀',
 '😃',
 '😄',
 '😁',
 '😆',
 '😅',
 '#awesome']

* Now @john and #awesome are coming together as they should have and smileys are coming in seperate lines 

# Tokenization - CSV file read

[link text](https://)

In [12]:
# !pip install -U -q PyDrive
# from pydrive.auth import GoogleAuth
# from pydrive.drive import GoogleDrive
# from google.colab import auth
# from oauth2client.client import GoogleCredentials

In [13]:
# auth.authenticate_user()
# gauth = GoogleAuth()
# gauth.credentials = GoogleCredentials.get_application_default()
# drive = GoogleDrive(gauth)

In [14]:
# downloaded = drive.CreateFile({'id':'12CUjW29tTTxYAcPhxuKb_qSn0UTzc4BR'}) # replace the id with id of file you want to access
# downloaded.GetContentFile('imdb_sentiment.csv') 

In [15]:
pwd

'/Users/sylvia/Desktop/IITR/M6_text_analytics/Times_Demo_Videos'

## Extract tokens from a text column in a csv file

In [16]:
import pandas as pd
data = pd.read_csv('/Users/sylvia/Desktop/datasets/imdb_sentiment.csv')
data.head()

Unnamed: 0,review,sentiment
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1


Extracting tokens from csv/pdf files. Here every row will be considered as one document.

In [17]:
docs = data['review'].str.lower()  # access review column & convert it to lowercase
tokenizer = RegexpTokenizer('\w+')  # RegexpTokenizer so can retain only sequence of word characters

# looping through first 5 docs in my dataset

for x in docs.head():
  tokens = tokenizer.tokenize(x)   # tokenizer here is RegexpTokenizer
  print(x)
  print(tokens,'\n')
  #print('-'*50)

a very, very, very slow-moving, aimless movie about a distressed, drifting young man.  
['a', 'very', 'very', 'very', 'slow', 'moving', 'aimless', 'movie', 'about', 'a', 'distressed', 'drifting', 'young', 'man'] 

not sure who was more lost - the flat characters or the audience, nearly half of whom walked out.  
['not', 'sure', 'who', 'was', 'more', 'lost', 'the', 'flat', 'characters', 'or', 'the', 'audience', 'nearly', 'half', 'of', 'whom', 'walked', 'out'] 

attempting artiness with black & white and clever camera angles, the movie disappointed - became even more ridiculous - as the acting was poor and the plot and lines almost non-existent.  
['attempting', 'artiness', 'with', 'black', 'white', 'and', 'clever', 'camera', 'angles', 'the', 'movie', 'disappointed', 'became', 'even', 'more', 'ridiculous', 'as', 'the', 'acting', 'was', 'poor', 'and', 'the', 'plot', 'and', 'lines', 'almost', 'non', 'existent'] 

very little music or anything to speak of.  
['very', 'little', 'music', 'or'

In [18]:
docs = data['review'].str.lower()
docs_cleaned = []
tokenizer = RegexpTokenizer('\w+')
for x in docs.head():
  tokens = tokenizer.tokenize(x)
  docs_cleaned.append(tokens)
docs_cleaned

[['a',
  'very',
  'very',
  'very',
  'slow',
  'moving',
  'aimless',
  'movie',
  'about',
  'a',
  'distressed',
  'drifting',
  'young',
  'man'],
 ['not',
  'sure',
  'who',
  'was',
  'more',
  'lost',
  'the',
  'flat',
  'characters',
  'or',
  'the',
  'audience',
  'nearly',
  'half',
  'of',
  'whom',
  'walked',
  'out'],
 ['attempting',
  'artiness',
  'with',
  'black',
  'white',
  'and',
  'clever',
  'camera',
  'angles',
  'the',
  'movie',
  'disappointed',
  'became',
  'even',
  'more',
  'ridiculous',
  'as',
  'the',
  'acting',
  'was',
  'poor',
  'and',
  'the',
  'plot',
  'and',
  'lines',
  'almost',
  'non',
  'existent'],
 ['very', 'little', 'music', 'or', 'anything', 'to', 'speak', 'of'],
 ['the',
  'best',
  'scene',
  'in',
  'the',
  'movie',
  'was',
  'when',
  'gerardo',
  'is',
  'trying',
  'to',
  'find',
  'a',
  'song',
  'that',
  'keeps',
  'running',
  'through',
  'his',
  'head']]

* Here we have list of lists which contains list elements in rows for e.g.-> 1st row in docs became 1st list, 2nd row in docs became 2nd list 

# Use spacy library to get individual tokens

In [19]:
nw_doc = 'I visited my grandparents last week; We had a good time together'


In [20]:
import spacy
nlp = spacy.load('en_core_web_sm') # Necessary corpus required to do text cleaning and processing operations 

spacy_doc = nlp(nw_doc.lower())

for x in spacy_doc:
  print(x)

i
visited
my
grandparents
last
week
;
we
had
a
good
time
together


* Here we got individual tokens automatically from a single document 