<a href="https://colab.research.google.com/github/rahiakela/practical-natural-language-processing/blob/master/8-social-media/2_effect_of_different_tokenizers_on_smtd.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Effect of different tokenizers on Social Media Text Data

One of the key steps in the above process is to correctly tokenize the text data. For this, we used twokenize to get tokens from the text corpus. This is a specialized function for getting tokens from tweets’ text data. This function is part of a set of NLP tools specially designed to work with SMTD. 

Now, we might ask: why do we need a specialized tokenizer, and why not use the standard tokenizer available in NLTK?

The answer lies in the fact that the tokenizer available in NLTK is designed to
work with standard English language. Specifically in the English language, two words are separated by space. This might not necessarily be true for English used on Twitter.

This suggests that a tokenizer that uses space as a way to identify word boundaries might not do well on SMTD. Let’s understand this with an example.

## Setup

In [None]:
!pip install twikenizer
!pip install emoji

In [7]:
import twikenizer as twk
from nltk.tokenize import TweetTokenizer, word_tokenize
import nltk

In [8]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Tweet tokenizer

Consider the following tweet:

In [3]:
tweet1 = "Hey @NLPer! This is a #NLProc tweet :-D"

twk = twk.Twikenizer()
print(twk.tokenize(tweet1))

['Hey', '@NLPer', '!', 'This', 'is', 'a', '#NLProc', 'tweet', ':', '-', 'D']


Using a tokenizer designed for the English language, like nltk.tokenize.word_tokenize, we’ll get the following tokens:

In [9]:
print(word_tokenize(tweet1))

['Hey', '@', 'NLPer', '!', 'This', 'is', 'a', '#', 'NLProc', 'tweet', ':', '-D']


Clearly, the set of tokens given by the tokenizer in NLTK is not correct. It’s
important to use a tokenizer that gives correct tokens. twokenize is specifically designed to deal with SMTD.

Once we have the correct set of tokens, frequency counting is straightforward. There are a number of specialized tokenizers available to work with SMTD.

In [10]:
tweet2 = 'Tw33t a_!aa&!a?b #%lol # @dude_really #b3st_day $ad (b@e) (beep#d) @dude. 😀😀 !😀abc %😀lol #loveit #love.it $%&/ d*ck-'

In [12]:
twk = twk.Twikenizer()
print(twk.tokenize(tweet2))

['Tw33t', 'a', '_', '!', 'aa', '&', '!', 'a', '?', 'b', '#%lol', '#', '@dude_really', '#b3st_day', '$ad', '(', 'b', '@', 'e', ')', '(', 'beep', '#', 'd', ')', '@dude', '.', '😀', '😀', '!', '😀', 'abc', '%', '😀', 'lol', '#loveit', '#love', '.', 'it', '$', '%', '&', '/', 'd*ck', '-']


In [13]:
tknzr = TweetTokenizer()
print(tknzr.tokenize(tweet2))

['Tw33t', 'a_', '!', 'aa', '&', '!', 'a', '?', 'b', '#', '%', 'lol', '#', '@dude_really', '#b3st_day', '$', 'ad', '(', 'b', '@e', ')', '(', 'beep', '#', 'd', ')', '@dude', '.', '😀', '😀', '!', '😀', 'abc', '%', '😀', 'lol', '#loveit', '#love', '.', 'it', '$', '%', '&', '/', 'd', '*', 'ck', '-']
