# Tweets Tokenization

The goal of the assignment is to write a tweet tokenizer. The input of the code will be a set of tweet text and the output will be the tokens in each tweet. The assignment is made up of four tasks.

The [data](https://drive.google.com/file/d/15x_wPAflvYQ2Xh38iNQGrqUIWLj5l5Nw/view?usp=share_link) contains 5 files whereby each contains 44 tweets. Each tweet is separated by a newline. For manual tokenization only one file should be used.

Grading:
- 30 points - Tokenize tweets by hand
- 30 points - Implement 4 tokenizers
- 20 points - Stemming and Lemmatization
- 20 points - Explain sentencepiece (for masters only)


Remarks: 
- Use Python 3 or greater
- Max is 80 points for bachelors, 100 points for masters

## Tokenize tweets by hand

As a first task you need to tokenize 15 tweets by hand. This will allow you to understand the problem from a linguistic point of view. The guidelines for tweet tokenization are as follows:

- Each smiley is a separate token
- Each hashtag is an individual token. Each user reference is an individual token
- If a word has spaces between them then it is converted to a single token
- If a sentence ends with a word that legitimately has a full stop (abbreviations, for example), add a final full stop
- All punctuations are individual tokens. This includes double-quotes and single quotes also
- A URL is a single token

Example of output

    Input tweet
    @xfranman Old age has made N A T O!

    Tokenized tweet (separated by comma)
    @xfranman , Old , age , has , made , NATO , !


    1. Input tweet
    ...
    1. Tokenized tweet
    ...

    2. Input tweet
    ...
    2. Tokenized tweet
    ...

1. `Moonfruit's inventive use of twitter for self-promotion. http://bit.ly/16jVaV`

    `Moonfruit's, inventive , of , twitter , for , self-promotion , . , http://bit.ly/16jVaV`



2 .`Being A Work At Home Mom (WAHM) Is A 24/7 Job » Messing With My Mind http://bit.ly/17rLra`

    `Being , A , Work , At , Home , Mom , ( , WAHM , ) , Is , A , 24 , / , 7 , Job , » , Messing , With , My , Mind , http://bit.ly/17rLra`



3. `@BadAstronomer Don't say I never resurrected a lost child for you: http://bit.ly/PlaitUnscrewed (I'd save that - G4's axed a lot of archive)`

    `@BadAstronomer , Do, n't , say , I , never , resurrected , a , lost , child , for , you, : , http://bit.ly/PlaitUnscrewed , (, I, 'd , save , that , - , G4, 's , axed , a , lot , of , archive, )`


4. `Can we make sure #lovewins for babies too? Or nah... #SemST`
   
   `Can, we , make , sure , #lovewins , for , babies , too , ? , Or , nah , ... , #SemST`


5. `@ArkBuilder17 we may disagree but I love ya man I don't want hard feelings #SemST`
   
   `@ArkBuilder17 , we , may , disagree , but , I , love , ya , man , I , do , n't , want , hard , feelings , #SemST`


6. `The Indie Artist X Project – Awareness http://ff.im/-4V4ap`
   
   `The , Indie , Artist , X , Project , – , Awareness , http://ff.im/-4V4ap`


7.  `" I just took the ""are you a true mcfly fan?"" quiz and got: yes your totally fan!!!! Try it: http://bit.ly/1pwAbT"`
   
    `" , I , just , took , the , " , " , are , you , a , true , mcfly , fan , ? , ", " , quiz , and , got , : , yes , your , totally , fan , ! , !, ! , ! , Try , it , : , http://bit.ly/1pwAbT"`


8.  `Get 400 followers a day using http://www.tweeterfollow.com`
    
   `Get , 400 , followers , a , day , using , http://www.tweeterfollow.com`

   
9.  `Fucked up on a CD-R. http://yfrog.com/9vc29j`
    
    `Fucked , up , on , a , CD-R , . , http://yfrog.com/9vc29j`


10. `We chase these days down with talks of the places we will go`
    
    `We , chase , these , days , down , with , talks , of , the , places , we , will , go`


11. `I truly believe that #BlackLivesMatter and that's why I'm against Planned Parenthood. #ycot #SemST`
    
    `I , truly , believe , that , #BlackLivesMatter , and , that , 's , why , I, 'm , against , Planned , Parenthood , . , #ycot , #SemST`


12. `They haven't eliminated #murder. They just call it by a different #name. #prolifeyouth #prolifegen #SemST`
    
    `They , have , n't , eliminated , #murder , . , They , just , call , it , by , a , different , #name , . , #prolifeyouth , #prolifegen , #SemST`


13. `@ProWomanChoice forcing women to change their body's normal function is the epitome of controlling women. #SemST`
    
    `@ProWomanChoice , forcing , women , to , change , their , body , 's , normal , function , is , the , epitome , of , controlling , women , . , #SemST`


14. `@EmilyBeaulieu1 maybe that's what he wants #SemST`
    
    `@EmilyBeaulieu1 , maybe , that's , what , he , wants , #SemST`
    

15. `I refuse to let people like you shame and insult women for accessing healthcare. Sorry. Just no.`
    
    `I , refuse , to , let , people , like , you , shame , and , insult , women , for , accessing , healthcare , . , Sorry , . , Just , no , .`




## Implement 4 tokenizers

Your task is to implement the 4 different tokenizers that take a list of tweets on a topic and output tokenization for each:

- White Space Tokenization
- Sentencepiece
- Tokenizing text using regular expressions
- NLTK TweetTokenizer

For tokenizing text using regular expressions use the rules in task 1. Combine task 1 rules into regular expression and create a tokenizer.

In [7]:
import re
from typing import List

In [8]:
def white_space_tokenizer(text: str) -> List[str]:
    return re.findall(r"\S+", text)

In [31]:
import sentencepiece as spm
import os

if not os.path.exists("./Assignment1_data/train.txt"):
    with open("./Assignment1_data/train.txt", "w+", encoding="utf-8") as train:
        for fname in os.listdir("./Assignment1_data/"):
            with open(os.path.join("./Assignment1_data/", fname), encoding="utf-8") as input_file:
                train.write(input_file.read())
                train.write("\n\n")
    
trainer = spm.SentencePieceTrainer.train(input="./Assignment1_data/train.txt", vocab_size=500, model_prefix="m")
model = spm.SentencePieceProcessor(model_file="m.model")

In [32]:
def sentencepiece_wrapper(text: str) -> List[str]:
    return model.encode(text, out_type=str)

In [123]:
FRAG_PATTERN = r"(?=\w)(?<!ca)(n't)|('d)|('s)|('ll)|('ve)|('re)"
LINK_PATTERN = r"(\w+:\/\/[\w\.\#\?\/]+)"
WORD_PATTERN = r"(\w(?:\s\w)+)\s?|(\w+(?=n't))|(\w+)"
ALIAS_PATTERN = r"((?:@|#)?\w+)"
PUNCTUATION_PATTERN = r"([\?\.!:\-;,\)\(\]\[\}\{]{1}|\*\&\%)|(\"|\')"
SMILEY_PATTERN = r"(?=[\w'\&\-\.\/\(\)=:;]+)|((?::|;|=)(?:-)?(?:\)|D|P))"

PATTERN = re.compile("|".join([FRAG_PATTERN, LINK_PATTERN, WORD_PATTERN, ALIAS_PATTERN, PUNCTUATION_PATTERN, SMILEY_PATTERN]))

def exclude_empty(x: tuple):
    return "".join(tuple(filter(lambda x: len(x) > 0, x)))

def re_tokenizer(text: str) -> List[str]:
    tokens = [exclude_empty(x) for x in PATTERN.findall(text) if len(x) > 0]
    return [_ for _ in tokens if _]

In [70]:
import nltk
from nltk.tokenize import TweetTokenizer

tk = TweetTokenizer()

def nltk_tweet_tokenizer(text: str) -> List[str]:
    return tk.tokenize(text)
    

Run your implementations on the data. Compare the results, decide which one is better. List the advantages of the best tokenizer.

In [26]:
with open("./Assignment1_data/file1", "r") as file:
    data = file.readlines()


In [122]:
print("White-space tokenizer:\n\n", *map(white_space_tokenizer, data), sep='\n')


White-space tokenizer:


['@anitapuspasari', 'waduh..']
['"', 'Could', 'journos', 'please', 'stop', 'putting', 'the', 'word', '""gate""', 'after', 'everything', 'they', 'write...', 'gate."']
['20%', 'More', 'Ridiculous', 'Sale', '@20x200', 'ends', 'tonight!', '-', 'get', '20%', 'off', 'by', 'entering', "'RIDONK'", 'at', 'checkout.', 'More', 'info:', 'http://bit.ly/ridonktues']
['@Studio85', 'I', 'have', 'a', 'pair', 'of', 'those', 'shoes.', 'They', 'are', 'comfy.', 'Like', 'being', 'barefoot.', 'Okay', 'for', 'running,', 'but', 'not', 'on', 'concrete,', 'as', "I've", 'discovered.']
['RT', '@twilightus', 'Team', 'Carlisle', 'is', 'a', 'Trending', 'Topic-', 'help', 'him', 'out', 'RT', 'Follow', '@peterfacinelli', 'see', 'a', 'grown', 'man', 'n', 'a', 'bikini', 'dance', 'Hollywood', 'Blvd']
['@karenrubin', 'you', 'might', 'have', 'to', 'reinstall', '-', 'that', 'happened', 'to', 'me', 'a', 'few', 'months', 'ago,', 'now', 'I', 'use', 'Nambu', 'on', 'my', 'Mac']
['Just', 'Posted:', 'Redneck

In [33]:
print("Sentencepiece tokenizer:\n\n", *map(sentencepiece_wrapper, data), sep='\n')


Sentencepiece tokenizer:


['▁@', 'an', 'it', 'a', 'p', 'u', 'spa', 's', 'ar', 'i', '▁wa', 'd', 'u', 'h', '.', '.']
['▁"', '▁C', 'ou', 'ld', '▁', 'jo', 'ur', 'no', 's', '▁pleas', 'e', '▁sto', 'p', '▁put', 'ting', '▁the', '▁', 'w', 'ord', '▁""', 'gate', '""', '▁a', 'f', 'ter', '▁every', 'th', 'ing', '▁the', 'y', '▁', 'w', 'r', 'ite', '...', '▁', 'gate', '.', '"']
['▁2', '0', '%', '▁Mo', 're', '▁R', 'id', 'icul', 'ous', '▁Sale', '▁@', '2', '0', 'x', '200', '▁', 'end', 's', '▁tonight', '!', '▁-', '▁get', '▁2', '0', '%', '▁of', 'f', '▁b', 'y', '▁', 'enter', 'ing', '▁', "'", 'R', 'I', 'D', 'ON', 'K', "'", '▁at', '▁check', 'out', '.', '▁Mo', 're', '▁in', 'f', 'o', ':', '▁', 'ht', 'tp', '://', 'bit', '.', 'ly', '/', 'ri', 'd', 'on', 'k', 't', 'ues']
['▁@', 'S', 't', 'udio', '8', '5', '▁I', '▁hav', 'e', '▁a', '▁p', 'air', '▁of', '▁', 'th', 'ose', '▁sho', 'es', '.', '▁The', 'y', '▁are', '▁', 'com', 'f', 'y', '.', '▁Li', 'ke', '▁be', 'ing', '▁', 'ba', 're', 'f', 'o', 'o', 't', '.', '▁O', 'k', 'a

In [124]:
print("Regexp tokenizer:\n\n", *map(re_tokenizer, data), sep='\n')


Regexp tokenizer:


['@anitapuspasari', 'waduh', '.', '.']
['"', 'Could', 'journos', 'please', 'stop', 'putting', 'the', 'word', '"', '"', 'gate', '"', '"', 'after', 'everything', 'they', 'write', '.', '.', '.', 'gate', '.', '"']
['20', 'More', 'Ridiculous', 'Sale', '@20x200', 'ends', 'tonight', '!', '-', 'get', '20', 'off', 'by', 'entering', "'", 'RIDONK', "'", 'at', 'checkout', '.', 'More', 'info', ':', 'http://bit.ly/ridonktues']
['@Studio85', 'I h', 'ave', 'a p', 'air', 'of', 'those', 'shoes', '.', 'They', 'are', 'comfy', '.', 'Like', 'being', 'barefoot', '.', 'Okay', 'for', 'running', ',', 'but', 'not', 'on', 'concrete', ',', 'as', 'I', "'ve", 'discovered', '.']
['RT', '@twilightus', 'Team', 'Carlisle', 'is', 'a T', 'rending', 'Topic', '-', 'help', 'him', 'out', 'RT', 'Follow', '@peterfacinelli', 'see', 'a g', 'rown', 'man', 'n a b', 'ikini', 'dance', 'Hollywood', 'Blvd']
['@karenrubin', 'you', 'might', 'have', 'to', 'reinstall', '-', 'that', 'happened', 'to', 'me', 'a f', 'ew', '

In [101]:
re_tokenizer("@xfranman Old age has made N A T O!")

['@xfranman', 'Old', 'age', 'has', 'made', 'N A T O', '!']

White-space tokenizer is the simplest and least effective, but it provides a valuable baseline since ~90% of tokens might be well separated by spaces. Regexp tokenizer is advantageous and flexible tool that works almost perfect, tokenizing valuable language aspects and twitter-specific things. However, writing regexp is not a trivial task and thus it is quite an error prone approach. Sentencepiece is a totally 3different tokenizer: actually, it is a model that builds a vocabulary by learning it from the text. It fills vocabulary with frequent tokens that are not limited to just words, but parts of sententences which may include words combinations or their parts.
The most useful tokenizer is NLTK tweeter tokenizer which comprises of complex and large regex patterns which are much more accurate than the ones introduced by me.

## Stemming and Lemmatization

Your task is to write two functions: stem and lemmatize. Input is a text, so you need to tokenize it first.

In [105]:
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer("english")

def stem(text: str) -> List[str]:
    tokens = list(map(nltk_tweet_tokenizer, text))
    all_tokens = []
    for x in tokens:
        all_tokens.extend(x)
    return list(map(stemmer.stem, all_tokens))

In [119]:
import spacy

nlp = spacy.load("en_core_web_sm")

def lemmatize(text: str) -> List[str]:
    return list(map(lambda x: x.lemma_, nlp(text)))

In [106]:
stem(data)

['@anitapuspasari',
 'waduh',
 '..',
 '"',
 'could',
 'journo',
 'pleas',
 'stop',
 'put',
 'the',
 'word',
 '"',
 '"',
 'gate',
 '"',
 '"',
 'after',
 'everyth',
 'they',
 'write',
 '...',
 'gate',
 '.',
 '"',
 '20',
 '%',
 'more',
 'ridicul',
 'sale',
 '@20x200',
 'end',
 'tonight',
 '!',
 '-',
 'get',
 '20',
 '%',
 'off',
 'by',
 'enter',
 "'",
 'ridonk',
 "'",
 'at',
 'checkout',
 '.',
 'more',
 'info',
 ':',
 'http://bit.ly/ridonktu',
 '@studio85',
 'i',
 'have',
 'a',
 'pair',
 'of',
 'those',
 'shoe',
 '.',
 'they',
 'are',
 'comfi',
 '.',
 'like',
 'be',
 'barefoot',
 '.',
 'okay',
 'for',
 'run',
 ',',
 'but',
 'not',
 'on',
 'concret',
 ',',
 'as',
 "i'v",
 'discov',
 '.',
 'rt',
 '@twilightus',
 'team',
 'carlisl',
 'is',
 'a',
 'trend',
 'topic',
 '-',
 'help',
 'him',
 'out',
 'rt',
 'follow',
 '@peterfacinelli',
 'see',
 'a',
 'grown',
 'man',
 'n',
 'a',
 'bikini',
 'danc',
 'hollywood',
 'blvd',
 '@karenrubin',
 'you',
 'might',
 'have',
 'to',
 'reinstal',
 '-',
 'that

In [120]:
lemmatize("\n".join(data))

['@anitapuspasari',
 'waduh',
 '..',
 '\n\n',
 '"',
 'could',
 'journos',
 'please',
 'stop',
 'put',
 'the',
 'word',
 '"',
 '"',
 'gate',
 '"',
 '"',
 'after',
 'everything',
 'they',
 'write',
 '...',
 'gate',
 '.',
 '"',
 '\n\n',
 '20',
 '%',
 'More',
 'Ridiculous',
 'Sale',
 '@20x200',
 'end',
 'tonight',
 '!',
 '-',
 'get',
 '20',
 '%',
 'off',
 'by',
 'enter',
 "'",
 'ridonk',
 "'",
 'at',
 'checkout',
 '.',
 'More',
 'info',
 ':',
 'http://bit.ly/ridonktue',
 '\n\n',
 '@studio85',
 'I',
 'have',
 'a',
 'pair',
 'of',
 'those',
 'shoe',
 '.',
 'they',
 'be',
 'comfy',
 '.',
 'like',
 'be',
 'barefoot',
 '.',
 'okay',
 'for',
 'run',
 ',',
 'but',
 'not',
 'on',
 'concrete',
 ',',
 'as',
 'I',
 "'ve",
 'discover',
 '.',
 '\n\n',
 'RT',
 '@twilightus',
 'Team',
 'Carlisle',
 'be',
 'a',
 'trending',
 'topic-',
 'help',
 'he',
 'out',
 'RT',
 'Follow',
 '@peterfacinelli',
 'see',
 'a',
 'grown',
 'man',
 'n',
 'a',
 'bikini',
 'dance',
 'Hollywood',
 'Blvd',
 '\n\n',
 '@karenrubin'

## Explain sentencepiece (for masters only)

For this task you will have to use sentencepiece text tokenizer. Your task will be to read how it works and write a minimum 10 sentences explanation of the tokenizer works.

...

## Resources

1. [Regular Expressions 1](https://realpython.com/regex-python/)
2. [Regular Expressions 2](https://realpython.com/regex-python-part-2/)
2. [Spacy Lemmatizer](https://spacy.io/api/lemmatizer)
2. [NLTK Stem](https://www.nltk.org/howto/stem.html)
3. [SentencePiece](https://github.com/google/sentencepiece)
4. [sentencepiece tokenizer](https://towardsdatascience.com/sentencepiece-tokenizer-demystified-d0a3aac19b15)