# **My Tokenizer**

In this assignment, you are asked to create your own word tokenizer without the help of external tokenizers. Steps to the assignment:
1. Choose one of the corpora from nltk.corpus list given - assign it to corpus_name
1. Create your tokenizer in the code block - tokenize the selected corpus into token_list
1. Give the raw corpus text, corpus_raw, and the my_token_list to the evaluation block

Only splitting on whitespace is not enough. At least try two other improvements on the tokenization. Please write sufficient comments to show your reasoning.

## Rules
### Allowed:
 - Choosing a top-down tokenizer or bottom-up tokenizer
 - Using regular expressions library (import re)
 - Adding additional coding blocks
 - Having an additional dataset if you are creating a bottom-up tokenizer but you need to be able to run the code standalone.

### Not allowed:
 - Using tokenizer libraries such as nltk.tokenize, or any other external libraries to tokenize.
 - Changing the contents of the evaluation block at the end of the notebook.

## Assignment Report
Please write a short assignment report at the end of the notebook (max 500 words). Please include all of the following points in the report:
 - Corpus name and the selection reason
 - Design of the tokenizer and reasoning
 - Challenges you have faced while writing the tokenizer and challenges with the specific corpus
 - Limitations of your approach
 - Possible improvements to the system

## Grading
You will be graded with the following criteria:
 - running complete code (0.5),
 - tokenizer algorithm (2),
 - clear commenting (0.5),
 - evaluation score - comparison with nltk word tokenizer (at most 1 point),
 - assignment report (1).

## Submission

Submission will be made to SUCourse. Please submit your file using the following naming convention.


`studentid_studentname_tokenizer.ipynb  - ex. 26744_aysegulrana_tokenizer.ipynb`


**Deadline is October 22nd, 5pm.**

In [227]:
import re

def my_tokenizer(corpus_raw):
    '''
    type corpus_raw: string
    param corpus_raw: The raw output of the corpus to be tokenized
    rtype: list
    return: a list of tokens extracted from the corpus_raw
    '''

    # write your tokenizer here and apply to corpus_raw. Return the resulting token_list.
    # you are NOT allowed to use external tokenizers such as word_tokenize from nltk.
    # Only splitting on whitespace is not enough. At least try two other improvements on the tokenization.

    # Splitting into lines using endline characters
    lines = corpus_raw.split('\n')

    # An empty list for tokens and a list for punctuation characters
    words = []
    punkt = [".", ",", "'", "!", "?", ";", ":", "$", "&", '"', "--", "...", "-"]

    # Split into sentences
    for line in corpus_raw.split('\n'):

        # Get words
        for word in line.split():

            # Lower the words
            word = word.lower()

            # Get the punctuations at the end of the sentence and add them to punkt list, they'll be added to the tokens list after the words are processed
            temp_punkt_list = []
            while word and word[-1] in punkt:
                temp_punkt_list.insert(0, word[-1])  # Insert at the beginning of the list
                word = word[:-1]

            # Remove non-alphanumeric and non-punctuation characters
            word = re.sub(r"[^a-zA-Z0-9\-.,':;!]", "", word)

            # If there is apostrophe, handle it similar to NLTK
            if re.search(r"'", word):

              # If there is a numeric character
              if re.search(r"\d", word):

                # Handle dates like 87', '99
                if re.search(r"'\d{2}$", word):
                    words.append(word[:-3].lower())
                    words.append(word[-2:])

                # Handle decade formats like 1980's
                elif re.search(r"^\d+'s$", word):
                    words.extend([word[:-2], "'s"])

              # If there is no numeric character
              else:

                # Handle clitics

                # If it starts with d' or o': !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
                if word[0] == 'o' or word[0] == 'd':
                  words.append(word)

                # For everything else, just find the apostrophe and split the word from there
                else:
                  found = re.search("'", word)
                  apostrophe_loc = found.start()
                  words.append(word[:apostrophe_loc])
                  words.append(word[apostrophe_loc:])

            # Add the regular words
            else:
              words.append(word)

            # Add punctuations
            for punc in temp_punkt_list:
              words.append(punc)

    token_list = words

    return token_list

You are allowed to add code blocks above to use for your tokenizer or evaluate it.



In [228]:
#main code to run your tokenizer.

#import your libraries here
import nltk

#select the corpus name from the list below
#gutenberg, webtext, reuters, product_reviews_2

corpus_name = "reuters"

#download the corpus and import it.
nltk.download(corpus_name)
nltk.download('punkt')
from nltk.corpus import reuters

#get the raw text output of the corpus to the corpus_raw variable.
corpus_raw = reuters.raw()

#call your tokenizer method
my_tokenized_list = my_tokenizer(corpus_raw)

[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Please do not touch the code below that will evaluate your tokenizer with the nltk word tokenizer. You will get zero points from evaluation if you do so.

In [229]:
def similarity_score(set_a, set_b):
    '''
    type set_a: set
    param set_a: The first set to be compared
    type set_b: set
    param set_b: The tokens extracted from the corpus_raw
    rtype: float
    return: similarity score with two sets using Jaccard similarity.
    '''

    jaccard_similarity = float(len(set_a.intersection(set_b)) / len(set_a.union(set_b)))

    return jaccard_similarity

In [230]:
from nltk import word_tokenize
nltk.download('punkt')
from nltk import punkt

def evaluation(corpus_raw, token_list):
    '''
    type corpus_raw: string
    param corpus_raw: The raw output of the corpus
    type token_list: list
    param token_list: The tokens extracted from the corpus_raw
    rtype: float"
    return: comparison score with the given token list and the nltk tokenizer.
    '''

    #The comparison score only looks at the tokens but not the frequencies of the tokens.
    #we assume case folding is already applied to the token_list
    corpus_raw = corpus_raw.lower()
    nltk_tokens = word_tokenize(corpus_raw, language='english')

    score = similarity_score(set(token_list), set(nltk_tokens))

    return score

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [231]:
#Evaluation

eval_score = evaluation(corpus_raw, my_tokenized_list)

print('The similarity score is {:.2f}'.format(eval_score))

The similarity score is 0.80


Please write your report below using clear headlines with markdown syntax.

# Report

## Corpus name and the selection reason

I selected Reuters as the corpus because of two reasons:


*   I am more interested in working with political data, so I thought this could be a nice start.
*   I liked the challenges and conveniences that it will bring, such as not caring about typos but having to fix issues about news-specific expressions.

## Design of the tokenizer and reasoning
I started with investigating NLTK as it was to be used as a benchmark. I tried several sentences and edge cases and noted how it behaves.

I noticed that it has a rather interesting approach to clitics, which was the main focus of my approach.

First, I start by splitting the corpus into sentences and words, and I lower the words directly. As it's easier to check the punctuation marks in this stage, I check them and add them to a list that is added to the tokens after the preceding word is handled.

After punctuation, I remove non-alphanumeric and non-punctuation characters.

The main part starts with checking aposthropes as they are the hardest to handle. I check if it has numerical characters and handle it accordingly, I have two elif statements for this and they are the most common two ways that numbers with apostrophes occur in reuters. If there's no numerical character, i first check if it's a word starts with "O'" or "d'" and handle accordingly, if not, I just find the aportrophe, split the word from there and add tokens.

If it's a regular word, it is also added.

Lastly, punctuations are added as mentioned.

The reason for this approach was the nature of news oriented dataset and apostrophes being hardest part to handle, therefore I made a tokenizer that puts apostrophes in the center and that has some news-oriented statements around.

## Challenges you have faced while writing the tokenizer and challenges with the specific corpus
There were some news-specific expressions that were hard to handle.

Since there were many dates, I added to separate elif statements just for them and coming up with this idea required some investigation.

Another thing was there were some words like "O'Connor" and 'd'Angelo" that originates from other countries, I also added and elif statement to handle these as I noticed while investigating.

Corpus has other news specific expressions that were really hard to handle with rule-based approach such as expressions like '\&lt;national'.

## Limitations of your approach
As mentioned, the corpus has many interesting cases that applying top down, rule based tokenization is not so easy to. For example, it has expressions like '\&lt;national' or 'us...nation'. Handling the non-regular punctuations middle of the sentence is really hard and my approach does not work as NLTK does in these cases.

## Possible improvements to the system
Some code can be added to handle the punctuations in the middle of the words and some news specific expressions like intl'.
