# **My Tokenizer**

In this assignment, you are asked to create your own word tokenizer without the help of external tokenizers. Steps to the assignment:
1. Choose one of the corpora from nltk.corpus list given - assign it to corpus_name
1. Create your tokenizer in the code block - tokenize the selected corpus into token_list
1. Give the raw corpus text, corpus_raw, and the my_token_list to the evaluation block

Only splitting on whitespace is not enough. At least try two other improvements on the tokenization. Please write sufficient comments to show your reasoning.

## Rules
### Allowed:
 - Choosing a top-down tokenizer or bottom-up tokenizer
 - Using regular expressions library (import re)
 - Adding additional coding blocks
 - Having an additional dataset if you are creating a bottom-up tokenizer but you need to be able to run the code standalone.

### Not allowed:
 - Using tokenizer libraries such as nltk.tokenize, or any other external libraries to tokenize.
 - Changing the contents of the evaluation block at the end of the notebook.

## Assignment Report
Please write a short assignment report at the end of the notebook (max 500 words). Please include all of the following points in the report:
 - Corpus name and the selection reason
 - Design of the tokenizer and reasoning
 - Challenges you have faced while writing the tokenizer and challenges with the specific corpus
 - Limitations of your approach
 - Possible improvements to the system

## Grading
You will be graded with the following criteria:
 - running complete code (0.5),
 - tokenizer algorithm (2),
 - clear commenting (0.5),
 - evaluation score - comparison with nltk word tokenizer (at most 1 point),
 - assignment report (1).

## Submission

Submission will be made to SUCourse. Please submit your file using the following naming convention.


`studentid_studentname_tokenizer.ipynb  - ex. 26744_aysegulrana_tokenizer.ipynb`


**Deadline is October 22nd, 5pm.**

In [1]:
import re

In [17]:
def my_tokenizer(corpus_raw):
    '''
    type corpus_raw: string
    param corpus_raw: The raw output of the corpus to be tokenized
    rtype: list
    return: a list of tokens extracted from the corpus_raw
    '''

    # write your tokenizer here and apply to corpus_raw. Return the resulting token_list.
    # you are NOT allowed to use external tokenizers such as word_tokenize from nltk.
    # Only splitting on whitespace is not enough. At least try two other improvements on the tokenization.


    token_list =[]
    # Convert text to lowercase to handle case insensitivity
    corpus_raw = corpus_raw.lower()

    # firstly  handle URLs and emails
    #https?: Matches "http" or "https".
    #://: Matches the literal characters "://"
    #[^\s]+: Matches any sequence of characters that are not whitespace ([^\s])
    url_tokens = re.findall(r'https?://[^\s]+|www\.[^\s]+|\b[\w.-]+?@\w+?\.\w+?\b', corpus_raw)

    for token in url_tokens:
        corpus_raw = corpus_raw.replace(token, '')  # Remove url tokens from original text

    #handle emojis
    '''[:;=8xX]: Matches any character that could start an emotion, like :, ;, =, 8, or X.
        [-^o*]?: Optionally matches the nose of the emoton (characters like -, ^, o, or *).
        [\)\(dDpP/\|O3]: Matches the face part of the emotion (like ), (, d, D, P, etc.).'''
    emotions = re.findall(r'[:;=8xX][-^o*]?[\)\(dDpP/\|O3]', corpus_raw)
    #for the hastags
    hashtags = re.findall(r'#\w+', corpus_raw)
    #for the mentions ex:@yagmurdolunay
    mentions = re.findall(r'@\w+', corpus_raw)
    #normalize punctuations ex: !!!!-> !
    normalized_punctuations=re.sub(r'([!?]){2,}', r'\1', corpus_raw)
    #tokenize the remaining text (split by spaces, keep punctuation)
    token_list = re.findall(r'\w+|[^\w\s]', corpus_raw)  # Splits words but keeps punctuation as separate tokens

    # Add back special tokens (URLs, emails, emoticons)
    token_list.extend(url_tokens + emotions + hashtags+mentions)

    return token_list

You are allowed to add code blocks above to use for your tokenizer or evaluate it.



In [18]:
#main code to run your tokenizer.

#import your libraries here
import nltk
from nltk.corpus import webtext
import re

#select the corpus name from the list below
#gutenberg, webtext, reuters, product_reviews_2

corpus_name = 'webtext'

#download the corpus and import it.
nltk.download('webtext')

#get the raw text output of the corpus to the corpus_raw variable.
corpus_raw = webtext.raw()

#call your tokenizer method
my_tokenized_list = my_tokenizer(corpus_raw)



[nltk_data] Downloading package webtext to /root/nltk_data...
[nltk_data]   Package webtext is already up-to-date!


## Please do not touch the code below that will evaluate your tokenizer with the nltk word tokenizer. You will get zero points from evaluation if you do so.

In [19]:
def similarity_score(set_a, set_b):
    '''
    type set_a: set
    param set_a: The first set to be compared
    type set_b: set
    param set_b: The tokens extracted from the corpus_raw
    rtype: float
    return: similarity score with two sets using Jaccard similarity.
    '''

    jaccard_similarity = float(len(set_a.intersection(set_b)) / len(set_a.union(set_b)))

    return jaccard_similarity

In [20]:
from nltk import word_tokenize
nltk.download('punkt')
from nltk import punkt

def evaluation(corpus_raw, token_list):
    '''
    type corpus_raw: string
    param corpus_raw: The raw output of the corpus
    type token_list: list
    param token_list: The tokens extracted from the corpus_raw
    rtype: float
    return: comparison score with the given token list and the nltk tokenizer.
    '''

    #The comparison score only looks at the tokens but not the frequencies of the tokens.
    #we assume case folding is already applied to the token_list
    corpus_raw = corpus_raw.lower()
    nltk_tokens = word_tokenize(corpus_raw, language='english')

    score = similarity_score(set(token_list), set(nltk_tokens))

    return score

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [21]:
#Evaluation

eval_score = evaluation(corpus_raw, my_tokenized_list)

print('The similarity score is {:.2f}'.format(eval_score))

The similarity score is 0.81


Please write your report below using clear headlines with markdown syntax.

# Report

**Corpus Name: WebText**

* *Selection Reason*

    I chose the WebText corpora because it represents a modern, internet type of text, including blog posts, online chats, and conversations that are often found on platforms such as Reddit Twitter. In addition, I believe that human language is used more dynamically and often on the web and social media platforms, which makes WebText more relevant and interesting than other corporas such as Gutenberg or Reuters. These web-based resources reflect how language has evolved over time, especially in the context of informal conversations, making it ideal for tasks such as emotion analysis and web text mining.

* *Design of the Tokenizer and Reasoning*

    Tokenizer is designed to capture text elements in various informal and internet jargon found in the WebText corpus compilation. It starts by converting the text to lowercase to ensure case insensitivity. Special attention is paid to the protection of urls, emails, hashtags and mentions, which are necessary to understand the structure of online discourse.  Then, emoticons and emojis are detected using patterns that capture common internet emoticons. In addition, the token is used to indicate that exaggerated punctuation marks ("!!!") is reduced to a single sample, which helps to avoid noise. The remaining text is marked by dividing into spaces and punctuation marks, while keeping the punctuation marks as separate markers when relevant.
* *Challenges Faced*

    One of the primary challenges in tokenizing WebText was handling the wide variety of text types present in informal online conversations, such as URLs, social media handles, and informal grammar.  The most difficulty was detecting the characters that are used in the emoji formations and capturing them without breaking the structure of the text required careful use of regular expressions. Since Web Text contains a large number of irregular tokens, such as hashtags, mentions, and hyperlinks, a significant part of the development effort went into handling these tokens robustly.

* *Limitations of my approach*

    One of the biggest limitations of my tokenizer is that it lacks semantic understanding, as it works at the linguistic level, splitting and normalizing tokens without deeper language processing. It does not do lemmatization or rooting, that is, words are not reduced to root forms, which potentially affects tasks such as text classification. In addition, while the token builder captures common expressions, it struggles with more complex emoji patterns due to the lack of custom libraries. Finally, the normalization of punctuation marks (for example, "!!!" to  "!") and the lower case can eliminate important emphasis and context, which is very important in online communication.

* *Possible improvements*

    1.   Incorporating Stemming and Lemmatization
    2.   Stopword Removal
    3.   Enhancing Emoji Detection maybe with libraries that can perform this


