# Introduction to Natural Language Processing: Assignment 1

In this assignment we'll practice tokenization, lemmatization and stemming

- Please comment your code
- Submissions are due Thursday at 23:59 and should be submitted **ONLY** on eCampus: **Assignmnets >> Student Submissions >> Assignment 1 (Deadline: 14.11.2023, at 23:59)**
- Name the file aproppriately "Assignment_1_\<Your_Name\>.ipynb".
- Please submit **ONLY** the Jupyter Notebook file.
- Please use relative path; Your code should work on my computer if the Jupyter Notebook and the file are both in the same directory.

Example: file_name = lemmatization-en.txt >> **DON'T use:** /Users/ComputerName/Username/Documents/.../lemmatization-en.txt

In [1]:
import re
import nltk
import pandas as pd

# Tokenization

$\textbf{Tokenization}$ is the process of creating $\textbf{Tokens}$.

$\textbf{Tokens}$ are building units of text sequence.

Simply said, units that build up the text corpus are $\textbf{tokens}$ and the process of splitting a text sequence into its tokens is $\textbf{tokenization}$.

$\textbf{Tokens}$ can be (depending on our task and goal):
   - characters
   - words (individual words or sets of multiple words together)
   - part of words
   - punctuations
   - sentences

## Types of Tokenization

There are 3 main types of tokenization:
   - Word Tokenization - splits text corpus into words
   - Sentence Tokenization - splits text corpus into sentences
   - Character Tokenization - splits text corpus into individual characters

### Task 1.1 (3 points)

Write a function `extract_words_tokens(any_string)` that takes a string as input and returns two numbers:
1. num_words: The number of words in string
2. num_tokens: The number of tokens in string (Please use the character-based tokenization.)

**Hint:** The string can be a single word or a sentence and
 can contain some special charecters, such as: "!", ",", ":"

Function for creating my tokens is given below. I used $re$ module, and its function $split$. So, splitting of sentence, will be done based on provided regex. Let me explain written regex:
   - \[ \] - it allows us to denote all of the characters that we want to split based on (basically group them together)
   - \s!@.:;,# - list of characters that we want to split based on. For convenicence, \s means one space
   - \+ - means one or more occurences of preceeding character. Preceeding character is actually a list of characters (here you can unterstand more what [] does actually)
   
So this will split based on 1+ occurences of all of the characters that appear between the square brackets. Obviously, much complex regex could be written to cover many more cases and special characters, and that would be only the extension of already written regex. 

The only problem with this regex is so-called $\textbf{trailing dot}$. $\textbf{trailing dot}$ means dot that appears as a last character in sentence. Problem is, that it will split based on this $\textbf{trailing dot}$, but since there is nothing behind that dot, it will return '' (empty space) as a token. Since trailing dot can only appear at the end, and it will be last in the list of returned split, we can just return all elements except last one (which is that dot).

In [2]:
def extract_words_tokens(any_string, regex):
    word_list = re.split(regex, any_string)[:-1]
    num_words = len(word_list)
    tokens = sorted(list(set(word_list))) # unique elements in this list
    num_tokens = len(tokens)
    return word_list, tokens

In [3]:
regex = r'[\s!@.:;,#]+'
words, tokens = extract_words_tokens("This is is is a sample text for. testing RegexpTokenizer in NLTK.NLTK.NLTK.", regex)
print('Number of words: ', len(words))
print('Number of tokens: ', len(tokens))

Number of words:  14
Number of tokens:  10


# Lemmatization

### Task 1.2 (4 points)

Write a function `lemmatize(any_string, file_name)` that takes as input any string and a file-name: `lemmatization-en.txt` (please download the file [here](https://github.com/michmech/lemmatization-lists/blob/master/lemmatization-en.txt). It's a tab separated corpus) and returns a dictionary with all words as keys and the lemma of the words as values.

**Hint:** To tokenize the string, please use the whitespace as the seperator. The string doesn't contain any special characters.

### Note:
It is written dictionary of all $\textbf{words}$, but since there can be many same words (which are basically represented by same token) that would mean that we want to create dictionary where all those words are the keys. But since key must be unique in dictionary, I suppose you wanted to say - dictionary with all $\textbf{tokens}$ as keys and the $\textbf{lemma}$ of the $\textbf{tokens}$ as values.

In [4]:
def lemmatize(any_string, file_name):
    df = pd.read_csv(file_name, sep='\t', header=None, names=['lemma', 'token'])
    # Set the 'token' column as the index for easier search (token is unique, whereas lemma is not)
    df.set_index('token', inplace=True)
    
    # Regex meaning: split based on:
    # 1) [\s]+ -> one or multiple occurences of whitespace , or (|)
    # 2) $ -> end of the line
    regex = r'[\s]+|$'
    words, tokens = extract_words_tokens(any_string, regex)
    print(words)
    
    dictionary_of_lemmatized_tokens = {}
    for token in tokens:
        dictionary_of_lemmatized_tokens[token] = df.loc[token]['lemma']
        
    return dictionary_of_lemmatized_tokens

In [5]:
file_name = "lemmatization-en.txt"

In [6]:
dictionary_of_lemmatized_tokens = lemmatize("bustards busies acclimated acclimates acclimating", file_name)

['bustards', 'busies', 'acclimated', 'acclimates', 'acclimating']


In [7]:
dictionary_of_lemmatized_tokens

{'acclimated': 'acclimate',
 'acclimates': 'acclimate',
 'acclimating': 'acclimate',
 'busies': 'busy',
 'bustards': 'bustard'}

### Task 1.3 (3 points)

Write a function `stemmer(string)` that takes a string as input and returns a string conaining only its stem.

Create rules for the following forms of the verbs, Here is one example:

- (Infinitive form) >> study - studi
- (Present simple tense: Third person) >> studies - studi
- (Continuous tense) >> studying - studi
- (Past simple tense) >> studied - studi

**Hint:** The string can be a single word or a sentence and
 can contain some special charecters, such as: "!", ",", ":"

Same as in lemmatization, goal is to get base form of the word. When it comes to stemming, it is simpler process then lemmatization, since in stemming base word does not need to have some meaning itself (whereas in lemmatization base form need to be some meaningful word). So, main process of stemming is to find set of rules, that maps word to its stem. This is not simple process, so we can only mimic stemming process by defining some set of naive rules, for the purpose of showing that we understand the main idea of how stemming works.

So for that purpose, I wrote set of naive rules, where you can give one word as input, output will be its stem. I printes examples for few words.

In [8]:
def stemmer(any_string):
    # Stemming rules as regex
    rules = {
        "y$" : 'i',
        "ies$" : 'i',
        "ying$" : 'i',
        "ed$" : 'i',
        "ing$": "",
        "ed$": "",
        "ves$": "f",
        "ied$": "y",
        "er$": "",
        "est$": "",
        "en$": "",
        "ly$": "",
        "ful$": "",
        "ment$": "",
        "ness$": "",
        "able$": "",
        "ize$": "",
        "ise$": "",
        "ation$": "",
        "ator$": "",
        "ative$": "",
        "al$": "",
        "ence$": "",
        "ance$": "",
        "tion$": "",
        "ion$": "",
        "ity$": "",
        "ous$": "",
        "ify$": "",
        "ible$": "",
        "ism$": "",
        "ist$": "",
        "ite$": "",
        "ship$": "",
        "hood$": ""
    }

    
    stemmed_string = any_string
    for rule, replacement in rules.items():
        pattern = re.compile(rule)
        if re.search(pattern, any_string):
            stemmed_string = re.sub(pattern, replacement, any_string)
            
    return stemmed_string

In [9]:
print(stemmer('studies'))
print(stemmer('neighbourhood'))
print(stemmer('likelihood'))
print(stemmer('fence'))
print(stemmer('stance'))
print(stemmer('crazy'))

print(stemmer('play'))
print(stemmer('plays'))
print(stemmer('playing'))
print(stemmer('played'))

studi
neighbour
likeli
f
st
crazi
plai
plays
play
play


Results are not too good, but logic of how stemming works is implemented. Having (proper) set of rules, we would be able to convert word to its corresponding stem.