*بخش اول:*
---
### 1_1
NER is a technique in NLP that locates and classifies named entities from unstructured data into different predefined categories, namely persons, organizations, times, locations, etc. First, NER analyzes the input text to identify and locate named entities, then, using capitalization rules, it identifies sentence boundaries. In the third step, it classifies documents into different types. Finally, using ML algorithms, it recognizes entities in new data and improves its accuracy by passing through multiple training iterations. Some NER methods are lexicon-based, rule-based, ML-based, and DL-based. NER is available in Python using the spaCy library.

### 1-3
Utilizing Regex in NER tasks has several benefits and drawbacks.

Since it provides direct control over patterns in text, training is not required and, therefore, it offers fast processing and high efficiency for simple patterns. However, because of its poor contextual understanding and limited flexibility, it may miss important entities; hence, it is not ideal for difficult tasks or large-scale datasets.

In [None]:
# 1_2:
import re

file_path = r'data\ner.txt'

with open(file_path, 'r') as file:
    text = file.read()

email_pattern = r'[a-zA-Z0-9]+@[a-zA-Z0-9]+\.[a-zA-Z]'
phone_pattern = r'\+[0-9]+[\s -0-9]*[0-9]+'
url_pattern = r'https?://www\.[a-zA-Z0-9.]*[a-zA-Z0-9]+\.[a-zA-Z]+'

emails = re.findall(email_pattern, text)
phones = re.findall(phone_pattern, text)
urls = re.findall(url_pattern, text)

print(emails, phones, urls)




['info@vrinnovations.c', 'martinez@techadvisors.n', 'support@innovateads.c', 'p@mailmasters.c', 'sales@mailmasters.c', 'director@socialbuzz.o', 'support@deepad.a', 'info@legaleagle.o', 'partners@influenceme.i', 'contact@brandvision.c', 'hello@mobilemarketersinc.c', 'jenkins@cxinnovators.c', 'info@digitalfuture2025.c'] ['+1-800', '+44 20', '+1-310', '+1-212', '+1-718', '+49 30', '+33 1', '+61 3', '+1-202', '+1-415', '+1-646', '+1-888'] ['https://www.digitalfuture2025.com', 'http://www.techadvisors.net', 'https://www.innovateads.co', 'https://www.innovateads.co', 'http://www.mailmasters.com', 'https://www.socialbuzz.org', 'http://www.deepad.ai', 'http://www.deepad.ai', 'https://www.legaleagle.org', 'http://www.influenceme.io', 'https://www.brandvision.com', 'http://www.mobilemarketersinc.com', 'https://www.cxinnovators.com', 'https://www.cxinnovators.com', 'https://www.digitalfuture2025.com']


*بخش دوم:*
---
### 2_1
Rule-based tokenizers use a set of predefined linguistic rules and grammars (mostly whitespace, punctuation, and capitalization) to split the input text into tokens. Their results are remarkable on well-structured datasets with predictable patterns. Since training is not required, these tokenizers are very fast and efficient. However, the necessity of rules makes them inflexible.

Machine learning-based tokenizers use ML algorithms to learn patterns from training data (mostly considering morphology and context). Hence, they are typically accurate on complex datasets and adaptable to real-world tasks due to their ability to handle informality. They are also highly flexible and adaptable to new domains. However, they may require resources, data preparation, and training processes, and they significantly impact tokenizing speed.

Rule-based tokenizers are best for applications where the text follows strict patterns (like formal writing).

ML-based tokenizers are ideal for NLP tasks where context is important.


### 2_2
Whitespace tokenizer is not suitable in complex languages which does not have space for each word (ie. Chinese and Japanese). Also it is not proper for combination of subwords and can not make new combination for an unseen word.

In [None]:
# 2_2_ Whitespace Tokenization Question

whitespace_tokens = text.split() # Built-in function


def whitespace_tokenizer(text): # custom function
    tokens = []
    token = ""
    
    for char in text:
        if not char.isspace():
            token += char
        else:
            if token:
                tokens.append(token)
                token = ""
    
    if token:
        tokens.append(token)
    
    return tokens

tokens = whitespace_tokenizer(text)

print(tokens == whitespace_tokens) # compare built-in and custom functions



True


*بخش سوم:*
---
### 3_1
Levenshtein distance refers to the minimum number of operations required to transform one string into another. The allowed operations are insertion, deletion, and substitution.

Damerau-Levenshtein distance is an extension of Levenshtein distance that adds one extra operation (transposition) for swapping two adjacent characters. This distance is suitable for correcting human typing errors.

In [None]:
# 3_2:
def levenshtein_distance(str1, str2):
    distance_matrix = [[0 for _ in range(len(str2)+1)] for _ in range(len(str1)+1)]

    # initialization
    for i in range(len(str1)+1):
        distance_matrix[i][0] = i
    for j in range(len(str2)+1):
        distance_matrix[0][j] = j

    for i in range(1, len(str1)+1):
        for j in range(1, len(str2)+1):
            if str1[i-1] == str2[j-1]:
                cost = 0
            else:
                cost = 2

            distance_matrix[i][j] = min(
                distance_matrix[i-1][j] + 1,    # Deletion
                distance_matrix[i][j-1] + 1,    # Insertion
                distance_matrix[i-1][j-1] + cost  # Substitution
            )
    return distance_matrix[len(str1)][len(str2)]


def damerau_levenshtein_distance(str1, str2):
    distance_matrix = [[0 for _ in range(len(str2) + 1)] for _ in range(len(str1) + 1)]
    
    for i in range(len(str1) + 1):
        distance_matrix[i][0] = i
    for j in range(len(str2) + 1):
        distance_matrix[0][j] = j

    for i in range(1, len(str1) + 1):
        for j in range(1, len(str2) + 1):
            if str1[i - 1] == str2[j - 1]:
                cost = 0
            else:
                cost = 2

            distance_matrix[i][j] = min(
                distance_matrix[i - 1][j] + 1,      # Deletion
                distance_matrix[i][j - 1] + 1,      # Insertion
                distance_matrix[i - 1][j - 1] + cost  # Substitution
            )

            if str1[i - 1] == str2[j - 2] and str1[i - 2] == str2[j - 1] and i > 1 and j > 1:
                distance_matrix[i][j] = min(
                    distance_matrix[i][j],
                    distance_matrix[i - 2][j - 2] + 1  # Transposition
                )

    return distance_matrix[len(str1)][len(str2)]


texts = [('kitten', 'sitting'), ('saturday', 'sunday'), ('book', 'back'), ('algorithm', 'logarithm'),
         ('', 'test'), ('abc', 'acb')]

for str1, str2 in texts:
    levenshtein = levenshtein_distance(str1, str2)
    damerau_levenshtein = damerau_levenshtein_distance(str1, str2)

    print(f'for {str1} and {str2}:')
    print(f'\tlevenshtein_distance is {levenshtein}')
    print(f'\tdamerau_levenshtein_distance is {damerau_levenshtein}')



for kitten and sitting:
	levenshtein_distance is 5
	damerau_levenshtein_distance is 5
for saturday and sunday:
	levenshtein_distance is 4
	damerau_levenshtein_distance is 4
for book and back:
	levenshtein_distance is 4
	damerau_levenshtein_distance is 4
for algorithm and logarithm:
	levenshtein_distance is 4
	damerau_levenshtein_distance is 3
for  and test:
	levenshtein_distance is 4
	damerau_levenshtein_distance is 4
for abc and acb:
	levenshtein_distance is 2
	damerau_levenshtein_distance is 1
