In [None]:
!pip install levenshtein
!pip install snowballstemmer
!pip install zeyrek

## Regex

When we need to modify text files before further processing, we could utilize python's string default string methods. These methods allow us to do some simple operations on our input strings.

In [None]:
dum_text = "look_here"
dum_text.split(r"k")

['loo', '_here']

In [None]:
dum_text.lower()

'look_here'

In [None]:
dum_text.find("_")

4

Although Python's default operations help us in essential modifications, we need to utilize regular expressions. For this purpose, we will use the `re` library to use our regular expression patterns on an input text. `re` library gives different options to utilize regular expressions like search(), findall(), sub(), and split(). Let's check a few of them.

In [None]:
import re
print(re.match("my","findmyphone"))
re.match("find","findmyphone")

None


<re.Match object; span=(0, 4), match='find'>

In [None]:
re.search(r'n.*r', 'number 6r').group()

'number 6r'

In [None]:
re.findall(r'[0-9]','2 times 3 is equalto 6')

['2', '3', '6']

In [None]:
re.sub("small","big",'This tree is small.')

'This tree is big.'

> The company's database is critical to their business operations. They have a dedicated team that ensures the accuracy and security of the database. However, there was an issue that caused the loss of some important data. The team is currently working to recover the lost ***data*** and improve the overall database system.

Considering the above text, Create a regex that will match only the bold word .

Tip: use `findall` function

In [None]:
text = '''The company's database is critical to their business operations.
          They have a dedicated team that ensures the accuracy and security
          of the database. However, there was an issue that caused the loss
          of some important data. The team is currently working to recover
          the lost data and improve the overall database system.'''

myregex = r"data"
print(re.findall(myregex,text))
re.search(myregex,text)

['data', 'data', 'data', 'data', 'data']


<re.Match object; span=(14, 18), match='data'>

In [None]:
# myregex = ## FILL IN HERE ##
print(re.findall(myregex,text))
re.search(myregex,text)

[' data ']


<re.Match object; span=(270, 276), match=' data '>

>"The function takes two arguments (x and y) and returns their sum. The output is then printed to the console using the print() function. The parentheses ensure that the arguments are passed to the function correctly and that the output is displayed as intended."

Considering the above text, Create a regex that will get the text between parentheses .

In [None]:
text = '''The function takes two arguments (x and y) and returns their sum.
        The output is then printed to the console using the print('asd') 
        function. The parentheses ensure that the arguments are passed to 
        the function correctly and that the output is displayed as intended.'''

# myregex = ## FILL IN HERE ##
re.findall(myregex,text)

['x and y', "'asd'"]

### Discussion Question

* What other use cases are there for regex?

* Come up with a simple rule of your own (psuedo-code is fine) for sentiment analysis.<br> (Figuring out how positive or negative the emotion expressed in the sentence is.)<br> What would be the advantages and risks of using such rules?

> Answers can be written here by double clicking and editing the text.

## Part-of-speech<a id="pos"></a>

Before we move along, we should take a look at part-of-speech (commonly referred as "POS") tagging. Sometimes we need to classify words according to their function (part-of-speech) in the sentence, so that we can extract certain information. For example, if we need to analyze verbs in a long text, we can use words' POS tags and filter out words that are not verbs, which would significantly simplify the process. 

POS tags are especially useful when a word can have different functions in a sentence with the exact same form, so we cannot just take a look at the word itself and draw conclusions. For example, "type" can mean a category or a verb (to type). For these reasons, we have POS taggers. A POS tagger classifies each unit's syntactic function in the sentence. There are different types of POS taggers. The one we will use is actually a pre-trained machine learning classifier of NLTK. The perceptron model is trained with a [treebank](https://en.wikipedia.org/wiki/Treebank) (a corpus with annotated POS tags).

In [None]:
from nltk import pos_tag

sentence = "Which type of typewriter would you like to type with?"

# Uppercase letters can confuse POS tagging, so we need to lowercase everything. This
# is automatically handled by our tokenizer anyway. Note that a truecasing approach or
# a more simplified approach such as only touching the first letter of a sentence could
# potentially yield better results in POS tagging. You can check
# https://en.wikipedia.org/wiki/Truecasing and
# https://towardsdatascience.com/truecasing-in-natural-language-processing-12c4df086c21
# to read more about truecasing.
sentence_tokens = tokenizer.tokenize(sentence)

# POS tagging:
sentence_tokens_pos = pos_tag(sentence_tokens)

print(sentence_tokens_pos)

[('which', 'WDT'), ('type', 'NN'), ('of', 'IN'), ('typewriter', 'NN'), ('would', 'MD'), ('you', 'PRP'), ('like', 'VB'), ('to', 'TO'), ('type', 'VB'), ('with', 'IN'), ('?', '.')]


As you can see, each term is now classified. However, the tags are not very clear for us. We can check the documentation or use a dictionary to read the explanation and see some examples:

In [None]:
from nltk.data import load
tag_dict = load('help/tagsets/upenn_tagset.pickle')

for token in sentence_tokens_pos:
    print("Token:",token[0],"\nPOS tag:",token[1],"\nExplanation:",tag_dict[token[1]][0],"\nExample:",tag_dict[token[1]][1],"\n")

Token: which 
POS tag: WDT 
Explanation: WH-determiner 
Example: that what whatever which whichever  

Token: type 
POS tag: NN 
Explanation: noun, common, singular or mass 
Example: common-carrier cabbage knuckle-duster Casino afghan shed thermostat investment slide humour falloff slick wind hyena override subhumanity machinist ...  

Token: of 
POS tag: IN 
Explanation: preposition or conjunction, subordinating 
Example: astride among uppon whether out inside pro despite on by throughout below within for towards near behind atop around if like until below next into if beside ...  

Token: typewriter 
POS tag: NN 
Explanation: noun, common, singular or mass 
Example: common-carrier cabbage knuckle-duster Casino afghan shed thermostat investment slide humour falloff slick wind hyena override subhumanity machinist ...  

Token: would 
POS tag: MD 
Explanation: modal auxiliary 
Example: can cannot could couldn't dare may might must need ought shall should shouldn't will would  

Token: y

See that the first "type" is classified as a noun ("NN") while the last one is classified as a verb ("VB"). It is not bad for general purposes. This will come handy later.

### Discussion Question

* Can a universal POS tagging algorithm be created? (Consider English versus Turkish.)

## Levenshtein distance<a id="levenshtein"></a>

As mentioned above, a simple autocorrection process can be applied using the Levenshtein distance. Let us see what that means.

Consider the words `cup` and `cap`. Their Levenshtein distance is 1, because we can obtain one from the other by simply substituting a character. For two given strings, we can calculate the number of operations required (edit distance) to obtain one from the another. These operations are:

* Insertion: Inserting a character to a specific location in the string.
    * "up" becomes "**c**up"
* Deletion: Deleting a character from a specific location in the string.
    * "cu**s**p" becomes "cup"
* Substitution: Substituting a character in a specific location in the string with another character.
    * "c**u**p" becomes "c**a**p"
* Transpotisition (this is later introduced by an extension, the Damerau-Levenstein distance algorithm): Switching the positions of two adjacent characters.
    * "c**pu**" becomes "c**up**"
    
Using these, for a given word, we can find the closest word from a dictionary that would require the least amount of changes. This is a costly process, so it is usually limited to certain amount of changes. See [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) for more information.

We can also use this distance to calculate the similarity between two strings. This is handy for fuzzy string matching, when the same thing can be represented in similar yet different forms. This is quite common in neighborhood or street names in Turkey. By setting a similarity threshold and looking at their similarity, we can match addresses like `Kemalpaşa Mah.` and `Kemal Paşa Mahallesi`.

In [None]:
import Levenshtein

string_a = "Kemalpaşa Mah."
string_b = "Kemal Paşa Mahallesi"

Levenshtein.ratio(string_a, string_b)
# Note that you would probably want to remove "mahallesi" or "mah." when your task is
# address matching. It would significantly increase your success.

0.7058823529411764

A use case for our dataset could be finding tweets that are similar to each other. Let us search for a tweet pair that has the highest similarity without being exactly the same:

In [None]:
pair = None
highest = 0

tweets_to_compare = 100
# To compare all the tweets, uncomment this line. Note that it would take much longer.
# tweets_to_compare = dataset.shape[0]

# This compares each tweet with the ones that come after itself, which takes some time.
for i in range(0, tweets_to_compare-1):
    for j in range(1, tweets_to_compare):
        # If the tweets are not the same:
        if dataset.loc[i,"text"] != dataset.loc[j,"text"]:
            similarity = Levenshtein.ratio(dataset.loc[i,"text"], dataset.loc[j,"text"])
            # If their similarity is higher than the previous similarities:
            if similarity > highest:
                highest = similarity
                pair = (dataset.loc[i,"text"], dataset.loc[j,"text"])

print(f"The most similar tweet pair: {pair}")
print(f"Similarity: {highest}")

The most similar tweet pair: ('395 new cases and 3 new deaths in Uzbekistan [13:22 GMT] #coronavirus #CoronaVirusUpdate #COVID19 #CoronavirusPandemic', '1,005 new cases and 18 new deaths in the United States [13:16 GMT] #coronavirus #CoronaVirusUpdate #COVID19 #CoronavirusPandemic')
Similarity 0.8699186991869918


## Bonus: NLP in Turkish<a id="tr"></a>

From a linguistic perspective, Turkish is a fascinating language with its rather strict grammatical rules. However, due to its agglutinative nature, words can easily become complex with many affixes and inflections. This can make morphological analyses harder compared to English.

### Stemming<a id="tr-stem"></a>

[snowballstemmer](https://pypi.org/project/snowballstemmer/) has a Turkish stemmer:

In [None]:
from snowballstemmer import TurkishStemmer

stemmer_tr = TurkishStemmer()

sentence = "Gözleme gözleyen gözlüklü gözcü gözden düştü."

[stemmer_tr.stemWord(token) for token in tokenizer.tokenize(sentence)]

['gözle', 'gözleye', 'gözlüklü', 'gözcü', 'göz', 'düş', '.']

### POS tagging and lemmatization<a id="tr-pos-lemma"></a>

It looks like the morphological analyzer of [Zemberek](https://github.com/ahmetaa/zemberek-nlp), the famous Turkish NLP tool for Java, has been unofficially ported to Python as [zeyrek](https://github.com/obulat/zeyrek/). It does not have all of its original features (like disambiguation and more), but we can still use it to morphologically analyze a word, sentence, or sentences:

In [None]:
from zeyrek import MorphAnalyzer, rulebasedanalyzer
import logging.config

# Disables redundant error messages from zeyrek
logging.config.dictConfig({
    'version': 1,
    'disable_existing_loggers': True,
})

analyzer = MorphAnalyzer()

import functools
import operator

def format_groups(groups):
    # This function flattens the list of lists,
    # removes redundant columns and, outputs a dataframe
    out = functools.reduce(operator.iconcat, groups, [])
    lines = []
    for item in out:
        item = item._asdict()
        item.pop('morphemes')
        lines.append(item)

    return pd.DataFrame(lines)

format_groups(analyzer.analyze(sentence))

# This also explicitly returns tokenized sentences if you prefer:
# analyzer._analyze_text(sentence)

Unnamed: 0,word,lemma,pos,formatted
0,Gözleme,gözlemek,Verb,[gözlemek:Verb] gözle:Verb+me:Neg+Imp+A2sg
1,Gözleme,gözlem,Noun,[gözlem:Noun] gözlem:Noun+A3sg+e:Dat
2,Gözleme,Gözlem,Noun,"[Gözlem:Noun,Prop] gözlem:Noun+A3sg+e:Dat"
3,Gözleme,gözleme,Noun,[gözleme:Noun] gözleme:Noun+A3sg
4,Gözleme,gözlemek,Noun,[gözlemek:Verb] gözle:Verb|me:Inf2→Noun+A3sg
5,gözleyen,gözlemek,Adj,[gözlemek:Verb] gözle:Verb|yen:PresPart→Adj
6,gözlüklü,gözlük,Adj,[gözlük:Noun] gözlük:Noun+A3sg|lü:With→Adj
7,gözlüklü,göz,Adj,[göz:Noun] göz:Noun+A3sg|lük:Ness→Noun+A3sg|lü:With→Adj
8,gözcü,gözcü,Noun,[gözcü:Noun] gözcü:Noun+A3sg
9,gözcü,göz,Noun,[göz:Noun] göz:Noun+A3sg|cü:Agt→Noun+A3sg


Since some words can have different morphological explanations, every alternative is retrieved. From this, we can obtain lemmas and POS tags as you can see. However, simply using the first explanation for a word may not yield the correct result:

We can also simply use `analyzer.lemmatize()` to lemmatize the words. Again, it returns all possible lemmas without disambiguation:

In [None]:
analyzer.lemmatize(sentence)

[('Gözleme', ['gözlemek', 'gözlem', 'Gözlem', 'gözleme']),
 ('gözleyen', ['gözlemek']),
 ('gözlüklü', ['gözlük', 'göz']),
 ('gözcü', ['gözcü', 'göz']),
 ('gözden', ['Gözde', 'göz', 'gözde']),
 ('düştü', ['düş', 'düşmek']),
 ('.', ['.'])]

### Dealing with Turkish characters<a id="tr-encoding"></a>

A past Turkish government's short-sightedness in the 80's is haunting programmers who work with Turkish texts to this day. To prevent some Turkish characters from deforming, we need to read and write files using the "UTF-8" encoding. However, this may not be enough if your data is not saved as "UTF-8" in the first place. Sometimes, you may realize that certain characters in your dataset itself is not properly represented. For example, instead of the word `kılıç`, you may see `kÄ±lÄ±Ã§`. This may suggest that your data is saved in "Latin-1" (also known as "ISO 8859-1"), which is the most common encoding in the Western world and the standard for many protocols. To fix it, you can specify the encoding of the string as "latin-1" (or "ISO 8859-1") and then decode it to "UTF-8" as shown below:

In [None]:
badly_encoded = "kÄ±lÄ±Ã§"
encoding_fixed = badly_encoded.encode("latin-1").decode("utf-8")

encoding_fixed

'kılıç'

Hopefully, this will solve your problem. If not, you can also find the deformed characters and write a function that replaces all of those characters with the correct ones.

Keep in mind that Turkish characters can also cause some packages or languages to raise an error. You may also see that the fonts used by some packages may not support these characters and plot `□` instead. Therefore, you may want to anglicize Turkish characters to prevent these. Here is a function that does that for you.

In [None]:
# Changes all Turkish characters with their simpler versions:
def anglicize_turkish(text):
    return text.translate(text.maketrans({"Ğ": "G", "ğ": "g", "Ü": "U", "ü": "u", "Ş": "S", "ş": "s", "İ": "I", "ı": "i", "Ö": "O", "ö": "o", "Ç": "C", "ç": "c"}))

anglicize_turkish("kılıç")

'kilic'

Another problem is the built-in functions that lowercase or uppercase your text may not work properly with Turkish text, even if your system locale is Turkish. For example, the lowercase version of `KILIÇ` is normally returned as `kiliç`, since "I" corresponds to "i" in the Western languages. Instead of dealing with locales (which may not solve your problem anyway), a simple solution is to manually lowercase/uppercase the problematic letters and then use the built-in function:

In [None]:
def turkish_lowercase(text):
    return text.translate(text.maketrans({"I": "ı"})).lower()

def turkish_uppercase(text):
    return text.translate(text.maketrans({"i": "İ"})).upper()

print("Built-in lowercase for KILIÇ:", "KILIÇ".lower())
print("Custom lowercase for KILIÇ:", turkish_lowercase("KILIÇ"))
print("Built-in uppercase for isim:", "isim".upper())
print("Custom uppercase for isim:", turkish_uppercase("isim"))

Built-in lowercase for KILIÇ: kiliç
Custom lowercase for KILIÇ: kılıç
Built-in uppercase for isim: ISIM
Custom uppercase for isim: İSİM


### Discussion Question

* [HuggingFace](https://huggingface.co) is a platform where organizations and people share open weight deep learning models, datasets and more.<br><br> Do a quick search on HuggingFace for current representation of Turkish in NLP. (Note that this won't be representative of actual research.) <br> How recent is the most downloaded model? What datasets were used for training or finetuning in general?


* What is the benefit of having NLP models specific to Turkish?

> Answers can be written here by double clicking and editing the text.