# Application 2: Normalization of Tweets
Normalizing data, i.e. bringing the text into a format that is preferable to the model, is a crucial part when doing basically any form of NLP.
We would like to remove text, which the model has not been trained for or which is just irrevelant to our task. For this, we will need to do a lot of string manipulation. 

Regular expressions (sequence of characters that specify a search pattern in text) are key for this, which we will revise first. Without prior experience in regex the tasks in Exercise 0 and Exercise 1 may take quite some time, which would take time away from our machine learning applications.  
Therefore, we suggest you to maybe quickly read through the Exercises and Tasks to at least get some idea of the concepts but **skip the Tasks in Exercise 0 and Exercise 1** and rather focus on Tasks for Exercise 2 for now. You are very welcome to look at that them by yourself later on.

## Exercise 0: Regular expressions
[Regular expressions](https://en.wikipedia.org/wiki/Regular_expression) are a very powerful tool to find and replace patterns found in text based data like our Tweets. While very complex expressions can be formed, we will try to provide a beginner-friendly introduction here. There are some basic operators which, will be provided in the following. However, you can of course use google to solve the tasks, just also try to understand why the suggested operation works. Note, usually expressions are more or less independent of the programming language used but if you want to google a certain operation, adding "python" to the search query will usually give you exactly what you need with no additional minor "translations" necessary.

Python uses the package `re` (import via `import re`), which is a standard package. Usually, we would like to replace a string with a different one or delete certain strings, which is a replacement with an empty string. For this we use ` re.sub(pattern, repl, string, count=0, flags=0)`:
* pattern: Is the regular expression determining what should be replaced
* repl: Is the string/regular expression you want to replace `pattern` with (use `pattern=''` to delete).
* string: Is the text you want to apply your replacement to, so usually a single Tweet in our case.
* flags: Optional flags that can be used to alter the behavior of the function. For example, `re.IGNORECASE` can be useful if you want to do a case insensitive replacement. A list of the flags is given [here](https://docs.python.org/3/library/re.html#contents-of-module-re).


Special characters are crucial to match broader search terms. Find a list at the begging of [their documentation](https://docs.python.org/3/library/re.html#contents-of-module-re). Don't bother to remember them initially, usually it makes sense to look for the correct one depending on your use case. Let's now take a look at some small examples taken from the project:
* `re.sub(r"We're", "We are", text)`: Here, we are replacing the colloquial form "We're" with the (in written form) more common "We are" to ease training. This is a regular expression that just directly matches any occurence of exactly these charaters.
* `re.sub("_+", " ", text)`: The plus symbol "+" is a special character to match 1 or more occurences of the preceding expression. Here, it will match any occurences of the underscore symbol "_" and replace it with a single space. Sometimes people use underscores to highlight parts of their text. However, in our training set for our model huge databases of more text formats are used like Wikipedia, where using underscores in this form is quite unlikely...
* `re.sub("\s+", " ", text)`: The special character "\s" is used to denominate any whitespace characters (single spaces, tabs, new lines,..). Tweets can have multiple white space characters in a row, while we just want a single space between words. Therefore, this expression is replacing one or more consequetive whitespace characters with a single space.
* `re.sub("@[a-z0-9]", "", text)`: Using square brackets, e.g. `[abc]`, you can invoke a so called "character class". This means "any character from a,b or c" (a character class may use ranges, e.g. [a-d] = [abcd]) is matched. This means that our example matches any occurence of the at sign either followed by any number or any lower case letter and removes them.
* `re.sub("@([a-z0-9])", "\\1", text)`: This example is very reminiscent of the previous one. However, we introduce parentheses that are used for "capture groups". These allow you to reference your matched strings and use them later on. In this example, we again match at sign and a single following lower case character or number. However we assign the following single number or single lower case character to a capture group and reference our first capture group via `\\1` (when using raw string only use single backslash: `r'\1'`) to effectively remove the at sign but leave the number/character untouched.
* `re.sub("@?… https:\S*$", "", text)`: Let's end on a more complex expression. Some Tweets are shortened, where removed characters are replaced with the `…` character (Single character *not* three periods in a row) and contain links at the end.

    (It's probably related to users sending Instagram messages when their Instagram and Twitter accounts are linked. If the message is too long, the Tweet will be cut short and removed words are replaced by a single `…` The same message will be posted on Twitter with the link to the instagram message appended to the end of the Tweet.)

    This suffix is sometimes initialized with an at sign "@". To match a single or no occurence of a character, we append the character with the special character `?` (Consequently, need to *escape* the character if want to match a question mark, i.e. `\?`). Then always the characters `… https:` follow. Finally, we would like to match any number of non-white space characters until the end of the sentence. We use the special character `\S` to match a non-white space character. The special character `*` is appended to match zero or more appearances of the preceding character. The special character `$` then matches the end of the text.

## Tasks (skip!):
Now, it is time for you to get started with this very useful toolbox used by most programmers independent of their field of work. Some of the tasks may be a bit overwhelming if you had no prior experience with regular expression. While you are always encouraged to cooperate on tasks. I would like to stress that you should feel free to reach out to your colleagues or any instructor for hints and tipps if you get stuck!

Write a function using `re.sub` that ...
* ... takes a string as an input and returns a sentence where all occurences of 'Hello!' are replaced with 'Bye!' 

    Example: 'Hello! Have a great day!' -> 'Bye! Have a great day!').
* ... removes a hashtag which is proceded by any letter of the English alphabet independent of capitalization, e.g. "@HASHTAGTEXT", @hashTagText
    
    Example: 'This is my hashtag @mycoolHASHtag. Do you like it too?' -> 'This is my hashtag . Do you like it too?'
    
    Hint, try to find a way to match any character of the alphabet and then any occurence of these characters.  
    
* ... removes a URL that uses either the application layer protocol "http" or the newer variant "https". (Names can include numbers and characters)

    Example: "https://en.wikipedia.org/", "http://en.wikipedia.org/"


## Exercise 1: Tweet normalization 
We now would like to apply our insights into regular expressions to normalize our data. As a first step, we will write our own simple function based on the regular expressions you just came above with and the examples given above. Afterwards, we will use a more comprehensive pipeline with a larger variety of functions tackling a wider variety of small "issues" in our data. 


## Tasks (skip!):

* Use the functions provided as examples above (in Exercise 0) that appear useful to normalize our Tweets and the regular expressions you introduced to fulfill the previous tasks to compose a function that normalizes a single Tweet (`ds_tweets['text_original']`).
* Write a function that normalizes an array of the first 100 Tweets
* Add an additional function to further normalize the data. Discuss your reasoning behind introducing your function with the group!

## Exercise 2: Tweet normalization pipeline
To actually convert our text into a format that is more appropriate for training, we will use functions provided in file `scripts/normalize_text_bootcamp.py` as finding undesired features in our Tweets is quite cumbersome (and may still incomplete in our best current version of the pipeline).

In the file, you will find the class `Normalizer` and its method `normalize`, which we will use to normalize the text. As it is a priori unclear if some formating options actually help training the model, there are optional keyword arguments in the normalize function. 

*Hint*, when using imported scripts that may be changed while you are working on your notebook. You may find it useful to add the following [*magic commands*](https://ipython.readthedocs.io/en/stable/config/extensions/autoreload.html) 

```python
# allows update of external libraries without need to reload package
%load_ext autoreload
%autoreload 2
```

at the beggining of your notebook to reload packages.

## Tasks:

* Write a function that normalizes an array of the first 100 Tweets (quicker to debug than using whole dataset) with the default options given in `normalize`. Use default values.
* Use `normalize_text_bootcamp.normalize_text_dataset` to normalize the whole dataset. Use default values.   
    *Hint*: Use `ds_tweets.sel(index=slice(0,99))` to only select the first 100 entries in the dataset for more efficient debugging and testing.
* The function `normalize_text_bootcamp.normalize_filter_dataset` also filters out 'unwanted' Tweets in addition to normalizing the Tweets via `normalize_text_bootcamp.normalize_text_dataset` (see previous Task). Use it (default values) to normalize and filter your Tweets.
* The function `normalize_slang_stopwords` in `scripts/normalize_text_bootcamp.py` hosts a collection of substitutions for colloquial phrases. With the knowledge you gained on regular expressions, add a new substitution to the function that helps us out in the future and share it with the group.  
    *Optionally, you could base this substitution on a phrases you find in the "cleaned" data*.

In [None]:
# allows update of external libraries without need to reload package
%load_ext autoreload
%autoreload 2