# Seminar *Python for Linguists*: Final Assignment 
<font color="grey">Instructor: Qi Yu (University of Konstanz)  |  ZHAW, March 03-04, 2022</font>

## General Information

This assignment consists of 4 tasks. In the tasks, you will work with various basic NLP tasks that researchers on NLP and computational linguistics often need to deal with. Please follow the guideline below to complete the assignment, and do all your implementation in this notebook. 

For passing the seminar and getting credits, you should have at least **50** of the 100 points.

**Rules of Submission:**
1. When submitting your assignment, please rename the notebook with the following format: ```Lastname_Firstname.ipynb```

    E.g., A person named Jane Smith would name her submission as ```Smith_Jane.ipynb``` 


2. Please submit the assignment **via E-mail to ```qi.yu@uni-konstanz.de```** by **March 31, 2022, 00:00**. 


## Guideline

In this assignment, you will work with the file ```peterpan_cleaned.txt``` which you have already created in the exercise ```exercise_regex_fileIO.ipynb```. If you still do not have this file, please re-run the Notebook ```solution_regex_fileIO.ipynb``` to generate one.

## Task 1: Text Preprocessing 

1. Please read in the text ```peterpan_cleaned.txt```. You should read in the text as one chunk (**not** line by line). <font color="blue">(2 points)</font>

In [None]:
# ADD YOUR CODE HERE. Please feel free to split your code to multiple cells when necessary.



2. Please use NLTK to tokenize the text into words. <font color="blue">(3 points)</font>

In [None]:
# ADD YOUR CODE HERE. Please feel free to split your code to multiple cells when necessary.



## Task 2: Named Entities

In computational linguistic researches, researchers are often interested in checking what the most frequent occurring **named entities** in a text are. Roughly speaking, a *named entity* is anything that can be referred to as a proper name, e.g., ```Peter```, ```Switzerland```, ```United Airlines```. 

We can approximate the search of named entities by firstly POS-tagging the text and then searching for tokens bearing the POS-tags ```NNP``` or ```NNPS``` - Please check the [*Penn Treebank Tagset*](https://www.cs.upc.edu/~nlp/SVMTool/PennTreebank.html) to find out what they stand for. 

(**NB:** There are named entities which consist of more than one tokens, such as the example ```United Airlines``` above. For the purpose of simplicity, we will just ignore such cases.)

Now, please follow the steps below to find out the most frequent named entities in ```peterpan_cleaned.txt```:

1. Please use NLTK to POS-tag the already tokenized text resulted from Task 1. <font color="blue">(7 points)</font>

In [None]:
# ADD YOUR CODE HERE. Please feel free to split your code to multiple cells when necessary.



2. We can use a dictionary to record tokens that are POS-tagged as ```NNP``` or ```NNPS``` together with their respective frequencies (i.e., how many times they occur in ```peterpan_cleaned.txt```). To be exact, we can build a dictionary with such tokens as keys and their frequencies as values. 

    Please construct further on the dictionary ```ne_freq_dict``` given below, so that it has the following form at the end:

    ```{'Wendy': 341, 'Peter': 394, 'Brussels': 1, ...}``` 
    (which means: In the text, ```Wendy``` occurs 314 times in all, ```Peter``` occurs 394 times in all, ```Brussels``` occurs 1 time in all, etc.)
    
    **Tips:** Recall that the value of a key can be overwritten, i.e., the values in a dictionary are modifiable (see ```data_types.ipynb```, Section 6). Thus, you can build the ```ne_freq_dict``` in the following way:
    1. When a proper noun P1 is found in the text, and P1 already exists as a key in the dictionary, the value of P1 should increase by 1. 
    2. When a proper noun P2 is found in the text, but P2 still does not exist as a key in the dictionary, you should add a new item to the dictionary with P2 as key and 1 as value. This means: when P2 is found again in the text later, the procedure *A* will be carried out, as P2 now is already an existing key. 
    
    
    
**NB:** You will notice that some tokens, such as ```Oh``` or ```Mr```, are wrongly tagged by the POS-tagger of NLTK as proper nouns. Every POS-tagger will generate such mistakes. For the purpose of simplicity, please just ignore these mistakes and pretend that they are proper nouns.

<font color="blue">(20 points)</font>

In [None]:
ne_freq_dict = {}

# ADD YOUR CODE HERE. You are free to split your code to multiple cells when necessary.



3. Now that you have built the ```ne_freq_dict``` with the proper nouns as keys and their respective frequencies as values, you can sort the dictionary by value in descending order by running the line given below. Please then print out the top 5 most frequent proper nouns together with their frequencies.

    **NB:** Take care of the data type of ```ne_freq_dict_sorted```. It is not a dictionary any more.

<font color="blue">(3 points)</font>

In [None]:
ne_freq_dict_sorted = sorted(ne_freq_dict.items(), key=lambda x: x[1], reverse=True)

# ADD YOUR CODE HERE. You are free to split your code to multiple cells when necessary.



## Task 3: Token Frequency and Type-Token Ratio

Another common task in computational linguistic researches is to investigate the frequency of each token. To this end, stop words and punctuations are usually removed from the text, as they only have grammatical meanings and are not informative with regard to the content.  

1. Please remove stop words and punctuations from the token list that you obtained from the tokenization in Task 1. To this end, please go through each token in the list, and append all tokens that are neither stop words nor punctuations to the new list ```tokenized_cleaned``` given below.

    For removing stop words, please use the stop word list provided by NLTK, which is already given below. For removing punctuations, please consider using regular expressions.
    
<font color="blue">(10 points)</font>

In [None]:
from nltk.corpus import stopwords

In [None]:
stop_words = stopwords.words('english')
tokenized_cleaned = []

# ADD YOUR CODE HERE. You are free to split your code to multiple cells when necessary.



2. To investigate the frequency of each token, we will again apply the method used in Task 2 for checking the most frequent proper nouns. 

    An empty dictionary ```token_freq_dict``` is defined below. Please construct on this dictionary further so that it contains the tokens in ```tokenized_cleaned``` as keys, and their respective frequencies as values. 
    
    As we would like to consider ```Apple``` and ```apple``` as the same token, please first convert all tokens in ```tokenized_cleaned``` to lower case.
    
<font color="blue">(7 points)</font>

In [None]:
token_freq_dict = {}

# ADD YOUR CODE HERE. You are free to split your code to multiple cells when necessary.



3. Please use the method you learned in Task 2 to sort ```token_freq_dict``` by values in descending order, and print out the top 5 most frequent tokens together with their frequencies.

<font color="blue">(3 points)</font>

In [None]:
# ADD YOUR CODE HERE. You are free to split your code to multiple cells when necessary.



4. The vocabulary richness of a text is often measured by **type-token ratio** (TTR). TTR is defined as the total number of *unique* tokens (i.e., *types*) divided by the total number of tokens in a given text.

    E.g., For the text ```John likes apple and Mary likes apple too```, the total number of types (unique tokens) would be 6: ```{'John', 'likes', 'apple', 'and', 'Mary', 'too'}```  (Note that ```likes``` and ```apple``` appeared two times). The total number of tokens would be 8. Thus, the TTR is 6/8 = 0.75.

    Please calculate the TTR of the text ```peterpan_cleaned.txt```.

    **Tips:** You can get the total number of types by inquiring how many keys the ```token_freq_dict``` contains.
    
<font color="blue">(10 points)</font>

In [None]:
# ADD YOUR CODE HERE. You are free to split your code to multiple cells when necessary.



## Task 4: N-grams

One can not only find out the frequency of single tokens in a text. Researchers are also often interested in the frequencies of the so-called **N-grams**, i.e., sequences of *N* tokens. Here we will work on the so-called *bigrams*, i.e., sequences of 2 tokens.

E.g., All bigrams of the list ```['Winterthur', 'is', 'a', 'city', 'in', 'Switzerland']``` are: 

```['Wintherthur is', 'is a', 'a city', 'city in', 'in Switzerland']```

1. Please use a **while-loop** or a **for-loop** to get all bigrams of the list ```tokenized_cleaned``` resulted from Task 3, and store them in the list ```bigrams``` given below. 

    **NB:** Actually, NLTK also provides off-the-shelf methods for getting N-grams from a list. However, for the purpose of practicing, please DO use a while-loop or a for-loop to complete this task.
    
<font color="blue">(25 points)</font>

In [None]:
bigrams = []

# ADD YOUR CODE HERE. You are free to split your code to multiple cells when necessary.



2. Next, please find out the frequency of each bigram, and print out the top 5 most frequent bigrams together with their frequencies.

<font color="blue">(5 points)</font>

In [None]:
# ADD YOUR CODE HERE. You are free to split your code to multiple cells when necessary.



## Additional Criterion: Programming Style

Please comment your code sufficiently. <font color="blue">(5 points)</font>

---**END**--- 