# Your name: <type it here, please>

# Homework 1
In this homework, you will analyze the statistics of the Wikipedia's article about the famous match between a South Korean professional Go player and 18-time Go world champion Lee Sedol and AlphaGo, a machine learning system developed by Google DeepMind (https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol ). You will use the spaCy library (https://spacy.io/) for the purposes of text processing, analysis and visualization. You will learn how to extract individual sentences and words from raw text, to annotate them with part-of-speech tags, to detect entities and to build parse trees.

Please follow the instructions in this notebook to complete the assignment.

Submit your homework as follows:

```sh
$ submit arum hw1 <your_notebook_filename>
```


# 0. Import libraries and load data
Please make sure to put the input text file "corpus.txt" into the same folder as this notebook.

In [1]:
# Make sure you have spacy and matplotlib installed
from collections import defaultdict, Counter

import matplotlib.pyplot as plt
import spacy
from spacy import displacy

%matplotlib inline

In [2]:
# Loading the English model from spaCy. 
# You can read the documentation here (https://spacy.io/models/en#en_core_web_lg)
# Make sure to run the "$ python3 -m spacy download en_core_web_lg" command in your terminal before executing this cell
nlp = spacy.load('en_core_web_lg')

In [None]:
# Reading data
with open('corpus.txt') as f:
    text = f.read()

# 1. Input text statistics
This part of the homework is dedicated to tokenization. Tokenization is the process of segmenting text into individual linguistic units: characters, words, sentences, etc. Tokenization is the first (after data collection and cleaning) step of the text processing workflow. The goal of tokenization is, given input text, to return a sequence of tokens it consists of. We'll be using spaCy to look into how tokenization is done.

Your task is to complete the missing code in the function `process_text`.
In this step you will have to implement the following:
- Tokenize the input text into sentences using spaCy sentence tokenization
- Tokenize the input text into tokens and counting the frequency of each one of them. Note: your implementation should disregard punctuation tokens. You can use the `is_punct` attribute to check if a given token is a punctuation token.
- Lemmatize the derived tokens if `lemmatize=True`
- Lowecase the derived tokens if `lowercase=True`
- Filter the most frequent tokens if their count exceeds `high_thr`
- Filter the least frequent tokens if their count is below `low_thr`

When finished, run the cells below to answer the questions. __For every cell make sure to call the completed function with the correct parameters.__

In [None]:
def process_text(text, lemmatize=False, lowercase=False, high_thr=None, low_thr=None):
    """
    Tokenize text at a sentence- and at a word-level. Punctuation is ignored.
    Return a list of sentences and a dictionary of tokens' counts.
    Args:
        text (str): The input text
        lemmatize (bool, optional): A flag for whether derived tokens should be lemmatized or not.
                                    Note: spaCy's lemmatization also lowercases a given token
        lowercase (bool, optional): A flag for whether derived tokens should be lowercased or not
        high_thr (int, optional): The threshold frequency for cutting off the tokens if their count exceeds high_thr
        low_thr (int, optional): The threshold frequency for cutting off the tokens if their count is below low_thr

    Returns:
        sentences (list): A list of sentences in the text.
        tok_counter (dict): Maps tokens to their counts in the text ({'word': count}). 
                            If lemmatize=True, the returned dictionary's keys should be lemmatized.
    """
    doc = nlp(text)
    
    # Tokenize text into sentences
    ### YOUR CODE BELOW ###
    sentences = None
    ### YOUR CODE ABOVE ###
    
    # Tokenize text into words
    ### YOUR CODE BELOW ###
    tok_counter = None

    
    
    
    ### YOUR CODE ABOVE ###
    
    # Filtration if high_thr or low_thr are set
    if high_thr is not None:
        ### YOUR CODE BELOW ###
        pass
        # change tok_counter here
        ### YOUR CODE ABOVE
        
    if low_thr is not None:
        ### YOUR CODE BELOW ###
        pass
        # change tok_counter here
        #### YOUR CODE ABOVE ###
    
    return sentences, tok_counter

## Problem set 1

### 1. How many sentences are there in the text? Print out the 61st sentence.

In [None]:
# Customize the function input parameters if needed
sentences, tok_counter = process_text(text)

# Print the answer
print("There are {} sentences in total.\n".format(len(sentences)))
print("The 61st sentence is: \n{}".format(sentences[60]))

### 2. How many times is the word "go" used?
Make sure your implementation is NOT case sensitive.

In [None]:
# Customize the function input parameters if needed
sentences, tok_counter = process_text(text)

# Print the answer
print("The word 'go' is used {} times".format(tok_counter["go"]))

### 3. How many times is any form of the word "go" used? 
("go", "went", "goes", "Going", etc.)

In [None]:
# Customize the function input parameters if needed
sentences, tok_counter = process_text(text)

# Print the answer
print("The forms of the word 'go' are used {} times".format(tok_counter["go"]))

### 4. How many tokens appear in the text no more than 20 times and no less than 5 times?
Your implementation should be case-sensitive

In [None]:
# Customize the function input parameters if needed
sentences, tok_counter = process_text(text)

# Print the answer
print("There are {} tokens within the [5, 20] frequency range".format(len(tok_counter)))

### 5. Visualize Zipf's law (https://en.wikipedia.org/wiki/Zipf%27s_law)
Zipf's law shows how the probabilities of the words in the corpus are distributed.

For the sake of visualization purposes (and speed) you can consider only tokens that appear in the text not less than 10 times.

In [None]:
# Customize the function input parameters if needed
sentences, tok_counter = process_text(text)

# Plot the output
tok_counter = sorted(tok_counter.items(), key=lambda x: x[1], reverse=True)
words = [tok[0] for tok in tok_counter]
counts = [tok[1] for tok in tok_counter]
plt.figure(figsize=(16, 8))
plt.bar(words, counts)
plt.xticks(rotation=70, fontsize=12)
plt.style.use('seaborn-paper')

# 2. Part of speech tagging
Part of speech (POS) tags help to understand the role of a word in a sentence. Knowing a given word's part of speech provides information about overall syntactic structure of a sentence and helps to disambiguate the meaning of the word. SpaCy has built-in automatic POS tagging, which you'll be using in this exercise. As in the previous function, the punctuation should be ignored in this task.

Your task is to complete the missing code in the function `get_pos` and in the cells below. In this step you will have to implement the following:

- Detect part of speech of every (non-pucntuation-)token in the text
- Construct a dictionary of POS tags and corresponding words. The keys of the dictionary are POS tags, and the value are list of all the words with this tag
- Similarly, construct a dictionary where the keys are lemmatized words and values are lists of all detected POS tags

You might find the `defaultdict` class from the `collections` library useful for this task.

The two dictionaries will be used for further analysis.
When finished, __complete the code in the cells below__ to answer the questions. Make sure to call the completed function with the correct input parameters.

In [None]:
def get_pos(text, lemmatize=False):
    """
    Build a mapping between every token (ignoring punctuation) and every detected POS tag within the given text.
    
    Args:
        text (str): The input text.
        lemmatization (bool, optional): A flag for whether or not tokens should be lemmatized.

    Returns:
        token2pos (dict): The token-to-list of corresponding POS tags mapping ({'cotton':['NOUN', 'ADJ']})
        pos2token (dict): The POS_tag-to-list of corresponding lemmatized tokens mapping ({'NOUN':['apple', 'pear']})
    """
    doc = nlp(text)

    ### YOUR CODE BELOW ###
    token2pos = None
    pos2token = None
    
    
    
    
    
    ### YOUR CODE ABOVE ###
        
    return token2pos, pos2token

## Problem set 2

### 1.  How many adjectives are there in the text?

In [None]:
### YOUR CODE BELOW ###
token2pos, pos2token = None, None
num_adjectives = None
### YOUR CODE ABOVE ###

# Print output
print("There are {} adjectives in the text.".format(num_adjectives))

### 2. What is the 3rd most popular verb in the text?
Note: only verbs are considered in this case

In [None]:
### YOUR CODE BELOW ###
token2pos, pos2token = None, None
verbs_cnt = None
### YOUR CODE ABOVE ###

verbs_sorted = sorted(verbs_cnt.items(), key=lambda x: x[1], reverse=True)

# Print output
print("The 3rd most popular verb is '{}'".format(verbs_sorted[2][0]))
    

### 3. If you only count the canonical forms of words, what is the 3rd most popular verb form in the text?
Note: "go", "going", "went", etc. refer to the same canonical form "go"

In [None]:
### YOUR CODE BELOW ###
token2pos, pos2token = None, None
verb_forms_cnt = None
### YOUR CODE ABOVE ###

verb_forms_sorted = sorted(verb_forms_cnt.items(), key=lambda x: x[1], reverse=True)

# Print output
print("The 3rd most popular verb is 'to {}'".format(verb_forms_sorted[2][0]))

### 4. Which words are the most "ambiguous"? (i.e. which words have more than 2 syntactic roles depending on the context?)

In [None]:
### YOUR CODE BELOW ###

### YOUR CODE ABOVE ###

# 3. NER analysis
Named entity recognition (NER) is an important information extraction task in NLP. The goal of NER is to identify named entities (such as places, people, organizations, etc.) in unstructured text. NER helps to avoid ambiguity, resolve coreferences, represent the meaning, and find relations between individual documents. It is widely used for question answering, news searching, textual entailment, and other tasks in NLP. As you might have guessed, spaCy automatically detects named entities in a given text, which you will implement in this exercise.


Your task is to complete the missing code in the function `get_entities`. In this step you will have to implement the following:

- detect all the entities in the text
- for every entity type store all the detected tokens of this type

Note that in this case the input text shouldn't be lowercased, as capital letters help with detecting named entities.

When finished, __complete the code in the cells below__ to answer the questions.

In [None]:
def get_entities(text):
    """
    Build a mapping between entity labels and tokens with the assigned labels.
    
    Args:
        text (str): The input text.

    Returns:
        entity_dict (dict): The dictionary containing entity labels as keys and lists of corresponding tokens as values.
                            {'PERSON':['Barack Obama', 'George Bush']} 
    """
    doc = nlp(text)
    
    ### YOUR CODE BELOW ###
    entity_dict = dict()

    
    ### YOUR CODE ABOVE ###
    return entity_dict

In [None]:
entity_dict = get_entities(text)

## Problem set 3

### 1. How many unique labels are detected in the text? What are they?

In [None]:
### YOUR CODE BELOW ###
labels = None
### YOUR CODE ABOVE ###

# Print output
print("There are {} unique labels identified".format(len(labels)))
print(list(labels))

### 2. How many dates are detected in the text?

In [None]:
### YOUR CODE BELOW ###
date_cnt = None
### YOUR CODE ABOVE ###

# Print output
print("There are {} dates identified".format(date_cnt))

### 3. Visualize all the entities detected in the first five sentences.
Use `displacy.render` to visualize how the entities of the first 5 sentences in the text are detected. You can use the detected sentences from the 1st part of the homework.
See https://spacy.io/usage/visualizers for reference.
__Note__: for proper visualization, it is recommended that you first convert all the sentences to the `str` type, convert the obtained joined text sentences back to the `Doc` type, and then call the `displacy.render()` method.

In [None]:
### YOUR CODE BELOW ###
displacy.render()
### YOUR CODE ABOVE ###


# 4. Dependency parsing
Dependency parsing helps identifying the grammatical structure of a sentence, establishing dependency relations between single words. Dependency parsing have been shown to improve NLP systems in certain languages; the best long paper in the top NLP conference EMNLP-2018 integrated syntactic parsing in their system (https://arxiv.org/pdf/1804.08199.pdf). Although there are a lot of methods to incorporate syntactic features in a model, we will look at spaCy's parser in this part of the homework.

This step doesn't require you to complete a function. Instead, you will visualize the parse tree reflecting dependency relations in the 61st sentence from the text (refer to the first part of the homework). You should use `displacy.render` class to complete this step.

In [None]:
doc = nlp(sentences[60])

### YOUR CODE BELOW ###
displacy.render()
### YOUR CODE ABOVE ###