<div class="alert alert-block alert-info">
    <h1>Natural Language Processing</h1>
    <h3>General Information:</h3>
    <p>Please do not add or delete any cells. Answers belong into the corresponding cells (below the question). If a function is given (either as a signature or a full function), you should not change the name, arguments or return value of the function.<br><br> If you encounter empty cells underneath the answer that can not be edited, please ignore them, they are for testing purposes.<br><br>When editing an assignment there can be the case that there are variables in the kernel. To make sure your assignment works, please restart the kernel and run all cells before submitting (e.g. via <i>Kernel -> Restart & Run All</i>).</p>
    <p>Code cells where you are supposed to give your answer often include the line  ```raise NotImplementedError```. This makes it easier to automatically grade answers. If you edit the cell please outcomment or delete this line.</p>
    <h3>Submission:</h3>
    <p>Please submit your notebook via the web interface (in the main view -> Assignments -> Submit). The assignments are due on <b>Wednesday at 15:00</b>.</p>
    <h3>Group Work:</h3>
    <p>You are allowed to work in groups of up to two people. Please enter the UID (your username here) of each member of the group into the next cell. We apply plagiarism checking, so do not submit solutions from other people except your team members. If an assignment has a copied solution, the task will be graded with 0 points for all people with the same solution.</p>
    <h3>Questions about the Assignment:</h3>
    <p>If you have questions about the assignment please post them in the LEA forum before the deadline. Don't wait until the last day to post questions.</p>
    
</div>

In [1]:
'''
Group Work:
Enter the UID of each team member into the variables. 
If you work alone please leave the second variable empty.
'''
member1 = 'Syed Mushrraf Ali (sali2s, 9040658)'
member2 = 'Shalaka Satheesh (ssathe2s, 9040760)'


# Introduction to spaCy

SpaCy is a tool that does tokenization, parsing, tagging and named entity regocnition (among other things).

When we parse a document via spaCy, we get an object that holds sentences and tokens, as well as their POS tags, dependency relations and so on.

Look at the next cell for an example.


In [20]:
import spacy

# Load the English language model
nlp = spacy.load('/srv/shares/NLP/en_core_web_sm')

# Our sample input
text = 'SpaCy is capable of    tagging, parsing and annotating text. It recognizes sentences and stop words.'

# Parse the sample input
doc = nlp(text)

# For every sentence
for sent in doc.sents:
    # For every token
    for token in sent:
        # Print the token itself, the pos tag, 
        # dependency tag and whether spacy thinks this is a stop word
        print(token, token.pos_, token.dep_, token.is_stop)
        
print('-'*30)
print('The nouns and proper nouns in this text are:')
# Print only the nouns:
for token in doc:
    if token.pos_ in ['NOUN', 'PROPN']:
        print(token)

SpaCy PROPN nsubj False
is AUX ROOT True
capable ADJ acomp False
of ADP prep True
    SPACE  False
tagging NOUN pobj False
, PUNCT punct False
parsing VERB conj False
and CCONJ cc True
annotating VERB conj False
text NOUN dobj False
. PUNCT punct False
It PRON nsubj True
recognizes VERB ROOT False
sentences NOUN dobj False
and CCONJ cc True
stop VERB conj False
words NOUN dobj False
. PUNCT punct False
------------------------------
The nouns and proper nouns in this text are:
SpaCy
tagging
text
sentences
words


## SpaCy A) [5 points]
### Splitting text into sentences

You are given the text in the next cell.

```
text = '''
This is a sentence. 
Mr. A. said this was another! 
But is this a sentence? 
The abbreviation Merch. means merchant(s).
At certain univ. in the U.S. and U.K. they study NLP.
'''
```

Use spaCy to split this into sentences. Store the resulting sentences (each as a **single** string) in the list ```sentences```. Make sure to convert the tokens to strings (e.g. via str(token)).

In [3]:
import spacy
nlp = spacy.load('/srv/shares/NLP/en_core_web_sm')

text = '''
This is a sentence. Mr. A. said this was another! 
But is this a sentence? The abbreviation Merch. means merchant(s).
At certain Univ. in the U.S. and U.K. they study NLP.
'''
sentences = []
tokens = []

doc = nlp(text)
for sentence in (doc.sents):
    sentences.append(str(sentence))
    for token in sentence:
        tokens.append(str(token))
    
for sentence in sentences:
    print(sentence)
    print('.')
    assert type(sentence) == str, 'You need to convert this to a single string!'


This is a sentence.
.
Mr. A. said this was another! 

.
But is this a sentence?
.
The abbreviation Merch. means merchant(s).

.
At certain Univ.
.
in the U.S. and U.K. they study NLP.

.


In [4]:
# This is a test cell, please ignore it!

## SpaCy B) [5 points]

### Cluster the text by POS tag

Next we want to cluster the text by the corresponding part-of-speech (POS) tags. 

The result should be a dictionary ```pos_tags``` where the keys are the POS tags and the values are lists of words with those POS tags. Make sure your words are converted to **strings**.

*Example:*

```
pos_tags['VERB'] # Output: ['said', 'means', 'study']
pos_tags['ADJ']  # Output: ['certain']
...
```

In [5]:
import spacy
nlp = spacy.load('/srv/shares/NLP/en_core_web_sm')

text = '''
This is a sentence. Mr. A. said this was another! 
But is this a sentence? The abbreviation Merch. means merchant(s).
At certain Univ. in the U.S. and U.K. they study NLP.
'''

pos_tags = dict()

doc = nlp(text)
        
for sentence in doc.sents:
    for token in sentence:
        if token.pos_ not in pos_tags:
            pos_tags[token.pos_] = []
        pos_tags[token.pos_].append(str(token))

for key in pos_tags:
    print('The words with the POS tag {} are {}.'.format(key, pos_tags[key]))
    for token in pos_tags[key]:
        assert type(token) == str, 'Each token should be a string'

The words with the POS tag SPACE are ['\n', '\n', '\n', '\n'].
The words with the POS tag DET are ['This', 'a', 'this', 'another', 'this', 'a', 'The', 'the'].
The words with the POS tag AUX are ['is', 'was', 'is'].
The words with the POS tag NOUN are ['sentence', 'sentence', 'abbreviation'].
The words with the POS tag PUNCT are ['.', '!', '?', '.', ')', '.', '.', '.'].
The words with the POS tag PROPN are ['Mr.', 'A.', 'Merch', 'merchant(s', 'Univ', 'U.S.', 'U.K.', 'NLP'].
The words with the POS tag VERB are ['said', 'means', 'study'].
The words with the POS tag CCONJ are ['But', 'and'].
The words with the POS tag ADP are ['At', 'in'].
The words with the POS tag ADJ are ['certain'].
The words with the POS tag PRON are ['they'].


In [6]:
# This is a test cell, please ignore it!

# SpaCy C) [5 points]

### Stop word removal

Stop words are words that appear often in a language and don't hold much meaning for a NLP task. Examples are the words ```a, to, the, this, has, ...```. This depends on the task and domain you are working on.

SpaCy has its own internal list of stop words. Use spaCy to remove all stop words from the given text. Store your result as a **single string** in the variable ```stopwords_removed```.

In [7]:
import spacy
nlp = spacy.load('/srv/shares/NLP/en_core_web_sm')

text = '''
This is a sentence. Mr. A. said this was another! 
But is this a sentence? The abbreviation Merch. means merchant(s).
At certain Univ. in the U.S. and U.K. they study NLP.
'''

stopwords_removed = ''

doc = nlp(text)

for sentence in doc.sents:
    for token in sentence:
        if token.is_stop:
            stopwords_removed += str(token)
            stopwords_removed += ' '

print(stopwords_removed)
assert type(stopwords_removed) == str, 'Your answer should be a single string!'

This is a this was another But is this a The At in the and they 


In [8]:
# This is a test cell, please ignore it!

# SpaCy D) [2 points]

### Dependency Tree

We now want to use spaCy to visualize the dependency tree of a certain sentence. Look at the Jupyter Example on the [spaCy website](https://spacy.io/usage/visualizers/). Render the tree.

In [9]:
import spacy
from spacy import displacy

nlp = spacy.load('/srv/shares/NLP/en_core_web_sm')

text = 'Dependency Parsing is helpful for many tasks.'

doc = nlp(text)
displacy.render(doc, style="dep")

# SpaCy E) [5 points]

### Dependency Parsing

Use spaCy to extract all subjects and objects from the text. We define a subject as any word that has ```subj``` in its dependency tag (e.g. ```nsubj```, ```nsubjpass```, ...). Similarly we define an object as any token that has ```obj``` in its dependency tag (e.g. ```dobj```, ```pobj```, etc.).

For each sentence extract the subject, root node ```ROOT``` of the tree and object and store them as a single string in a list. Name this list ```subj_obj```.

*Example:*

```
text = 'Learning multiple ways of representing text is cool. We can access parts of the sentence with dependency tags.'

subj_obj = ['Learning ways text is', 'We access parts sentence tags']

In [13]:
import spacy
import re
nlp = spacy.load('/srv/shares/NLP/en_core_web_sm')

text = '''
This is a sentence. Mr. A. said this was another! 
But is this a sentence? The abbreviation Merch. means merchant(s).
At certain Univ. in the U.S. and U.K. they study NLP.
'''

subj_obj = []

doc = nlp(text)

# USE re.compile ????

for sentence in doc.sents:
    sentence_string = ''
    for token in sentence:
        if re.findall('\w*subj\w*', str(token.dep_)):
            sentence_string += str(token)
            sentence_string += ' '
        elif re.findall('ROOT', str(token.dep_)):
            sentence_string += str(token)
            sentence_string += ' '
        elif re.findall('\w*obj\w*', str(token.dep_)):
            sentence_string += str(token)
            sentence_string += ' '
    subj_obj.append(sentence_string)

for cleaned_sent in subj_obj:
    print(cleaned_sent)
    assert type(cleaned_sent) == str, 'Each cleaned sentence should be a string!'

This is 
A. said this 
is this 
abbreviation means 
At Univ 
U.S. they study NLP 


In [14]:
# This is a test cell, please ignore it!

# Keyword Extraction

In this assignment we want to write a keyword extractor. There are several methods of which we want to explore a few.

We want to extract keywords from our Yelp reviews.

##  POS tag based extraction

When we look at keywords we realize that they are often combinations of nouns and adjectives. The idea is to find all sequences of nouns and adjectives in a corpus and count them. The $n$ most frequent ones are then our keywords.

A keyword (or keyphrase) by this definition is any combination of nouns (NOUN) and adjectives (ADJ) that ends in a noun. We also count proper nouns (PROPN) as nouns.

## POS tag based extraction A) [35 points]

### POSKeywordExtractor

Please complete the function ```keywords``` in the class ```POSKeywordExtractor```.

You are given the file ```wiki_nlp.txt```, which has the raw text from all top-level Wikipedia pages under the category ```Natural language processing```. Use this for extracting your keywords.

*Example:*

Let us look at the definition of an index term or keyword from Wikipedia. Here I highlighted all combinations of nouns and adjectives that end in a noun. All the highlighted words are potential keywords.

An **index term**, **subject term**, **subject heading**, or **descriptor**, in **information retrieval**, is a **term** that captures the **essence** of the **topic** of a **document**. **Index terms** make up a **controlled vocabulary** for **use** in **bibliographic records**.

*Rules:*

- A keyphrase is a sequence of nouns, adjectives and proper nouns ending in a noun or proper noun.
- Keywords / Keyphrases can not go over sentence boundaries.
- We always take the longest sequence of nouns, adjectives and proper nouns
  - Consider the sentence ```She studies natural language processing.```. The only extracted keyphrase here will be ```('natural', 'language', 'processing')```.
- Consider the sentence ```neural networks massively increased the performance.```:
  - Here our keyphrase would be ```neural networks```, not ```neural networks massively```.
  - Our keyphrases are always the longest sequence of nouns and adjectives ending in a noun

In [26]:
%%time
from typing import List, Tuple, Iterable
from collections import Counter
import spacy
from spacy.tokens import Token
import pickle
import numpy as np

class POSKeywordExtractor:
    
    def __init__(self):
        # Set up SpaCy in a more efficient way by disabling what we do not need
        # This is the dependency parser (parser) and the named entity recognizer (ner)
        self.nlp = spacy.load(
            '/srv/shares/NLP/en_core_web_sm', 
            disable=['ner', 'parser']
        )
        # Add the sentencizer to quickly split our text into sentences
        self.nlp.add_pipe(self.nlp.create_pipe('sentencizer'))
        # Increase the maximum length of text SpaCy can parse in one go
        self.nlp.max_length = 1500000
        
    def validate_keyphrase(self, candidate: Iterable[Token]) -> Iterable[Token]:
        '''
        Takes in a list of tokens which are all proper nouns, nouns or adjectives
        and returns the longest sequence that ends in a proper noun or noun
        
        Args:
            candidate         -- List of spacy tokens
        Returns:
            longest_keyphrase -- The longest sequence that ends in a noun
                                 or proper noun
                                 
        Example:
            candidate = [neural, networks, massively]
            longest_keyphrase = [neural, networks]
        '''
        # If the candidate list is not empty
        if candidate:
            # Check if the last word is an adjective
            if candidate[-1].pos_ == 'ADJ':
                candidate = candidate[:-1]
            # Check if the last word is a noun or pronoun
            for i in range(len(candidate)-1, -1, -1):
                if candidate[i].pos_ in ['NOUN', 'PROPN']:
                    return (candidate[:i+1])    
        return candidate
        
    def keywords(self, text: str, n_keywords: int, min_words: int) -> List[Tuple[Tuple[str], int]]:
        '''
        Extract the top n most frequent keywords from the text.
        Keywords are sequences of adjectives and nouns that end in a noun
        
        Arguments:
            text       -- the raw text from which to extract keywords
            n_keywords -- the number of keywords to return
            min_words  -- the number of words a potential keyphrase has to include
                          if this is set to 2, then only keyphrases consisting of 2+ words are counted
        Returns:
            keywords   -- List of keywords and their count, sorted by the count
        '''
        doc = self.nlp(text)
        keywords = []
        
        # Make a dictionary of sentence-indices with
        # their respective tokens
        all_tokens = []  
        sentences = dict()
        c = 0
        for sentence in doc.sents:
            tokens = []
            for token in sentence:
                all_tokens.append(token)
                tokens.append(token)
            sentences[c] = tokens
            c = c + 1
        
        # For each sentence, go through their tokens,
        # append the tokens which are ADJ/NOUN/PROPN to 
        # a dictionary. The key of this dictionary is the 
        # position that the token occurs in the sentence
        for key, sentence in sentences.items():
            possible_keyword = dict()
            for index, token in enumerate(sentence):
                if token.pos_ in ['ADJ','NOUN', 'PROPN']:
                    if index not in possible_keyword:
                        possible_keyword[index] = token
            
            # Get all the positions of the valid tokens
            # and check which positions are consecutive. 
            # Consecutively occuring keys will give keyphrases.
            data = list(possible_keyword.keys())
            
            # REFERENCE: https://stackoverflow.com/questions/
            # 7352684/how-to-find-the-groups-of-consecutive-elements-in-a-numpy-array
            for array in np.split(data, np.where(np.diff(data) != 1)[0]+1):
                candidate = []
                for index in array:
                    candidate.append(possible_keyword[index])
                    
                # Check if the candidate is valid and if it is
                # convert each token to string and convert the 
                # list that contains each candidate into a tuple.
                candidate = self.validate_keyphrase(candidate)
                if candidate:
                    candidate = [str(i) for i in candidate]
                    keywords.append(tuple(candidate))
        
        # Check if the keywords/keyphrases are 
        # more than min_words limit.
        for keyword in keywords[:]:
            if len(keyword) < min_words:
                keywords.remove(keyword)
        
        # Make a list of tuples. Each tuple has the
        # keyword and it's number of occurances
        keywords_return = []
        for key, value in Counter(keywords).items():
            keywords_return.append((key, value))
        
        return sorted(keywords_return, key=lambda x: x[1], reverse=True)[:n_keywords]

    
with open('/srv/shares/NLP/wiki_nlp.txt', 'r') as corpus_file:
    text = corpus_file.read()
    
keywords = POSKeywordExtractor().keywords(text.lower(), n_keywords=15, min_words=1)

'''
Expected output:
The keyword ('words',) appears 353 times.
The keyword ('text',) appears 342 times.
The keyword ('example',) appears 263 times.
The keyword ('word',) appears 231 times.
The keyword ('natural', 'language', 'processing') appears 184 times.
...
'''
for keyword in keywords:
    print('The keyword {} appears {} times.'.format(*keyword))

The keyword ('words',) appears 353 times.
The keyword ('text',) appears 342 times.
The keyword ('example',) appears 263 times.
The keyword ('word',) appears 231 times.
The keyword ('natural', 'language', 'processing') appears 184 times.
The keyword ('documents',) appears 162 times.
The keyword ('language',) appears 148 times.
The keyword ('information',) appears 137 times.
The keyword ('n',) appears 136 times.
The keyword ('set',) appears 133 times.
The keyword ('system',) appears 120 times.
The keyword ('t',) appears 117 times.
The keyword ('number',) appears 112 times.
The keyword ('sentence',) appears 112 times.
The keyword ('context',) appears 110 times.
CPU times: user 7.33 s, sys: 3.1 s, total: 10.4 s
Wall time: 11.6 s


In [146]:
# This is a test cell, please ignore it!

### POS tag based extraction B) [4 points]

Rerun the keyword extrator with a minimum word count of ```min_words=2``` and a keyword count of ```n_keywords=15```.

Store this in the variable ```keywords_2```. Print the result.

Make sure to convert the input text to lower case!

In [18]:
keywords_2 = POSKeywordExtractor().keywords(text.lower(), n_keywords=15, min_words=2)
for keyword in keywords_2:
    print('The keyword {} appears {} times.'.format(*keyword))

The keyword ('natural', 'language', 'processing') appears 184 times.
The keyword ('computational', 'linguistics') appears 105 times.
The keyword ('external', 'links') appears 101 times.
The keyword ('machine', 'translation') appears 94 times.
The keyword ('information', 'retrieval') appears 70 times.
The keyword ('natural', 'language') appears 66 times.
The keyword ('sentiment', 'analysis') appears 58 times.
The keyword ('=', 'references') appears 56 times.
The keyword ('text', 'mining') appears 55 times.
The keyword ('artificial', 'intelligence') appears 49 times.
The keyword ('word', 'sense', 'disambiguation') appears 47 times.
The keyword ('computer', 'science') appears 36 times.
The keyword ('machine', 'learning') appears 33 times.
The keyword ('information', 'extraction') appears 33 times.
The keyword ('speech', 'recognition') appears 31 times.


In [None]:
# This is a test cell, please ignore it!


# Stop word based keyword extraction

One approach to extract keywords is by splitting the text at the stop words. Then we count these potential keywords and output the top $n$ keywords. Make sure to only include words proper words. Here we define proper words as those words that match the regular expression ```r'\b(\W+|\w+)\b'```. 



## Stop word based keyword extraction A) [35 points]

Complete the function ```keywords``` in the class ```StopWordKeywordExtractor```.

In [76]:
%%time
from typing import List, Tuple
from collections import Counter
import re
import spacy

class StopWordKeywordExtractor:
    
    def __init__(self):
        # Set up SpaCy in a more efficient way by disabling what we do not need
        # This is the dependency parser (parser) and the named entity recognizer (ner)
        self.nlp = spacy.load('/srv/shares/NLP/en_core_web_sm', disable=['ner', 'parser'])
        # Add the sentencizer to quickly split our text into sentences
        self.nlp.add_pipe(self.nlp.create_pipe('sentencizer'))
        # Increase the maximum length of text SpaCy can parse in one go
        self.nlp.max_length = 1500000
        
    def is_proper_word(self, token:str) -> bool:
        '''
        Checks if the word is a proper word by our definition
        
        Arguments:
            token     -- The token as a string
        Return:
            is_proper -- True / False
        '''
        match = re.search(r'\b(\W+|\w+)\b', token)
        return match and token == match[0] 
    
    def keywords(self, text: str, n_keywords: int, min_words: int) -> List[Tuple[Tuple[str], int]]:
        '''
        Extract the top n most frequent keywords from the text.
        Keywords are sequences of adjectives and nouns that end in a noun
        
        Arguments:
            text       -- the raw text from which to extract keywords
            n_keywords -- the number of keywords to return
            min_words  -- the number of words a potential keyphrase has to include
                          if this is set to 2, then only keyphrases consisting of 2+ words are counted
        Returns:
            keywords   -- List of keywords and their count, sorted by the count
                          Example: [(('potato'), 12), (('potato', 'harvesting'), 9), ...]
        '''
        doc = self.nlp(text)
        keywords = []   
        keywords_sub = []
        # Split each sentence once a stop word
        # or a word which is not proper is encountered.
        # This split sentence stored as keyword or a keyphrase
        for sentence in doc.sents:
            for token in sentence:
                if token.is_stop or not self.is_proper_word(str(token)):
                    keywords.append(tuple(keywords_sub))
                    keywords_sub = []
                else:
                    keywords_sub.append(str(token))
                    
        # Check if the keywords/keyphrases are 
        # more than min_words limit.
        for keyword in keywords[:]:
            if len(keyword) < min_words:
                keywords.remove(keyword)
                
        # Make a list of tuples. Each tuple has the
        # keyword and it's number of occurances
        keywords_return = []
        for key, value in Counter(keywords).items():
            if key:
                keywords_return.append((key, value))
        
        return sorted(keywords_return, key=lambda x: x[1], reverse=True)[:n_keywords]
        
with open('/srv/shares/NLP/wiki_nlp.txt', 'r') as corpus_file:
    text = corpus_file.read()
    
keywords = StopWordKeywordExtractor().keywords(text.lower(), n_keywords=15, min_words=1)

'''
Expected output:
The keyword ('words',) appears 273 times.
The keyword ('text',) appears 263 times.
The keyword ('example',) appears 257 times.
The keyword ('word',) appears 201 times.
The keyword ('references',) appears 184 times.
The keyword ('natural', 'language', 'processing') appears 165 times.
...
'''
for keyword in keywords:
    print('The keyword {} appears {} times.'.format(*keyword))

The keyword ('words',) appears 273 times.
The keyword ('text',) appears 263 times.
The keyword ('example',) appears 257 times.
The keyword ('word',) appears 201 times.
The keyword ('references',) appears 184 times.
The keyword ('natural', 'language', 'processing') appears 165 times.
The keyword ('n',) appears 160 times.
The keyword ('use',) appears 151 times.
The keyword ('set',) appears 144 times.
The keyword ('language',) appears 123 times.
The keyword ('t',) appears 120 times.
The keyword ('documents',) appears 118 times.
The keyword ('based',) appears 115 times.
The keyword ('1',) appears 115 times.
The keyword ('number',) appears 106 times.
CPU times: user 25.1 s, sys: 3.05 s, total: 28.2 s
Wall time: 28.4 s


In [None]:
# This is a test cell, please ignore it!

## Stop word based keyword extraction B) [4 points]

Rerun the keyword extrator with a minimum word count of ```min_words=2``` and a keyword count of ```n_keywords=15```.

Store this in the variable ```keywords_2```. Print the result.

Make sure to convert the input text to lower case!

In [77]:
keywords_2 = StopWordKeywordExtractor().keywords(text.lower(), n_keywords=15, min_words=2)

for keyword in keywords_2:
    print('The keyword {} appears {} times.'.format(*keyword))

The keyword ('natural', 'language', 'processing') appears 165 times.
The keyword ('computational', 'linguistics') appears 103 times.
The keyword ('external', 'links') appears 101 times.
The keyword ('machine', 'translation') appears 68 times.
The keyword ('information', 'retrieval') appears 67 times.
The keyword ('natural', 'language') appears 47 times.
The keyword ('text', 'mining') appears 47 times.
The keyword ('sentiment', 'analysis') appears 45 times.
The keyword ('word', 'sense', 'disambiguation') appears 45 times.
The keyword ('artificial', 'intelligence') appears 45 times.
The keyword ('machine', 'learning') appears 42 times.
The keyword ('computer', 'science') appears 34 times.
The keyword ('speech', 'recognition') appears 29 times.
The keyword ('information', 'extraction') appears 29 times.
The keyword ('customer', 'inserts') appears 29 times.


In [None]:
# This is a test cell, please ignore it!