<a href="https://colab.research.google.com/github/shreyanshu28/text-analysis/blob/main/Assignment4/Assignment_04_students.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 4: Keyphrase Extraction, Named Entity Recognition & Neural Models

Due: Monday, February 06, 2023, at 2pm via Moodle

**Team Members** `1.Yanxin Jia, 3769165
2.Shreyansu Vyas, 3769429
3.Smaran Nair , 3771609
4.AnuReddy , 3768482`

Please note that this assignment comes with quite a number of artifacts, totaling somewhere around 5 GB of necessary disk space. In case you are running into issues or do want to keep your environment "clean", we suggest the use of [Google Colab](https://colab.research.google.com/).

In [None]:
%%bash
. ~/.bashrc
python3 -m pip install keybert
python3 -m pip install git+https://github.com/LIAAD/yake
python3 -m pip install transformers
python3 -m pip install datasets
python3 -m pip install nltk
python3 -m pip install spacy
# Install necessary packages for all questions

## Task 1: Keyphrase Extraction (5 + 3 + 3 + 5) = 16 Points

In this task, we will implement our own unsupervised keyphrase extraction (KPE) module utilizing a simple grammatical ruling system, which we apply to a Sherlock Holmes novel.
To generate TF-IDF-weighted phrases, we will be using the entire collection of Sir Arthur Donan Coyle novels to calculate document frequencies.

Finally, we compare the results to general-purpose KPE libraries.

### Sub Task 1: Unsupervised Keyphrase Extraction System (5 Points)

#### 1. Candidate Generation
We will need to generate a set of suitable candidate phrases first, which can then be ranked as keyphrases later on. To do this, we will again be using spaCy's, this time its rule-based [`Matcher` class](https://spacy.io/api/matcher).

The syntactic pattern of a keyphrase candidate should satisfy the following rules:

1. An optional adjective, noun, proper noun
2. An optional adjective, noun, proper noun
3. A mandatory noun or proper noun.

Add a second pattern, which recognizes the pattern

1. A noun or proper noun
2. An adposition
3. Another noun or proper noun

Note that the first condition will match any phrase of length between 1-3 tokens, which is a suitable approximation for our task at hand, whereas the second pattern is slightly more specific, always matching exactly three tokens.
An example of a valid matched phrases for the first pattern would be "Sherlock Holmes" ([PROPN, PROPN]), and "Hounds of Baskervilles" ([NOUN, ADP, PROPN]) for the second pattern.

In [2]:
import spacy
from spacy.matcher import Matcher

In [3]:
# load language model
nlp = spacy.load("en_core_web_sm", disable=["ner"])
matcher = Matcher(nlp.vocab)

# Define the above patterns
pattern = [
    [{"POS":{"IN":["ADJ","NOUN","PROPN"]},"OP":"?"},{"POS":{"IN":["ADJ","NOUN","PROPN"]},"OP":"?"},
    {"POS":{"IN":["NOUN","PROPN"]}}],
    [{"POS":{"IN":["NOUN","PROPN"]}},{"POS":"ADP"},{"POS":{"IN":["NOUN","PROPN"]}}]

]

    

# Add the pattern to the Matcher
matcher.add("PATTERN", pattern)

To verify whether your pattern is correct, use the below example.
If you have done everything correctly, your matcher will identify **13 phrases**.

In [4]:
doc = nlp("This is a simple test. It should return 'simple', and 'test', among other phrases. Maybe we can also see if it can recognize the art of war. Would it recognize integer linear programming, too?")
matches = matcher(doc)

print(len(matches))

13


#### 2. Applying Your System

Once you have matched the correct number of keyphrase candidates on the above example, apply your rule-based matcher to an actual data sample. We are going to use the Sherlock Holmes novel "Hounds of Baskervilles". You can find the raw text file at the following URL:

https://sherlock-holm.es/stories/plain-text/houn.txt

Download the text from this URL and apply your spaCy model and matcher on it.  
**Hint:** Make sure you properly decode your input, since some libraries return binary strings.

In [5]:
from urllib.request import urlopen
def load_txt_from_url(url: str = "https://sherlock-holm.es/stories/plain-text/houn.txt") -> str:
  response = urlopen(url)
  data = response.read()
  text = data.decode("utf-8")
  return text

text = load_txt_from_url()

# Apply the spacy model to the loaded text and extract the phrases with the Matcher
doc = nlp(text)
matches = matcher(doc)
print(text)





                          THE HOUND OF THE BASKERVILLES

                               Arthur Conan Doyle







                                Table of contents
        Mr. Sherlock Holmes
        The Curse of the Baskervilles
        The Problem
        Sir Henry Baskerville
        Three Broken Threads
        Baskerville Hall
        The Stapletons of Merripit House
        First Report of Dr. Watson
        Second Report of Dr. Watson
        Extract from the Diary of Dr. Watson
        The Man on the Tor
        Death on the Moor
        Fixing the Nets
        The Hound of the Baskervilles
        A Retrospection















          CHAPTER I
          Mr. Sherlock Holmes


     Mr. Sherlock Holmes, who was usually very late in the mornings, save
     upon those not infrequent occasions when he was up all night, was
     seated at the breakfast table. I stood upon the hearth-rug and picked
     up the stick which our visitor had left behind him the night before.
     

In [6]:
print(matches)

[(11920309760829426267, 2, 3), (11920309760829426267, 5, 6), (11920309760829426267, 7, 8), (11920309760829426267, 7, 9), (11920309760829426267, 8, 9), (11920309760829426267, 7, 10), (11920309760829426267, 8, 10), (11920309760829426267, 9, 10), (11920309760829426267, 11, 12), (11920309760829426267, 11, 14), (11920309760829426267, 13, 14), (11920309760829426267, 15, 16), (11920309760829426267, 15, 17), (11920309760829426267, 16, 17), (11920309760829426267, 15, 18), (11920309760829426267, 16, 18), (11920309760829426267, 17, 18), (11920309760829426267, 20, 21), (11920309760829426267, 23, 24), (11920309760829426267, 26, 27), (11920309760829426267, 28, 29), (11920309760829426267, 28, 30), (11920309760829426267, 29, 30), (11920309760829426267, 28, 31), (11920309760829426267, 29, 31), (11920309760829426267, 30, 31), (11920309760829426267, 32, 33), (11920309760829426267, 32, 34), (11920309760829426267, 33, 34), (11920309760829426267, 32, 35), (11920309760829426267, 33, 35), (1192030976082942626

We will now investigate which phrase candidates are the most frequently appearing in this novel, simply based on the phrase frequency. Therefore, convert your abstract match objects into actual strings, lowercase them, and return the 20 most frequently occurring phrase candidates and their respective frequencies.  
**Hint:** For counting occurrences, you may look at `collections.Counter`.

In [7]:
from collections import Counter
import collections

candidates = []
# Lowercase and add the extracted candidate matches to `candidates`
## YOUR CODE
for match_id,start, end in matches:
    string_id = nlp.vocab.strings[match_id]
    span =doc[start:end]
    candidates.append(span.text.lower())
    

candidate_phrases = Counter(candidates)

# Print the most frequently occurring phrases, together with the respective frequencies
print(candidate_phrases.most_common(20))



[('sir', 350), ('man', 214), ('holmes', 192), ('moor', 159), ('henry', 156), ('sir henry', 135), ('watson', 117), ('baskerville', 116), ('dr.', 109), ('charles', 94), ('stapleton', 93), ('mortimer', 89), ('night', 88), ('time', 86), ('sir charles', 86), ('house', 75), ('face', 75), ('hound', 72), ('barrymore', 72), ('eyes', 71)]


#### 3. Briefly summarize the quality of your top 20 candidates:

[('sir', 350), ('man', 214), ('holmes', 192), ('moor', 159), ('henry', 156), ('sir henry', 135), ('watson', 117), ('baskerville', 116), ('dr.', 109), ('charles', 94), ('stapleton', 93), ('mortimer', 89), ('night', 88), ('time', 86), ('sir charles', 86), ('house', 75), ('face', 75), ('hound', 72), ('barrymore', 72), ('eyes', 71)]

### Sub Task 2: Generating Document Frequency Values (3 Points)

To compare the previously generated terms with a more refined model, we are going to extract document frequencies from the collection of all Sherlock Holmes works. Since the books are relatively long documents, we are instead going to split based on a simple heuristic in the input document, which should allow a decent approximation by taking into account individual chapters of each novel.

1. Start by loading the Sherlock Holmes canon from https://sherlock-holm.es/stories/plain-text/cnus.txt  
Afterwards, split the full document into individual chapters. For this, use three consecutive line breaks `\n\n\n` as a splitting condition to approximate the chapters.

In [8]:
df_texts = load_txt_from_url("https://sherlock-holm.es/stories/plain-text/cnus.txt")

chapter_split = "\n\n\n"
split_df_texts = df_texts.split(chapter_split)

print(len(split_df_texts))

353


After splitting, you should have 353 individual "documents" to work with.

2. Now, create a dictionary containing each phrase encountered in the larger corpus, and its associated document frequency. Again, ensure that phrase strings are lowercased for consistency with the previous transformation.  
**Hint:** Since the processing of 353 documents might take a while, incorporate [`tqdm.tqdm`](https://tqdm.github.io/) to visualize progress on the task.

In [9]:
from tqdm import tqdm
from typing import List

def return_occurring_phrases(doc_text: str) -> List[str]:
  # process text with spaCy and apply the Matcher
  doc = nlp(doc_text)
  matches = matcher(doc)

  # Candidates can be a set, since we only care about the occurrence *once* for IDF values.
  # Again, extract the lower-cased text of a matched span.
  candidates = set()
  for match_id, start, end in matches:
    span = doc[start:end]
    candidates.add(span.text.lower())

  return list(candidates)

all_document_phrases = []
# Iterate through the individual documents and extract phrases for them. Use `tqdm` to visualize progress
## YOUR CODE
for chapter in tqdm(split_df_texts):
    all_document_phrases.extend(return_occurring_phrases(chapter))

# Once again, count the frequency of term occurrences across all documents
## YOUR CODE
phrase_freq = Counter(all_document_phrases)



100%|██████████| 353/353 [01:13<00:00,  4.81it/s]


3. Output the 20 most frequently appearing document phrases that your system detected:

In [10]:
from collections import Counter

most_common_phrases = phrase_freq.most_common(20)
print(most_common_phrases)

[('man', 112), ('holmes', 107), ('time', 104), ('eyes', 104), ('night', 104), ('face', 102), ('hand', 102), ('sherlock holmes', 101), ('sherlock', 101), ('way', 101), ('matter', 100), ('room', 99), ('mr.', 98), ('hands', 97), ('day', 97), ('house', 96), ('case', 96), ('life', 96), ('door', 95), ('one', 95)]


### Sub Task 3: Generating Weighted Keyphrases (3 Points)

We can now incorporate the extracted keyphrases to calculate `tf-idf` scores, and return a hopefully improved version of our keyphrases for the original "Hounds of Baskervilles" document. 

1. Iterate over all phrases occurring in the novel "Hounds of Baskervilles", and re-score phrases according to the definition of TF-IDF. Use the smoothed definition of idf:

$ idf(t, D) = \log \frac{|D|}{|\{d \in D : t \in d\}| + 1} + 1 $

In [11]:

import math
from typing import Dict

def tf_idf(tf: int, df_count: int) -> float:
  """
  Computes the TF-IDF scores according to the formula.
  """
  num_docs = 353
  idf = math.log(num_docs / (df_count + 1)) + 1
  return tf * idf

tf_idf_weighted_candidates = {}


# Iterate through all candidate phrase/frequency pairs and compute the TF-IDF scores for each phrase
# Store the phrase together with its TF-IDF score in `tf_idf_weighted_candidates`
for candidate, tf in candidate_phrases.items():## the number of documents in the corpus that contain the candidate phrase
  tf_idf_weighted_candidates[candidate]=tf_idf(tf, phrase_freq[candidate])



2. Now print the top 20 candidate phrases by TF-IDF weight, and compare the results to your previous output. 

In [12]:
print(sorted(tf_idf_weighted_candidates.items(), key = lambda kv:(kv[1], kv[0]),reverse=True)[0:20])

[('sir', 836.1959348592817), ('moor', 623.6026233649302), ('henry', 588.966394157697), ('sir henry', 552.6737101836245), ('baskerville', 461.2271706883073), ('man', 457.7631709792846), ('holmes', 419.3926713233428), ('stapleton', 400.0412390508737), ('mortimer', 358.9596694460602), ('sir charles', 357.6239356014735), ('charles', 354.8900067360482), ('barrymore', 321.7372404577147), ('dr.', 303.9934368200059), ('watson', 279.5283553672456), ('dr. mortimer', 275.29719676619266), ('hound', 257.0854457468857), ('baronet', 206.4728975746445), ('night', 194.70067819626806), ('hall', 191.75335703217434), ('time', 190.27566278271652)]


3. Write your insights on the comparison of the results below. Try to theorize why some of the phrases still appear, or why other phrases are no longer present:

YOUR ANSWER HERE

4. Give two examples of how you could further improve the list of keyphrase values.

YOUR ANSWER HERE



### Sub Task 4: Apply off-the-shelf Keyphrase Extraction Tools (5 Points)

To put the findings of your system into context, compare them with two popular open-source libraries, namely [YAKE!](https://github.com/LIAAD/yake) and [KeyBERT](https://github.com/MaartenGr/KeyBERT).

1. First, start by running the document with YAKE!; you may use the default parameters. Print the resulting keyphrases, which by default returns 20 phrases.

In [None]:
!pip3 install git+https://github.com/LIAAD/yake

In [14]:

import yake

extractor = yake.KeywordExtractor()
keywords = extractor.extract_keywords(df_texts)

# Print the top 20 keywords
sorted_result = sorted(keywords,key=lambda x:(x[1],x[0]),reverse=True)
print(sorted_result[:20])


[('remarked Sherlock Holmes', 5.949394080324617e-05), ('Sherlock Holmes face', 5.8356658052643885e-05), ('sir', 5.5356217452982246e-05), ('Sir Charles', 5.330414318372243e-05), ('Baker Street', 5.216905012344322e-05), ('dear Watson', 5.0746857680460824e-05), ('Sherlock Holmes put', 4.92944444985635e-05), ('Sherlock Holmes sprang', 4.4530755662024974e-05), ('Sir Charles Baskerville', 4.435050135434968e-05), ('asked Holmes', 4.4127246582873294e-05), ('Sherlock Holmes left', 4.2728733416566765e-05), ('Man', 3.064333636646172e-05), ('Sir Henry', 2.645741307299696e-05), ('Watson', 2.56607499254069e-05), ('Sir Henry Baskerville', 2.290271021192336e-05), ('Sherlock Holmes sat', 2.019031281138435e-05), ('asked Sherlock Holmes', 1.8636307219539084e-05), ('friend Sherlock Holmes', 1.4905164542368326e-05), ('HOLMES', 3.5475912431521295e-06), ('SHERLOCK HOLMES', 1.244220900143271e-06)]


In [None]:
!pip3 install rake-nltk

2. Compare both runtime efficiency and the extracted phrases with your own system.

YOUR ANSWER HERE

3. Now use the KeyBERT library to extract keyphrases. Importantly, you will need to split the document into separate paragraphs, as the underlying neural model will be unable to handle the complete document as input.  
Use the pattern of `\n\n` to separate the text into smaller paragraphs, and filter out any empty lines after. An "empty line" also constitutes all inputs that only contain newline (`\n`) or whitespace ` ` characters.


In [16]:
# Split the input text according to the specified criteria and filter empty lines out.
split_text = [p for p in df_texts.split("\n\n") if p.strip() != ""]

4. To ensure consistency between the tools when extracting keyphrases, set the *n*-gram range to `(1,3)`.
Otherwise, leave all parameters at the default value, and extract the keyphrases from each paragraph.

In [None]:
from keybert import KeyBERT

# This might take a while to install
model = KeyBERT("all-MiniLM-L6-v2")

# Extract the keyphrases from each split, using the adjusted keyphrase ngram range
# Hint: You may pass a list to the extraction function and KeyBERT will automatically handle iteration.
# Set the n-gram range to (1, 3)
ngram_range = (1, 3)

# Extract the keyphrases from each paragraph
extracted_phrases = model.extract_keywords(split_text, keyphrase_ngram_range=ngram_range)
print(extracted_phrases)

5. Combine the predictions of all individual splits into a single list. For this, sum up the prediction scores across all splits.  
**Hint:** `collections.defaultdict` makes aggregations like this much easier.

In [18]:
from typing import List, Tuple
from collections import defaultdict

def merge_predictions(list_of_predictions: List[List[Tuple]]) -> List[Tuple]:
    """
    Combines lists of predictions into a single list with added scores.
    """
  

    phrase_dict ={}
    for word_list in extracted_phrases:
      for (word,score) in word_list:
        if word in phrase_dict:
          phrase_dict[word] = phrase_dict[word]+score
        else:
          phrase_dict[word] = phrase_dict.setdefault(word,0)+score
          


    # Iterate through all the lists of predictions and add the scores to the correct dict entry
    ## YOUR CODE

    # Extract the 20 keyphrases with the highest weithgts from `phrase_dict`
    phrase_list = sorted(keywords, key=lambda x: x[1], reverse=True)[:20]


    return phrase_list

In [None]:
print(merge_predictions(extracted_phrases))

6. Again, evaluate the result and compare it to the other two approaches in terms of extraction quality and extraction speed.

YOUR ANSWER HERE

## 2. Named Entity Recognition (4 + 5 + 5 = 14 Points)

Slightly different, but still operating on the sequence level, is the task of Named Entity Recognition (NER).
In this task, we will evaluate the NER capabilities of some more open-source libraries.
Particularly, we will also evaluate the utility of NER as a stand-in for Keyphrase Extraction.

### Sub Task 1: Using spaCy NER (4 Points)

So far, when using spaCy models, we have primarily disabled the NER component, as it requires significant extra compute.
In this task, we will explicitly leave the component enabled, to see what results it can produce on the text from the previous question.

In [1]:
!pip3 install spacy 


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m23.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
import spacy

# Load the en_core_web_sm model, but with NER enabled.
nlp = spacy.load("en_core_web_sm",enable='ner')
nlp.pipe_names

['ner']

1. Re-load the text for the "Hounds of Baskervilles" novel, and run it with the spacy model.

In [3]:
# Re-use the function from the previous exercise.
import requests
url = "https://sherlock-holm.es/stories/plain-text/houn.txt"
response = requests.get(url)
text = response.text
# Run the text through the model
doc = nlp(text)

2. Similar to the previous exercise, count the number of occurrences, however, this time for the extracted entities instead of phrases. Print the top 20 most frequently occurring entities.  
Make sure to lowercase the text again during your aggregation.

In [4]:
# Count the number of occurrences of particular entities
from collections import defaultdict
# Create a defaultdict to store the entity counts
entity_counts = defaultdict(int)

# Iterate over the entities in the doc
for ent in doc.ents:
    # Lowercase the entity text
    ent_text = ent.text.lower()
    # Increment the count for the entity
    entity_counts[ent_text] += 1

# Print the top 20 most frequently occurring entities.
for ent_text,count in sorted(entity_counts.items(),key=lambda x:x[1],reverse=True)[:20]:
    print(f'{ent_text} --> {count}')

holmes --> 151
henry --> 117
watson --> 104
one --> 96
charles --> 74
stapleton --> 73
mortimer --> 72
two --> 61
first --> 51
london --> 49
barrymore --> 43
baskerville --> 34
baskerville hall --> 28
sherlock holmes --> 24
half --> 21
henry baskerville --> 20
night --> 17
coombe tracey --> 16
second --> 15
three --> 15


You might have noticed some unwanted results in the list, such as "night". Upon closer inspection, it turns out that the NER module further differentiates between different entity *categories*, such as PERSON (referencing, as expected, a physical person) or ORG (organizations, such as companies, NGOs, etc.), but also TIME (under which "night" falls). For reference, you can find the full list of supported NER labels by this particular model [here](https://spacy.io/models/en#en_core_web_sm-labels).

3. Refine the list of most common entities by printing out the top three occurring entities in the category `PERSON`, `ORG` and `GPE` (physical locations) instead.

In [5]:
from collections import Counter 
from typing import List,Tuple
def get_top_entities_by_class(doc: spacy.tokens.Doc, class_name: str, n: int = 3) -> List[Tuple[str, int]]:
    """
    Returns the three most frequent entities (and their frequencies)
    of entity type `class_name` from `doc`.
    """
    # Extract phrase and frequency of a particular entity class
    counter = Counter([ent.text.lower() for ent in doc.ents if ent.label_ == class_name])
    # Return the top 3 entities and frequencies
    return counter.most_common(n)

# Print the results for "PERSON", "ORG" and "GPE"
print("Top 3 entities in category PERSON:", get_top_entities_by_class(doc, "PERSON"))
print("Top 3 entities in category ORG:", get_top_entities_by_class(doc, "ORG"))
print("Top 3 entities in category GPE:", get_top_entities_by_class(doc, "GPE"))

Top 3 entities in category PERSON: [('holmes', 141), ('henry', 117), ('watson', 104)]
Top 3 entities in category ORG: [('baker street', 10), ('times', 10), ('i.', 10)]
Top 3 entities in category GPE: [('london', 49), ('baskerville', 16), ('devonshire', 14)]


### Sub Task 2: Financial Bank Statements of Deutsche Bank (5 Points)

Instead of using the Sherlock Holmes Novels, we will now compare the functionality of spaCy and NLTK's NER modules on the financial statements of Deutsche Bank from 2021. For this, see the file available on Moodle.

1. Download it and convert the PDF document into text, by using the `pdftotext` command-line utility. In particular, run with the `-layout` option enabled.

In [None]:
%%bash
. ~/.bashrc
## YOUR SHELL COMMAND HERE
# If you have to execute this command through your shell, still paste the command you ran in here.

In [6]:
import subprocess
pdf_file = 'Assignment4/DB_annual_report.pdf'
text_file = '/Assignment4/DB_annual_report.txt
subprocess.run(['pdftotext','-layout',pdf_file,text_file])

CompletedProcess(args=['pdftotext', '-layout', '/Users/anureddy/Desktop/Sem01/DataScience_for_text_analytics/Assignments/Assignment04/DB_annual_report.pdf', '/Users/anureddy/Desktop/Sem01/DataScience_for_text_analytics/Assignments/Assignment04/DB_annual_report.txt'], returncode=0)

2. Given that the document is extremely long, split the inputs into chunks of 500.000 characters and process them separately.

In [7]:
def load_long_text_in_chunks(fp: str, chunk_size: int = 500_000):
    """
    Loads a text file (located at `fp`) and chunks it into chunks fo at most `chunk_size` characters.
    Note that the last chunk might be significantly shorter.
    """
    # Load the text file
    with open(fp,'r') as f1:
        text = f1.read()

    # Split the text into segments of at most `chunk_size` characters
    chunks = [text[i:i+chunk_size]for i in range(0,len(text),chunk_size)]
    return chunks

In [9]:
db_chunks = load_long_text_in_chunks(text_file)

3. Print the top 5 occurring `ORG` entities that are not referencing Deutsche Bank itself, both by using spaCy's NER module and the NER function of NLTK.  
To exclude "Deutsche Bank" entities, filter out all entities that contain both "deutsche" and "bank" in their name, irrespective of the actual upper-/lowercasing.
**Hint:** For more information on how to run NER with NLTK, see [here](https://nanonets.com/blog/named-entity-recognition-with-nltk-and-spacy/#performing-ner-with-nltk-and-spacy)

In [10]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

org_entities_spacy = []
org_entities_nltk = []

def is_deutsche_bank_entity(name: str) -> bool:
    """
    Returns True if the entity name contains "deutsche" and "bank" in some upper-/lowercased version.
    This means both "Deutsche Bank" and "deutsche bank's" should be recognized.
    """
    ## YOUR CODE
    names = name.lower()
    if "deutsche" in names and "bank" in names:
        return True
    else:
        return False

for chunk in db_chunks:
    # Process the chunk with spaCy
    doc = nlp(chunk)
    entities = [(ent.text, ent.label_) for ent in doc.ents]

    # And also with NLTK
    words = nltk.word_tokenize(chunk)
    tags = nltk.pos_tag(words)
    chunks = nltk.ne_chunk(tags)
    

    # Add all the extracted "ORG" entities to `org_entities`, except those referencing Deutsche Bank
    org_entities_spacy.extend([ent[0] for ent in entities if ent[1] == "ORG" and not is_deutsche_bank_entity(ent[0])])
    org_entities_nltk.extend([c[0][0] for c in chunks if hasattr(c, 'label') and c.label() == 'ORGANIZATION' and not is_deutsche_bank_entity(c[0][0])])

[nltk_data] Downloading package punkt to /Users/anureddy/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/anureddy/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/anureddy/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /Users/anureddy/nltk_data...
[nltk_data]   Package words is already up-to-date!


In [11]:
# Return the top 5 entities by frequency
from collections import Counter
entity_counts_spacy = Counter(org_entities_spacy)
entity_counts_nltk = Counter(org_entities_nltk)
print("Top 5 ORG entities wrt spaCy: ", entity_counts_spacy.most_common(5))
print("Top 5 ORG entities wrt NLTK : ", entity_counts_nltk.most_common(5))

Top 5 ORG entities wrt spaCy:  [('Group', 891), ('the Management Board', 262), ('the Supervisory Board', 189), ('Bank', 136), ('Management Board', 132)]
Top 5 ORG entities wrt NLTK :  [('Deutsche', 1102), ('Group', 791), ('Management', 511), ('Supervisory', 392), ('Credit', 222)]


4. Compare and analyze the different results between the two methods.

Running the same dataset with same functionalities, only spacy was able produce top 5 org entities,but nltk game none since spacy is also trained on larger corpus of data and therefore be able to recognize a wider range of entities.
By comparing the results of spaCy and NLTK's NER, it can be seen that spaCy is better at recognizing entities with more context-specific features and is able to recognize a wider range of entities.

In [12]:
# RAWCOUNTS of PERSON 
entity_counts = {}
for chunk in db_chunks:
    doc = nlp(chunk)
    
    for ent in doc.ents:
        if ent.label_ == 'PERSON':
            if ent.text not in entity_counts:
                entity_counts[ent.text]=1
            else:
                entity_counts[ent.text]+=1

most_freq_ent = max(entity_counts,key = entity_counts.get)


In [13]:
# Print the most frequent 'PERSON' entity and its count
print("The most frequent 'PERSON' entity is:", most_freq_ent)

The most frequent 'PERSON' entity is: MREL


### Sub Task 3: Co-Occurrence Counts of Entities (5 Points)

As is becoming apparent, the *raw* occurrence counts of entities might not be meaningful on its own, especially if we are interested in less frequently occurring entities.

Instead, we will "investigate" the entities that are most frequently mentioned in association with "Deutsche Bank". For this purpose, we will look at the textual co-occurrences of two named entities. The basic idea is that entities that frequently appear together are likely related.

1. For each text chunk, extract all mentions of the entity `('Deutsche Bank', 'ORG')`, as well as all `PERSON` entity mentions in the text using spaCy. Store the respective entity name and the text position. Unlike the previous question, you do *not* need to check for different spelllings of the "Deutsche Bank" entity.  
**Hint:** Entities are represented as a [`Span`](https://spacy.io/api/span) element in spaCy, which has access to text position.


In [14]:
entity_mentions_with_start_position = []

for chunk in db_chunks:
    chunk_mentions = []
    # Process the doc with spaCy
    doc = nlp(chunk)
    
    # Extract only entity mentions of "Deutsche Bank" (ORG) or any PERSON mention.
    # Append each mention, including the text and its starting position, to `chunk_mentions`
    for e in doc.ents:
        if e.label_ == 'ORG' and e.text == 'Deutsche Bank':
            chunk_mentions.append((e.label_,e.text,e.start_char))
        elif e.label_ == 'PERSON':
            chunk_mentions.append((e.label_,e.text,e.start_char))
    
    # Append the chunk's entities to the aggregate list
    entity_mentions_with_start_position.append(chunk_mentions)


2. Within each chunk, for each mention of `Deutsche Bank`, search for `PERSON` entities that have a starting position within 200 characters before/after the starting position of the `Deutsche Bank` mention. Count for each `PERSON` entity how many times it occurs nearby a mention of `Deutsche Bank`.  
Aggregate the co-occurrences across all chunks. 

In [None]:
co_occurrences = []

for chunk_mentions in entity_mentions_with_start_position:
    # Iterate through the entities. If the entity is a "Deutsche Bank" mention, extract nearby
    # PERSON references (less than +/- 200 character difference in the starting position)

    ## YOUR CODE


3. Return the number of co-occurrences and the name of the top 5 frequently occurring `PERSON` entities.


In [15]:
import collections

co_occurrences = collections.defaultdict(int)
for chunk_mentions in entity_mentions_with_start_position:
    for mention in chunk_mentions:
        if mention[1]=='Deutsche Bank':
            db_start = mention[2]
            for other_mention in chunk_mentions:
                if other_mention[0]=='PERSON' and abs(other_mention[2]-db_start)<=200:
                    co_occurrences[other_mention[1]] += 1
                    
print(co_occurrences)

defaultdict(<class 'int'>, {'Frank Kuhnke': 2, 'Rebecca Short': 2, 'Frank Werneke': 1, 'Frank Bsirske': 1, 'Olivier Vigneron': 3, 'Paul Achleitner': 1, 'Dagmar Valcárcel': 1, 'Paul Achl': 1, 'PB GY': 3, 'Main': 10, 'Zurich Italy': 1, '2021                                               Key': 1, '2021                              Risk': 1, 'Risk Type': 1, 'Risk Types': 1, 'Significant Increase': 1, '‚': 1, 'Leverage Ratio': 6, 'Eurosystem': 1, 'MREL': 15, 'Consent Order': 2, 'Jeffrey Epstein': 8, 'DB': 1, 'KGaA': 4, 'Datenträgerverfahren': 2, 'Nichtzulassungsbeschwerde': 1, 'Warburg Invest\nKapitalanlagegesellschaft': 1, 'Warburg Invest': 5, 'Schwab': 3, 'Epstein': 2, 'George Town            Other Enterprise                   ': 2, 'Spólka Akcyjna                                 ': 1, 'Governance': 1, 'Fabrizio Campelli': 1, 'Alexander von zur Mühlen': 2, 'Stefan Simon': 1, 'Durin': 1, 'Sewing': 5, 'Generalbevollmächtigter': 1, 'Paul': 1, 'Achleitner': 2, 'Wirtschaftsprüfer': 1, 'Wirtsch

In [16]:
aggregated_co_occurrences = dict(co_occurrences)

In [17]:
top_5_persons = sorted(aggregated_co_occurrences.items(), key=lambda x: x[1], reverse=True)[:5]

# Print the number of co-occurrences and the name of the top 5 frequently occurring persons
print("Number of co-occurrences: ", sum(aggregated_co_occurrences.values()))
for person in top_5_persons:
    print("Name: ", person[0])

Number of co-occurrences:  129
Name:  MREL
Name:  Main
Name:  Jeffrey Epstein
Name:  Leverage Ratio
Name:  Warburg Invest


4. Look back at the results of your previous task. Are the `PERSON` entities returned by your co-occurrence method the same ones that appear most frequently by raw counts?

Yes, the most frequent 'PERSON' entities is by name 'MREL' by both the methods.

## 3. Neural Models with Huggingface (3 + 5 + 2 = 10 Points)

For state-of-the-art performance, most text-related tasks nowadays use some variation of the Transformer architecture. The particular advantage is especiall the readily available weights for models that have been pre-trained on large general-purpose datasets, which reduces the amount of domain-specific labeled training data.

In this task, we will explore the [Huggingface](https://hf.co/) ecosystem to see in which way Transformer models can be used.
One of the central aspects of the Huggingface platform is the so-called [Model Hub](https://huggingface.co/models), where you can find many different models uploaded by community members for a variety of tasks.

Because the neural models are generally very expensive to run, this exercise will be limited to  less data than in previous questions.

### Sub Task 1: Loading Transformer Models (3 Points)

1. Install the `transformers` library and load the model `cardiffnlp/twitter-roberta-base-sentiment-latest` to classify a sequence.
2. Report the result of the prediction on the test sequence.

In [19]:
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification, XLMRobertaFor
from transformers import AutoTokenizer, AutoConfig ## YOUR IMPORTS

model = f"cardiffnlp/twitter-roberta-base-sentiment-latest" ## YOUR CODE
tokenizer = AutoTokenizer.from_pretrained(model)## YOUR CODE

input_text = "Das ist ein Test."

prediction =  ## YOUR CODE

Downloading (…)lve/main/config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

### Sub Task 2: Using Pipelines (5 Points)

The most succinct way of using a Transformer model is the [`transformers.pipeline`](https://huggingface.co/docs/transformers/pipeline_tutorial). You can check out the linked tutorial for more information on the topic, but essentially, `pipeline` provides a light-weight wrapper around a number of different popular NLP tasks

1. Instead of manually defining a pipeline, now load a model through a `"text-classification"` pipeline. Look up the neural model that is loaded by default, and post the link to its [model card](https://huggingface.co/docs/hub/model-cards) below.


In [None]:
## YOUR 
from transformers import pipeline
sentiment_task = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
sentiment_task("Covid cases are increasing fast!")


2. Now, instead, load a pipeline for `"text-classification"`, but with a custom model and tokenizer. Use the Model Hub platform to find the most popular model for the German language (by number of downloads) and manually specify the usage of another model (and tokenizer) to the pipeline. Re-run the previous example, and report the prediction result.


In [28]:
model = 'xlm-roberta-large' ## YOUR CODE
tokenizer = AutoTokenizer.from_pretrained(model) ## YOUR CODE

# Instantiate the pipeline with custom components
pipe = pipeline('text-classification', model=model)## YOUR CODE

# Output the prediction by your pipe on the test sample.
print(pipe("das ist ein test")) ## YOUR CODE

Some weights of the model checkpoint at xlm-roberta-large were not used when initializing XLMRobertaForSequenceClassification: ['lm_head.decoder.weight', 'lm_head.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.weight', 'roberta.pooler.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-large and are newly initialized: ['classifier.den

[{'label': 'LABEL_1', 'score': 0.621100127696991}]


3. Keeping in line with the previous exercises, let us now try and actually predict something with the model. Re-load a pipeline, this time for Named Entity Recognition, using the default model.

In [None]:
## YOUR CODE

4. Run the pipeline with the text from the Deutsche Bank report from Question 2 and output the results.

In [None]:
## YOUR CODE

print( ## YOUR CODE

5. Look at the results. Something looks strange here; why is it not working properly? Elaborate your answer.

YOUR ANSWER HERE

### Sub Task 3: Using Datasets through Huggingface (2 Points)

Instead of using the `transformers` library for model training and inference, it is also possible to use other libraries by Huggingface without neural models.
In particular, the `datasets` library provides a centralized and streamlined way of accessing a variety of different datasets.

1. Using the `datasets` library, load the `imdb` dataset.

In [None]:
## YOUR CODE

2. Report the mean length of `text` column for the training, validation and test split, respectively.


In [None]:
## YOUR CODE