# Assignment 2: Theses

---

## Task 2) Theses Inspiration

Imagine you'd have to write another thesis, and you just can't find a good topic to work on.
Well, n-grams to the rescue!
Download the `theses.txt` data set from the `Supplemental Materials` in the `Files` section of our Microsoft Teams group.
This dataset consists of approx. 1,000 theses topics chosen by students in the past.

In this assignment, you will be sampling from n-grams to generate new potential thesis topics.
Pay extra attention to preprocessing: How would you handle hyphenated words and acronyms/abbreviations?

*In this Jupyter Notebook, we will provide the steps to solve this task and give hints via functions & comments. However, code modifications (e.g., function naming, arguments) and implementation of additional helper functions & classes are allowed. The code aims to help you get started.*

---

In [173]:
# Dependencies
import re
from collections import Counter

import pandas as pd
import numpy as np
import nltk

N = 5

### Prepare the Data

1.1 Spend some time on pre-processing. How would you handle hyphenated words and abbreviations/acronyms?

In [174]:
def load_theses_titles(filepath):
    """Loads all theses titles and returns them as a list."""
    ### YOUR CODE HERE
    colnames=["year", "organisation", "type", "title"] 

    df = pd.read_csv(filepath, sep="\t", names=colnames, header=None)
    
    return list(df["title"])
    ### END YOUR CODE

In [175]:
TOKENS_PATTERN = r"""(
    [A-Za-zÄÖÜäöüß0-9]+(?:[-'][A-Za-zÄÖÜäöüß0-9]+)*   # Wörter inkl. Umlaute, Bindestrich, Apostroph
  | \d+,\d+                                           # deutsche Kommazahlen (z. B. 3,14)
  | [.,!?;:()\[\]„“"«»]                               # Satzzeichen (inkl. deutsche Anführungszeichen)
  | \S                                                # sonstige Einzelzeichen
)"""


def tokenize(text, pattern=TOKENS_PATTERN):
    """Tokenizes a single Tweet."""
    ### YOUR CODE HERE
    return re.findall(pattern, text, re.VERBOSE)
    ### END YOUR CODE

def preprocess(data):
    """Preprocesses and tokenizes the given theses titles for further use."""
    ### YOUR CODE HERE
    return [tokenize(title) for title in data]
    ### END YOUR CODE

In [176]:
theses_data = preprocess(load_theses_titles("C:\\Users\\Felix\\PythonProjects\\seqlrn_assignments\\2-markov-chains\\data\\theses.tsv"))
print(theses_data)

[['EMail', 'am', 'Beispiel', 'SMTP', 'im', 'Internet'], ['Einführung', 'des', 'Configuration', 'Management-Systems', 'PCMS', 'zur', 'strukturierten', 'Versions-Release', '-', 'und', 'Änderungskontrolle', 'in', 'Projekten', 'der', 'Abteilung', 'Information-Systems', 'der', 'Firma', 'Motorola'], ['Analyse', 'und', 'Leistungsvergleich', 'von', 'zwei', 'Echtzeitsystemen', 'für', 'eingebettete', 'Anwendungen'], ['Erfassung', 'und', 'automatische', 'Zuordnung', 'von', 'Auftragsdaten', 'für', 'ein', 'Dienstleistungsunternehmen', 'mit', 'Hilfe', 'von', 'Standardsoftware', '-', 'Konzeption', 'und', 'Realisierung'], ['Organisationskonzept', 'zur', 'Administration', 'von', 'Lehrgangsrechnern', 'für', 'eine', 'DV-Fortbildungsinstitution'], ['Monte', 'Carlo-Simulation', 'für', 'ein', 'gekoppeltes', 'Round-Robin-System'], ['Untersuchung', 'der', 'elektrischen', 'Eigenschaften', 'von', 'Supraleitern', 'mit', 'Hilfe', 'eines', 'Gas-Kryosystems'], ['Prioritätsfreies', 'Scheduling', 'in', 'verteilten', 

### Train N-gram Models

2.1 Train n-gram models with n = [1, ..., 5]. What about \<s> and \</s>?

In [177]:
def build_n_gram_models(n, data):
    """This method does calculate all n-grams up to the given n."""
    ### YOUR CODE HERE
    n_gram_models = []

    for i in range(1, n+1):
        i_gram = []
        for title in data:
            i_gram.extend(nltk.ngrams(sequence=title, n=i))
            
        n_gram_models.append(i_gram)

    return n_gram_models
    ### END YOUR CODE

In [178]:
n_gram_models = build_n_gram_models(N, theses_data)

### Generate the Titles

3.1 Write a generator that provides thesis titles of desired length. Please do not use the available `lm.generate` method but write your own.

3.2 How can you incorporate seed words?

3.3 How do you handle </s> tokens (w.r.t. the desired length?)

3.4 If you didn't just copy what nltk's lm.generate does: compare the outputs.

In [179]:
# Notice: If you fix the seed in numpy.random.choice, you get reproducible results.
np.random.seed(666)

def sample_next_token(prev, n_gram_model):
    """Samples the next word for the given n_grams."""
    ### YOUR CODE HERE
    n = len(n_gram_model[0])
             
    if len(prev) > n - 1:
        raise ValueError(f"{n}-gram needs {n} tokens for prediction.")

    ngram_counts = Counter(n_gram_model)
    all_sequences_count = sum(ngram_counts.values())
    candidates = [(sequence, count / all_sequences_count) for sequence, count in ngram_counts.items() 
                  if prev == list(sequence[:-1])]
    
    if len(candidates) < 1:
        random_suggestion = np.random.choice(list(ngram_counts.keys()))
        print("No candidate found, returning random token!")
        return random_suggestion[-1]

    candidates.sort(key=lambda c: c[1], reverse=True)

    return candidates[0][0][-1]  
    ### END YOUR CODE


def generate(n, n_gram_models, seed, title_length):
    """Generates a thesis title using the n-grams, a seed word, and desired title length."""

    seq = ["<s>", seed]

    for i in range(2, n+1):
        context = seq[-(i-1):]
        suggestion = sample_next_token(context, n_gram_models[i-1])
        seq.append(suggestion)
        if suggestion in ["</s>", "."]:
            break

    while seq[-1] not in ["</s>", "."] and len(seq) - 1 < title_length:
        context = seq[-(n-1):]
        suggestion = sample_next_token(context, n_gram_models[n-1])
        seq.append(suggestion)

    return " ".join([t for t in seq if t not in ("<s>", "</s>")])

    ### END YOUR CODE

In [188]:
title_length = 20
seed_word =  "Entwicklung"
thesis_title = generate(N, n_gram_models, seed_word, title_length)
print(thesis_title)

seed_word =  "Cloud"
thesis_title = generate(N, n_gram_models, seed_word, title_length)
print(thesis_title)

No candidate found, returning random token!
No candidate found, returning random token!
No candidate found, returning random token!
No candidate found, returning random token!
No candidate found, returning random token!
Entwicklung eines Konzepts zur Implementierung einer " On-Screen-Tastatur " in Panels und PC basierenden Systemen unter der der : Tablet-Umgebung
No candidate found, returning random token!
No candidate found, returning random token!
No candidate found, returning random token!
No candidate found, returning random token!
No candidate found, returning random token!
No candidate found, returning random token!
No candidate found, returning random token!
No candidate found, returning random token!
No candidate found, returning random token!
No candidate found, returning random token!
No candidate found, returning random token!
No candidate found, returning random token!
No candidate found, returning random token!
No candidate found, returning random token!
No candidate found