# 1. The Norwegian UD treebank: A Nynorsk text dataset


# Summary

The Norwegian UD treebank is based on the Nynorsk section of the Norwegian
Dependency Treebank (NDT), which is a syntactic treebank of Norwegian. 
NDT has been automatically converted to the UD
scheme by Lilja Øvrelid at the University of Oslo.

# Introduction

NDT was developed 2011-2014 at the National Library of Norway in collaboration
with the Text Laboratory and the Department of Informatics at the
University of Oslo.
NDT contains around 300,000 tokens taken from a variety of genres.
The treebank texts have been manually annotated for morphosyntactic
information. The morphological annotation mainly follows
the [Oslo-Bergen Tagger](http://tekstlab.uio.no/obt-ny/).  The syntactic
annotation follows, to a large extent, the Norwegian Reference
Grammar, as well as a dependency annotation scheme formulated at the
outset of the annotation project and iteratively refined throughout
the construction of the treebank. For more information, see the
references below.

## Run the code below to load the dataset and words:

In [None]:
from utils import read_data

nynorskCorpus, nynorskTokens = read_data("../datasets/UD_Norwegian-Nynorsk-master/no_nynorsk-ud-train.conllu")
# nynorskCorpus, nynorskTokens = read_data("../datasets/UD_Norwegian-Nynorsk-master/no_nynorsk-ud-test.conllu")

## 1.1 Visualize the text corpus

Let's take a closer look at all the words that are available in the dataset.

### Word cloud

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
wordcloud = WordCloud(width=800, height=400, background_color="white").generate(" ".join(nynorskTokens))
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.title("Word Cloud")
plt.show()

### Frequency of words

Check what are the most frequent words in the dataset

In [None]:
import pandas as pd
from nltk import FreqDist
import seaborn as sns

freq_dist = FreqDist(nynorskTokens)
df_freq_dist = pd.DataFrame(list(freq_dist.items()), columns=["Token", "Frequency"])
df_freq_dist = df_freq_dist.sort_values(by="Frequency", ascending=False)
plt.figure(figsize=(12, 6))
sns.barplot(x="Token", y="Frequency", data=df_freq_dist.head(20))
plt.title("Top 20 Most Frequent Tokens")
plt.xticks(rotation=45, ha="right")
plt.show()

### Word lengths

How long are the words in the dataset?

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.scatter(range(len(nynorskTokens)), [len(token) for token in nynorskTokens], alpha=0.5)
plt.title("Scatter Plot of Word Lengths")
plt.xlabel("Token Index")
plt.ylabel("Word Length")
plt.show()

# 2. Let's Train a simple language model for sentence generation

<div style="display: flex; flex-direction: row;">
    <img src="resources/ngram.png" alt="Image 1" width="500px" height="150px" style="margin-left: 300px;">
</div>

N-gram models serve as fundamental language models for text generation, providing a probabilistic framework to capture the structure and patterns within a sequence of words. In the context of natural language processing, an n-gram refers to a contiguous sequence of n items, typically words. N-gram models estimate the likelihood of a word based on its context—the preceding n-1 words. The key assumption is that the probability of a word depends only on a limited history of preceding words, making computation more feasible. These models offer simplicity and efficiency, making them foundational in various language processing tasks, including text generation, machine translation, and speech recognition. While n-gram models exhibit effectiveness in capturing local dependencies, they may struggle with long-range dependencies and fail to capture the broader semantic context present in more advanced models like transformers.

In [None]:
# train a simple Bigram model from the training data loaded above.

from utils import train_ngram_model
ngram_model = train_ngram_model(nynorskTokens)

In [None]:
# Generate text starting with a seed
from utils import generate_text

seed = 'Hallo hvordan går det' # The initial prompt for generating a sentence.
generated_text = generate_text(ngram_model , seed, length=50) # generate sample text using the given function.
print(generated_text)

# 3. Large Language Models

<div style="display: flex; flex-direction: row;">
    <img src="resources/lstm.png" alt="Image 1" width="500px" height="150px" style="margin-right: 10px;">
    <img src="resources/transformer.png" alt="Image 2" width="500px" height="100" style="margin-right: 10px;">
</div>

Neural language models, such as those based on recurrent neural networks (RNNs), long short-term memory networks (LSTMs), and transformer architectures, have shown superior performance compared to n-gram models in various natural language processing (NLP) tasks. Here are some reasons why neural language models are generally considered better:

***Long-range Dependencies:***

N-gram models capture dependencies up to a fixed number of preceding words (the "n" in n-gram). Neural language models, especially transformer architectures, can capture long-range dependencies in a sequence of words, allowing them to model more complex relationships.

***Parameter Efficiency:***

Neural language models can efficiently represent and learn from large amounts of data with relatively fewer parameters compared to n-gram models. This is crucial in dealing with the vast amount of information present in natural language.

***Continuous Embeddings:***

Neural models represent words as continuous embeddings in a high-dimensional space. This continuous representation allows the model to capture semantic relationships between words, which is challenging for discrete representations used in n-grams.

***Generalization:***

Neural models generalize better to unseen or rare words because they learn continuous representations that can capture similarities between words. N-gram models struggle with out-of-vocabulary words and may not generalize well to unseen contexts.

***Adaptability to Task Complexity:***

Neural models can adapt to the complexity of different NLP tasks by fine-tuning or adjusting hyperparameters. N-gram models have limited capacity to adapt to different tasks without modifying the n-gram order, which may not be practical.

***Handling Variable-Length Contexts:***

Neural models can handle variable-length contexts, making them more flexible in processing sequences of different lengths. In contrast, n-gram models require fixed-length contexts, which can be limiting.

***Contextual Information:***

Models like BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer) consider contextual information by processing the entire sequence bidirectionally or unidirectionally. This allows them to capture richer context for each word.

***State-of-the-Art Performance:***

Neural language models have achieved state-of-the-art performance on various NLP benchmarks, including tasks such as language modeling, machine translation, text summarization, and sentiment analysis.
Despite the advantages of neural language models, n-gram models can still be useful in certain scenarios, especially when dealing with limited resources or when a simple model is sufficient for the task at hand. The choice between n-gram models and neural language models often depends on the specific requirements of the task, the amount of available data, and computational resources.

## Tokenization

Check out this link for a tokenization playground for ChatGPT: https://platform.openai.com/tokenizer

<div style="display: flex; flex-direction: row;">
    <img src="resources/tokenizer.png" alt="Image 1" width="700px" height="300px" style="margin-left: 200px;">
</div>

In [None]:
from utils import tokenize_text
tokens = tokenize_text("Supercalifragilisticexpialidocious")
print(tokens)


tokens = tokenize_text("ordbokstavrimkonkurranse")
print(tokens)


tokens = tokenize_text("Enter your word here")
print(tokens)

## Running an LLM on local host requires lots of resources and takes time, So in the next section we will use OpenAI API for remote access to a GPT model.