# Introduction to Language Models

Language models are a type of machine learning model trained to learn a probability distribution over words. They've been used for various applications, including text generation, question answering, text summarization, language translation, and speech recognition. 

These models have existed since the 1980s and are mainly categorized into two kinds: (1) **statistical models** that use statistical techniques such as N-grams, hidden markov models, etc. These models are interpretable and more suitable when amount of data used for training is small, and (2) **neural models** that use neural networks for training on large amounts of data.

Humans are inherently good at learning the probability of the next word. For example, if asked which of the below sentences has a higher probability for you to encounter, we know that the probability of the first sentence is greater than the second. 

$$
  p(\text{jupiter is the largest planet}) > p(\text{jupiter is the largest moon})
$$

Language modeling aims to train models that can do well in tasks like the above statement. 



## N-Gram

> An N-gram is a sequence of N words (or tokens.)

Take the sentence : `The Sun is the Solar System's star and by far its most massive component. `

A unigram, or 1-gram for the above sentence would be: `"The", "Sun", "is", "the", "Solar", "Systems", "star", "and", "by", "far", "its", "most", "massive", "component"`

A bigram, or 2-gram for the above sentence would be: `"The Sun", "Sun is", "is the", "the Solar", "Solar Systems", "Systems star", "star and", "and by", "by far", "far its", "its most", "most massive", "massive component"`

N-gram models consider the probability of the given word, given the (N-1) previous words. For trigrams, they consider the probability of a word given the two previous words. 

## Building an N-Gram Language Model

To demystify how language models in principle, we will **build an n-gram language model** from scratch in Python. 

We will be using [arXiv Dataset](https://www.kaggle.com/datasets/Cornell-University/arxiv?resource=download) for this tutorial, in particular we will use the abstracts for all the papers classified under the astrophysics category, i.e., with category value of `astro-ph`. 

The raw dataset has already been cleaned and stored in a pickle file for you. 

In [1]:
from collections import Counter, defaultdict
import nltk
import random
import re
import numpy as np
import pandas as pd

In [2]:
from ssec_tutorials import ASTROPH_ARXIV_ABSTRACTS, download_astroph_arxiv_abstracts

In [3]:
download_astroph_arxiv_abstracts()

astro-ph arXiv abstracts already exist at /Users/lsetiawan/.cache/ssec_tutorials/astro-ph-arXiv-abstracts.pkl


PosixPath('/Users/lsetiawan/.cache/ssec_tutorials/astro-ph-arXiv-abstracts.pkl')

In [4]:
ASTROPH_ARXIV_ABSTRACTS

PosixPath('/Users/lsetiawan/.cache/ssec_tutorials/astro-ph-arXiv-abstracts.pkl')

In [5]:
astro_ph_df = pd.read_pickle(ASTROPH_ARXIV_ABSTRACTS)

In [6]:
display(astro_ph_df.head())
f"There are {astro_ph_df.shape[0]} papers in our dataset."

Unnamed: 0,id,title,abstract
0,704.0009,"The Spitzer c2d Survey of Large, Nearby, Inste...",We discuss the results from the combined IRA...
1,704.0017,Spectroscopic Observations of the Intermediate...,Results from spectroscopic observations of t...
2,704.0023,ALMA as the ideal probe of the solar chromosphere,"The very nature of the solar chromosphere, i..."
3,704.0044,Astrophysical gyrokinetics: kinetic and fluid ...,We present a theoretical framework for plasm...
4,704.0048,Inference on white dwarf binary systems using ...,We report on the analysis of selected single...


'There are 338658 papers in our dataset.'

For demonstration purposes, we won't use the entire dataset, but only sample 10000 papers from it. 

In [7]:
sampled_astro_ph_df = astro_ph_df.sample(10000)
assert sampled_astro_ph_df.shape[0] == 10000

It's important to clean raw text data because computers, unlike humans, do not know that the word `many` and `Many` mean the same thing. We will use the `nltk` library to clean our raw text as part of data preprocessing.

In [8]:
# sampled_astro_ph_df.head()

### Data Preprocessing

In [9]:
# Stitch all abstracts together
abstracts = " ".join(list(sampled_astro_ph_df.abstract))

In [10]:
# Removing extra spaces
abstracts = re.sub("\s+", " ", abstracts)

In [11]:
# Remove punctuations
abstracts = re.sub("[^-9A-Za-z ]", "", abstracts)

In [12]:
# Lowercase all data
abstracts = abstracts.lower()


An important part of any text processing pipeline is `tokenizers,` which break your unstructured data and natural language text into discrete elements. For this tutorial, we will use nltk's word tokenizer, which can help you separate words and punctuations. 

In [13]:
# nltk word tokenizer requires Punkt sentence tokenization models.
nltk.download("punkt")

[nltk_data] Downloading package punkt to /Users/lsetiawan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [14]:
abstract_tokens = nltk.tokenize.word_tokenize(abstracts)

In [15]:
abstract_tokens[:10]

['the',
 'parkes',
 'pulsar',
 'data',
 'archive',
 'currently',
 'provides',
 'access',
 'to',
 'data']

Let's also remove stop words that are widely used words (such as "the," "a," "an," or "in").

In [16]:
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/lsetiawan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [17]:
# Remove stop words
stopwords = nltk.corpus.stopwords.words("english")

In [18]:
stopwords[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [19]:
abstract_tokens = [token for token in abstract_tokens if token not in stopwords]

### Trigram model

`nltk` package provides an abstraction to build trigrams out of your given text. 

In [20]:
list(
    nltk.trigrams(
        ["Sun", "is", "the", "largest", "star", "our", "solar", "system", "."]
    )
)

[('Sun', 'is', 'the'),
 ('is', 'the', 'largest'),
 ('the', 'largest', 'star'),
 ('largest', 'star', 'our'),
 ('star', 'our', 'solar'),
 ('our', 'solar', 'system'),
 ('solar', 'system', '.')]

In [21]:
# Initialize a model.
model = defaultdict(lambda: defaultdict(lambda: 0))

In [22]:
# Compute the frequency of the word given its two previous words.
for w1, w2, w3 in nltk.trigrams(abstract_tokens):
    model[(w1, w2)][w3] += 1

Print 10 random possible words given "black hole" as a prompt to our language model. Note that the trigram model can tell you the probability of a word given two words

In [23]:
random.sample(list(model[("black", "hole")].items()), 10)

[('large', 2),
 ('event', 2),
 ('following', 2),
 ('keeping', 1),
 ('embh', 1),
 ('strong', 1),
 ('centers', 1),
 ('source', 1),
 ('interiors', 1),
 ('harris', 1)]

In [24]:
# Transform frequency to probabilities
for key in model:
    total_count = float(sum(model[key].values()))
    for word in model[key]:
        model[key][word] /= total_count

Given "black hole" as a prompt to the model, these are the most probable words that our model generates.

In [25]:
sorted(model["black", "hole"].items(), key=lambda item: item[1], reverse=True)[:5]

[('mass', 0.11749116607773852),
 ('bh', 0.04328621908127209),
 ('binaries', 0.03268551236749117),
 ('masses', 0.03180212014134275),
 ('spin', 0.030035335689045935)]

### Text Generation

We can iteratively use our model to generate sentences given a random piece of text.

In [26]:
def text_generator(text: list[str]) -> str:
    sentence_finished = False

    while not sentence_finished:

        # Randomly pick a word from model given the last two words in text.
        model_sorted_probabilities = sorted(
            model[tuple(text[-2:])].items(), key=lambda item: item[1], reverse=True
        )
        word = model_sorted_probabilities[
            np.random.choice(np.arange(len(model_sorted_probabilities)))
        ][0]

        text.append(word)

        # Finish once we have generated 20 words.
        if len(text) == 20:
            sentence_finished = True

    return " ".join(text)

In [27]:
text_generator(["black", "hole"])

'black hole prediction general relativity must recovered match tight constraints possible lorentz invariance greatly improved snr latter index allowing present'

N-grams are simple models, but they can get computationally expensive as you start considering the long-term context of the words in a sequence, or increase the N value to a larger number. 

Cold start is also a huge problem with such models, for example, what should have happened if our model would have never encountered `black hole` in raw text. 

#### Your turn 😎

Try sending different inputs to the text generator model and reflect on what happens. 

In [None]:
text_generator(["<add your first word>", "<add your second word>"])