# Introduction to Language Models

Language models are a type of machine learning model trained to learn a probability distribution over words. They've been used for various applications, including text generation, question answering, text summarization, language translation, and speech recognition. 

These models have existed since the 1980s and are mainly categorized into two kinds: (1) **statistical models** that use statistical techniques such as N-grams, hidden markov models, etc., and (2) **neural models** that use neural networks for training.

Humans are inherently good at learning the probability of the next word. For example, if asked which of the below sentences has a higher probability for you to encounter, we know that the probability of the first sentence is greater than the second. 

$$
  p(\text{jupiter is the largest planet}) > p(\text{jupiter is the largest moon})
$$

Language modeling aims to train models that can do well in tasks like the above statement. 



## N-Gram

> An N-gram is a sequence of N words (or tokens.)

Take the sentence : `The Sun is the Solar System's star and by far its most massive component. `

A unigram, or 1-gram for the above sentence would be: `"The", "Sun", "is", "the", "Solar", "Systems", "star", "and", "by", "far", "its", "most", "massive", "component"`

A bigram, or 2-gram for the above sentence would be: `"The Sun", "Sun is", "is the", "the Solar", "Solar Systems", "Systems star", "star and", "and by", "by far", "far its", "its most", "most massive", "massive component"`

N-gram models consider the probability of the given word, given the (N-1) previous words. For trigrams, they consider the probability of a word given the two previous words. 

## Building an N-Gram Language Model

To demystify how language models in principle, we will **build an n-gram language model** from scratch in Python. 

We will be using [arXiv Dataset](https://www.kaggle.com/datasets/Cornell-University/arxiv?resource=download) for this tutorial, in particular we will use the abstracts for all the papers classified under the astrophysics category, i.e., with category value of `astro-ph`. 

The raw dataset has already been cleaned and stored in a pickle file for you. 

In [250]:
from collections import Counter, defaultdict
import nltk
import random
import re
import numpy as np
import pandas as pd


In [169]:
astro_ph_df = pd.read_pickle("../../resources/astro-ph-arXiv-abstracts.pkl")

In [170]:
display(astro_ph_df.head())
f"There are {astro_ph_df.shape[0]} papers in our dataset."

Unnamed: 0,id,title,abstract
0,712.2086,On weak and strong magnetohydrodynamic turbulence,Recent numerical and observational studies c...
1,712.2103,Hilltop Curvatons,We study ``hilltop'' curvatons that evolve o...
2,712.211,Near-field cosmology with the VLT,With the arrival of wide-field imagers on me...
3,712.2111,The prototype colliding-wind pinwheel WR 104,Results from the most extensive study of the...
4,712.2116,X-ray spectral evolution of TeV BL Lac objects...,Many of the extragalactic sources detected i...


'There are 331564 papers in our dataset.'

For demonstration purposes, we won't use the entire dataset, but only sample 10000 papers from it. 

In [171]:
sampled_astro_ph_df = astro_ph_df.sample(10000)
assert sampled_astro_ph_df.shape[0] == 10000

It's important to clean raw text data because computers, unlike humans, do not know that the word `many` and `Many` mean the same thing. We will use the `nltk` library to clean our raw text as part of data preprocessing.

### Data Preprocessing

In [172]:
# Stitch all abstracts together
abstracts = " ".join(list(sampled_astro_ph_df.abstract))

In [173]:
# Removing extra spaces
abstracts = re.sub("\s+", " ", abstracts)

In [174]:
# Remove punctuations
abstracts = re.sub("[^-9A-Za-z ]", "", abstracts)

In [175]:
# Lowercase all data
abstracts = abstracts.lower()


An important part of any text processing pipeline is `tokenizers,` which break your unstructured data and natural language text into discrete elements. For this tutorial, we will use nltk's word tokenizer, which can help you separate words and punctuations. 

In [176]:
# nltk word tokenizer requires Punkt sentence tokenization models.
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/a42/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [115]:
abstract_tokens = nltk.tokenize.word_tokenize(abstracts)

In [116]:
abstract_tokens[:10]

['understanding',
 'the',
 'properties',
 'of',
 'the',
 'hot',
 'corona',
 'is',
 'very',
 'important']

Let's also remove stop words that are widely used words (such as "the," "a," "an," or "in").

In [117]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/a42/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [118]:
# Remove stop words
stopwords = nltk.corpus.stopwords.words('english')

In [119]:
stopwords[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [120]:
abstract_tokens = [token for token in abstract_tokens if token not in stopwords]

### Trigram model

`nltk` package provides an abstraction to build trigrams out of your given text. 

In [121]:
list(nltk.trigrams(["Sun", "is", "the", "largest", "star", "our", "solar", "system", "."]))

[('Sun', 'is', 'the'),
 ('is', 'the', 'largest'),
 ('the', 'largest', 'star'),
 ('largest', 'star', 'our'),
 ('star', 'our', 'solar'),
 ('our', 'solar', 'system'),
 ('solar', 'system', '.')]

In [122]:
# Initialize a model.
model = defaultdict(lambda: defaultdict(lambda: 0))

In [126]:
# Compute the frequency of the word given its two previous words.
for w1, w2, w3 in nltk.trigrams(abstract_tokens):
    model[(w1, w2)][w3] += 1

Print possible words given "black hole" as a prompt to our language model. Note that the trigram model can tell you the probability of a word given two words

In [132]:
model[("black", "hole")]

defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
            {'x-ray': 69,
             'candidate': 51,
             'populations': 3,
             'population': 12,
             'spin': 108,
             'expanding': 3,
             'background': 12,
             'present': 21,
             'put': 3,
             'address': 3,
             'observational': 3,
             'candidates': 72,
             'space-time': 3,
             'mergers': 57,
             'work': 3,
             'nut': 3,
             'finally': 3,
             'integrable': 3,
             'accommodate': 3,
             'few-percent': 3,
             'review': 3,
             'bbh': 30,
             'systems': 42,
             'one': 9,
             'accretion': 102,
             'spins': 24,
             'components': 3,
             'mass': 429,
             'properties': 15,
             'average': 3,
             'sun': 3,
             'progenitor': 6,
             'massive': 15,
             'b

In [133]:
# Transform frequency to probabilities
for key in model:
    total_count = float(sum(model[key].values()))
    for word in model[key]:
        model[key][word] /= total_count

Given "black hole" as a prompt to the model, these are the most probable words that our model generates.

In [186]:
sorted(model['black', 'hole'].items(), key = lambda item: item[1], reverse=True)[:5]

[('mass', 0.11550888529886914),
 ('masses', 0.03715670436187399),
 ('bh', 0.034733441033925685),
 ('spin', 0.029079159935379646),
 ('accretion', 0.027463651050080775)]

### Text Generation

We can iteratively use our model to generate sentences given a random piece of text.

In [257]:
def text_generator(text: list[str]) -> str:
    sentence_finished = False
     
    while not sentence_finished:

        # Randomly pick a word from model given the last two words in text.
        model_sorted_probabilities = sorted(model[tuple(text[-2:])].items(), key = lambda item: item[1], reverse=True)
        word = model_sorted_probabilities[np.random.choice(np.arange(len(model_sorted_probabilities)))][0]
        
        text.append(word)
    
        # Finish once we have generated 20 words.
        if len(text) == 20:
            sentence_finished = True
     
    return ' '.join(text)

In [258]:
text_generator(["black", "hole"])

'black hole higher dimensional representations eg models pure glue ordinary quarks confined two flux tubes implied triangular configuration baryons within'

N-grams are simple models, but they can get computationally expensive as you start considering the long-term context of the words in a sequence, or increase the N value to a larger number. 

Cold start is also a huge problem with such models, for example, what should have happened if our model would have never encountered `black hole` in raw text. 