# Learning linguistic regularities

While language data are highly unstructured, language itself _is_ highly structured. Generally, if we are thinking statistically, we can think of how language is _distributed_.

A _distribution_ is about how linguistic events -- such as sounds, words, elements of a story or narrative -- appear and when they appear.

<img src="https://upload.wikimedia.org/wikipedia/commons/3/3e/Dirichlet_distributions.png" alt="Picture of four different Dirichlet distributions" width=400/>

Language is structured in a systematic way. That is, words do not appear in random locations, and words do not occur with just any other words. The non-randomness of language is what allows us to learn from text data with statistical models.

Here are some of the ways that language is structured that we can leverage for natural language processing applications:

* Grammatical rules
  * There are grammatical dependencies between words (e.g., **subject-verb agreement**; we say "bananas are" and "banana is", but not "bananas is")
  * In some languages like Spanish, there other agreement systems, including between nouns (things) and adjectives (modifiers). For example, "**la** cas**a** pequeñ**a**" versus "**el** gat**o** pequeñ**o**."
* Word forms
  * Words often sound like other words that mean similar thing (e.g., "clap", "clack", "clang")
  * The way we pronounce words (e.g., louder, with a higher pitch, by emphasizing certain sounds) can tell us what the speaker means (e.g., "Click on the BLUE bat" versus "Click on the blue bat.")
* Word orders are often very predictable
  * Some languages, like English, mostly rely on word order to express grammatical relationships (e.g, who did what to whom)
  * Other languages, like Turkish, use word _forms_ in the form of _morphology_ to express grammatical relationships, in addition to word order
* Words typically occur with related words
  * If we are talking about cats, other types of pets may come up
  * The types of ways we talk about cats and dogs are similar

## Order in language data really matters

Compare the following two sequences of words:

1. all of a sudden I notice three of my pals standing on the sidewalk
2. on pals all I of notice sidewalk of three a sudden my standing the

It is easy to show that the count statistics of individual tokens will be identical between (1) and (2), even though (2) is scrambled. 

```python
# The two of these will have identical word frequency distributions
# frequencies of normal sentence
(
    Counter(
        word_tokenize("all of a sudden I notice three of my pals standing on the sidewalk")
    )==
# frequencies of scrambled sentence
 Counter(
    word_tokenize("on pals all I of notice sidewalk of three a sudden my standing the")
    )
)
```

However, one of the texts is sensical and a "better" instance of an English sentence. Therefore, we want a **language model** that can best account for language data. 

One of the major approaches of NLP in the last 30 years has been to better understand how to characterize how much a computer can "understand" by focusing on its ability to "model" the language it experiences. Usually, this is quantified using _probabilities_.

* <u>Linking assumption</u>: The greater _probability_ (likelihood) a model assigns to an utterance, the better it knows the language we are asking it to process.

#### This means that we need to learn or know as many factors as possible that affect why language data looks like it does.

# Language Modeling

Most commonly, a **language model** is any system that can accurately characterize the way languages look. A language model is NOT a model of language per se -- it will always be incomplete. Most commonly, language models are _statistical_ and learn from observations of words in text corpora. However, language models can be trained over all kinds of linguistic entities, such as the sounds within words, and have [even been applied to genetic sequences](https://github.com/rostlab/SeqVec).

Remember the dictionaries that we created? These stored counts of individual words and bigrams and are among the simplest kinds of language model. This is because we can easily translate the **counts** into **probabilities** as below:

$p(word==w) = \large \frac{count(w)}{\sum_{w_i}count(w_i)}$

This statistic is known as the **unigram probability** of a word, or its **unconditional** probability -- that is, when we don't know _anything_ about when that word occurs, we say that it is not **conditioned** on anything.

When we calculate more complex statistics -- such as the frequency of two words co-occurring one after the other (e.g., _parsing algorithms_) -- we are interested in the **joint probability** because we have multiple events that we are combining. The joint probability of two words co-occurring in a particular order is known as a **bigram probability** and can be characterized as follows:

$p(w_1==\text{``parsing"} \cap word_2==\text{``algorithms"})$ = $\large \frac{count(w_i w_{i+1})}{ \sum_{w_{i}} \sum_{w_{i + 1}} w_i w_{i + 1}}$

So, if we count up the number of pairs of words we observe, we can get a _bigram probability_. This is analogous to the classic problem in probability in which a coin is tossed and we want to know what the chances are of getting "heads" followed by "tails" or if two dice are rolled and how often we see a 1 and a 6.

To get the unconditional probability of a bigram (i.e., the frequency of a bigram relative to all bigrams), you need to know the total number of bigrams in your corpus:

```python
# we *can* compute bigram probabilities the same way as single-word frequencies
abstract_bigrams_dict = defaultdict(int)
for abstract in abstracts:
  tokenized_abstract = word_tokenize(abstract)
  for i, word in enumerate(tokenized_abstract):
    if i < (len(tokenized_abstract) - 1):
      next_word = tokenized_abstract[i + 1]
      bigram = (word, next_word)
      abstract_bigrams_dict[bigram] += 1

total_number_of_bigrams = sum(abstract_bigrams_dict.values())
```

# Quiz yourself:

What do we need to change about the above code to handle trigrams and even larger sequences?

## Computing conditional probabilities

Now that you have a dictionary of word and bigram counts, you can compute conditional probabilities. Formally, for a pair of words that occur in sequence in a sentence $w_1$ and $w_2$, these are defined as:

$\large p(w_2|w_1) = \large \frac{p(w_1 w_2)}{p(w_1)}$

We already know $p(w_1)$. To compute the probability of a _single_ word, all we have to do is find the frequency of A and divide it by the sum of the counts all of the words.

We already know the frequency of $p(A B)$ because we computed the frequencies of the words occurring together. For example, the frequency of "parsing algorithms" is 21 (from the above). 

In order to compute the _unconditional probability of a bigram_ ($p(w_1)$), all we have to do is divide by the count of all of the pairs of words in our dataset (i.e., the total number we have seen $w_1$).

## Zipf's Law and sparsity

Language data is distributed in a very "concentrated" or predictable way. While we do not always know what someone is going to say next, we can generally make a pretty good guess and not be too far off. Language is much more predictable than, say, coin flips or dice rolls. The non-randomness of language can be seen by analyzing how "peaky" several distributions that we use to analyze corpora are. 

These extremely rare events can cause a major problem for NLP because sometimes:

1. They are a nuisance (for example, they may be typos)
2. They are exactly what we want to know (they uniquely characterize a text)

The most important of these distributions is the [Zipf-Mandelbrot distribution](https://en.wikipedia.org/wiki/Zipf%E2%80%93Mandelbrot_law)). 

Keeping with this, very few bigrams occur very many times, and many bigrams occur very few times. This trade-off is commonly seen with co-occurrence data across fields, such as the prevalence of different species in a natural area, what songs appear together on playlists, etc.

# Quiz yourself:

* How do you think Zipf's law interacts with n-grams of different sizes?
* How do you think the distribution of bigram frequencies is different from a distribution of all pairs if they co-occurred _independently_?