# Chapter 0: Basic concepts required to understand LLM 

Before you start the notebook, I would like to clarify that this notebook serves more as a refresher of background conecpts required to understand LLMs. The notebook doesn't delve deeper into everything NLP, rather focussed on limited conecpts at a bird's eye level. 

If you want to dive deeper into NLP, here are some of resources I recommend:

That being said, let's dive in deeper and set the foundations right before our chapter 1 that introduces LLM!

## What is Natural Language Processing?


**Natural language processing (NLP)** is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. 

The result is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Basically NLP is the way of communicating with the computers in natural language used by humans. 

## Approaches in NLP

The different approaches used to solve NLP problems commonly fall
into three categories: 

<ul> 1. Heuristics </ul>
<ul> 2. Machine learning </ul>
<ul> 3. Deep learning </ul>

## 1. Heuristics based NLP

Similar to other early AI systems, early attempts at designing NLP systems were based on building rules for the task at hand. 

This required that the developers had some expertise in the domain to formulate rules that could be incorporated into a program. Such systems also required resources like dictionaries and thesauruses,typically compiled and digitized over a period of time. Eaxmples of such heuristics based NLP include Lexicon based sentiment analysis, regex, and context-free grammar. 


## 2. Machine Lerning based NLP

Machine Learning for NLPMachine learning techniques are applied to textual data just as they’re used on other forms of data, such as images, speech, and structured data. 

Supervised machine learning techniques such as classification and regression methods are heavily used for various NLP tasks. As an example, an NLP classification task would be to classify news articles into a set of news topics like sports or politics. 

On the other hand, regression techniques, which give a numeric prediction, can be used to estimate the price of a stock based on processing the social media discussion about that stock. 

Similarly, unsupervised clustering algorithms can be used to club together text documents.

Some algorithms to be used:

<ul> 1. Naive Bayes Classifier </ul>
<ul> 2. Support Vector Machine </ul>
<ul> 3. Hidden Markov model </ul>

## 3. Deep Learning for NLP

In the last few years, we have seen a huge surge in using neural networks to deal with complex, unstructured data. Language is inherently complex and  unstructured.

Therefore, we need models with better representation and learning
capability to understand and solve language tasks. 

Few popular deep neural network architectures that can be used:

<ul> 1. RNN </ul>
<ul> 2. LSTM </ul>
<ul> 3. CNN </ul>

## Transformers: Talk of the town

The Transformer in NLP is a novel architecture that aims to solve sequence-to-sequence tasks while handling long-range dependencies with ease. The Transformer was proposed in the paper "Attention Is All You Need". It is recommended reading for anyone interested in NLP.

They model the textual context but not in a sequential manner. Given a word in the input, it prefers to look at all the words around it (known as self  attention) and represent each word with respect to its context. For example, the word “bank” can have different meanings depending on the context in which it appears. If the context talks about finance, then “bank” probably denotes a financial institution. 

On the other hand, if the context mentions a river, then it probably indicates a bank of the river. Transformers can model such context and hence have been used heavily in NLP tasks due to this higher representation capacity as compared to other deep networks.

## What is a language model?

The course touches on Large Language Models, before we jump into large language model, it's important to understand what is a language model?

**From Wikipedia**: A language model is a probabilistic model of a natural language that can generate probabilities of a series of words, based on text corpora in one or multiple languages it was trained on. It that made your head spin, don't worry! I am gonna simplify it for you. 

In any language, we have vocabulary and grammar, that help us communicate effectively and clearly. Those rules aren't clear to computers, as they understand only numbers, so they convert everything into numbers and create menaingful sentences using probability. Let's look at an example:

I had coffee at coffeeshop -- looks good, makes sense \
I had wine at coffeeshop -- possible, but the above one makes more sense \
Wine had coffee at coffeeshop -- what?

Now, machine will assign probability to each of these series of words after being trained on plethora of english language based data. It will assign higher probability to first statement, slightly lower for the second, and very low for the third one. 

p(I, had, coffee, at, coffeeshop) = 0.015 \
p(I, had, wine, at, coffeeshop) = 0.02 \
p(Wine, had, coffee, at, coffeeshop) = 0.00001

Using language model, we can perfrom a hosts of tasks like speech recognition, handwriting recognition, machine translation, informational retrieval and natural language generation. 

## How is the probability calculated?

This is done using auto-regressive model, which forms the backbone of feed-forward neural network

The probability here is calculated based the chain rule of probability:

$ p(x_{1:k}) = p(x_{1} * p(x_{2}p(x_{1}) * p(x_{3}| x_{2},x_{1}) ........ p(x_{k}| x_{1:k-1})$

For our example, it will translate to:

p(I, had, coffee, at, coffeeshop) = p(I) * \
                                    p(had|I) * p(coffee| I, had)*
                                    p(at | I, had, coffee) * \
                                    p( coffeeshop | I, had, coffee, at)
                                    
Does that look computationally expensive? It is! 

While now we have deep learning algorithms that can compute this in the form (with feed-forward neural networks mentioned above), a more computationally efficient method has been used: N-Grams model

**N-Grams model**

In an n-gram model, the prediction of a token $x_{i}$ only depends on the last n−1 characters $x_{i−(n−1):i−1}$ not on the whole corpus of k tokens, as done previously. The probability now would becomes:

$p(x_{i∣x1:i−1})=p(x_{i}∣x_{i−(n−1):i−1})$

In our example, for a 2 gram model, 

p(I, had, coffee, at, coffeeshop) = p(I)

p(I, had, coffee, at, coffeeshop) = p(I) * \
                                    p(had|I) * p(coffee| I)*
                                    p(at | had) * \
                                    p( coffeeshop | at)
                                    
Building an n-gram models computationally feasible and scalable that has made them a popular model before we had more computational power to fit in neural network models that could fit in more information and enable text generation. 

**RNNs and Transformers**

RNNs enabled the entire context $x_{1:i-1}$ to be taken into account, which means implementing the above mentioned chain rule of probability as is, without simplifying it for n-grams. But they were again quite computationally expensive. Transformers described above reduced the computational expensive without compromising the results by having a fixed context length of "n" tokens, but the good thing is that you can make n large enough. 

## What makes Large Language Model (LLM) different? 

Now that we have a clear understanding of language model, let's understand what is a "large" language model. They are the most advanced form of language models, trained on large amounts of data (literally the whole internet!) and use feed forward neural networks and transformers. "Large" can refer either to the number of parameters in the model, or sometimes the number of words in the training dataset.These capabilities have enabled LLMs to have greater language understanding, generate human-like text, answer questions, and carry out other language-related tasks. If you have played with ChatGPT, you know what I am talking about. 

As we have been reviewing n-grams, RNNs, transformers, we now know that earlier language models could predict the probability of a single word; modern large language models can predict the probability of sentences, paragraphs, and with the new changes announed at OpenAI Dev Day, we might be able to predict the probability of documents!

<!-- They also overcome limitation of transformers based language models is them being task-specific. They require task-specific datasets and task-specific fine-tuning  -->

Some very popular LLMs are:
<li> GPT series: Large Language Models developed by OpenAI. GPT-4 is a multi-modal model which means it can work with both images and text </li>
<li> BERT: Developed by Google, BERT is another very popular LLM </li>
<li> XLNet: Developed by Google + CMU </li>
<li> T5: Developed by Google </li>

While LLMs sound all amazing, they are not free from limitations and ill-effects. I will be dedicating a whole chapter around limitations and opportunities of LLMs, but to summarize here, the limitations include lack of interpretability, bias, hallucination and their risk comprise loss of jobs, and creativity in humans. I will discussing everything in detail, so stay tuned. 