# Intro to Large Language Model Data

This course is going to teach various techniques for preparing text and language data for large language models. But before we do this, we should probably understand the landscape of natural language processing, large langugage models, and how computers see text. 

## What is Natural Language Processing? 

**Natural language processing (NLP)** is a garden variety of ways computers process speech and text. A field that is more than 50 years old, NLP tries to find efficient and accurate ways to convert language and perform meaningful tasks with it. For example, it can be transcribing speech in audio or summarizing several text documents into a single paragraph. We are also seeing natural language processing being used to generate images off language prompts. Even simple applications exist, like predicting an emoji response given the sentiment analysis of a text message. Given how much we humans use oral and text language every day, the these applications can be seemingly endless. With advances in machine learning and generative AI, the field has only gotten more interesting and exciting.

Linguistics, the study of language, has increasingly married itself to computer science to take advantage of the powerful processing and statistical methods that only make sense with the growing availability of language data. We have had practically the entire world's library transcribed to digital form, on top of articles and social media outlets. The data-driven approaches have now become the preferred means of research in linguistics although traditional heuristics still lends itself. After all, one must consider language is much more than emperically answering the question "what word has the highest probability of coming next?" There is a complex structure to human language with a high volume of rules, as well as a high volume of exceptions to those rules. 

Naturally, some language tasks come easier in natural language processing such as predicting a happy or angry sentiment in a customer review. Others can be difficult and largely unsolved, such as determining if the reviews are fake or generated by bots. A great paradox even exists in generative AI research where the more an "AI" improves and sounds human-like, how is another AI supposed to identify its inauthenticity? Take another example of a a robotic call operator. The robotic operator can pretty easily help a customer on the helpline by processing their voice feedback, as long as the possible actions are narrow and defined (e.g. asking about a shipment update, or processing a return). But being able to create a customer service bot for a large company that can help every customer issue without ever involving a human is likely impossible. 

Regardless, we are not going to address these open questions. We will focus on this though. Even with breakthroughs like ChatGPT, we are still not without challenges in natural language processing. Natural language is messy, full of double-meanings, and nuance. Typos, grammar issues, and inconsistent formatting alone are issues that are time-consuming to deal with.  In the area of natural language processing, we are going to bring our attention to large language models or more specifically, the preparation of data going into LLMs. Let's talk about large language models next. 

## What are (Large) Language Models?

You probably have heard of **large language models (LLMs)**, which are massive mathematical algorithms, trained by an emormous corpus of text data, that can generate and seemingly "understand" language. GPT models (including ChatGPT), Gemini, and LLaMa are several large language models that have made generative AI a mainstream cultural phenomenon. Breakthrough models like these were enabled by the recent innovation of the transformer architecture, further building on previous paradigms like recurrent neural networks and long short-term memory neural networks. While we will not learn how to build these large language models in this course, we will learn some important groundwork in preparing and interpreting language data as understood by these models. 

svg image

Let's look at an interesting example. Go to [Google Gemini](https://gemini.google.com/) and and ask it to rewrite this email to be more diplomatic. 

> Hey Sam! You pushed code to the production branch that was full of bugs and several unit tests. You need to do better than this!

When you observe its output, consider in amazement how it took that input, tokenized and vectorized it in some way, and then after a number of mathematical operations it gave this output. The idea of these large language models is to have an algorithm that can learn rules and structures of language without explicitly being coded for them. This enables enormous flexiblity for LLMs to adapt and add new tasks, particularly in taking a sequence input and then providing another sequence as output. As we get more compute power as well as learn to use it more efficiently, it will be interesting to see what benchmarks are met and what new benchmarks will emerge. 

It probably is worth attempting a distinction between a **language model** versus a large language model. It is much like attempting a distinction between *data* and *big data*, as in one is a much trendier subset of the other. Because of this, there is always a degree of conflating different technologies so they can ride the coat tails of the latest thing. For example, a lot of small and medium data problems were treated as big data problems. SQL databases and web API services became caught in the swirl of big data and many capitalized on this. However you can probably say this has not quite happened to language models like speech recognition, handwriting recognition, optical character recognition, and information retrieval. But things like recurrent neural networks and n-gram language models have been superseded by the transformer and feed forward neural network, which are fed by massive amounts of internet data. 

## How Computers See Text

When you look at a sentence, a paragraph, an email, a text message, a web page, an article, a novel, what do you see? Let's take a look at the opening paragraph of the timeless Charles Dickens book _A Tale of Two Cities_. 

> It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way--in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.

When you see this piece of classic literature, what do you see? What does your mind do? What meaning do you attach to the words? What context?

If you did not know already, this novel took place during the French Revolution. Does that change how you read the paragraph again? 

Now ask yourself this: what does a computer see? Does it extrapolate context about the French Revolution? Qualities about Charles Dickens as an author? The purpose of the contradictions in the prose? Let me simplify things and make it a little less ambitious. Does the computer even know what words are much more the meaning of the word "best" and "times"? 

Well if we are going to get technical, this is what the computer sees this paragraph as: bytes of 1's and 0's in an ASCII/UTF-8 format.  

```
01001001 01110100 00100000 01110111 01100001 01110011 00100000 01110100 01101000 01100101 00100000 01100010 01100101 01110011 01110100 00100000 01101111 01100110 00100000 01110100 01101001 01101101 01100101 01110011 00101100 00100000 01101001 01110100 00100000 01110111 01100001 01110011 00100000 01110100 01101000 01100101 00100000 01110111 01101111 01110010 01110011 01110100 00100000 01101111 01100110 00100000 01110100 01101001 01101101 01100101 01110011 00101100 00100000 01101001 01110100 00100000 01110111 01100001 01110011 00100000 01110100 01101000 01100101 00100000 01100001 01100111 01100101 00100000 01101111 01100110 00100000 01110111 01101001 01110011 01100100 01101111 01101101 00101100 00100000 01101001 01110100 00100000 01110111 01100001 01110011 00100000 01110100 01101000 01100101 00100000 01100001 01100111 01100101 00100000 01101111 01100110 00100000 01100110 01101111 01101111 01101100 01101001 01110011 01101000 01101110 01100101 01110011 01110011 00101100 00100000 01101001 01110100 00100000 01110111 01100001 01110011 00100000 01110100 01101000 01100101 00100000 01100101 01110000 01101111 01100011 01101000 00100000 01101111 01100110 00100000 01100010 01100101 01101100 01101001 01100101 01100110 00101100 00100000 01101001 01110100 00100000 01110111 01100001 01110011 00100000 01110100 01101000 01100101 00100000 01100101 01110000 01101111 01100011 01101000 00100000 01101111 01100110 00100000 01101001 01101110 01100011 01110010 01100101 01100100 01110101 01101100 01101001 01110100 01111001 00101100 00100000 01101001 01110100 00100000 01110111 01100001 01110011 00100000 01110100 01101000 01100101 00100000 01110011 01100101 01100001 01110011 01101111 01101110 00100000 01101111 01100110 00100000 01001100 01101001 01100111 01101000 01110100 00101100 00100000 01101001 01110100 00100000 01110111 01100001 01110011 00100000 01110100 01101000 01100101 00100000 01110011 01100101 01100001 01110011 01101111 01101110 00100000 01101111 01100110 00100000 01000100 01100001 01110010 01101011 01101110 01100101 01110011 01110011 00101100 00100000 01101001 01110100 00100000 01110111 01100001 01110011 00100000 01110100 01101000 01100101 00100000 01110011 01110000 01110010 01101001 01101110 01100111 00100000 01101111 01100110 00100000 01101000 01101111 01110000 01100101 00101100 00100000 01101001 01110100 00100000 01110111 01100001 01110011 00100000 01110100 01101000 01100101 00100000 01110111 01101001 01101110 01110100 01100101 01110010 00100000 01101111 01100110 00100000 01100100 01100101 01110011 01110000 01100001 01101001 01110010 00101100 00100000 01110111 01100101 00100000 01101000 01100001 01100100 00100000 01100101 01110110 01100101 01110010 01111001 01110100 01101000 01101001 01101110 01100111 00100000 01100010 01100101 01100110 01101111 01110010 01100101 00100000 01110101 01110011 00101100 00100000 01110111 01100101 00100000 01101000 01100001 01100100 00100000 01101110 01101111 01110100 01101000 01101001 01101110 01100111 00100000 01100010 01100101 01100110 01101111 01110010 01100101 00100000 01110101 01110011 00101100 00100000 01110111 01100101 00100000 01110111 01100101 01110010 01100101 00100000 01100001 01101100 01101100 00100000 01100111 01101111 01101001 01101110 01100111 00100000 01100100 01101001 01110010 01100101 01100011 01110100 00100000 01110100 01101111 00100000 01001000 01100101 01100001 01110110 01100101 01101110 00101100 00100000 01110111 01100101 00100000 01110111 01100101 01110010 01100101 00100000 01100001 01101100 01101100 00100000 01100111 01101111 01101001 01101110 01100111 00100000 01100100 01101001 01110010 01100101 01100011 01110100 00100000 01110100 01101000 01100101 00100000 01101111 01110100 01101000 01100101 01110010 00100000 01110111 01100001 01111001 00101101 00101101 01101001 01101110 00100000 01110011 01101000 01101111 01110010 01110100 00101100 00100000 01110100 01101000 01100101 00100000 01110000 01100101 01110010 01101001 01101111 01100100 00100000 01110111 01100001 01110011 00100000 01110011 01101111 00100000 01100110 01100001 01110010 00100000 01101100 01101001 01101011 01100101 00100000 01110100 01101000 01100101 00100000 01110000 01110010 01100101 01110011 01100101 01101110 01110100 00100000 01110000 01100101 01110010 01101001 01101111 01100100 00101100 00100000 01110100 01101000 01100001 01110100 00100000 01110011 01101111 01101101 01100101 00100000 01101111 01100110 00100000 01101001 01110100 01110011 00100000 01101110 01101111 01101001 01110011 01101001 01100101 01110011 01110100 00100000 01100001 01110101 01110100 01101000 01101111 01110010 01101001 01110100 01101001 01100101 01110011 00100000 01101001 01101110 01110011 01101001 01110011 01110100 01100101 01100100 00100000 01101111 01101110 00100000 01101001 01110100 01110011 00100000 01100010 01100101 01101001 01101110 01100111 00100000 01110010 01100101 01100011 01100101 01101001 01110110 01100101 01100100 00101100 00100000 01100110 01101111 01110010 00100000 01100111 01101111 01101111 01100100 00100000 01101111 01110010 00100000 01100110 01101111 01110010 00100000 01100101 01110110 01101001 01101100 00101100 00100000 01101001 01101110 00100000 01110100 01101000 01100101 00100000 01110011 01110101 01110000 01100101 01110010 01101100 01100001 01110100 01101001 01110110 01100101 00100000 01100100 01100101 01100111 01110010 01100101 01100101 00100000 01101111 01100110 00100000 01100011 01101111 01101101 01110000 01100001 01110010 01101001 01110011 01101111 01101110 00100000 01101111 01101110 01101100 01111001 00101110
```

Binary is like the decimal system we know and love, except we have only the digits 0 and 1 instead of 0,1,2,3,4,5,6,7,8, and 9. In a byte, the number 0 would be expressed as `000000000`, and then 1 would be `00000001`. However, 2 would be `00000010` and 3 would be `00000011`. By establishing an agreed protocol like UTF-8, we can express text characters as binary numbers. The character "I" is `01001001`, "t" is `01110100`, and a space is `00100000`. Those are the first three characters of the paragraph. 

svg image

Now before you get excited and say "oh so this is how text data is seen by LLMs" hold your horses. This is just one way computers see text data different than us humans. While it is important to understand how data is stored as 1's and 0's by computers in this way, this is just a basic piece of knowledge as LLMs do many mathematical representations of language. Different types of mathematical transformations are applied to different language problems. But the big idea to extract for now is this: **to computers and large language models, language is nothing more than numbers.** What mathematical representations we use to express those numbers depends on the task and modeling strategy we want to employ. 

For example, a simple `CountVectorizer` will build a vocabulary off one or more documents. Then we can use it to count the number of words that occurred in new documents with the same vocabulary. Note how the vocabulary is managed in a positional index in an array. 

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

text = ["The dog and cat were very happy today."]
vectorizer = CountVectorizer()
vectorizer.fit(text)

# encode a new document
vector = vectorizer.transform(["The dog was happy, and the cat was happy."])

# summarize encoded vector
print(vectorizer.vocabulary_)
print(vector.toarray())

{'the': 4, 'dog': 2, 'and': 0, 'cat': 1, 'were': 7, 'very': 6, 'happy': 3, 'today': 5}
[[1 1 1 2 2 0 0 0]]


There are many ways we can vectorize text data, or turn it into numbers. We will discuss many of these techniques later. 

## Strengths and Limits of Large Language Models

What are the strengths and limits of large language models? And what applications are they appropriate for? 

Let's start off disucssing the strengths. With a large corpus of data, large language models can take a sequence of human-prompted text and output a sequence of human-sounding text. With attention-based mechanisms, some powerful tasks can be achieved. We saw how Gemini can rewrite an email to be diplomatic. We could also use an LLM to waste scammers' time or create marketing materials. LLMs can even write code. 

But to truly find the strengths of large language models, we have to understand the weaknesses. Generally speaking, large language models have no concept of **ground truth** which is knowing what is true to be true. The internet is filled with bad data and poor information, and when a large language model consumes this data it can regurgitate the nonsense. 

Let's say you happen to have high quality data that is verified and accurate, such as the biographies of famous book authors. Now imagine that data exists in a high-dimensionsal space as a scatterplot showing which words frequently occur in proximity to each other in context. We might see J.R.R. Tolkien and C.S. Lewis frequently associated with fantasy. If you ask the large language model questions that directly come out of the data, such as "tell me about J.R.R. Tolkien" you can likely get reliable answers. But let's say you start asking questions that push outside the edge of the data, and ask it about an obscure fantasy author named "Jane Farlow" with only two sentences in her biography. The LLM can start to **hallucinate**, or make up information, to compensate for a lack of data. It suddenly starts to make Jane sound like she wrote *Lord of the Rings*, *The Lion, the Witch, and the Wardrobe*, and *Game of Thrones* which it has plenty of data on.

Hallucination is a problem of **extrapolation**, where the LLM has to infer about areas it has no data on. 

svg image

Another example of hallucination is LLM-generated code. Below I have some SQLAlchemy code that was generated by Gemini. It may not be obvious, but it made up library calls that do not exist in SQLAlchemy. What likely happened is it started to conflate SQLAlchemy with other SQL libraries from other languages outside of Python, such as Java.  

svg image

Hallucination is still an open problem with large language models, but let's address some other practical concerns when it comes to procuring data. 

## Concerns in Procuring LLM Data

Let's say you are tasked with creating a large language model for your organization. It will ingest documents about your products, legal paperwork, and marketing materials so it can aid in information retrieveal as well as customer service requests. Here are some tasks your boss wants the LLM to do: 

* Help employees retrieve information from company knowledge
* Assist customers via chatbot that has access to EULA, product, and warranty information
* Develop marketing posts for social media
* Provide code assistance based on company code repositories
  
**Stop and think about this for a moment. What about this scope makes sense? What could be problematic?**

svg image

It make sense to use a LLM that is tailored specifically to a business and the data it contains in a corpus repository of some kind. By keeping scope controlled to some internal use cases, it is easy to manage what the LLM can see and learn from. However, it has to be considered whether the LLM should be interacted with internally versus externally. We do not want customers to get leaked information because the chatbot has access to internal documents. Therefore it might make sense to partition separate models where one is customer-facing and another is internal-facing. 

There also needs to be discretion on what documents and data are sourced into the model. There are many concerns here: 

* What to do about dated/deprecated data and removing that from the model?
* What if marketing outputs from the LLM become inputs? And create a reinforcement loop? 
* What if code generated from the LLM becomes training input, and reinforces security vulnerabilities with it?
* If data entry or prompt training needs to be done, [who is going to do that labor](https://time.com/6247678/openai-chatgpt-kenya-workers/)?
* Do we store customer data in the LLM and the correspondences? 

As you can see, it may be tempting to throw the kitchen sink at the LLM, but it is prudent to consider what should go into it, who has access to it, and when data should be phased out. Consider carefully what the **operating domain** of the LLM should be, which are the tasks and conditions it is sanctioned to operate on. Employees need to be trained on the definition of this operating domain and should, for example, know it is okay to use the LLM to create social media posts but not okay to use it for PR releases. They also have to be trained, for example, to not include customer information in the documents they submit into the repository. 

## Exercise

You run a software company and you want to enhance your developers' productivity by developing an LLM-based code assist. You would like to integrate outside open-source code using the third party API, but you also want to augment the internal LLM to look at internal code so domain-specific aid can be achieved. For example, if you ask it "How do I implement Carl's prioritization algorithm?" it will look at Carl's code commits and figure out which algorithm you are referring to. 

What are some things you should be mindful of when procuring this data? What concerns might you have to be aware of? 


### SCROLL DOWN FOR ANSWER
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
v 


**Here are some things to be aware of in procuring code repository data and what the LLM needs to be scoped appropriately for:**

* Confidential and proprietary code, with sensitive security requirements, should likely not be ingested by the LLM.
* Deprecated and dead code should have a way of being removed from the LLM so it is never recommended as an output.
* Code analyses should be done to see if circular cycles are being reinforced, where outputs become inputs and code quality declines because of these self-perpetuating biases.
* Regular code review should still be done to ensure security and best practices are being followed, and the LLM should not be 100% trusted with this task.
