## Foundation Model
A foundation model is a type of AI model that is trained on a vast amount of data at scale, allowing it to perform a wide range of tasks with minimal additional training. These models serve as a base for various applications. They excel at learning from extensive datasets, enabling them to generalize and perform well on tasks they weren't explicitly trained for.
- Trained on vast datasets.
- Designed to generalize across many tasks, allowing them to perform well on examples they have never seen before.
- Require significant computational resources due to their size and complexity.
- Examples include models like GPT from OpenAI, Bard from Google, and DALL-E.

Generalize: The ability of a model to apply what it has learned from its training data to new, unseen data.

## Traditional Models:
- Typically trained on smaller, task-specific datasets.
- Developed from scratch based on meticulously curated data, making them task-specific and domain-specific.
- Generally smaller in size and require less computational power.
- Examples include linear regression, decision trees, and convolutional neural networks.


### Key Differences:

Foundation models are versatile and can adapt to various tasks, while traditional models are specialized and limited to specific applications.
The emergence of foundation models represents a significant shift in AI, focusing on large-scale training and broad applicability.

## Transformer Architecture
The transformer architecture has revolutionized how machines handle sequential data, allowing for the training of large models efficiently.

### Self-attention mechanism
The self-attention mechanism in a transformer is a process where each element in a sequence computes its representation by attending to and weighing the importance of all elements in the sequence, allowing the model to capture complex relationships and dependencies. This is particularly useful in tasks like language modeling.


## Benchmark

Why they matter?
Benchmarks matter because they are the standards that help us measure and accelerate progress in AI. They offer a common ground for comparing different AI models and encouraging innovation, providing important stepping stones on the path to more advanced AI technologies.

Importance of Benchmark Datasets: Benchmark datasets serve as standardized testbeds for algorithms, providing clear, objective, and quantifiable metrics for evaluation.

Benefits of Benchmark Datasets:
Comparability: They allow for direct comparison of different algorithms and models.  
Reproducibility: They create a shared foundation for reproducing and verifying results, crucial for scientific progress.  
Focus: They concentrate research efforts on specific problems, leading to innovation.  
Democratization: Open access to high-quality datasets levels the playing field for researchers worldwide.  
Acceleration of Progress: As models surpass benchmarks, datasets evolve to present new challenges, driving further advancements in AI.  

### GLUE - General Language Understanding Evaluation
The GLUE benchmark is designed to assess the capabilities of AI models across a variety of linguistic tasks, serving as a litmus test for their understanding of human language. They serve as an essential tool to assess an AI's grasp of human language, covering diverse tasks, from grammar checking to complex sentence relationship analysis. By putting AI models through these varied linguistic challenges, we can gauge their readiness for real-world tasks and uncover any potential weaknesses.

#### The GLUE Tasks / Benchmarks
##### CoLA (Corpus of Linguistic Acceptability) - Grammatical Acceptibility
    Measures the ability to determine if an English sentence is linguistically acceptable. 

##### SST-2 (Stanford Sentiment Treebank) - Sentiment Analysis
    Consists of sentences from movie reviews and human annotations about their sentiment. Analyzes sentiment in movie reviews.  

##### MRPC (Microsoft Research Paraphrase Corpus) - Paraphrase Identification
    Focuses on identifying whether two sentences are paraphrases of each other.  

##### STS-B (Semantic Textual Similarity Benchmark) - Semantic Textual Similarity
    Involves determining how similar two sentences are in terms of semantic content.  

##### QQP (Quora Question Pairs) - Question Pairs Equivalence
    Aims to identify whether two questions asked on Quora are semantically equivalent. Evaluates semantic equivalence of questions. 

##### MNLI (Multi-Genre Natural Language Inference) - Natural Language Inference
    Consists of sentence pairs labeled for textual entailment across multiple genres of text. Assesses the relationship between sentence pairs.  

##### QNLI (Question Natural Language Inference) - Question Answering Inference
    Involves determining whether the content of a paragraph or a context sentence contains the answer to a question.

##### RTE (Recognizing Textual Entailment) - Textual Entailment Recognition
    Requires understanding whether one sentence entails another.

###### WNLI (Winograd Natural Language Inference) - Pronoun Disambiguation
    Tests a system's reading comprehension by having it determine the correct referent of a pronoun in a sentence, where understanding depends on contextual information provided by specific words or phrases. Tests reading comprehension by resolving pronouns in context.  


Semantic Equivalence: When different phrases or sentences convey the same meaning or idea.

Textual Entailment: The relationship between text fragments where one fragment follows logically from the other.

#### Importance of Evaluation 
The GLUE benchmark allows researchers to compare the performance of different models on standardized tasks, facilitating advancements in natural language processing.

### SuperGlue
SuperGlue is designed as a successor to the original GLUE benchmark. It's a more advanced benchmark aimed at presenting even more challenging language understanding tasks for AI models. Created to push the boundaries of what AI can understand and process in natural language, SuperGlue emerged as models began to achieve human parity on the GLUE benchmark. It also features a public leaderboard, facilitating the direct comparison of models and enabling the tracking of progress over time.

#### SuperGlue Tasks / Benchmarks
##### BoolQ (Boolean Questions)
Involves answering a yes/no question based on a short passage.

##### CB (Commitment Bank)
    Tests understanding of entailment and contradiction in a three-sentence format.

##### COPA (Choice of Plausible Alternatives)
    Measures causal reasoning by asking for the cause/effect of a given sentence.

##### MultiRC (Multi-Sentence Reading Comprehension)
    Involves answering questions about a paragraph where each question may have multiple correct answers.

##### ReCoRD (Reading Comprehension with Common Senese Reasoning)
    Requires selecting the correct named entity from a passage to fill in the blank of a question.

##### RTE (Recognizing Textual Entailment)
    Involves identifying whether a sentence entails, contradicts, or is neutral towards another sentence.

##### WiC (Words in Context)
    Tests understanding of word sense disambiguation in different contexts.

##### WSC (Winograd Schema Challenge)
    Focuses on resolving coreference resolution within a sentence, often requiring commonsense reasoning.

##### AX-b (Broad Coverage Diagnostics)
    A diagnostic set to evaluate model performance on a broad range of linguistic phenomena.
    
##### AX-g (Winogender Schema Diagnostics)
    Tests for the presence of gender bias in automated coreference resolution systems.


Coreference Resolution: This is figuring out when different words or phrases in a text, like the pronoun she and the president, refer to the same person or thing.

## Data used for Training LLMs
Generative AI, specifically Large Language Models (LLMs), rely on a rich mosaic of data sources to fine-tune their linguistic skills. These sources include web content, academic writings, literary works, and multilingual texts, among others. By engaging with a variety of data types, such as scientific papers, social media posts, legal documents, and even conversational dialogues, LLMs become adept at comprehending and generating language across many contexts, enhancing their ability to provide relevant and accurate information.

### Diverse Data Sources
- Websites: Content from various online sources, including articles and blogs, helps models learn both formal and informal language.
- Scientific Papers: Academic texts provide technical language and complex concepts, which are useful for expert-level queries.
- Encyclopedias: Factual entries give models a basis for general knowledge across many topics.
- Books and Literature: Classic and modern literature enriches the model's vocabulary and understanding of complex sentence structures.
- Conversational Data: Transcripts from dialogues and chatbots help models grasp nuances in dialogue and colloquial speech.
- Social Media Posts: This data helps models understand current linguistic trends and informal communication styles.
- Legal Documents: These texts train models to comprehend formal language and complex structures.
- Multilingual Texts: Including texts in various languages helps models understand and generate language across different linguistic contexts.


Preprocessing: This is the process of preparing and cleaning data to ensure quality before it is used to train a machine learning model. It might involve removing errors, irrelevant information, or anonymizing information or formatting the data in a way that the model can easily learn from it.

Fine-tuning: After a model has been pre-trained on a large dataset, fine-tuning is an additional training step where the model is further refined with specific data to improve its performance on a particular type of task.

### Data Scale and Volume
The scale of data for Large Language Models (LLMs) is tremendously vast, involving datasets that could equate to millions of books. The sheer size is pivotal for the model's understanding and mastery of language through exposure to diverse words and structures.

Common Crawl: An open repository of web crawl data. Essentially, it is a large collection of content from the internet that is gathered by automatically scraping the web.

### Biases in Training Data
Biases in training data deeply influence the outcomes of AI models, reflecting societal issues that require attention. Ways to approach this challenge include promoting diversity in development teams, seeking diverse data sources, and ensuring continued vigilance through bias detection and model monitoring.

#### Types of Bias
##### Selection Bias - Biased data selection
    When the data used to train an AI model does not accurately represent the whole population or situation by virtue of the selection process leading to a skewed result, e.g. those choosing the data will tend to choose dataset their are aware of.
    
##### Historical Bias - Caused by historical prejudices in data
    Prejudices and societal inequalities of the past that are reflected in the data, influencing the AI in a way that perpetuates these outdated beliefs.
    
##### Confirmation Bias - caused by data with pre-existing beliefs
    Arises when data is selected to confirm pre-existing beliefs, further skewing the model's understanding.

#### Effects of Biased Data
##### Discriminatory Outcomes
    Unfair results produced by AI that disadvantage certain groups, often due to biases in the training data or malicious actors, such as biased hiring practices or loan approvals.
    
##### Echo Chambers
    Situations where biased AI reinforces and amplifies existing biases, leading to a narrow and distorted sphere of information. This can create feedback loops that reinforce existing biases, limiting diverse perspectives.

##### Misrepresentation
    Certain groups may be underrepresented or misrepresented in the outputs of AI models.

#### Mitigating Bias
##### Organizational Diversity
    Ensuring that the teams involved are diverse
    - Fair models = fair teams, companies, and society

##### Diverse Data Collection
    Actively seeking out diverse data sources

##### Bias Detection and Correction - employing algorithms and human oversight
    Processes and algorithms designed to identify and remove biases from data before it's used to train AI models.

##### Transparency and Accountability - being transparent about sources and nature of training data 
    Openness about how AI models are trained and the nature of their data, ensuring that developers are answerable for their AI's performance and impact.

##### Continuous Monitoring
    Regularly testing and updating the models.

