LoRA (Low-Rank Adaptation) is a technique for efficiently fine-tuning large language models (LLMs) for specific tasks by adding low-rank matrices to the frozen, pre-trained model, thereby adapting it without retraining all parameters[1]. This approach significantly reduces the number of trainable parameters, leading to reduced memory footprint, faster training, and the feasibility of using less powerful hardware[2][3]. LoRA decomposes weight update matrices into smaller matrices, reducing computational overhead and allowing a base model to be shared across multiple tasks with small LoRA modules[3][4]. The technique's linear design introduces no inference latency and can be applied across various types of neural networks, though it is predominantly showcased within NLP[2][3]. LoRA is versatile and supported for use in various applications and can be combined with other training techniques[6]. It offers a balance between adaptation and computational efficiency, making it a practical approach for adapting large pre-trained models[3][4][5].


---
---

Corpus: A large and structured set of texts used for linguistic analysis and research.  
Example: "The Brown Corpus is commonly used for NLP research."

Document: An individual piece of text within a corpus, ranging from a single sentence to an entire book.  
Example: "Each news article in a corpus is considered a separate document."

Vocabulary: The set of unique words present in a corpus or document.  
Example: "In the sentence 'I love NLP,' the vocabulary is {I, love, NLP}."

Term: A word or group of words that conveys a specific meaning within a particular context.  
Example: "In a machine learning document, terms like 'algorithm' and 'data' are important."

Token: An individual unit of text, such as a word or punctuation mark, obtained by breaking down a sentence during tokenization.  
Example: "'NLP is fun!' → ['NLP', 'is', 'fun', '!']."

Semantic Analysis: The process of understanding the meaning of words and sentences to derive insights from text.  
Example: "'The bank can refuse to lend money' differentiates 'bank' as a financial institution."

Pragmatics: The study of how context influences the interpretation of meaning in language.  
Example: "'Can you pass the salt?' is understood as a request."

Word Sense Disambiguation (WSD): The task of determining which meaning of a word is being used in a given context.  
Example: "'Bark' can refer to a dog's sound or a tree's outer layer."

Lexicon: A collection of words and their meanings, including usage and grammatical information.  
Example: "A thesaurus lists synonyms for words."

Topic Modeling: A technique to identify underlying topics in a collection of documents.  
Example: "Topic modeling reveals that documents about 'sports' frequently discuss 'teams' and 'players.'"

Dependency Parsing: A syntactic analysis that identifies relationships between words in a sentence.  
Example: "'The cat sat on the mat' shows 'sat' as the main verb, 'cat' as the subject."

Text Classification: Assigning predefined categories or labels to text based on its content.  
Example: "A review 'This product is great!' could be classified as positive sentiment."

Ensemble Learning: A technique that combines multiple models to improve overall performance.  
Example: "Using both a decision tree and a logistic regression model for email classification."

---
https://youtu.be/ENLEjGozrio?si=94jkx5ZiBvN0HVAc

---



# Data Preprocessing Steps in NLP

## Text Cleaning and Normalization
1. **Text Cleaning**: The process of removing unwanted characters and noise from text data to improve analysis.  
   Example: "Hello, World! 123" → "hello world"

2. **Lowercasing**: Converting all text to lowercase to ensure uniformity and avoid case sensitivity issues.  
   Example: "This is a Sample Text" → "this is a sample text"

3. **Removing Punctuation**: Eliminating punctuation marks that do not contribute to the meaning of the text.  
   Example: "Hello, world!" → "Hello world"

4. **Removing Numbers**: Depending on the context, this step involves removing or retaining numeric values in the text.  
   Example: "I have 2 apples" → "I have apples"

5. **Removing Whitespace**: Stripping extra spaces from the text to ensure clean and consistent input.  
   Example: "  Hello   world!  " → "Hello world!"

6. **Text Normalization**: Converting text to a standard format to ensure consistency, which may include lowercasing, removing diacritics, and expanding contractions.  
   Example: "can't" → "cannot"

## Filtering and Transformation
7. **Stop Words Removal**: The process of eliminating common words (e.g., "and", "the") that do not add significant meaning.  
   Example: "The cat sat on the mat." → "cat sat mat."

8. **Stemming**: Reducing words to their base or root form by removing suffixes or prefixes.  
   Example: "running" → "run"

9. **Lemmatization**: A more advanced form of stemming that reduces words to their base form, considering their meaning and context.  
   Example: "better" → "good"

## Tokenization and Structuring
10. **Tokenization**: Breaking down text into individual units (tokens), which can be words or sentences.  
    Example: "Hello world!" → ["Hello", "world"]

11. **Part-of-Speech (POS) Tagging**: Assigning parts of speech to each word in the text (e.g., noun, verb, adjective).  
    Example: "The cat sat" → [("The", "DT"), ("cat", "NN"), ("sat", "VBD")]

12. **Named Entity Recognition (NER)**: Identifying and classifying named entities (e.g., people, organizations, locations) in the text.  
    Example: "Barack Obama was the president." → [("Barack Obama", "PERSON"), ("president", "TITLE")]

## Analysis and Representation
13. **Word Frequency Analysis**: Analyzing the frequency of words to identify common terms and insights within the dataset.  
    Example: In "apple apple banana", the word frequency is {"apple": 2, "banana": 1}.

14. **N-grams Analysis**: Generating continuous sequences of n items (words or characters) from a given text to understand word patterns.  
    Example: For the sentence "I love NLP", bigrams are ["I love", "love NLP"].

15. **Bag of Words (BoW)**: A representation of text that describes the occurrence of words within a document without considering the order.  
    Example: "I love apples" becomes [1, 1, 1, 0, 0] for ["I", "love", "apples", "bananas", "oranges"].

16. **Personal Frequency Distribution**: Analyzing how frequently specific terms or entities appear in the text, often useful for personalized recommendations or analyses.
 

## Visualization and Final Steps
19. **Visualizations**: Using graphical representations (e.g., word clouds, bar charts) to understand the distribution and characteristics of text data.  
    Example: A bar chart showing the frequency of top words like "apple", "banana".

20. **Feature Extraction**: Transforming text data into numerical representations for use in machine learning algorithms.  
    Example: Using TF-IDF to convert a document into a numerical vector.

21. **Handling Imbalanced Data**: Techniques to address class imbalance in datasets, ensuring unbiased model training.  
    Example: Oversampling the minority class "yes" in a churn prediction dataset.



---
---

## Embedding Techniques
![image.png](attachment:image.png)
# Frequency-Based Embedding

## 1. **Count Vectors**
- **Definition**: Count vectors represent text data as a matrix where each element is the count of a term in the document. Each unique word in the corpus becomes a dimension in the vector space.
- **Example**: Given two sentences:
  - Sentence 1: "I love NLP"
  - Sentence 2: "I enjoy NLP"
  
  The count vector representation might look like this:
  
  | Word | Count in Sentence 1 | Count in Sentence 2 |
  |------|----------------------|----------------------|
  | I    | 1                    | 1                    |
  | love | 1                    | 0                    |
  | enjoy| 0                    | 1                    |
  | NLP  | 1                    | 1                    |

## 2. **TF-IDF (Term Frequency-Inverse Document Frequency)**
- **Definition**: TF-IDF is a numerical statistic that reflects the importance of a word in a document relative to a collection of documents (corpus). It increases proportionally to the number of times a word appears in the document and is offset by the frequency of the word in the corpus.
- **Formula**:
  
  $$ \text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t) $$

  Where:
  
  - $ \text{TF}(t, d) = \frac{\text{Number of times term t appears in document d}}{\text{Total number of terms in document d}} $
  
  - $ \text{IDF}(t) = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing term t}}\right) $
  
- **Example**: If "NLP" appears 3 times in a document of 100 words and appears in 5 out of 100 documents:
  
  $$ \text{TF}(NLP) = \frac{3}{100} = 0.03 $$
  
  $$ \text{IDF}(NLP) = \log\left(\frac{100}{5}\right) = \log(20) \approx 1.301 $$
  
  Thus, \( \text{TF-IDF}(NLP) \approx 0.03 \times 1.301 \approx 0.03903 \).

## 3. **Co-Occurrence Matrix**
- **Definition**: A co-occurrence matrix counts how often pairs of words occur together in a given context (e.g., within the same sentence or window). This matrix helps to identify relationships and associations between words.
- **Example**: For the sentences "I love NLP" and "I enjoy NLP", the co-occurrence matrix may look like this:

  |     | I | love | enjoy | NLP |
  |-----|---|------|-------|-----|
  | I   | 2 | 1    | 1     | 2   |
  | love| 1 | 0    | 0     | 1   |
  | enjoy| 1| 0    | 0     | 1   |
  | NLP | 2 | 1    | 1     | 0   |

# Prediction-Based Embedding

## 1. **Word2Vec**
- **Definition**: Word2Vec is a prediction-based model that generates word embeddings by predicting words in a given context (CBOW) or predicting context words given a target word (Skip-gram).

### a. **CBOW (Continuous Bag of Words)**
- **Definition**: In CBOW, the model predicts the target word based on the context words surrounding it.
- **Example**: Given the context words "the", "cat", and "sat", CBOW would predict the target word "on".

### b. **Skip-Gram**
- **Definition**: In the Skip-gram model, the model predicts the context words given a target word.
- **Example**: Given the target word "on", Skip-gram would predict the surrounding context words "the", "cat", and "sat".

## 2. **GloVe (Global Vectors for Word Representation)**
- **Definition**: GloVe is a global vector representation model that utilizes aggregated global word-word co-occurrence statistics from a corpus to generate embeddings. It aims to derive the embedding such that the dot product of two word vectors predicts their co-occurrence probability.
- **Formula**:
  
  $$ P_{ij} = \frac{X_{ij}}{X_j} $$

  Where:
  
  - \( P_{ij} \) is the probability of word \( i \) occurring in the context of word \( j \).
  - \( X_{ij} \) is the co-occurrence count of word \( i \) and word \( j \).
  
- **Example**: For the co-occurrence matrix generated from a corpus, GloVe learns embeddings by factorizing the matrix to obtain word vectors that represent words based on their global context.

---
---

# Text normalization

In the tect pre-processing highly overlooked step is text normalization. The text normalization means the process of transforming the text into the canonical(or standard) form. Like, "ok" and "k" can be transformed to "okay", its canonical form.And another example is mapping of near identical words such as "preprocessing", "pre-processing" and "pre processing" to just "preprocessing".

Text normaliztion is too useful for noisy textssuch as social media comments, comment to blog posts, text messages, where abbreviations, misspellings, and the use out-of-vocabulary(oov) are prevalent.


### Effects of normalization

Text normalization has even been effective for analyzing highly unstructured clinical texts where physicians take notes in non-standard ways. We have also found it useful for topic extraction where near synonyms and spelling differences are common (like 'topic modelling', 'topic modeling', 'topic-modeling', 'topic-modelling').

Unlike stemming and lemmatization, there is not a standard way to normalize texts. It typically depends on the task. For e.g, the way you would normalize clinical texts would arguably be different from how you normalize text messages.

Some of the common approaches to text normalization include dictionary mappings, statistical machine translation (SMT) and spelling-correction based approaches.

---
---
To compare two strings and determine if they are similar, there are several methods depending on the level of similarity and context you need. Here are some common techniques:

### 1. **Exact Match**

   - Simply use the equality operator `==` in most programming languages (e.g., `string1 == string2`). This checks if both strings are identical, character by character.

### 2. **Case-Insensitive Comparison**

   - Convert both strings to lowercase (or uppercase) and then compare. This method helps when case differences are irrelevant.

     ```python
     string1.lower() == string2.lower()
     ```

### 3. **Levenshtein Distance (Edit Distance)**

   - The **Levenshtein Distance** measures the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into the other. A smaller distance indicates more similarity.
   - This can be computed using libraries like `python-Levenshtein` or `editdistance` in Python.

     ```python
     import Levenshtein
     similarity = Levenshtein.distance(string1, string2)
     ```

### 4. **Jaccard Similarity**

   - Treat each string as a set of characters or words and calculate the Jaccard similarity, which is the ratio of the intersection of the sets to the union.

     $$
     \text{Jaccard Similarity} = \frac{|A \cap B|}{|A \cup B|}
     $$

   - For example, for `string1 = "apple"` and `string2 = "appeal"`, the Jaccard similarity would focus on common and unique characters.

### 5. **Cosine Similarity (TF-IDF)**

   - For longer strings, convert each string into a vector (such as TF-IDF) and calculate the **cosine similarity**, which measures the cosine of the angle between two vectors. Values close to 1 indicate higher similarity.
   - This approach is commonly used in text mining for comparing larger bodies of text.

     ```python
     from sklearn.feature_extraction.text import TfidfVectorizer
     from sklearn.metrics.pairwise import cosine_similarity

     vectorizer = TfidfVectorizer().fit_transform([string1, string2])
     similarity = cosine_similarity(vectorizer[0:1], vectorizer[1:2])
     ```

### 6. **N-gram Similarity**

   - Break each string into consecutive `n`-length character sequences (n-grams), then compare them for overlap. This method captures partial matches effectively and is useful in detecting similar substrings.

### 7. **Fuzzy Matching (Token Set or Token Sort)**

   - Libraries like `fuzzywuzzy` in Python provide various fuzzy matching techniques, such as `fuzz.ratio()` or `fuzz.partial_ratio()`, which are based on Levenshtein distance but offer more flexibility for approximate matches.

     ```python
     from fuzzywuzzy import fuzz

     similarity = fuzz.ratio(string1, string2)
     ```

### 8. **Jaro-Winkler Distance**

   - This distance metric accounts for transpositions and is particularly useful for short strings with small typographical errors. It gives higher weights to matches that start similarly, making it effective for name matching.

### Choosing the Right Method

The method you choose depends on the context:
- **Exact matches** or **case-insensitive checks** work well for strict comparisons.
- **Levenshtein**, **Jaccard**, or **Cosine similarity** work well for partial matches or text comparisons with minor errors.
- **Fuzzy matching** methods (like those in `fuzzywuzzy`) are helpful when comparing names, addresses, or strings with possible typographical errors.

---
---
Here’s the information about sampling novel sequences in the context of NLP:

**Sampling Novel Sequences:** A method used in generating new text or sequences by selecting from a probability distribution over possible next items (words, characters, etc.) based on a trained model. This approach aims to create diverse outputs rather than repeating known sequences, often leveraging randomness or temperature settings to control creativity.

**Example:** Given the model output probabilities for the next word after "The cat is," sampling might yield:

1. "The cat is sitting on the mat." (common continuation)
2. "The cat is a magician." (novel but plausible)
3. "The cat is flying to the moon!" (unexpected and creative)

By adjusting the sampling method (like greedy sampling, temperature sampling, or top-k sampling), the generated text can range from conservative to highly creative.

---
---
---

 

### 1. **Performance Metrics**:
   - **Perplexity**: Measures how well a probability model predicts a sample. Lower perplexity indicates better performance.
   - **BLEU Score**: Used for evaluating machine translation, comparing the model output with one or more reference translations. Higher scores indicate better performance.
   - **ROUGE Score**: Commonly used for summarization tasks, it measures the overlap between the model-generated output and reference summaries. Higher scores indicate better performance.
   - **Accuracy**: For classification tasks, the proportion of correct predictions.
   - **F1 Score**: The harmonic mean of precision and recall, especially useful for imbalanced datasets.

### 2. **Context Length Handling**:
   - **Maximum Context Length**: Evaluate how much longer sequences the new model can handle compared to the previous model. 
   - **Performance on Long Contexts**: Assess the performance metrics (e.g., perplexity, BLEU, ROUGE) specifically on longer input sequences to see if the new model maintains or improves performance as context length increases.

### 3. **Computational Efficiency**:
   - **Training Time**: Measure the time taken to train the model. A more efficient model can handle longer contexts without significantly increasing training time.
   - **Inference Time**: Measure how quickly the model generates outputs for given input lengths.

### 4. **Generalization Ability**:
   - **Evaluation on Diverse Datasets**: Test the new model on a wide range of datasets, especially those that include longer sequences. Improved performance across various domains indicates better generalization.

### 5. **Qualitative Evaluation**:
   - **Human Evaluation**: Gather human judgments on the quality of the outputs generated by the models, especially for tasks like text generation, translation, or summarization. Assess aspects such as coherence, relevance, and fluency.

### 6. **Robustness**:
   - **Stress Testing**: Evaluate how well the model performs under various input conditions, including noisy, incomplete, or misleading inputs. Check if the model can still provide meaningful outputs with longer contexts.

### Example Comparison

| Criteria                     | Previous Model               | New Model                    |
|------------------------------|------------------------------|------------------------------|
| Maximum Context Length        | 512 tokens                   | 2048 tokens                  |
| Perplexity (on validation set)| 35.0                         | 30.5                         |
| BLEU Score (on translation task)| 25.2                      | 28.7                         |
| Training Time                 | 10 hours                     | 12 hours                     |
| F1 Score (for classification) | 0.85                       | 0.88                         |
| Human Evaluation (score out of 5) | 3.5                     | 4.2                          |

 

---
---
---
---

## Web Scraping with Python

Suppose you have to pull large amount of data from websites and you want to fetch it as quickly as possible. How would you do it? Manually going to the website and collect those datas.It will be tedious work. So, "web scrapping" will help you out in this situation. Web scrapping just makes this job easier and faster.

Here, we will do Web Scrapping with Python, starts with

1. Why we do web scrapping?
Web scrapping will be used to collect large amount of data from Websites.But why does someone have to collect such large amount of data from websites? To know about this, let's have to look at the applications of web scrapping:

- __Price comaparison:__ Parsehub is providing such services to useweb scraping to collect data from some online shopping websites and use to comapre price of products from another.
- __Gathering Emails:__ There are lots of companies that use emails as a medium for marketing, they use web scrapping to collect email id's and send bulk emails.
- __Social media scrapping:__ Web scrapping is used to collect data from Social Media websites such as Twitter to find out what's trending in twitter.
- __Research and Development:__ For reasearch purposes people do web scrapping to collect a large set of data(Statistics, General information, temperature,etc.) from websites, which are analyzed and used to carry out surveys or for R&D.


2. What is Web scrapping and is it legal or not?
Web scrapping is an automated to extract large amount of data from websites.And the websites data are unstructured most of the time.Web scrapping will help you out to collect those unstructured data and stored it in a structured form.There are different ways to scrape websites such as online services,APIs, or by writing your own code. Here, we'll see how to implementing the web scraping with python.

Coming to the question, is scrapping legal or not? Some websites allow web scrapping and some not.To know whether website allows you to scrape it or not by website's "robots.txt" file. You can find this file just append "/robots.txt" to the URL that you want to scrape.Here, we're scrapping from Flipkart website.So, to see the "robots.txt" file, URL is www.flipkart.com/robots.txt. 

3. How does web scrapping work?

When we run the code for web scraping, a request is sent to the URL that you have mentioned in the code. As a response to the request, the server send the data and allows you to read the HTML or XML page. Then our code will parses the HTML or XML page, find the data and extract it.

To extract datas using web scraping with python, you need to follow these basic steps:

  1.Find that URL that you mentioned in the code and want to scrape it.
  2.Inspect the Page for scraping.
  3.Find those data you want to extract.
  4.Write the code for scrapping.
  5.Run the code and extract the data.
  6.Store the data in the required format.
  
Now lets see how to extract data from the flipkart website using Python.

4. Libraries used for Web scrapping

We already know, that python used for various applications and there are different libraries for different purposes.In this, we're using the following libraries:

- **Selenium:** Selenium library is used for web testing. We will use to automate browser activities.
- **BeautifulSoup4:** It is generally used for parsing HTML and XML documents.It creates a parse trees that is helpful to extract the datas easily.
- **Pandas:** It is a Python library used for data manipulation and analysis.Pandas is used to extract data and stored it in the desired format.

5. For Demo Purpose : Scrapping a Flipkart Website

Pre-requisites:

 - Python 3.x with Selenium, Beautifulsoup4, Pandas library  installed.
 - Google Chrome Browser
 
You can go through this [link](https://github.com/iNeuronai/webscrappper_text.git) for more details.

---
---

### REGEX

In [0]:
import re

lists = ['icecream images', 'i immitated', 'inner peace']

for i in lists:
    q = re.match("(i\w+)\W(i\w+)", i)
    
    if q:
        print((q.groups()))

('icecream', 'images')


### Finding Pattern in the text(re.search())

A RegEx is commonly used to search for a pattern in the text. This method takes a RegEx pattern and a string and searches that pattern with the string.

For using re.search() function, you need to import re first. The search() function takes the "pattern" and "text" to scan from our given string and returns the match object when the pattern found or else not match.

In [0]:
import re

pattern = ["playing", "iNeuron"]
text = "Raju is playing outside."

for p in pattern:
    print("You're looking for '%s' in '%s'" %(p, text), end = ' ')
    
    if re.search(p, text):
        print('Found match!')
        
    else:
        print("no match found!")

You're looking for 'playing' in 'Raju is playing outside.' Found match!
You're looking for 'iNeuron' in 'Raju is playing outside.' no match found!


In the Above example, we look for two literal strings "playing", "iNeuron" and in text string we had taken "Raju is playing outside.". For "playing" we got the match and in the output we got "Found Match", while for the word "iNeuron" we didn't got any match. So,we got no match found for that word.

## Using re.findall() for text

We use re.findall() module is when you wnat to iterate over the lines of the file, it'll do like list all the matches in one go. Here in a example, we would like to fetch email address from the list and we want to fetch all emails from the list, we use re.findall() method.

In [0]:
import re

kgf = "Gaurav@iNeuron.ai, Nilesh@iNeuron.ai, Jay@iNeuron.ai, Vikash@iNeuron.ai"

emails = re.findall(r'[\w\.-]+@[\w\.-]+', kgf)

for e in emails:
    print(e)

Gaurav@iNeuron.ai
Nilesh@iNeuron.ai
Jay@iNeuron.ai
Vikash@iNeuron.ai


## Python Flags

You see many Python RegEx methods and functions take an optional arguemnet flag.This flag can modify the meaning of the given regeEx pattern.

Various flags used in python include.
<img src=".\Images\14.png">

## Let's look the example for re.M or Multiline Flags

In the multiline flag the pattern character "^" matches the first character of the string and the begining of the each line. While the small "w" is used to mark the space with characters.When you run the code first variable "q1" prints out the character "i" only and while using the Multiline flag will give the result of all first character of all strings.

In [0]:
import re

aa = """iNeuron13
Machine
Learning"""

q1 = re.findall(r"^\w", aa)
q2 = re.findall(r"^\w", aa, re.MULTILINE)
print(q1)
print(q2)

['i']
['i', 'M', 'L']


Likewise, you can also use other Python flags like re.U (Unicode), re.L (Follow locale), re.X (Allow Comment), etc.

---
---