# Unit 2

## Dataset Deduplication and Redundancy Removal

In the world of large-scale language models (LLMs), the quality and uniqueness of your dataset are crucial. Duplicates and near-duplicates can skew the model's learning process, leading to inefficiencies and potential biases. This lesson focuses on **deduplication**, a key step in data preparation that ensures your dataset is as clean and efficient as possible. By the end of this lesson, you'll understand how to remove both exact and near-duplicates from your dataset, setting a strong foundation for building robust LLMs.

### Recall: Basic Concepts of Hashing

Before diving into deduplication, let's briefly revisit the concept of **hashing**. Hashing is a process that converts data into a fixed-size string of characters, which is typically a hash code. This is useful for quickly comparing data, as hash codes are unique to the data they represent. In previous lessons, we introduced the `hashlib` library in Python, which provides a simple way to generate hash codes. Remember, hashing is a fundamental tool in data processing, especially when dealing with large datasets.

### Exact Deduplication Using Hashing

Exact deduplication involves removing identical entries from your dataset. This is a straightforward process that can be efficiently handled using Python's `set` data structure. Let's walk through the steps:

1.  **Identify Duplicates:** Start with a list of texts, some of which may be duplicates.

    ```python
    texts = [
        "Large language models require diverse datasets.",
        "Language models need large and diverse datasets.",
        "This is a duplicate sentence.",
        "This is a duplicate sentence."
    ]
    ```

2.  **Remove Duplicates:** Use a `set` to automatically filter out duplicate entries.

    ```python
    unique_texts = list(set(texts))
    ```

    By converting the list to a set and back to a list, you remove any duplicate entries. The `set` data structure inherently does not allow duplicates, making it perfect for this task.

3.  **Result:** The `unique_texts` list now contains only unique entries.

    ```python
    print(unique_texts)
    # Output: ['This is a duplicate sentence.', 'Large language models require diverse datasets.', 'Language models need large and diverse datasets.']
    ```

### Near-Duplicate Detection with MinHash

MinHash is a technique used to approximate the similarity between sets, which is useful for detecting near-duplicates in large datasets.

1.  **Setup MinHash:** Use the `datasketch` library to implement MinHash.

    ```python
    from datasketch import MinHash
    minhash_dict = {}
    ```

2.  **Create MinHash Signatures:** For each unique text, create a MinHash signature.

    ```python
    for i, text in enumerate(unique_texts):
        minhash = MinHash(num_perm=128)
        for word in text.split():
            minhash.update(word.encode('utf8'))
        minhash_dict[i] = minhash
    ```

      * `num_perm=128`: This parameter specifies the number of permutations used in the MinHash algorithm. A higher number of permutations increases the accuracy of the similarity estimation but also increases the computational cost. In this context, `num_perm=128` strikes a balance between accuracy and efficiency, providing a reliable approximation of the Jaccard similarity between sets.
      * `encode('utf8')`: The `encode('utf8')` method is used to convert each word in the text into a byte format, which is necessary for the MinHash object to process the data. UTF-8 is a standard encoding that supports a wide range of characters, ensuring that the text is correctly encoded regardless of its content.

### Near-Duplicate Detection with Locality-Sensitive Hashing (LSH)

Locality-Sensitive Hashing (LSH) efficiently finds similar items in large datasets.

1.  **Setup LSH:** Initialize LSH with a similarity threshold.

    ```python
    from datasketch import MinHashLSH
    lsh = MinHashLSH(threshold=0.8, num_perm=128)
    ```

    Here, `MinHashLSH` is initialized with a similarity threshold of 0.8, meaning it will consider items with 80% similarity as near-duplicates.

2.  **Insert MinHash Signatures into LSH:** Insert each MinHash signature into the LSH.

    ```python
    for i, minhash in minhash_dict.items():
        lsh.insert(f"text_{i}", minhash)
    ```

3.  **Query for Near-Duplicates with LSH:** Use the LSH to find near-duplicates.

    ```python
    for i, minhash in minhash_dict.items():
        print(f"Near duplicates for '{unique_texts[i]}':", lsh.query(minhash))
    ```

    This code queries the LSH for each text's MinHash signature, returning a list of similar texts.

### Recall: Basic Concepts of TF-IDF

Before diving into near-duplicate detection using cosine similarity, let's briefly revisit the concept of **TF-IDF** (Term Frequency-Inverse Document Frequency). TF-IDF is a numerical statistic that reflects the importance of a word in a document relative to a collection of documents (or corpus). It is often used in text mining and information retrieval to convert text data into numerical vectors, which can then be used for various analyses, including similarity measurements.

  * **Term Frequency (TF):** Measures how frequently a term appears in a document. It is calculated as the number of times a term appears in a document divided by the total number of terms in the document.
  * **Inverse Document Frequency (IDF):** Measures how important a term is. It is calculated as the logarithm of the total number of documents divided by the number of documents containing the term.

The TF-IDF value is the product of these two metrics, providing a weight that indicates the importance of a term in a document relative to the entire corpus. This weighting helps in identifying the most relevant words for distinguishing between documents, making it a powerful tool for text vectorization.

### Near-Duplicate Detection with Cosine Similarity

Cosine Similarity measures the cosine of the angle between two vectors, providing a value between -1 and 1, which helps identify near-duplicates. The formula for cosine similarity between two vectors $A$ and $B$ is:

$$\text{Cosine Similarity} = \frac{A \cdot B}{\|A\| \|B\|}$$

where:

  * $A \\cdot B$ is the dot product of the vectors.
  * $|A|$ and $|B|$ are the magnitudes (or lengths) of the vectors.

<!-- end list -->

1.  **Vectorize Texts:** Convert texts into numerical vectors using techniques like TF-IDF.

    ```python
    from sklearn.feature_extraction.text import TfidfVectorizer
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(unique_texts)
    ```

    The `TfidfVectorizer` converts the text data into a matrix of TF-IDF features.

2.  **Calculate Cosine Similarity:** Use the cosine similarity function to find similar texts and round the similarity matrix for better readability.

    ```python
    from sklearn.metrics.pairwise import cosine_similarity
    import numpy as np

    cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
    print(np.round(cosine_sim, 2))
    ```

    This computes the cosine similarity between each pair of texts in the dataset and prints the similarity matrix rounded to two decimal places.

3.  **Identify Near-Duplicates:** Set a threshold to determine near-duplicates based on cosine similarity.

    ```python
    # Initialize a set to keep track of already reported near-duplicates
    reported_pairs = set()
    threshold = 0.8
    for i in range(len(unique_texts)):
        for j in range(i + 1, len(unique_texts)):  # Start from i+1 to avoid duplicate pairs
            if cosine_sim[i][j] > threshold and (j, i) not in reported_pairs:
                print(f"Near duplicates: '{unique_texts[i]}' and '{unique_texts[j]}'")
                reported_pairs.add((i, j))
    ```

    This modification ensures that each pair of near-duplicates is only reported once, avoiding repetition of the same text in the output.

### Handling and Removing Duplicates: When and How

#### When to Remove Duplicates

  * **Data Quality Improvement:** Remove duplicates to enhance the quality of your dataset, ensuring that the model learns from diverse and unique examples.
  * **Bias Reduction:** Duplicates can introduce bias, as repeated data points may skew the model's understanding. Removing them helps maintain a balanced dataset.
  * **Efficiency:** Reducing redundancy decreases the dataset size, leading to faster processing and training times.

#### How to Handle Duplicates

  * **Exact Duplicates:** Use Python's `set` data structure to remove exact duplicates efficiently.
  * **Near-Duplicates:** Implement MinHash, LSH, and Cosine Similarity to detect and handle near-duplicates, ensuring that similar but not identical entries are identified and managed.

#### Cases to Consider

  * **Domain-Specific Needs:** In some domains, duplicates might be necessary for emphasis or context. Evaluate the importance of duplicates based on your specific use case.
  * **Data Augmentation:** If duplicates are part of a data augmentation strategy, consider their role in enhancing model robustness before removal.
  * **Threshold Tuning:** Adjust similarity thresholds in MinHash, LSH, and Cosine Similarity based on the desired level of similarity detection, balancing between removing too many or too few entries.

### Summary and Preparation for Practice

In this lesson, you learned how to perform both exact and near-duplicate deduplication on datasets, a crucial step in preparing data for large-scale language models. You now understand how to use Python's `set` for exact deduplication and the `datasketch` library for detecting near-duplicates with MinHash and LSH, as well as cosine similarity for an additional layer of precision. As you move on to the practice exercises, apply these techniques to different datasets and experiment with various parameters to deepen your understanding. This hands-on practice will reinforce your learning and prepare you for more advanced data preparation tasks.

## Removing Exact Duplicates Efficiently

You've learned about exact deduplication using Python's set. Now, let's put that into practice. Your task is to create a function that removes exact duplicates from a list of text strings.

Use Python's set to filter out duplicates.
Return a list of unique texts.
Calculate and print:
The number of duplicates removed.
The percentage of the original dataset that consisted of duplicates.
This exercise will solidify your understanding of basic deduplication. Dive in and see how efficiently you can clean up the dataset!

```python
def deduplicate_texts(texts):
    # TODO: Convert list to set to remove duplicates
    
    # TODO: Calculate number of duplicates removed
    
    # TODO: Calculate percentage of duplicates
    
    # TODO: Print the number and percentage of duplicates
    print(f"Number of duplicates removed: {______}")
    print(f"Percentage of duplicates: {______:.2f}%")
    
    return unique_texts

# Sample dataset with duplicates
texts = [
    "Natural language processing is an interesting field.",
    "Natural language processing is an interesting field.",
    "Natural langauge processing is an interesting field.",  # Typo in "language"
    "Natural language processing is a fascinating field.",   # Word change: "interesting" → "fascinating"
    "Processing of natural language is interesting.",        # Word order changed
    "Deep learning is used in natural language processing."  # Different but related sentence
]

# Call the function and print unique texts
unique_texts = deduplicate_texts(texts)
print("Unique texts:", unique_texts)

```

```python
def deduplicate_texts(texts):
    # TODO: Convert list to set to remove duplicates
    initial_count = len(texts)
    unique_texts = list(set(texts))
    final_count = len(unique_texts)
    
    # TODO: Calculate number of duplicates removed
    duplicates_removed = initial_count - final_count
    
    # TODO: Calculate percentage of duplicates
    percentage_duplicates = (duplicates_removed / initial_count) * 100
    
    # TODO: Print the number and percentage of duplicates
    print(f"Number of duplicates removed: {duplicates_removed}")
    print(f"Percentage of duplicates: {percentage_duplicates:.2f}%")
    
    return unique_texts

# Sample dataset with duplicates
texts = [
    "Natural language processing is an interesting field.",
    "Natural language processing is an interesting field.",
    "Natural langauge processing is an interesting field.",  # Typo in "language"
    "Natural language processing is a fascinating field.",   # Word change: "interesting" → "fascinating"
    "Processing of natural language is interesting.",        # Word order changed
    "Deep learning is used in natural language processing."  # Different but related sentence
]

# Call the function and print unique texts
unique_texts = deduplicate_texts(texts)
print("Unique texts:", unique_texts)

```

## Creating MinHash Signatures

Nice job on understanding exact deduplication! Now, let's take it a step further. Your task is to implement a function that creates MinHash signatures for a list of unique texts.

Use the datasketch library to generate a MinHash signature for each text.
Split each text into words, convert them to bytes, and update the MinHash object.
Return a dictionary mapping text indices to their MinHash signatures.
This exercise will help you grasp the first step in detecting near duplicates. Dive in and see how you can apply MinHash to your dataset!

```python
from datasketch import MinHash

def create_minhash_signatures(texts):
    minhash_dict = {}
    for i, text in enumerate(texts):
        minhash = MinHash(num_perm=128)
        # TODO: Split the text into words
        # TODO: Convert each word to bytes and update the MinHash object
        minhash_dict[i] = minhash
    return minhash_dict

# Example usage
unique_texts = [
    "Natural language processing is an interesting field.",
    "Natural langauge processing is an interesting field.",  # Typo in "language"
    "Natural language processing is a fascinating field.",   # Word change: "interesting" → "fascinating"
    "Processing of natural language is interesting.",        # Word order changed
    "Deep learning is used in natural language processing."  # Different but related sentence
]

minhash_signatures = create_minhash_signatures(unique_texts)
for index, minhash in minhash_signatures.items():
    print(f"Text {index}: MinHash signature created.")
```

```python
from datasketch import MinHash

def create_minhash_signatures(texts):
    minhash_dict = {}
    for i, text in enumerate(texts):
        minhash = MinHash(num_perm=128)
        # TODO: Split the text into words
        words = text.split()
        # TODO: Convert each word to bytes and update the MinHash object
        for word in words:
            minhash.update(word.encode('utf8'))
        minhash_dict[i] = minhash
    return minhash_dict

# Example usage
unique_texts = [
    "Natural language processing is an interesting field.",
    "Natural langauge processing is an interesting field.",  # Typo in "language"
    "Natural language processing is a fascinating field.",   # Word change: "interesting" → "fascinating"
    "Processing of natural language is interesting.",        # Word order changed
    "Deep learning is used in natural language processing."  # Different but related sentence
]

minhash_signatures = create_minhash_signatures(unique_texts)
for index, minhash in minhash_signatures.items():
    print(f"Text {index}: MinHash signature created.")
```

## Detect Near-Duplicates with LSH

Well done on mastering MinHash signatures! Now, let's explore Locality-Sensitive Hashing (LSH) to detect near-duplicates in a dataset. Your task is to implement a function that:

Initializes a MinHashLSH object with two given similarity thresholds: 0.8 and 0.5.
Inserts MinHash signatures into the LSH.
Queries the LSH to find near-duplicates for each text at both thresholds.
Return a dictionary mapping text indices to their near-duplicates for each threshold. Also, print examples of identified near-duplicate text indices along with their similarity thresholds. This exercise will deepen your understanding of near-duplicate detection and help you understand how thresholds affect the detection process. Let's see how effectively you can apply LSH!

```python
from datasketch import MinHash, MinHashLSH

def detect_near_duplicates(minhash_dict, texts, threshold=0.8):
    # TODO: Initialize MinHashLSH with the given similarity threshold
    # num_perm=128 specifies the number of permutations used in MinHash, affecting the accuracy of similarity estimation
    
    # TODO: Insert MinHash signatures into LSH
    
    # Query the LSH to find near-duplicates for each text
    near_duplicates = {}
    for i, minhash in minhash_dict.items():
        # TODO: Query the LSH with the current MinHash signature to find similar items        
        # TODO: Store the result in the near_duplicates dictionary with the current index as the key        
        # TODO: Print examples of identified near-duplicate text indices along with their similarity thresholds
        
    return near_duplicates

# Example usage
texts = [
    "Natural language processing is an interesting field.",
    "Natural langauge processing is an interesting field.",  # Typo in "language"
    "Natural language processing is a fascinating field.",   # Word change: "interesting" → "fascinating"
    "Processing of natural language is interesting.",        # Word order changed
    "Deep learning is used in natural language processing."  # Different but related sentence
]

# Create MinHash signatures
minhash_dict = {}
for i, text in enumerate(texts):
    minhash = MinHash(num_perm=128)
    for word in text.split():
        minhash.update(word.encode('utf8'))
    minhash_dict[i] = minhash

# Detect near-duplicates
print("Results for near-duplicates with a similarity threshold of 0.8:")
detect_near_duplicates(minhash_dict, texts)
print("-----------------------------------------------------------------")
# TODO: Detect near-duplicates with threshold = 0.5
print("Results for near-duplicates with a similarity threshold of 0.5:")

```

```python
from datasketch import MinHash, MinHashLSH

def detect_near_duplicates(minhash_dict, texts, threshold=0.8):
    # TODO: Initialize MinHashLSH with the given similarity threshold
    # num_perm=128 specifies the number of permutations used in MinHash, affecting the accuracy of similarity estimation
    lsh = MinHashLSH(threshold=threshold, num_perm=128)
    
    # TODO: Insert MinHash signatures into LSH
    for i, minhash in minhash_dict.items():
        lsh.insert(f"text_{i}", minhash)
    
    # Query the LSH to find near-duplicates for each text
    near_duplicates = {}
    for i, minhash in minhash_dict.items():
        # TODO: Query the LSH with the current MinHash signature to find similar items
        result = lsh.query(minhash)
        # TODO: Store the result in the near_duplicates dictionary with the current index as the key
        near_duplicates[i] = result
        # TODO: Print examples of identified near-duplicate text indices along with their similarity thresholds
        if len(result) > 1:
            print(f"Near duplicates for text_{i} (threshold={threshold}): {result}")
            
    return near_duplicates

# Example usage
texts = [
    "Natural language processing is an interesting field.",
    "Natural langauge processing is an interesting field.",  # Typo in "language"
    "Natural language processing is a fascinating field.",   # Word change: "interesting" → "fascinating"
    "Processing of natural language is interesting.",        # Word order changed
    "Deep learning is used in natural language processing."  # Different but related sentence
]

# Create MinHash signatures
minhash_dict = {}
for i, text in enumerate(texts):
    minhash = MinHash(num_perm=128)
    for word in text.split():
        minhash.update(word.encode('utf8'))
    minhash_dict[i] = minhash

# Detect near-duplicates
print("Results for near-duplicates with a similarity threshold of 0.8:")
detect_near_duplicates(minhash_dict, texts, threshold=0.8)
print("-----------------------------------------------------------------")
# TODO: Detect near-duplicates with threshold = 0.5
print("Results for near-duplicates with a similarity threshold of 0.5:")
detect_near_duplicates(minhash_dict, texts, threshold=0.5)
```

## Detecting Near-Duplicates with Cosine Similarity

You've done well with MinHash! Now, let's explore Cosine Similarity for detecting near-duplicates. Your task is to complete the missing parts in the starter code to:

Use TfidfVectorizer to convert texts into TF-IDF vectors.
Calculate the cosine similarity matrix and print it.
Identify text pairs with scores above two different thresholds.
Return two dictionaries mapping each text index to a list of its near duplicates for each threshold.
This exercise will deepen your understanding of different deduplication techniques. Dive in and see how Cosine Similarity can enhance your dataset analysis!

```python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def find_near_duplicates(texts, threshold1, threshold2):
    # TODO: Use TfidfVectorizer to convert texts into TF-IDF vectors
    vectorizer = _______________
    tfidf_matrix = _______.fit_transform(texts)
    
    # TODO: Calculate the cosine similarity matrix and print it
    cosine_sim = cosine_similarity(_________, __________)
    print("Cosine Similarity Matrix:")
    print(np.round(cosine_sim, 2))

    near_duplicates1 = {}
    near_duplicates2 = {}
    for i in range(len(texts)):
        # TODO: Identify text pairs with similarity scores above the first threshold
        similar_indices1 = [j for j in range(len(texts)) if i != j and _______ > threshold1]
        near_duplicates1[i] = similar_indices1
        
        # TODO: Identify text pairs with similarity scores above the second threshold
        similar_indices2 = [j for j in range(len(texts)) if i != j and  _______ > threshold2]
        near_duplicates2[i] = similar_indices2
    return near_duplicates1, near_duplicates2

# Sample dataset with duplicates
texts = [
    "Natural language processing is an interesting field.",
    "Natural language processing is an interesting field.",  # This is a duplicate
    "Natural langauge processing is an interesting field.",  # Typo in "language"
    "Natural language processing is a fascinating field.",   # Word change: "interesting" → "fascinating"
    "Processing of natural language is interesting.",        # Word order changed
    "Deep learning is used in natural language processing."  # Different but related sentence
]

# Example usage
threshold1 = 0.8
threshold2 = 0.5
near_duplicates1, near_duplicates2 = find_near_duplicates(texts, threshold1, threshold2)
print("Near duplicates with threshold 0.8:")
for index, duplicates in near_duplicates1.items():
    print(f"Text {index}: {duplicates}")

print("Near duplicates with threshold 0.5:")
for index, duplicates in near_duplicates2.items():
    print(f"Text {index}: {duplicates}")
```