# Unit 3

## Dataset Filtering and Toxicity Detection

### Dataset Filtering and Toxicity Detection

**Introduction and Context Setting**

Welcome to the lesson on Dataset Filtering and Toxicity Detection. In the previous lessons, we explored efficient data storage and deduplication techniques for preparing datasets for large-scale language models (LLMs). Now, we will focus on filtering datasets to remove non-English and toxic content. This step is crucial to ensure the quality and safety of the data used to train LLMs. By the end of this lesson, you will be able to implement a function that filters out unwanted content from a dataset.

### Language Detection with `langdetect`

To filter out non-English content, we will use the `langdetect` library. This library helps identify the language of a given text.

**Step-by-Step Explanation**

1.  **Import the Library:** First, we need to import the `detect` function from the `langdetect` library.

    ```python
    from langdetect import detect
    ```

2.  **Detect Language:** Use the `detect` function to identify the language of a text. It returns a language code, such as "en" for English.

    ```python
    text = "Je ne parle pas anglais."
    language = detect(text)
    print("Detected Language:", language)
    ```

    Output:

    ```
    Detected Language: fr
    ```

    Here, the text is in French, so the detected language code is "fr".

### Toxicity Detection with `Detoxify`

Next, we will use the `Detoxify` library to detect toxic language in the text. This library provides a model that predicts toxicity scores.

**Step-by-Step Explanation**

1.  **Import the Library:** Import the `Detoxify` class from the `detoxify` library.

    ```python
    from detoxify import Detoxify
    ```

2.  **Predict Toxicity:** Use the `Detoxify` model to predict the toxicity score of a text. A higher score indicates more toxic content.

    ```python
    text = "I hate this group of people!"
    toxicity_scores = Detoxify("original").predict(text)
    print("Toxicity Score:", toxicity_scores["toxicity"])
    ```

    Output:

    ```
    Toxicity Score: 0.85
    ```

    In this example, the text is considered highly toxic with a score of 0.85.

### Implementing the Filtering Function

Now, let's implement a function that combines language and toxicity detection to filter a dataset.

**Step-by-Step Explanation**

1.  **Define the Function:** Create a function `filter_text` that takes a text as input and returns `None` if the text is non-English or highly toxic.

    ```python
    def filter_text(text):
        # Language Detection
        if detect(text) != "en":
            return None  # Remove non-English text
        # Toxicity Detection
        toxicity_scores = Detoxify("original").predict(text)
        if toxicity_scores["toxicity"] > 0.7:
            return None  # Remove highly toxic content
        return text  # Keep clean text
    ```

    The function first checks if the text is in English. If not, it returns `None`. It then checks the toxicity score. If the score is above 0.7, it returns `None`. If the text passes both checks, it is returned as clean text.

### Applying the Filtering Function to a Dataset

Finally, we will apply the `filter_text` function to a list of texts using list comprehension.

**Step-by-Step Explanation**

1.  **Sample Dataset:** Define a list of sample texts.

    ```python
    texts = [
        "I hate this group of people!",  # Toxic statement
        "This is a normal sentence.",
        "Je ne parle pas anglais."  # Non-English
    ]
    ```

2.  **Apply Filtering:** Use list comprehension to filter the dataset.

    ```python
    filtered_texts = [filter_text(text) for text in texts if filter_text(text)]
    print("Filtered Dataset:", filtered_texts)
    ```

    Output:

    ```
    Filtered Dataset: ['This is a normal sentence.']
    ```

    The list comprehension iterates over each text, applies the `filter_text` function, and includes only the texts that are not `None`.

### Summary and Preparation for Practice

In this lesson, you learned how to filter a dataset by removing non-English and toxic content using the `langdetect` and `Detoxify` libraries. We implemented a function that combines these checks and applied it to a sample dataset. This filtering process is essential for maintaining the quality and safety of data used in training large-scale language models.

As you move on to the practice exercises, try experimenting with different datasets and filtering criteria. This hands-on practice will reinforce your understanding and help you apply these techniques to real-world scenarios. Congratulations on reaching this point in the course, and keep up the great work\!



## Language Detection and Reporting

Now that you've learned about the importance of filtering datasets for LLM training, let's put your knowledge into practice! In this exercise, you'll focus specifically on the language detection aspect we just covered.

Your task is to complete the detect_languages function that processes a list of text samples and returns a list of tuples, each containing a text and its detected language. You'll use the langdetect library's detect function to identify each text's language.

To complete this exercise:

Implement the logic to detect the language of each text
Return a list of tuples, each containing the text and its detected language
This exercise builds a foundation for the more complex filtering we'll do later when we combine language detection with toxicity filtering. Successfully implementing this function will give you the skills needed for real-world dataset preparation tasks!

```python
from langdetect import detect

# Sample dataset with texts in different languages
texts = [
    "Hello, this is a sample text in English.",  # English
    "Hola, este es un ejemplo de texto en español.",  # Spanish
    "Bonjour, ceci est un exemple de texte en français.",  # French
    "Hallo, dies ist ein Beispieltext auf Deutsch.",  # German
    "This is another English text sample.",  # English
    "Questo è un esempio di testo in italiano.",  # Italian
    "The quick brown fox jumps over the lazy dog.",  # English
    "私はあなたの言語を話せません。",  # Japanese
    "English is a West Germanic language."  # English
]

def detect_languages(text_list):
    """
    Detect the language of each text in the list and return the results.
    
    Args:
        text_list (list): A list of text strings in various languages
    
    Returns:
        list: A list of tuples containing each text and its detected language
    """
    results = []
    for text in text_list:
        # TODO: Detect the language of the text
        # TODO: Append the text and its detected language to the results list
        pass
    return results

# Get the detection results
detection_results = detect_languages(texts)

# Print the results
for text, language in detection_results:
    print(f"Text: {text} | Detected Language: {language}")
```

```python
from langdetect import detect

# Sample dataset with texts in different languages
texts = [
    "Hello, this is a sample text in English.",  # English
    "Hola, este es un ejemplo de texto en español.",  # Spanish
    "Bonjour, ceci est un exemple de texte en français.",  # French
    "Hallo, dies ist ein Beispieltext auf Deutsch.",  # German
    "This is another English text sample.",  # English
    "Questo è un esempio di testo in italiano.",  # Italian
    "The quick brown fox jumps over the lazy dog.",  # English
    "私はあなたの言語を話せません。",  # Japanese
    "English is a West Germanic language."  # English
]

def detect_languages(text_list):
    """
    Detect the language of each text in the list and return the results.
    
    Args:
        text_list (list): A list of text strings in various languages
    
    Returns:
        list: A list of tuples containing each text and its detected language
    """
    results = []
    for text in text_list:
        # TODO: Detect the language of the text
        try:
            language = detect(text)
        except:
            language = "unknown" # Handle cases where language cannot be detected
        # TODO: Append the text and its detected language to the results list
        results.append((text, language))
    return results

# Get the detection results
detection_results = detect_languages(texts)

# Print the results
for text, language in detection_results:
    print(f"Text: {text} | Detected Language: {language}")
```

## Filter English Texts with Langdetect

You've done well learning about language detection! Now, let's put that knowledge to use. Your task is to create a function using the langdetect library to filter a list of text samples.

Import the detect function from langdetect.
Use it to identify the language of each text.
Return only the English texts from the list.
Dive in and see how effectively you can filter out non-English content!

```python
from langdetect import detect

# Sample dataset
texts = [
    "Hello, how are you?",  # English
    "Hola, ¿cómo estás?",   # Spanish
    "Bonjour, comment ça va?",  # French
    "This is a test sentence.",  # English
    "C'est une phrase de test."  # French
]

# Function to filter English texts
def filter_english_texts(texts):
    # TODO: Use list comprehension to filter only English texts (en)
    return english_texts

# Apply filtering
filtered_texts = filter_english_texts(texts)
print("Filtered English Texts:", filtered_texts)
```

```python
from langdetect import detect

# Sample dataset
texts = [
    "Hello, how are you?",  # English
    "Hola, ¿cómo estás?",   # Spanish
    "Bonjour, comment ça va?",  # French
    "This is a test sentence.",  # English
    "C'est une phrase de test."  # French
]

# Function to filter English texts
def filter_english_texts(texts):
    # Use list comprehension to filter only English texts (en)
    english_texts = [text for text in texts if detect(text) == 'en']
    return english_texts

# Apply filtering
filtered_texts = filter_english_texts(texts)
print("Filtered English Texts:", filtered_texts)
```

**Output:**

```
Filtered English Texts: ['Hello, how are you?', 'This is a test sentence.']
```

## Detect and Filter Toxic Texts

Nice progress in understanding toxicity detection! Now, let's focus on filtering English texts based on their toxicity levels.

Use the Detoxify library to analyze a list of texts.
Calculate toxicity scores for each text.
Remove texts with scores above 0.7.
This exercise will help you refine your skills in ensuring data quality. Dive in and see how well you can clean up the dataset!

```python
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"  # Suppress TensorFlow logs

from detoxify import Detoxify

# Sample dataset
texts = [
    "I hate this group of people!",  # Toxic statement
    "This is a normal sentence.",
    "I love everyone here!"  # Positive statement
]

# Function to filter content based on toxicity
def filter_toxic_texts(texts, threshold=0.7):
    filtered_texts = []
    for text in texts:
        # TODO: Calculate toxicity scores using Detoxify
        # TODO: Append text to filtered_texts if toxicity score is below or equal to threshold
    return filtered_texts

# Apply filtering
filtered_texts = filter_toxic_texts(texts)
print("Filtered Dataset:", filtered_texts)

```

```python
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"  # Suppress TensorFlow logs

from detoxify import Detoxify

# Sample dataset
texts = [
    "I hate this group of people!",  # Toxic statement
    "This is a normal sentence.",
    "I love everyone here!"  # Positive statement
]

# Function to filter content based on toxicity
def filter_toxic_texts(texts, threshold=0.7):
    detoxifier = Detoxify('unbiased')
    filtered_texts = []
    for text in texts:
        # Calculate toxicity scores using Detoxify
        results = detoxifier.predict(text)
        toxicity_score = results['toxicity']
        
        # Append text to filtered_texts if toxicity score is below or equal to threshold
        if toxicity_score <= threshold:
            filtered_texts.append(text)
            
    return filtered_texts

# Apply filtering
filtered_texts = filter_toxic_texts(texts)
print("Filtered Dataset:", filtered_texts)

```

**Output:**

```
Filtered Dataset: ['This is a normal sentence.', 'I love everyone here!']
```

## Filter English and Non-Toxic Texts

You've learned how to detect languages and filter toxic content. Now, let's combine these skills! Your task is to create a filter_dataset function that processes a list of texts to return only English, non-toxic content.

Use langdetect to check whether each text is in English.
Use Detoxify to evaluate toxicity, filtering out texts with scores above 0.7.
Implement the filtering pipeline to ensure that only clean texts are returned.
This exercise will solidify your understanding of dataset filtering. Dive in and see how effectively you can clean up the dataset!

```python
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"  # Suppress TensorFlow logs

from detoxify import Detoxify
from langdetect import detect

# Sample dataset
texts = [
    "I hate this group of people!",  # Toxic statement
    "This is a normal sentence.",
    "Je ne parle pas anglais."  # Non-English
]

# Function to filter content
def filter_dataset(texts):
    def filter_text(text):
        # TODO: Detect the language of the text
        # TODO: Compare the detected language with "en" to check if it's English
            return None  # Remove non-English text
        
        # TODO: Initialize Detoxify model
        # TODO: Predict the toxicity score of the text
        # TODO: Check if the toxicity score is greater than 0.7
            return None  # Remove highly toxic content
        
        # TODO: Return the text if it passes both language and toxicity checks
        return text  # Keep clean text

    # TODO: Create a list comprehension to filter texts using filter_text function
    return [filter_text(text) for text in texts if filter_text(text)]

# TODO: Execute the filter_dataset function with the sample dataset
# TODO: Print the filtered dataset to verify the results
filtered_texts = filter_dataset(texts)
print("Filtered Dataset:", filtered_texts)

```


This traceback indicates a `SyntaxError` in your code. The error message `assignment expression cannot rebind comprehension iteration variable 'text'` means you can't use the walrus operator `:=` to reassign the variable that's being iterated over within a list comprehension.

The problem lies in this line:

```python
return [text for text in texts if (text := filter_text(text)) is not None]
```

To fix this, you need to store the result of `filter_text(text)` in a separate variable within the list comprehension. However, a list comprehension isn't designed for this type of logic.

A better way to solve this is to refactor the code to a standard `for` loop, which is much more readable and doesn't run into this syntax limitation.

Here is the corrected and more Pythonic code for the `filter_dataset` function:

-----

### Corrected Code

```python
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"  # Suppress TensorFlow logs

from detoxify import Detoxify
from langdetect import detect

# Sample dataset
texts = [
    "I hate this group of people!",  # Toxic statement
    "This is a normal sentence.",
    "Je ne parle pas anglais."  # Non-English
]

# Initialize Detoxify once to avoid reloading the model for each text
detoxify_model = Detoxify('unbiased')

# Function to filter content
def filter_dataset(texts):
    filtered_list = []
    for text in texts:
        # Check language
        try:
            if detect(text) != "en":
                continue # Skip to the next text if not English
        except:
            continue # Skip if language detection fails

        # Check toxicity
        results = detoxify_model.predict(text)
        if results['toxicity'] > 0.7:
            continue # Skip if highly toxic

        # Add to the filtered list if it passes all checks
        filtered_list.append(text)

    return filtered_list

# Execute the filter_dataset function with the sample dataset
filtered_texts = filter_dataset(texts)

# Print the filtered dataset to verify the results
print("Filtered Dataset:", filtered_texts)
```

By switching to a standard `for` loop, the logic becomes straightforward: iterate through each `text`, apply your checks, and if a text passes, add it to a new list. This avoids the `SyntaxError` and makes the code easier to understand and debug.