# Unit 3

## Converting and Storing Text Chunks in JSONL Format

Of course. Here is the text converted to English and formatted in Markdown.

-----

# Welcome to the Final Lesson

Welcome to the final lesson of our course on data processing for **Large Language Models (LLMs)**. In this lesson, we will focus on converting text chunks into **JSONL** format and storing them for efficient retrieval and processing.

This skill is crucial for managing text data in LLM applications, allowing for streamlined data handling and processing. By the end of this lesson, you will be able to convert text chunks into JSONL format and store them for later use.

-----

### **Recall: Text Chunking Basics**

Before we dive into JSONL, let's briefly recall the concept of **text chunking**. In previous lessons, we discussed how breaking down large text into smaller, manageable chunks is essential for efficient processing in LLMs. This process helps maintain context and ensures that the model can handle the data effectively. Remember, chunking can be done by sentences, characters, or tokens, depending on the specific requirements of your task.

-----

### **Understanding JSONL Format**

**JSONL**, or **JSON Lines**, is a format that stores JSON objects in a line-by-line manner. Each line in a JSONL file is a valid JSON object, making it easy to process large datasets one line at a time. This format is particularly useful for streaming data and handling large files efficiently.

-----

### **Why JSONL?**

  * **Efficiency**: JSONL allows for line-by-line processing, which is memory-efficient.
  * **Simplicity**: Each line is a complete JSON object, making it easy to parse and manipulate.
  * **Scalability**: Ideal for large datasets, as it supports incremental processing.

-----

### **Converting Text Chunks to JSONL**

Let's start by converting text chunks into JSONL format using Python. We'll use the `json` module, which is part of Python's standard library, to handle JSON data.

#### **Step 1: Chunk Your Text**

Before converting text into JSONL format, we need to chunk the text. Let's assume we have a large text that we want to break into smaller chunks. We'll use sentence-based chunking for this example.

```python
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

large_text = "This is the first sentence. Here is the second sentence. And this is the third sentence."
chunks = sent_tokenize(large_text)
```

In this code, we use the `sent_tokenize` function from the `nltk` library to split the `large_text` into individual sentences, which will serve as our text chunks.

#### **Step 2: Create JSON Objects**

Next, we'll create a list of JSON objects, where each object contains an `id` and the corresponding `text` chunk.

```python
import json

chunk_data = [{"id": i, "text": chunk} for i, chunk in enumerate(chunks)]
```

In this code, we use a list comprehension to create a list of dictionaries. Each dictionary has an `id` (the index of the chunk) and the `text` (the chunk itself).

#### **Step 3: Convert to JSONL Format**

Now, let's convert these JSON objects into JSONL format and write them to a file using the `jsonlines` library.

Install the `jsonlines` library if you haven't already:

```bash
pip install jsonlines
```

Use the `jsonlines` library to write the JSON objects to a JSONL file:

```python
import jsonlines

with jsonlines.open("chunked_data.jsonl", mode='w') as writer:
    writer.write_all(chunk_data)
```

In this code, `jsonlines.open()` is used to open the file in write mode, and `writer.write_all()` writes all the JSON objects from `chunk_data` to the file in JSONL format. This approach eliminates the need for a manual loop to write each line.

-----

### **Storing and Retrieving JSONL Data**

Once we have stored our data in JSONL format, we need to know how to retrieve it for further processing.

#### **Step 4: Read JSONL Data**

To read the stored JSONL data, we open the file in read mode and load each line as a JSON object.

```python
with jsonlines.open("chunked_data.jsonl", mode='r') as reader:
    stored_chunks = [obj for obj in reader]
```

In this snippet, we use a list comprehension to read each line from the file and convert it back into a JSON object using the `jsonlines` library.

#### **Step 5: Verify the Output**

Finally, let's print the first two chunks to verify that our data has been correctly stored and retrieved.

```python
print("Stored Chunks:", stored_chunks[:2])
```

The output will be:

```
Stored Chunks: [{'id': 0, 'text': 'This is the first sentence.'}, {'id': 1, 'text': 'Here is the second sentence.'}]
```

-----

### **Summary and Next Steps**

In this lesson, you learned how to convert text chunks into JSONL format and store them for efficient retrieval. We covered the benefits of JSONL, how to use Python's `json` module and `jsonlines` library to handle JSON data, and how to store and read JSONL files.

Congratulations on reaching the end of this course\! You've gained valuable skills in text processing for LLMs, and I encourage you to apply these skills in the practice exercises that follow. Well done on your progress and dedication\!

## Convert Text Chunks to JSONL

You've done well learning about the JSONL format! Now, let's put that knowledge into practice. Your task is to complete a function that takes a list of text chunks and converts them into JSON objects with an ID and text content.

Use the jsonlines library to write these objects to a JSONL file.
This exercise will help you solidify your understanding of converting text data to the JSONL format. Dive in and see how efficiently you can manage text data!

```python
import nltk
import jsonlines

# Step 1: Chunk Your Text
nltk.download('punkt_tab', quiet = True)
from nltk.tokenize import sent_tokenize

large_text = "This is the first sentence. Here is the second sentence. And this is the third sentence."
chunks = sent_tokenize(large_text)

# TODO: Define the function convert_chunks_to_jsonl(chunks, filename)
def convert_chunks_to_jsonl(chunks, filename):
    # TODO: Create a list of dictionaries with 'id' and 'text' for each chunk
    chunk_data = []
    
    with jsonlines.open(filename, mode='w') as writer:
        # TODO: Use jsonlines to write chunk_data to a JSONL file

# Use the function to write to a file
convert_chunks_to_jsonl(chunks, "chunked_data.jsonl")

# Step 4: Read JSONL Data
with jsonlines.open("chunked_data.jsonl", mode='r') as reader:
    stored_chunks = [obj for obj in reader]

# Step 5: Verify the Output
print("Stored Chunks:", stored_chunks)
```

I can definitely help with that. Here is the completed Python code that includes the function to convert text chunks into a JSONL file, just as you requested.

The `convert_chunks_to_jsonl` function is now fully implemented. It first creates a list of dictionaries, where each dictionary contains an `"id"` (the index) and the `"text"` for each chunk. Then, it uses the `jsonlines` library to write this data to the specified file.

```python
import nltk
import jsonlines

# Step 1: Chunk Your Text
nltk.download('punkt_tab', quiet=True)
from nltk.tokenize import sent_tokenize

large_text = "This is the first sentence. Here is the second sentence. And this is the third sentence."
chunks = sent_tokenize(large_text)

# Define the function convert_chunks_to_jsonl(chunks, filename)
def convert_chunks_to_jsonl(chunks, filename):
    # Create a list of dictionaries with 'id' and 'text' for each chunk
    chunk_data = [{"id": i, "text": chunk} for i, chunk in enumerate(chunks)]
    
    with jsonlines.open(filename, mode='w') as writer:
        # Use jsonlines to write chunk_data to a JSONL file
        writer.write_all(chunk_data)

# Use the function to write to a file
convert_chunks_to_jsonl(chunks, "chunked_data.jsonl")

# Step 4: Read JSONL Data
with jsonlines.open("chunked_data.jsonl", mode='r') as reader:
    stored_chunks = [obj for obj in reader]

# Step 5: Verify the Output
print("Stored Chunks:", stored_chunks)
```

## Filter Text Chunks with JSONL

Nice job on learning how to store text chunks in JSONL format! Now, let's focus on retrieving and filtering this data. Your task is to implement a function that reads from a JSONL file containing text chunks and filters them based on specific criteria.

Use the jsonlines library to read the file.
Filter chunks by minimum length or by checking for specific keywords.
Return the filtered results.
This exercise will reinforce your skills in handling JSONL data. Dive in and see how effectively you can process and filter text data!

```python
import jsonlines

def filter_chunks(file_path, min_length=0, keyword=None):
    filtered_chunks = []
    # TODO: Open the JSONL file in read mode using jsonlines
        # TODO: Iterate over each JSON object in the file
        for ____ in _____:
            # TODO: Extract the 'text' field from the JSON object
            text = obj[_____]
            # TODO: Check if the text meets the minimum length requirement
            # TODO: Check if the keyword is present in the text (if a keyword is provided)
            if len(text) >= _________ and (___________ is None or ________ in text):
                # TODO: If both conditions are met, add the object to the filtered list
                filtered_chunks.append(_____)
    # TODO: Return the list of filtered chunks
    return ___________

# Example usage
filtered = filter_chunks("chunked_data.jsonl", min_length=20, keyword="second")
print("Filtered Chunks:", filtered)

```

No problem. I've completed the Python code for the `filter_chunks` function.

This function efficiently reads each line from a JSONL file and applies two filters: a **minimum length** and an optional **keyword**. Only the chunks that satisfy both conditions are added to the `filtered_chunks` list, which is then returned.

You can see how the logic is implemented in the completed code below.

```python
import jsonlines

def filter_chunks(file_path, min_length=0, keyword=None):
    filtered_chunks = []
    # Open the JSONL file in read mode using jsonlines
    with jsonlines.open(file_path, mode='r') as reader:
        # Iterate over each JSON object in the file
        for obj in reader:
            # Extract the 'text' field from the JSON object
            text = obj['text']
            # Check if the text meets the minimum length requirement
            # and if the keyword is present (if a keyword is provided)
            if len(text) >= min_length and (keyword is None or keyword in text):
                # If both conditions are met, add the object to the filtered list
                filtered_chunks.append(obj)
    # Return the list of filtered chunks
    return filtered_chunks

# Example usage
filtered = filter_chunks("chunked_data.jsonl", min_length=20, keyword="second")
print("Filtered Chunks:", filtered)
```

Now you have a flexible function that can filter your text data based on different criteria. This is a very useful skill for managing and processing large datasets for LLMs.

## Text Processing Pipeline with JSONL

You've done a fantastic job learning about JSONL! Now, let's build a complete text processing pipeline. Your task is to:

Chunk a large text using sentence tokenization from text.txt.
Enrich each chunk with metadata:
char_count (length of the text)
source (set this to "example_text" or the filename, e.g., "text.txt")
(Optional but recommended): Add an id field to each chunk to uniquely identify it (e.g., use the index of the chunk).
Store these enriched chunks in a JSONL file.
Implement a search_chunks(query) function that takes a string and returns all chunk objects whose "text" field contains the query string. For example, search_chunks("Artificial Intelligence") should return all chunks where that phrase appears in the text.
This exercise will help you integrate all the skills you've learned. Dive in and see how effectively you can manage and search text data!


```python
import nltk
import jsonlines

# Step 1: Chunk Your Text
nltk.download('punkt_tab', quiet = True)
from nltk.tokenize import sent_tokenize

# Load the sample text from text.txt
# TODO: Read the text from 'text.txt' and store it in 'large_text'

# TODO: Tokenize the text into chunks using sentence tokenization

# TODO: Create JSON Objects with Metadata
# Add metadata such as 'char_count' and 'source' to each chunk

# Step 3: Convert to JSONL Format
# TODO: Write the JSON objects to a JSONL file

# Step 4: Read JSONL Data
# TODO: Read the JSONL file and load the data into a list

# TODO: Implement a `search_chunks` function to find chunks containing specific terms

# Verify the Output
# TODO: Print the first two stored chunks and search results for verification
# TODO: Find and print chunks where "Artificial Intelligence" exists

```

It's a great next step to combine all the skills you've learned into a complete pipeline. Here is the completed code that chunks text, enriches it with metadata, stores it in a JSONL file, and includes a function to search for specific terms.

To get the pipeline working, you'll need to create a `text.txt` file in the same directory as your Python script. The example text below includes the phrase "Artificial Intelligence" so you can test the `search_chunks` function.

**text.txt**

```
Artificial Intelligence is a field of computer science that focuses on creating intelligent machines. It has many applications, from self-driving cars to natural language processing. The goal is to build systems that can think and learn like humans.
```

Here is the completed Python code.

```python
import nltk
import jsonlines

# Step 1: Chunk Your Text
nltk.download('punkt_tab', quiet=True)
from nltk.tokenize import sent_tokenize

# Load the sample text from text.txt
with open('text.txt', 'r') as file:
    large_text = file.read()

# Tokenize the text into chunks using sentence tokenization
chunks = sent_tokenize(large_text)

# Create JSON Objects with Metadata
# Add metadata such as 'char_count' and 'source' to each chunk
enriched_chunks = []
for i, chunk in enumerate(chunks):
    enriched_chunk = {
        'id': i,
        'text': chunk,
        'char_count': len(chunk),
        'source': 'text.txt'
    }
    enriched_chunks.append(enriched_chunk)

# Step 3: Convert to JSONL Format
# Write the JSON objects to a JSONL file
with jsonlines.open('enriched_chunks.jsonl', mode='w') as writer:
    writer.write_all(enriched_chunks)

# Step 4: Read JSONL Data
# Read the JSONL file and load the data into a list
def load_chunks(file_path):
    with jsonlines.open(file_path, mode='r') as reader:
        return [obj for obj in reader]

# Implement a `search_chunks` function to find chunks containing specific terms
def search_chunks(query, file_path='enriched_chunks.jsonl'):
    chunks_data = load_chunks(file_path)
    results = [chunk for chunk in chunks_data if query.lower() in chunk['text'].lower()]
    return results

# Verify the Output
print("First two stored chunks:", enriched_chunks[:2])
print("\n---")
print("Search results for 'Artificial Intelligence':")
search_results = search_chunks("Artificial Intelligence")
for result in search_results:
    print(result)
```

-----

You can watch a YouTube Shorts video on [writing to a text file in Python](https://www.youtube.com/shorts/Dw85RIvQlc8). This video is relevant because it provides a quick, visual guide on a core Python skill needed for this exercise: writing data to a file.
http://googleusercontent.com/youtube_content/5