# Unit 1

## Efficient Data Storage for Large-Scale LLMs

# Introduction and Context Setting

Welcome to the first lesson of our course on "Optimized Data Preparation for Large-Scale LLMs". In this lesson, we will explore the importance of **efficient data storage for large-scale language models (LLMs)**. As you may know, LLMs require vast amounts of data to train effectively. Therefore, choosing the right data storage format is crucial for handling these large datasets efficiently.

We will focus on two popular storage formats: **JSONL** and **Parquet**. These formats are widely used due to their efficiency and ease of use, especially when dealing with large-scale datasets. By the end of this lesson, you will understand how to load, stream, and save large datasets using these formats, setting a strong foundation for your journey in data preparation for LLMs.

-----

### Loading Large Datasets with the `datasets` Library

To handle large datasets efficiently, we will use the `datasets` library. This library is designed to work with large datasets by allowing you to **stream data**, which means you can process data in chunks rather than loading the entire dataset into memory at once.

Let's start by loading a large dataset. In this example, we'll use the Wikipedia dataset:

```python
from datasets import load_dataset

# Load large dataset (Wikipedia)
dataset = load_dataset("wikipedia", "20220301.en", split="train", streaming=True, trust_remote_code=True)
```

**Detailed Explanation of Parameters:**

  * `load_dataset`: This function from the `datasets` library is used to load a dataset. It supports a wide range of datasets and provides options for customization.
  * `"wikipedia"`: This is the name of the dataset you want to load. In this case, it specifies that we are loading the Wikipedia dataset.
  * `"20220301.en"`: This parameter specifies the configuration or version of the dataset. Here, `"20220301.en"` indicates that we are using the English Wikipedia dump from March 1, 2022.
  * `split="train"`: This parameter specifies which subset of the dataset to load. Common splits include `"train"`, `"test"`, and `"validation"`. In this example, we are loading the training split of the dataset.
  * `streaming=True`: This parameter enables **streaming mode**, which allows you to process the dataset in chunks rather than loading the entire dataset into memory at once. This is particularly useful for handling large datasets that may not fit into memory.
  * `trust_remote_code=True`: This parameter is used to allow the execution of code from the dataset's repository. It is necessary when the dataset requires custom processing or transformations defined in its repository. Use this option with caution, as it executes code from an external source.

-----

### Streaming and Structuring Data

Once we have the dataset loaded, the next step is to stream and structure the data. We will extract the text data and organize it into a list of dictionaries.

```python
# Stream and store in efficient formats
data_list = [{"text": example["text"]} for example in dataset.take(10000)]
```

In this code snippet, we use a list comprehension to iterate over the first 10,000 examples in the dataset. For each example, we extract the `"text"` field and store it in a dictionary with the key `"text"`. This results in a list of dictionaries, where each dictionary contains a single text entry.

-----

### Saving Data in JSONL Format

Now that we have our data structured, we can save it in the **JSONL** format. JSONL, or JSON Lines, is a format where each line is a valid JSON object. This format is particularly useful for storing large text datasets.

```python
import json

# Save as JSONL
with open("dataset.jsonl", "w") as f:
    for line in data_list:
        json.dump(line, f)
        f.write("\n")
```

Here, we open a file named `"dataset.jsonl"` in write mode. We then iterate over our `data_list`, using `json.dump` to write each dictionary as a JSON object to the file, followed by a newline character. This creates a JSONL file where each line represents a single text entry.

**When to Use JSONL**

JSONL is ideal for datasets where each entry is independent and can be processed line by line. It is particularly useful for text data, logs, or any data that can be represented as a series of JSON objects. JSONL is easy to read and write, making it a good choice for data that needs to be human-readable or easily parsed by other systems.

-----

### Saving Data in Parquet Format

Another efficient format for storing large datasets is **Parquet**. Parquet is a **columnar storage file format** that is highly efficient for both storage and retrieval.

```python
import pandas as pd

# Save as Parquet
df = pd.DataFrame(data_list)
df.to_parquet("dataset.parquet", engine="pyarrow")
```

In this example, we first convert our `data_list` into a Pandas DataFrame. We then use the `to_parquet` method to save the DataFrame as a Parquet file named `"dataset.parquet"`. The `engine="pyarrow"` parameter specifies the use of the PyArrow library, which is commonly used for handling Parquet files.

**When to Use Parquet**

Parquet is best suited for datasets that benefit from columnar storage, such as those with a large number of columns or when performing analytical queries. It is highly efficient for both storage and retrieval, making it ideal for large-scale data processing tasks. Parquet is also a good choice when working with data that needs to be compressed or when you need to perform complex queries on the data.

-----

### Summary and Preparation for Practice

In this lesson, we covered the essential steps for efficiently storing large-scale datasets for LLMs. We learned how to load and stream data using the `datasets` library, structure the data into a list of dictionaries, and save it in both **JSONL** and **Parquet** formats. These skills are crucial for managing large datasets and will serve as a foundation for more advanced data preparation techniques.

As you move on to the practice exercises, you'll have the opportunity to apply these concepts and solidify your understanding. Remember, choosing the right data storage format is key to handling large-scale datasets efficiently. Keep exploring and experimenting with different datasets and formats to enhance your skills further.

**View video lesson**

**Start practice**



## Efficient Streaming of Wikipedia Dataset

You've done well learning about streaming and structuring data. Now, let's put that knowledge to use by loading the Wikipedia dataset in streaming mode, similar to what we did in the lesson.

Load the Wikipedia dataset using the Hugging Face datasets library.
Use "wikipedia" as the dataset name.
Use "20220301.en" as the configuration to specify the English Wikipedia dump from March 1, 2022.
Set split="train" to load the training subset of the dataset.
Enable streaming=True to process data in chunks.
Set trust_remote_code=True to allow execution of code from the dataset's repository.
Extract the first 100 examples from the "train" split.
Use a list comprehension to iterate over the first 100 examples.
For each example, extract the "text" field and store it in a dictionary with the key "text".
Store these examples in a list, with each item containing the text content.
Print the first example to verify the output.
This exercise will help you practice efficient data handling. Dive in and see how smoothly you can manage large datasets!

```python
from datasets import load_dataset

# Load the Wikipedia dataset in streaming mode
# - "wikipedia": specifies the dataset to load
# - "20220301.en": specifies the version/configuration of the dataset (English Wikipedia dump from March 1, 2022)
# - split="train": specifies the subset of the dataset to load (training split)
# - streaming=True: enables streaming mode to process data in chunks
# - trust_remote_code=True: allows execution of code from the dataset's repository

# Extract the first 100 examples
# Use a list comprehension to iterate over the first 100 examples in the dataset
# For each example, extract the "text" field and store it in a dictionary with the key "text"

# Print the first example to verify
# This will output the text content of the first example in the data_list
print(data_list[0])

```

Here is the completed code that efficiently loads the Wikipedia dataset in streaming mode and extracts the first 100 examples as requested.

```python
from datasets import load_dataset

# Load the Wikipedia dataset in streaming mode
# - "wikipedia": specifies the dataset to load
# - "20220301.en": specifies the version/configuration of the dataset (English Wikipedia dump from March 1, 2022)
# - split="train": specifies the subset of the dataset to load (training split)
# - streaming=True: enables streaming mode to process data in chunks
# - trust_remote_code=True: allows execution of code from the dataset's repository
dataset = load_dataset("wikipedia", "20220301.en", split="train", streaming=True, trust_remote_code=True)

# Extract the first 100 examples
# Use a list comprehension to iterate over the first 100 examples in the dataset
# For each example, extract the "text" field and store it in a dictionary with the key "text"
data_list = [{"text": example["text"]} for example in dataset.take(100)]

# Print the first example to verify
# This will output the text content of the first example in the data_list
print(data_list[0])
```


## Saving Wikipedia Dataset in JSONL Format

Great job learning about JSONL and Parquet formats! Now, let's apply that knowledge to a practical task. Your goal is to create a function that saves the Wikipedia dataset examples into a JSONL file.

Use the pre-loaded Wikipedia dataset.
Create a function to save the data into a JSONL file.
Each line should be a JSON object with the text data.
This exercise will help you understand how to format and serialize data efficiently. Dive in and see how well you can manage this task!

```python
from datasets import load_dataset
import json

# Load Wikipedia dataset
dataset = load_dataset("wikipedia", "20220301.en", split="train", streaming=True, trust_remote_code=True)

def save_to_jsonl(data_list, filename):
    # TODO: Open the file in write mode and save each example as a JSON object
    pass

# Prepare data list with text
data_list = [{"text": example["text"]} for example in dataset.take(10000)]

# TODO: Call the function to save data_list to "wikipedia_texts.jsonl"

print("Wikipedia texts saved in JSONL format.")

```

Here is the completed code with the `save_to_jsonl` function and the function call filled in. This solution will create a file named "wikipedia\_texts.jsonl" with 10,000 JSON objects, each on a new line.

```python
from datasets import load_dataset
import json

# Load Wikipedia dataset
dataset = load_dataset("wikipedia", "20220301.en", split="train", streaming=True, trust_remote_code=True)

def save_to_jsonl(data_list, filename):
    # Open the file in write mode and save each example as a JSON object
    with open(filename, "w") as f:
        for item in data_list:
            json.dump(item, f)
            f.write("\n")

# Prepare data list with text
data_list = [{"text": example["text"]} for example in dataset.take(10000)]

# Call the function to save data_list to "wikipedia_texts.jsonl"
save_to_jsonl(data_list, "wikipedia_texts.jsonl")

print("Wikipedia texts saved in JSONL format.")
```

## Saving Wikipedia Data as Parquet

Great job on understanding JSONL and Parquet formats! Now, let's take it a step further by working with the Wikipedia dataset.

Load the Wikipedia dataset using the datasets library.
Stream the data and extract the first 10,000 examples.
Create a function to convert these examples into a pandas DataFrame.
Save the DataFrame as a Parquet file named "wikipedia_data.parquet" using the PyArrow engine.
This task will help you see the benefits of using Parquet for large datasets. Let's get started and see how efficiently you can handle this data!

```python
from datasets import load_dataset
import pandas as pd

# Load large dataset (Wikipedia)
dataset = load_dataset("wikipedia", "20220301.en", split="train", streaming=True, trust_remote_code=True)

# Stream and store in efficient formats
data_list = [{"text": example["text"]} for example in dataset.take(10000)]

# TODO: Implement the function to convert the list of dictionaries (data_list) into a Pandas DataFrame
def convert_to_df(data_list):
    # TODO: Create a DataFrame from the data_list
    
    # TODO: Save the DataFrame as a Parquet file using the PyArrow engine

# TODO: Call the function to save the data as Parquet

print("Dataset saved in Parquet format.")

```

Here is the completed code that implements the function to convert the data to a Pandas DataFrame and save it as a Parquet file. This method is highly efficient for handling and storing large-scale structured data.

```python
from datasets import load_dataset
import pandas as pd

# Load large dataset (Wikipedia)
dataset = load_dataset("wikipedia", "20220301.en", split="train", streaming=True, trust_remote_code=True)

# Stream and store in efficient formats
data_list = [{"text": example["text"]} for example in dataset.take(10000)]

# Implement the function to convert the list of dictionaries (data_list) into a Pandas DataFrame
def convert_to_df(data_list):
    # Create a DataFrame from the data_list
    df = pd.DataFrame(data_list)
    
    # Save the DataFrame as a Parquet file using the PyArrow engine
    df.to_parquet("wikipedia_data.parquet", engine="pyarrow")

# Call the function to save the data as Parquet
convert_to_df(data_list)

print("Dataset saved in Parquet format.")
```