# Use BulkWriter for Data Import (1): Use LocalBulkWriter

This notebook helps you learn how to use PyMilvus' LocalBulkWriter to prepare your dataset ready to import to Zilliz Cloud.

## Before you start
Ensure that:

- Install the dependencies, including PyMilvus (2.2.16) and MinIO Python Client.
- Create an output folder for the storage of the BulkWriter output.

In [None]:
!pip install pymilvus==2.2.16 minio

# Create the output folder
!mkdir processed_dataset

Collecting pymilvus==2.2.16
  Downloading pymilvus-2.2.16-py3-none-any.whl (159 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m159.1/159.1 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting minio
  Downloading minio-7.1.16-py3-none-any.whl (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
Collecting environs<=9.5.0 (from pymilvus==2.2.16)
  Downloading environs-9.5.0-py2.py3-none-any.whl (12 kB)
Collecting ujson>=2.0.0 (from pymilvus==2.2.16)
  Downloading ujson-5.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (53 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.9/53.9 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
Collecting marshmallow>=3.0.0 (from environs<=9.5.0->pymilvus==2.2.16)
  Downloading marshmallow-3.20.1-py3-none-any.whl (49 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m5.1 MB/s[0m et

## Import the dependencies

In this part, you need to import the dependencies required to run this notebook, including PyMilvus for the operations with Zilliz Cloud clusters, Pandas for data processing of your dataset, and some standard libraries.

In [None]:
from pathlib import Path
from urllib.parse import urlparse
import os, json

import pandas as pd

from pymilvus import (
    connections,
    FieldSchema, CollectionSchema, DataType,
    Collection,
    utility,
    LocalBulkWriter,
    RemoteBulkWriter,
    BulkFileType,
    BulkInsertState,
    bulk_import,
    get_import_progress,
    list_import_jobs,
)

## Determine collection schema

You need to work out a collection schema out of your dataset. This demo uses [this example dataset](https://drive.google.com/file/d/12RkoDPAlk-sclXdjeXT6DMFVsQr4612w/view?usp=drive_link), and collection will be as the following.

In [None]:
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
    FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=512),
    FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=768),
    FieldSchema(name="link", dtype=DataType.VARCHAR, max_length=512),
    FieldSchema(name="reading_time", dtype=DataType.INT64),
    FieldSchema(name="publication", dtype=DataType.VARCHAR, max_length=512),
    FieldSchema(name="claps", dtype=DataType.INT64),
    FieldSchema(name="responses", dtype=DataType.INT64)
]

schema = CollectionSchema(fields)

## Rewrite your dataset

Once the schema is ready, you can rewrite your data into a format that Zilliz Cloud understands in the specified local output folder.

To do so, you need to create a LocalBulkWriter with the following parameters:

- `schema`: Schema of the target collection.
- `local_path`: Path to the folder to hold the output file.
- `segment_size`: Maximum size of a generated file of set of files. If the size of your dataset exceeds the specified value, multiple files or sets of files are to be generated.
- `file_type`: Format of the generated file or files. Possible values are `pymilvus.BulkFileType.JSON_RB` and `pymilvus.BulkFileType.NPY`.

In [None]:
# Extract the ID from the share link of the dataset file.
# For a file at https://drive.google.com/file/d/12RkoDPAlk-sclXdjeXT6DMFVsQr4612w/view?usp=drive_link, the ID should be 12RkoDPAlk-sclXdjeXT6DMFVsQr4612w.
# Concatenate the file ID to the end of the url as follows:

url = 'https://drive.google.com/uc?id=12RkoDPAlk-sclXdjeXT6DMFVsQr4612w'
dataset = pd.read_csv(url)

local_writer = LocalBulkWriter(
    schema=schema,
    local_path=Path("processed_dataset").joinpath('json'),
    segment_size=4*1024*1024,
    file_type=BulkFileType.JSON_RB
)

for i in range(0, len(dataset)):
  row = dataset.iloc[i].to_dict()
  row["vector"] = json.loads(row["vector"])
  local_writer.append_row(row)

local_writer.commit()
print("test local writer done!")
print(local_writer.data_path)

INFO:local_bulk_writer:Data path created: processed_dataset/json
INFO:local_bulk_writer:Data path created: processed_dataset/json/13b17c01-d81b-4cb6-ba17-b484276ed8f3
INFO:local_bulk_writer:Prepare to flush buffer, row_count: 1289, size: 4195575
INFO:local_bulk_writer:Flush thread begin, name: Thread-15 (_flush)
INFO:local_bulk_writer:Commit done with async=True
INFO:local_bulk_writer:Previous flush action is not finished, MainThread is waiting...
INFO:local_bulk_writer:Previous flush action is not finished, MainThread is waiting...
INFO:bulk_buffer:Successfully persist row-based file processed_dataset/json/13b17c01-d81b-4cb6-ba17-b484276ed8f3/1.json
INFO:local_bulk_writer:Flush thread done, name: Thread-15 (_flush)
INFO:local_bulk_writer:Prepare to flush buffer, row_count: 1289, size: 4194835
INFO:local_bulk_writer:Flush thread begin, name: Thread-16 (_flush)
INFO:local_bulk_writer:Commit done with async=True
INFO:local_bulk_writer:Previous flush action is not finished, MainThread is 

test local writer done!
processed_dataset/json/13b17c01-d81b-4cb6-ba17-b484276ed8f3


In [None]:
# Files in the output folder is as follows:

os.listdir(local_writer.data_path)

['2.json', '1.json', '4.json', '5.json', '3.json']