# Use BulkWriter for Data Import (1): Use LocalBulkWriter

This notebook helps you learn how to use PyMilvus' LocalBulkWriter to prepare your dataset ready to import to Zilliz Cloud.

## Before you start
Ensure that:

- Install the dependencies, including PyMilvus (2.2.16) and MinIO Python Client.
- Create an output folder for the storage of the BulkWriter output.

In [1]:
%pip install pymilvus minio

# Create the output folder
!mkdir processed_dataset


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Import the dependencies

In this part, you need to import the dependencies required to run this notebook, including PyMilvus for the operations with Zilliz Cloud clusters, Pandas for data processing of your dataset, and some standard libraries.

In [1]:
from pathlib import Path
import os, json

import pandas as pd

DATASET_PATH = "../New_Medium_Data.csv"
OUTPUT_PATH = "../output"

from pymilvus import (
    FieldSchema, CollectionSchema, DataType,
    LocalBulkWriter,
    BulkFileType
)

## Determine collection schema

You need to work out a collection schema out of your dataset. This demo uses [this example dataset](https://drive.google.com/file/d/12RkoDPAlk-sclXdjeXT6DMFVsQr4612w/view?usp=drive_link), and collection will be as the following.

In [2]:
# You need to work out a collection schema out of your dataset.
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
    FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=512),
    FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=768),
    FieldSchema(name="link", dtype=DataType.VARCHAR, max_length=512),
    FieldSchema(name="reading_time", dtype=DataType.INT64),
    FieldSchema(name="publication", dtype=DataType.VARCHAR, max_length=512),
    FieldSchema(name="claps", dtype=DataType.INT64),
    FieldSchema(name="responses", dtype=DataType.INT64)
]

schema = CollectionSchema(fields)

## Rewrite your dataset

Once the schema is ready, you can rewrite your data into a format that Zilliz Cloud understands in the specified local output folder.

To do so, you need to create a LocalBulkWriter with the following parameters:

- `schema`: Schema of the target collection.
- `local_path`: Path to the folder to hold the output file.
- `segment_size`: Maximum size of a generated file of set of files. If the size of your dataset exceeds the specified value, multiple files or sets of files are to be generated.
- `file_type`: Format of the generated file or files. Possible values are `pymilvus.BulkFileType.JSON_RB` and `pymilvus.BulkFileType.NPY`.

In [4]:
# Load the dataset
dataset = pd.read_csv(Path(DATASET_PATH))

# Rewrite the above dataset into a JSON file
local_writer = LocalBulkWriter(
    schema=schema,
    local_path=Path(OUTPUT_PATH).joinpath('json'),
    segment_size=4*1024*1024,
    file_type=BulkFileType.JSON_RB
)

for i in range(0, len(dataset)):
  row = dataset.iloc[i].to_dict()
  row["vector"] = json.loads(row["vector"])
  local_writer.append_row(row)

local_writer.commit()
print("test local writer done!")

test local writer done!


In [5]:
print(os.path.relpath(local_writer.data_path))

../output/json/5abfc289-1702-4b55-9b24-d6d916ca48f3


In [6]:
# Check what you have in the `output` folder
print(os.listdir(local_writer.data_path))

['1.json', '2.json', '3.json', '4.json', '5.json']
