# Use BulkWriter for Data Import (2): Use RemoteBulkWriter

This notebook helps you learn how to use PyMilvus' RemoteBulkWriter to prepare your dataset ready to import to Zilliz Cloud.

## Before you start
Ensure that:

- Install the dependencies, including PyMilvus and MinIO Python Client.
- Create an output folder for the storage of the BulkWriter output.

In [1]:
%pip install pymilvus minio


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Import the dependencies

In this part, you need to import the dependencies required to run this notebook, including PyMilvus for the operations with Zilliz Cloud clusters, MinIO for the operations with your object storage bucket, Pandas for data processing of your dataset, and some standard libraries.

In [3]:
from pathlib import Path
import os, json

import pandas as pd
from minio import Minio

from pymilvus import (
    FieldSchema, CollectionSchema, DataType,
    RemoteBulkWriter,
)


ACCESS_KEY = "YOUR_OBJECT_STORAGE_ACCESS_KEY"
SECRET_KEY = "YOUR_OBJECT_STORAGE_SECRET_KEY"
BUCKET_NAME = "YOUR_OBJECT_STORAGE_BUCKET_NAME"
REMOTE_PATH = "DATA_FILES_PATH_IN_BLOCK_STORAGE"
DATASET_PATH = "../New_Medium_Data.csv"


## Determine collection schema

You need to work out a collection schema out of your dataset. This demo uses [this example dataset](https://drive.google.com/file/d/12RkoDPAlk-sclXdjeXT6DMFVsQr4612w/view?usp=drive_link), and collection will be as the following.

In [4]:
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
    FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=512),
    FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=768),
    FieldSchema(name="link", dtype=DataType.VARCHAR, max_length=512),
    FieldSchema(name="reading_time", dtype=DataType.INT64),
    FieldSchema(name="publication", dtype=DataType.VARCHAR, max_length=512),
    FieldSchema(name="claps", dtype=DataType.INT64),
    FieldSchema(name="responses", dtype=DataType.INT64)
]

schema = CollectionSchema(fields)

## Rewrite your dataset

Once the schema is ready, you can rewrite your data into a format that Zilliz Cloud understands in an object storage bucket.

To do so, you need to:

- Create a `ConnectParam` for the connection to your object storage bucket.
- Create a `RemoteBulkWriter` with the following parameters:
  - `schema`: Schema of the target collection.
  - `remote_path`: Path to the folder to hold the output file in the specified bucket .
  - `segment_size`: Maximum size of a generated file of set of files. If the size of your dataset exceeds the specified value, multiple files or sets of files are to be generated.
  - `connect_param`: Connection parameters for the connection to your object storage.

In [5]:
# Extract the ID from the share link of the dataset file.
# For a file at https://drive.google.com/file/d/12RkoDPAlk-sclXdjeXT6DMFVsQr4612w/view?usp=drive_link, the ID should be 12RkoDPAlk-sclXdjeXT6DMFVsQr4612w.
# Concatenate the file ID to the end of the url as follows:

url = Path(DATASET_PATH)
dataset = pd.read_csv(url)

connect_param = RemoteBulkWriter.ConnectParam(
    endpoint="storage.googleapis.com", # use 's3.amazonaws.com' for GCS
    access_key=ACCESS_KEY,
    secret_key=SECRET_KEY,
    bucket_name=BUCKET_NAME,
    secure=True
)

remote_writer = RemoteBulkWriter(
    schema=schema,
    remote_path=REMOTE_PATH,
    segment_size=50*1024*1024,
    connect_param=connect_param,
)

for i in range(0, len(dataset)):
  row = dataset.iloc[i].to_dict()
  row["vector"] = json.loads(row["vector"])
  remote_writer.append_row(row)

remote_writer.commit()

print(remote_writer.data_path)

/numpy-files/f6edff70-b5ca-467d-b5ee-981a98979743


In [6]:
# To check the files in the remote folder

client = Minio(
    endpoint="storage.googleapis.com", # use 's3.amazonaws.com' for AWS
    access_key=ACCESS_KEY,
    secret_key=SECRET_KEY,
    secure=True)

objects = client.list_objects(
    bucket_name=BUCKET_NAME,
    prefix=str(remote_writer.data_path)[1:],
    recursive=True
)

print([obj.object_name for obj in objects])

['numpy-files/f6edff70-b5ca-467d-b5ee-981a98979743/1/claps.npy', 'numpy-files/f6edff70-b5ca-467d-b5ee-981a98979743/1/id.npy', 'numpy-files/f6edff70-b5ca-467d-b5ee-981a98979743/1/link.npy', 'numpy-files/f6edff70-b5ca-467d-b5ee-981a98979743/1/publication.npy', 'numpy-files/f6edff70-b5ca-467d-b5ee-981a98979743/1/reading_time.npy', 'numpy-files/f6edff70-b5ca-467d-b5ee-981a98979743/1/responses.npy', 'numpy-files/f6edff70-b5ca-467d-b5ee-981a98979743/1/title.npy', 'numpy-files/f6edff70-b5ca-467d-b5ee-981a98979743/1/vector.npy']
