# Use BulkWriter for Data Import (2): Use RemoteBulkWriter

This notebook helps you learn how to use PyMilvus' RemoteBulkWriter to prepare your dataset ready to import to Zilliz Cloud.

## Before you start
Ensure that:

- Install the dependencies, including PyMilvus (2.2.16) and MinIO Python Client.
- Create an output folder for the storage of the BulkWriter output.

In [None]:
!pip install pymilvus==2.2.16 minio

Collecting pymilvus==2.2.16
  Downloading pymilvus-2.2.16-py3-none-any.whl (159 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m159.1/159.1 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting minio
  Downloading minio-7.1.16-py3-none-any.whl (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
Collecting environs<=9.5.0 (from pymilvus==2.2.16)
  Downloading environs-9.5.0-py2.py3-none-any.whl (12 kB)
Collecting ujson>=2.0.0 (from pymilvus==2.2.16)
  Downloading ujson-5.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (53 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.9/53.9 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
Collecting marshmallow>=3.0.0 (from environs<=9.5.0->pymilvus==2.2.16)
  Downloading marshmallow-3.20.1-py3-none-any.whl (49 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m5.5 MB/s[0m et

## Import the dependencies

In this part, you need to import the dependencies required to run this notebook, including PyMilvus for the operations with Zilliz Cloud clusters, MinIO for the operations with your object storage bucket, Pandas for data processing of your dataset, and some standard libraries.

In [None]:
from pathlib import Path
from urllib.parse import urlparse
import sys, time, json
import threading

import pandas as pd
from minio import Minio

from pymilvus import (
    connections,
    FieldSchema, CollectionSchema, DataType,
    Collection,
    utility,
    LocalBulkWriter,
    RemoteBulkWriter,
    BulkFileType,
    BulkInsertState,
    bulk_import,
    get_import_progress,
    list_import_jobs,
)

## Determine collection schema

You need to work out a collection schema out of your dataset. This demo uses [this example dataset](https://drive.google.com/file/d/12RkoDPAlk-sclXdjeXT6DMFVsQr4612w/view?usp=drive_link), and collection will be as the following.

In [None]:
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
    FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=512),
    FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=768),
    FieldSchema(name="link", dtype=DataType.VARCHAR, max_length=512),
    FieldSchema(name="reading_time", dtype=DataType.INT64),
    FieldSchema(name="publication", dtype=DataType.VARCHAR, max_length=512),
    FieldSchema(name="claps", dtype=DataType.INT64),
    FieldSchema(name="responses", dtype=DataType.INT64)
]

schema = CollectionSchema(fields)

## Rewrite your dataset

Once the schema is ready, you can rewrite your data into a format that Zilliz Cloud understands in an object storage bucket.

To do so, you need to:

- Create a `ConnectParam` for the connection to your object storage bucket.
- Create a `RemoteBulkWriter` with the following parameters:
  - `schema`: Schema of the target collection.
  - `remote_path`: Path to the folder to hold the output file in the specified bucket .
  - `segment_size`: Maximum size of a generated file of set of files. If the size of your dataset exceeds the specified value, multiple files or sets of files are to be generated.
  - `connect_param`: Connection parameters for the connection to your object storage.

In [None]:
YOUR_OBJECT_STORAGE_ACCESS_KEY = ""
YOUR_OBJECT_STORAGE_SECRET_KEY = ""
YOUR_OBJECT_STORAGE_BUCKET_NAME = ""


# Extract the ID from the share link of the dataset file.
# For a file at https://drive.google.com/file/d/12RkoDPAlk-sclXdjeXT6DMFVsQr4612w/view?usp=drive_link, the ID should be 12RkoDPAlk-sclXdjeXT6DMFVsQr4612w.
# Concatenate the file ID to the end of the url as follows:

url = 'https://drive.google.com/uc?id=12RkoDPAlk-sclXdjeXT6DMFVsQr4612w'
dataset = pd.read_csv(url)

connect_param = RemoteBulkWriter.ConnectParam(
    endpoint="s3.amazonaws.com", # use 'storage.googleapis.com' for GCS
    access_key=YOUR_OBJECT_STORAGE_ACCESS_KEY,
    secret_key=YOUR_OBJECT_STORAGE_SECRET_KEY,
    bucket_name=YOUR_OBJECT_STORAGE_BUCKET_NAME,
    secure=True
)

remote_writer = RemoteBulkWriter(
    schema=schema,
    remote_path="medium_articles",
    segment_size=50*1024*1024,
    connect_param=connect_param,
)

for i in range(0, len(dataset)):
  row = dataset.iloc[i].to_dict()
  row["vector"] = json.loads(row["vector"])
  remote_writer.append_row(row)

remote_writer.commit()
print("test local writer done!")
print(remote_writer.data_path)

INFO:local_bulk_writer:Data path created: /usr/local/lib/python3.10/dist-packages/bulk_writer
INFO:local_bulk_writer:Data path created: /usr/local/lib/python3.10/dist-packages/bulk_writer/a1f14844-1722-47e1-990e-2b4a735e1ed2
INFO:remote_bulk_writer:Remote buffer writer initialized, target path: /medium_articles/a1f14844-1722-47e1-990e-2b4a735e1ed2
INFO:local_bulk_writer:Prepare to flush buffer, row_count: 5979, size: 19463443
INFO:local_bulk_writer:Flush thread begin, name: Thread-10 (_flush)
INFO:local_bulk_writer:Wait flush to finish
INFO:bulk_buffer:Successfully persist column-based file /usr/local/lib/python3.10/dist-packages/bulk_writer/a1f14844-1722-47e1-990e-2b4a735e1ed2/1/id.npy
INFO:bulk_buffer:Successfully persist column-based file /usr/local/lib/python3.10/dist-packages/bulk_writer/a1f14844-1722-47e1-990e-2b4a735e1ed2/1/title.npy
INFO:bulk_buffer:Successfully persist column-based file /usr/local/lib/python3.10/dist-packages/bulk_writer/a1f14844-1722-47e1-990e-2b4a735e1ed2/1/

test local writer done!
/medium_articles/a1f14844-1722-47e1-990e-2b4a735e1ed2


In [None]:
# To check the files in the remote folder

client = Minio(
    endpoint="s3.amazonaws.com", # use 'storage.googleapis.com' for GCS
    access_key=YOUR_OBJECT_STORAGE_ACCESS_KEY,
    secret_key=YOUR_OBJECT_STORAGE_SECRET_KEY,
    secure=True)

objects = client.list_objects(
    bucket_name=YOUR_OBJECT_STORAGE_BUCKET_NAME,
    prefix=str(remote_writer.data_path)[1:],
    recursive=True
)

for obj in objects:
    print(obj.object_name)

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): s3.us-east-1.amazonaws.com:443
DEBUG:urllib3.connectionpool:https://s3.us-east-1.amazonaws.com:443 "GET /doc-demo-1?location= HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): doc-demo-1.s3.us-west-2.amazonaws.com:443
DEBUG:urllib3.connectionpool:https://doc-demo-1.s3.us-west-2.amazonaws.com:443 "GET /?delimiter=&encoding-type=url&list-type=2&max-keys=1000&prefix=medium_articles%2Fa1f14844-1722-47e1-990e-2b4a735e1ed2 HTTP/1.1" 200 None


medium_articles/a1f14844-1722-47e1-990e-2b4a735e1ed2/1/claps.npy
medium_articles/a1f14844-1722-47e1-990e-2b4a735e1ed2/1/id.npy
medium_articles/a1f14844-1722-47e1-990e-2b4a735e1ed2/1/link.npy
medium_articles/a1f14844-1722-47e1-990e-2b4a735e1ed2/1/publication.npy
medium_articles/a1f14844-1722-47e1-990e-2b4a735e1ed2/1/reading_time.npy
medium_articles/a1f14844-1722-47e1-990e-2b4a735e1ed2/1/responses.npy
medium_articles/a1f14844-1722-47e1-990e-2b4a735e1ed2/1/title.npy
medium_articles/a1f14844-1722-47e1-990e-2b4a735e1ed2/1/vector.npy
