<div id="singlestore-header" style="display: flex; background-color: rgba(124, 195, 235, 0.25); padding: 5px;">
    <div id="icon-image" style="width: 90px; height: 90px;">
        <img width="100%" height="100%" src="https://raw.githubusercontent.com/singlestore-labs/spaces-notebooks/master/common/images/header-icons/database.png" />
    </div>
    <div id="text" style="padding: 5px; margin-left: 10px;">
        <div id="badge" style="display: inline-block; background-color: rgba(0, 0, 0, 0.15); border-radius: 4px; padding: 4px 8px; align-items: center; margin-top: 6px; margin-bottom: -2px; font-size: 80%">SingleStore Notebooks</div>
        <h1 style="font-weight: 500; margin: 8px 0 0 4px;">Learn How to ingest JSON files in S3 into SingleStoreDB</h1>
    </div>
</div>

This notebook helps you navigate through different scenarios data ingestion of JSON files from an AWS S3 location:
* Ingest JSON files in AWS S3 using wildcards with pre-defined schema
* Ingest JSON files in AWS S3 using wildcards into a JSON column

## Create a Pipeline from JSON files in AWS S3 using wildcards

In this example, we want to create a pipeline from two JSON files called **actors1.json** and **actors2.json** stored in an AWS S3 bucket called singlestoredb and a folder called **actors**. This bucket is located in **us-east-1**.

Each file has the following shape with nested objects and arrays:
```json
{
  "Actors": [
    {
      "name": "Tom Cruise",
      "age": 56,
      "Born At": "Syracuse, NY",
      "Birthdate": "July 3, 1962",
      "photo": "https://jsonformatter.org/img/tom-cruise.jpg",
      "wife": null,
      "weight": 67.5,
      "hasChildren": true,
      "hasGreyHair": false,
      "children": [
        "Suri",
        "Isabella Jane",
        "Connor"
      ]
    },
    {
      "name": "Robert Downey Jr.",
      "age": 53,
      "Born At": "New York City, NY",
      "Birthdate": "April 4, 1965",
      "photo": "https://jsonformatter.org/img/Robert-Downey-Jr.jpg",
      "wife": "Susan Downey",
      "weight": 77.1,
      "hasChildren": true,
      "hasGreyHair": false,
      "children": [
        "Indio Falconer",
        "Avri Roel",
        "Exton Elias"
      ]
    }
  ]
}
```

### Create a Table

We first create a table called **actors** in the database **demo_database**

In [11]:
%%sql
Create database if not exists demo_database;
Use demo_database;
CREATE TABLE if not exists demo_database.actors (
name  text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
age  int NOT NULL,
born_at text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
Birthdate text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
photo text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
wife text CHARACTER SET utf8 COLLATE utf8_general_ci,
weight float NOT NULL,
haschildren boolean,
hasGreyHair boolean,
children  JSON COLLATE utf8_bin NOT NULL,
SHARD KEY ()
);



### Create a pipeline

We then create a pipeline called **actors** in the database **demo_database**. Since those files are small, batch_interval is not as important and the maximum partitions per batch is only 1. For faster performance, we recommend increasing the maximum partitions per batch. 
Note, that since the bucket is publcly accessible, you do not need to provide access key and secret.

In [13]:
%%sql
CREATE PIPELINE if not exists demo_database.actors
AS LOAD DATA S3 'studiotutorials/sample_dataset/json_files/wildcard_demo/*.json'
CONFIG '{ \"region\": \"us-east-1\" }'
/*
CREDENTIALS '{"aws_access_key_id": "<Key to Enter>",  
    "aws_secret_access_key": "<Key to Enter>"}'
*/
BATCH_INTERVAL 2500
MAX_PARTITIONS_PER_BATCH 1
DISABLE OUT_OF_ORDER OPTIMIZATION
DISABLE OFFSETS METADATA GC
SKIP DUPLICATE KEY ERRORS
INTO TABLE `actors`
FORMAT JSON
(
    actors.name <- name,
    actors.age <- age,
    actors.born_at <- `Born At`,
    actors.Birthdate <- Birthdate,
    actors.photo <- photo,
    actors.wife <- wife,
    actors.weight <- weight,
    actors.haschildren <- hasChildren,
    actors.hasGreyHair <- hasGreyHair,
    actors.children <- children
);



### Start and monitor the pipeline

In [15]:
%%sql
Start pipeline demo_database.actors;



If there is no error or warning, you should see no error message.

In [17]:
%%sql
select * from information_schema.pipelines_errors
where pipeline_name = 'actors' ;

DATABASE_NAME,PIPELINE_NAME,ERROR_UNIX_TIMESTAMP,ERROR_TYPE,ERROR_CODE,ERROR_MESSAGE,ERROR_KIND,STD_ERROR,LOAD_DATA_LINE,LOAD_DATA_LINE_NUMBER,BATCH_ID,ERROR_ID,BATCH_SOURCE_PARTITION_ID,BATCH_EARLIEST_OFFSET,BATCH_LATEST_OFFSET,HOST,PORT,PARTITION


### Query the table

In [18]:
%%sql
select * from demo_database.actors;

name,age,born_at,Birthdate,photo,wife,weight,haschildren,hasGreyHair,children
Robert Downey Jr.,53,"New York City, NY","April 4, 1965",https://jsonformatter.org/img/Robert-Downey-Jr.jpg,Susan Downey,77.1,1,0,"['Indio Falconer', 'Avri Roel', 'Exton Elias']"
Tom Cruise,56,"Syracuse, NY","July 3, 1962",https://jsonformatter.org/img/tom-cruise.jpg,,67.5,1,0,"['Suri', 'Isabella Jane', 'Connor']"


### Cleanup ressources

In [20]:
%%sql
Drop pipeline if exists demo_database.actors;
Drop table if exists demo_database.actors;



## Ingest JSON files in AWS S3 using wildcards into a JSON column

As the schema of your files might change, you might want to keep flexibility in ingesting the data into one JSON column that we name **json_data**. the table we create is named **actors_json**.

### Create Table

In [14]:
%%sql
Create database if not exists demo_database;
Use demo_database;
CREATE TABLE if not exists demo_database.actors_json (
json_data JSON NOT NULL ,
SHARD KEY ()
);

### Create a pipeline

In [21]:
%%sql
CREATE PIPELINE if not exists demo_database.actors_json
AS LOAD DATA S3 'studiotutorials/sample_dataset/json_files/wildcard_demo/*.json'
CONFIG '{ \"region\": \"us-east-1\" }'
/*
CREDENTIALS '{"aws_access_key_id": "<Key to Enter>",  
    "aws_secret_access_key": "<Key to Enter>"}'
*/
BATCH_INTERVAL 2500
MAX_PARTITIONS_PER_BATCH 1
DISABLE OUT_OF_ORDER OPTIMIZATION
DISABLE OFFSETS METADATA GC
SKIP DUPLICATE KEY ERRORS
INTO TABLE `actors_json`
FORMAT JSON
(json_data <- %);

### Start and monitor pipeline

In [22]:
%%sql
Start pipeline demo_database.actors_json;

In [23]:
%%sql
# Monitor and see if there is any error or warning
select * from information_schema.pipelines_errors
where pipeline_name = 'actors_json' ;

DATABASE_NAME,PIPELINE_NAME,ERROR_UNIX_TIMESTAMP,ERROR_TYPE,ERROR_CODE,ERROR_MESSAGE,ERROR_KIND,STD_ERROR,LOAD_DATA_LINE,LOAD_DATA_LINE_NUMBER,BATCH_ID,ERROR_ID,BATCH_SOURCE_PARTITION_ID,BATCH_EARLIEST_OFFSET,BATCH_LATEST_OFFSET,HOST,PORT,PARTITION


### Query the table

In [25]:
%%sql
select * from demo_database.actors_json

json_data
"{'Birthdate': 'July 3, 1962', 'Born At': 'Syracuse, NY', 'age': 56, 'children': ['Suri', 'Isabella Jane', 'Connor'], 'hasChildren': True, 'hasGreyHair': False, 'name': 'Tom Cruise', 'photo': 'https://jsonformatter.org/img/tom-cruise.jpg', 'weight': 67.5, 'wife': None}"
"{'Birthdate': 'April 4, 1965', 'Born At': 'New York City, NY', 'age': 53, 'children': ['Indio Falconer', 'Avri Roel', 'Exton Elias'], 'hasChildren': True, 'hasGreyHair': False, 'name': 'Robert Downey Jr.', 'photo': 'https://jsonformatter.org/img/Robert-Downey-Jr.jpg', 'weight': 77.1, 'wife': 'Susan Downey'}"
"{'Birthdate': 'April 4, 1965', 'Born At': 'New York City, NY', 'age': 53, 'children': ['Indio Falconer', 'Avri Roel', 'Exton Elias'], 'hasChildren': True, 'hasGreyHair': False, 'name': 'Robert Downey Jr.', 'photo': 'https://jsonformatter.org/img/Robert-Downey-Jr.jpg', 'weight': 77.1, 'wife': 'Susan Downey'}"
"{'Birthdate': 'July 3, 1962', 'Born At': 'Syracuse, NY', 'age': 56, 'children': ['Suri', 'Isabella Jane', 'Connor'], 'hasChildren': True, 'hasGreyHair': False, 'name': 'Tom Cruise', 'photo': 'https://jsonformatter.org/img/tom-cruise.jpg', 'weight': 67.5, 'wife': None}"


### Cleanup ressources

In [27]:
%%sql

Drop pipeline if exists demo_database.actors_json;
Drop table if exists demo_database.actors_json;

<div id="singlestore-footer" style="background-color: rgba(194, 193, 199, 0.25); height:2px; margin-bottom:10px"></div>
    <div><img src="https://raw.githubusercontent.com/singlestore-labs/spaces-notebooks/master/common/images/singlestore-logo-grey.png" style="padding: 0px; margin: 0px; height: 24px"/></div>
</div>