
# Data Loading

## Create Table

```sql
DROP SCHEMA IF EXISTS demo CASCADE;

CREATE SCHEMA demo;

DROP TABLE IF EXISTS demo.amzn_reviews;

CREATE TABLE demo.amzn_reviews2(
  marketplace TEXT, 
  customer_id BIGINT, 
  review_id TEXT,
  product_id TEXT, 
  product_parent BIGINT, 
  product_title TEXT, 
  product_category TEXT, 
  star_rating INTEGER, 
  helpful_votes INTEGER, 
  total_votes INTEGER, 
  vine TEXT, 
  verified_purchase TEXT, 
  review_headline TEXT, 
  review_body TEXT, 
  review_date DATE) 
DISTRIBUTED BY (review_id);
```



## Check Table Exist

```sql
SELECT COUNT(*) FROM demo.amzn_reviews;
```

## Load the Input Dataset using the gpload Utility

**gpload** is a data loading utility that acts as an interface to the Greenplum Database external table parallel loading feature. Using a load specification defined in a YAML formatted control file, gpload executes a load by invoking the Greenplum Database parallel file server (**gpfdist**), creating an external table definition based on the source data defined, and executing an *INSERT*, *UPDATE* or *MERGE* operation to load the source data into the target table in the database.

You can declare more than one file as input/source as long as the data is of the same format in all files specified. Additionally, if the files are compressed using **gzip** or **bzip2** (have a .gz or .bz2 file extension), the files will be uncompressed automatically (provided that gunzip or bunzip2 is in your path). You can also declare options such as the schema of the source data files, perform basic transformations, define custom delimiter and/or escape character(s), and many more. For the full list of available options, check the GPLoad Utility Reference available on Pivotal Greenplum Database Documentation (Pivotal Greenplum Documentation > Utility Guide > Management Utility Reference > gpload).

The operation, including any SQL commands specified in the SQL collection of the YAML control file, are performed as a single transaction to prevent inconsistent data when performing multiple, simultaneous load operations on a target table.

For our demo, we have prepared the gpload_amzn_reviews.yaml YAML control file, as shown here:



```bash
cat btpn/gpload-amzn-reviews2.yaml
```

```yaml
VERSION: 1.0.0.1
GPLOAD:
   INPUT:
    - SOURCE:
         FILE:
           - /data1/tmp_s3_data/will_use/amazon_reviews_us_*.tsv.gz
    - FORMAT: text
    - HEADER: true
    - LOG_ERRORS: true
    - MAX_LINE_LENGTH: 100000
    - ERROR_LIMIT: 50000
   OUTPUT:
    - TABLE: demo.amzn_reviews2
    - MODE: insert
   PRELOAD:
    - TRUNCATE: true
    - REUSE_TABLES: true
```

## Delete error log information for existing tables in the current database. 

```sql
SELECT gp_truncate_error_log('demo.amzn_reviews2');
```

## Run gpload

```sh
gpload -d dev -f gpload-amzn-reviews2.yaml -l ./gpload_amzn_reviews.log 2>&1
```

## Check gpload execution

```sql
SELECT COUNT(*) FROM demo.amzn_reviews;
```