# Understanding Hive with Retail Dataset (EMR + HDFS + Hive)


## Pre-Requisites

### Connect to EMR primary node (SSH)
Connect using PuTTY (Windows) or Terminal (macOS/Linux).

### Upload `retail_data` to the EMR node
Use FileZilla / WinSCP / scp to copy the dataset folder onto the EMR node.

### Upload `retail_data` into HDFS
Create your HDFS home folder and upload the dataset.


In [None]:
hadoop fs -mkdir /user/hadoop/nagabhushan

In [None]:
hadoop fs -put retail_db /user/hadoop/nagabhushan

## Working with `retail_db` using Hive

### Validate datasets are available in HDFS
Confirm the expected folders exist:
- customers
- categories
- orders
- order_items
- products
- departments


In [None]:
hadoop fs -ls -R /user/hadoop/nagabhushan/retail_db/

### Launch Hive
Launch the Hive CLI on the EMR node.


In [None]:
hive

hive> set hive.execution.engine=mr;

### Create a database for retail data

Create and switch to a database named `retail_db`.


In [None]:
SHOW DATABASES;

In [None]:
CREATE DATABASE IF NOT EXISTS retail_db;

In [None]:
SHOW DATABASES;

In [None]:
USE retail_db;

In [None]:
SHOW TABLES;

## Create `orders` table

Before creating the table, validate that order data exists in HDFS.

> `dfs` commands run from inside the Hive CLI and allow you to interact with HDFS.


In [None]:
dfs -tail /user/hadoop/nagabhushan/retail_db/orders/part-00000;

### Create table: `orders`

The table reads CSV (comma-separated) files stored as text.

**Important:** Ensure the `LOCATION` path is correct and does not contain a leading/trailing unintended space.


In [None]:
CREATE TABLE IF NOT EXISTS orders (
  order_id INT,
  order_date TIMESTAMP,
  order_customer_id INT,
  order_status STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/hadoop/nagabhushan/retail_db/orders';

### Validate `orders` table metadata and sample data

In [None]:
SHOW TABLES;

In [None]:
DESC orders;

In [None]:
DESC FORMATTED orders;

In [None]:
dfs -ls /user/hadoop/nagabhushan/retail_db/orders;

In [None]:
SELECT * FROM orders LIMIT 10;

In [None]:
SELECT COUNT(*) FROM orders;

### Quick Exercises on `orders`

#### Retrieve `order_status`
First, retrieve all statuses (with duplicates), then distinct statuses.


In [None]:
SELECT order_status FROM orders;

In [None]:
SELECT DISTINCT order_status FROM orders;

#### Count of orders per status

In [None]:
SELECT order_status, COUNT(1) FROM orders GROUP BY order_status;

#### COMPLETED and CLOSED orders placed in Jan and Feb 2014

In [None]:
SELECT *
FROM retail_db.orders
WHERE order_status IN ('COMPLETE', 'CLOSED')
  AND (order_date LIKE '2014-01%' OR order_date LIKE '2014-02%');

## Create `customers` table

Create an **EXTERNAL** table on the customers data stored in HDFS.


In [None]:
CREATE EXTERNAL TABLE IF NOT EXISTS retail_db.customers (
  customer_id INT,
  customer_fname STRING,
  customer_lname STRING,
  customer_email STRING,
  customer_password STRING,
  customer_street STRING,
  customer_city STRING,
  customer_state STRING,
  customer_zipcode STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/hadoop/nagabhushan/retail_db/customers';

### Preview customers

In [None]:
SELECT * FROM retail_db.customers LIMIT 10;

### Exercise: Customers who have not placed any orders

Use a **LEFT OUTER JOIN** from customers to orders and filter rows where the match is missing.


In [None]:
SELECT c.customer_id, c.customer_fname, c.customer_lname
FROM retail_db.customers c
LEFT OUTER JOIN retail_db.orders o
  ON o.order_customer_id = c.customer_id
WHERE o.order_customer_id IS NULL;