## Managed Tables - Exercise

Let us use NYSE data and see how we can create tables in Spark Metastore.

In [1]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/Ag7tkdhewcM?rel=0&amp;controls=1&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our [10 node state of the art cluster/labs](https://labs.itversity.com/plans) to learn Spark SQL using our unique integrated LMS.

In [2]:
val username = System.getProperty("user.name")

username = itv001477


itv001477

In [3]:
import org.apache.spark.sql.SparkSession

val username = System.getProperty("user.name")
val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    appName(s"${username} | Spark SQL - Managing Tables - Basic DDL and DML").
    master("yarn").
    getOrCreate

username = itv001477
spark = org.apache.spark.sql.SparkSession@2d988ea


org.apache.spark.sql.SparkSession@2d988ea

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

**Using Spark SQL**

```
spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Scala**

```
spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Pyspark**

```
pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

* Duration: **30 Minutes**
* Data Location (Local): /data/nyse_all/nyse_data
* Create a database with the name - YOUR_OS_USER_NAME_nyse
* Table Name: nyse_eod
* File Format: TEXTFILE (default)
* Review the files by running Linux commands before using data sets. Data is compressed and we can load the files as is.
* Copy one of the zip file to your home directory and preview the data. There should be 7 fields. You need to determine the delimiter.
* Field Names: stockticker, tradedate, openprice, highprice, lowprice, closeprice, volume. For example, you need to use `BIGINT` for volume not `INT`.
* Determine correct data types based on the values
* Create Managed table with default Delimiter.
> As delimiters in data and table are not same, you need to figure out how to get data into the target table.
* Make sure the data is copied into the table as per the structure defined and validate.

In [3]:
%%sql
CREATE DATABASE itv001477_nyse

Waiting for a Spark session to start...

++
||
++
++



In [4]:
%%sql
USE itv001477_nyse

Waiting for a Spark session to start...

++
||
++
++



In [5]:
%%sql
DROP TABLE nyse_eod

++
||
++
++



In [6]:
%%sql
CREATE TABLE nyse_eod(
    stockticker STRING, 
    tradedate INT, 
    openprice DOUBLE, 
    highprice DOUBLE, 
    lowprice DOUBLE, 
    closeprice DOUBLE, 
    volume BIGINT
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

++
||
++
++



In [7]:
%%sql
LOAD DATA LOCAL INPATH '/home/${username}/NYSE_dataFiles/NYSE_*.txt' INTO TABLE nyse_eod

++
||
++
++



In [None]:
import org.apache.spark.sql.SparkSession

val username = System.getProperty("user.name")
val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    appName(s"${username} | Spark SQL - Managing Tables - Basic DDL and DML").
    master("yarn").
    getOrCreate

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

**Using Spark SQL**

```
spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Scala**

```
spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Pyspark**

```
pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

### Validation

Run the following queries to ensure that you will be able to read the data.

```
DESCRIBE FORMATTED YOUR_OS_USER_NAME_nyse.nyse_eod;
SELECT * FROM YOUR_OS_USER_NAME_nyse.nyse_eod LIMIT 10
SELECT count(1) FROM YOUR_OS_USER_NAME_nyse.nyse_eod;
```

In [8]:
// There should not be field delimiter as the requirement is to use default delimiter
spark.sql("DESCRIBE FORMATTED itversity_nyse.nyse_eod").show(200, false)

+----------------------------+-------------------------------------------------------------------------+-------+
|col_name                    |data_type                                                                |comment|
+----------------------------+-------------------------------------------------------------------------+-------+
|stockticker                 |string                                                                   |null   |
|tradedate                   |string                                                                   |null   |
|openprice                   |double                                                                   |null   |
|highprice                   |double                                                                   |null   |
|lowprice                    |double                                                                   |null   |
|closeprice                  |double                                                            

In [9]:
%%sql

SELECT * FROM itv001477_nyse.nyse_eod LIMIT 10

|      ...


+-----------+---------+---------+---------+--------+----------+------+
|stockticker|tradedate|openprice|highprice|lowprice|closeprice|volume|
+-----------+---------+---------+---------+--------+----------+------+
|         AA| 19980101|    52.77|    52.77|   52.77|     52.77|     0|
|        ABC| 19980101|     7.28|     7.28|    7.28|      7.28|     0|
|        ABM| 19980101|    15.28|    15.28|   15.28|     15.28|     0|
|        ABT| 19980101|    32.75|    32.75|   32.75|     32.75|     0|
|        ABX| 19980101|    18.62|    18.62|   18.62|     18.62|     0|
|        ACP| 19980101|     9.75|     9.75|    9.75|      9.75|     0|
|        ACV| 19980101|    21.37|    21.37|   21.37|     21.37|     0|
|        ADC| 19980101|    21.75|    21.75|   21.75|     21.75|     0|
|        ADM| 19980101|    17.84|    17.84|   17.84|     17.84|     0|
|        ADX| 19980101|    16.12|    16.12|   16.12|     16.12|     0|
+-----------+---------+---------+---------+--------+----------+------+



In [10]:
%%sql

SELECT count(1) FROM itv001477_nyse.nyse_eod

+--------+
|count(1)|
+--------+
| 9384739|
+--------+

