# Data Collection and Exploration with the Iguazio Data Science Platform

Learn how to quickly start using the Iguazio Data Science Platform to collect, ingest, and explore data.

- [Overview](#gs-overview)
  - [Platform Data Containers](#platform-data-containers)
- [Collecting and Ingesting Data](#gs-data-collection-and-ingestion)
  - [Ingesting Data From an External Database to a NoSQL Table Using V3IO Frames](#ingest-from-external-db-to-no-sql-using-frames)
  - [Ingesting Files from Amazon S3](#ingest-from-amazon-s3)
  - [Streaming Data From an External Streaming Engine Using Nuclio](#streaming-data-from-an-external-streaming-engine-using-nuclio)
- [Exploring and Processing Data](#gs-data-exploration-and-processing)
  - [Exploring Data Using Spark DataFrames](#data-exploration-spark)
  - [Exploring Data Using V3IO Frames and pandas DataFrames](#data-exploration-v3io-frames-n-pandas)
  - [Exploring Data Using SQL](#data-exploration-sql)
- [Data Collection and Exploration Getting-Started Example](#getting-started-example)

<a id="gs-overview"></a>
## Overview

This tutorial explains and demonstrates how to collect, ingest, and explore data with the Iguazio Data Science Platform (**"the platform"**).<br>
For an overview of the platform and how it can be used to implement a full data science workflow, see the [**welcome**](../welcome.ipynb) tutorial notebook.<br>
For full end-to-end platform use-case application demos, see [**demos**](../demos/README.ipynb) tutorial notebooks directory.

<a id="platform-data-containers"></a>
### Platform Data Containers

Data is stored within data containers in the platform's distributed file system (DFS).
All platform clusters have two predefined containers:

- <a id="default-container"></a> The default **"bigdata"** container.
- <a id="users-container"></a>The **"users"** container, which is designed to contain **&lt;username&gt;** directories that provide individual development environments for storing user-specific data.
  The platform's Jupyter Notebook, Zeppelin, and web-based shell "command-line services" automatically create such a directory for the running user of the service and set it as the home directory of the service environment.
  You can leverage the following environment variables, which are predefined in the platform's command-line services, to access this running-user directory from your code:

  - `V3IO_USERNAME` &mdash; set to the username of the running user of the Jupyter Notebook service.
  - `V3IO_HOME` &mdash; set to the running-user directory in the "users" container &mdash; **users/&lt;running user&gt;**.
  - `V3IO_HOME_URL` &mdash; set to the fully qualified `v3io` path to the running-user directory &mdash; `v3io://users/<running user>`.

The data containers and their contents are referenced differently depending on the programming interface.
For example:

- In local file-system (FS) commands you use the predefined `v3io` root data mount &mdash; `/v3io/<container name>[/<data path>]`.
  There's also a predefined local-FS `User` mount to the **users/&lt;running user&gt;** directory, and you can use the aforementioned environment variables when setting data paths.
  For example, `/v3io/users/$V3IO_USERNAME`, `/v3io/$V3IO_HOME`, and `/User` are all valid ways of referencing the **users/&lt;running user&gt;** directory from a local FS command.
- In Hadoop FS or Spark DataFrame commands you use a fully qualified path of the format `v3io://<container name>/<data path>`.
  You can also use environment variables with these interfaces.

For detailed information and examples on how to set the data path for each interface, see [Setting Data Paths](https://www.iguazio.com/docs/tutorials/latest-release/getting-started/fundamentals/#data-paths) and the examples in the platform's tutorial Jupyter notebooks.

<a id="gs-data-collection-and-ingestion"></a>
## Collecting and Ingesting Data

The platform supports various alternative methods for collecting and ingesting data into its data containers (i.e., its data store).
For more information, see the [**welcome**](../welcome.ipynb#data-collection-and-ingestion) platform tutorial Jupyter notebook
The data collection and ingestion can be done as a one-time operation, using different platform APIs &mdash; which can be run from your preferred programming interface, such as an interactive web-based Jupyter or Zeppelin notebook &mdash; or as an ongoing ingestion stream, using Nuclio serverless functions.
This section explains and demonstrates how to collect and ingest (import) data into the platform using code that's run from a Jupyter notebook.

<a id="ingest-from-external-db-to-no-sql-using-frames"></a>
### Ingesting Data From an External Database to a NoSQL Table Using V3IO Frames

For an example of how to collect data from an external database &mdash; such as MySQL, Oracle, and Postgress &mdash; and ingest (write) it into a NoSQL table in the platform, using the V3IO Frames API, see the [read-external-db](read-external-db.ipynb) getting-started tutorial.

<a id="ingest-from-amazon-s3"></a>
### Ingesting Files from Amazon S3

<a id="ingest-from-amazon-s3-using-curl"></a>
#### Ingesting Files from Amazon S3 to the Platform File System Using curl

You can use a simple [curl](https://curl.haxx.se/) command to ingest a file (object) from an external web data source, such as an Amazon S3 bucket, to the platform's distributed file system (i.e., into the platform's data store).
This is demonstrated in the following code example and in the [getting-started example](#getting-started-example) in this notebook.
The [spark-sql-analytics](spark-sql-analytics.ipynb) getting-started tutorial notebook demonstrates a similar ingestion using [Botocore](https://github.com/boto/botocore).

The example in the following cells uses curl to read a CSV file from the [Iguazio sample data-sets](http://iguazio-sample-data.s3.amazonaws.com/) public Amazon S3 bucket and save it to an **examples** directory in the running-user directory of the predefined "users" data container (`/v3io/users/$V3IO_USERNAME` = `v3io/$V3IO_HOME` = `/User`).

In [1]:
!mkdir -p /v3io/${V3IO_HOME}/examples

In [2]:
!curl -L "iguazio-sample-data.s3.amazonaws.com/2018-03-26_BINS_XETR08.csv" > /v3io/${V3IO_HOME}/examples/stocks.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  861k  100  861k    0     0   599k      0  0:00:01  0:00:01 --:--:--  599k


<a id="ingest-from-amazon-s3-to-nosql-table-using-v3io-frames-n-pandas"></a>
#### Ingesting Data from Amazon S3 to a NoSQL Table Using V3IO Frames and pandas

For en example of how to import data from Amazon S3 and save it into a NoSQL table in the platform's data store by using V3IO Frames and pandas DataFrames, see the [frames](frames.ipynb) getting-started tutorial notebook.

<a id="streaming-data-from-an-external-streaming-engine-using-nuclio"></a>
### Streaming Data From an External Streaming Engine Using Nuclio

To read data from an external streaming engine &mdash; such as Kafka, Kinesis, or RabbitMQ &mdash; create a Nuclio function that listens on the stream, and write the stream data to a NoSQL or time-series database (TSDB) table:

1. In the dashboard's side navigation menu, select **Functions** to display the Nuclio serverless functions dashboard.
2. Create a new Nuclio project or select an existing project.
3. In the action toolbar, select **Create Function**.
4. Enter a function name, select an appropriate template, such as **kafka-to-tsdb**, configure the required template parameters, and apply your changes.
5. Select **Deploy** from the action toolbar to deploy your function.

<a id="gs-data-exploration-and-processing"></a>
## Exploring and Processing Data

After you have ingested data into the platform's data containers, you can use various alternative methods and tools to explore and analyze the data.
Data scientists typically use Jupyter Notebook to run the exploration phase.
As outlined in the [**welcome**](../welcome.ipynb#data-exploration-and-processing) tutorial notebook, the platform's Jupyter Notebook service has a wide range of pre-deployed popular data science tools (such as Spark and Presto) and allows installation of additional tools and packages, enabling you to use different APIs to access the same data from a single Jupyter notebook.
This section explains and demonstrates how to explore data in the platform from a Jupyter notebook.

<a id="data-exploration-spark"></a>
### Exploring Data using Spark DataFrames

Spark is a distributed computing framework for data analytics.
You can easily run distributed Spark jobs on you platform cluster that use Spark DataFrames to access data files (objects), tables, or streams in the platform's data store.
For more information and examples, see the [spark-sql-analytics](spark-sql-analytics.ipynb) getting-started tutorial notebook.

<a id="data-exploration-v3io-frames-n-pandas"></a>
### Exploring Data Using V3IO Frames and pandas DataFrames

Iguazio's V3IO Frames open-source data-access library <font color="#00BCF2">\[Tech Preview\]</font> provides a unified high-performance DataFrames API for accessing NoSQL, stream, and time-series data in the platform's data store.
These DataFrames can also be used to analyze the data with pandas. 
For details and examples, see the [frames](frames.ipynb) getting-started tutorial notebook.

<a id="data-exploration-sql"></a>
### Exploring Data Using SQL

You can run SQL statements (`SELECT` only) on top of NoSQL tables in the platform's data store.
To do this, you need to use the Jupyter `%sql` or `%%sql` IPython Jupyter magic followed by an SQL statement.
The platform supports standard ANSI SQL semantics.
Under the hood, the SQL statements are executed via [Presto](https://prestodb.github.io/), which is a distributed SQL engine designed from the ground up for fast analytics queries.

In the example in the following cell, as a preparation for the SQL query, the **stocks.csv** file that was ingested to the **users/&lt;running user&gt;/examples** platform data-container directory in the previous [Ingesting Files from Amazon S3 to the Platform](#ingest-from-amazon-s3) example is written to a **stocks_example_tab** NoSQL table in the same directory.
Then, an SQL `SELECT` query is run on this table.
You can also find a similar example in the [getting-started example](#getting-started-example) in this notebook.

In [3]:
# Convert the CSV file that was ingested in the AWS S3 data-collection example into a NoSQL table by using the V3io Frames library.
# NOTE: Make sure to first create a V3IO Frames service from the "Services" page of the platform dashboard, and run the
# "Ingesting Files from Amazon S3 to the Platform File System Using curl" example to create users/$V3IO_USERNAME/examples/stocks.csv.
import pandas as pd
import v3io_frames as v3f
import os
client = v3f.Client('framesd:8081', container='users')

df = pd.read_csv(os.path.join('/v3io/users/'+os.getenv('V3IO_USERNAME')+'/examples/stocks.csv'))

table_path = os.path.join(os.getenv('V3IO_USERNAME')+'/examples/stocks_example_tab')
client.write('kv', table_path, df)

In [4]:
# Use Presto to query the stocks NoSQL table that was created in the previous step
table_path = os.path.join('v3io.users."'+os.getenv('V3IO_USERNAME')+'/examples/stocks_example_tab"')
%sql select * from $table_path limit 10

Done.


securitydesc,securitytype,time,isin,minprice,date,endprice,numberoftrades,mnemonic,currency,securityid,maxprice,tradedvolume,startprice
BNP PARIBAS INH. EO 2,Common stock,08:34,FR0000131104,59.3,2018-03-26,59.3,1,BNP,EUR,2505185,59.3,25,59.3
I2-I.MSCI USA QD.UETF DLD,ETF,08:18,IE00BKM4H312,26.225,2018-03-26,26.225,1,QDVD,EUR,2505429,26.225,100,26.225
COMST.-F.A.Z.IDX U.ETF I,ETF,08:55,LU0650624025,27.495,2018-03-26,27.495,1,C006,EUR,2506038,27.495,431,27.495
VANG.FTSE A.-WO.U.ETF DL,ETF,08:53,IE00B3RBWM25,67.39,2018-03-26,67.39,1,VGWL,EUR,2749247,67.39,348,67.39
LEONI AG NA O.N.,Common stock,08:56,DE0005408884,52.5,2018-03-26,52.52,9,LEO,EUR,2504929,52.56,538,52.56
BAYWA AG VINK.NA. O.N.,Common stock,08:41,DE0005194062,28.35,2018-03-26,28.35,1,BYW6,EUR,2504903,28.35,22,28.35
TLG IMMOBILIEN AG,Common stock,08:55,DE000A12B8Z4,22.7,2018-03-26,22.7,6,TLG,EUR,2504555,22.7,446,22.7
"RIO TINTO PLC LS-,10",Common stock,08:59,GB0007188757,41.285,2018-03-26,41.285,1,RIO1,EUR,2505378,41.285,12,41.285
"NOVARTIS NAM. SF 0,50",Common stock,08:49,CH0012005267,65.0,2018-03-26,65.0,1,NOT,EUR,2504217,65.0,26,65.0
IS EO H.Y.CO.BD U.ETF EOD,ETF,08:50,IE00B66F4759,104.795,2018-03-26,104.795,1,EUNW,EUR,2505762,104.795,100,104.795


<a id="getting-started-example"></a>
## Data Collection and Exploration Getting-Started Example

This section demonstrates a data collection, ingestion, and exploration flow.
Follow the tutorial by running the code cells in order of appearance:
- [Step #1](#getting-started-example-step-ingest-csv) &mdash; a CSV file is read from an Amazon S3 bucket and saved into an examples data-container directory using curl.
  The examples directory is first created by using a file-system command.
- [Step #2](#getting-started-example-step-convert-csv-to-nosql-table) &mdash; the ingested file is converted into a NoSQL table by using Spark DataFrames.
- [Step #3](#getting-started-example-step-run-sql-queries) &mdash; a Presto SQL query is run on the NoSQL table.
- [Step #4](#getting-started-example-step-convert-data-to-parquet) &mdash; the ingested CSV file is converted into a Parquet table by using Spark DataFrames.
- [Step #5](#getting-started-example-step-browse-the-examples-dir) &mdash; the examples container directory is browsed by using local and Hadoop file-system commands.
- At the end of the flow, you can optionally [delete](#getting-started-example-deleting-data) the examples directory using a file-system command.

You can find more information about this sample flow in the [Converting a CSV File to a NoSQL Table](https://www.iguazio.com/docs/tutorials/latest-release/getting-started/ingest-n-consume-files/#convert-csv-to-nosql) platform quick-start tutorial.

> **Tip:** You can also browse the files and directories that you write to the "users" container in this tutorial from the platform dashboard: in the side navigation menu, select **Data**, and then select the **users** container from the table. On the container data page, select the **Browse** tab, and then use the side directory-navigation tree to browse the directories. Selecting a file or directory in the browse table displays its metadata.

<a id="getting-started-example-step-ingest-csv"></a>
### Step 1: Ingest a sample CSV file from Amazon S3

Use `curl` to download a sample stocks-data CSV file from the [Iguazio sample data-set](http://iguazio-sample-data.s3.amazonaws.com/) public Amazon S3 bucket.
For additional public data sets, check out [Registry of Open Data on AWS](https://registry.opendata.aws/).

> **NOTE:** All the platform tutorial notebook examples store the data in an **examples** directory in the running-user directory of the predefined "users" container &mdash; **users/&lt;running user&gt;/examples**.
> The running-user directory is automatically created by the Jupyter Notebook service.
> The `V3IO_HOME` environment variable is used to reference the **users/&lt;running user&gt;** directory.
> To save the data to a different root container directory or to a different container, you need to specify the data path in the local file-system commands as `/v3io/<container name>/<data path>`, and in Spark DataFrames or Hadoop FS commands as a fully qualified path of the format `v3io://<container name>/<table path>`.
> For more information, see the [v3io-mount](#v3io-mount) and [running-user directory](#running-user-dir) information [Jupyter Notebook Basics](#jupyter-notebook-basics) section of this notebook.

In [5]:
%%sh 
mkdir -p /v3io/${V3IO_HOME}/examples

# Download a sample stocks CSV file from the Iguazio sample data-set Amazon S3 bucket
curl -L "iguazio-sample-data.s3.amazonaws.com/2018-03-26_BINS_XETR08.csv" > /v3io/${V3IO_HOME}/examples/stocks.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  861k  100  861k    0     0   626k      0  0:00:01  0:00:01 --:--:--  627k


<a id="getting-started-example-step-convert-csv-to-nosql-table"></a>
### Step 2: Convert the sample CSV file to a NoSQL table

Read the sample **stocks.csv** file that you downloaded and ingested in the previous step into a Spark DataFrame, and write the data in NoSQL format to a new "stocks_tab" table in the same container directory (**users/&lt;running user&gt;/examples/stocks_tab**). 

> **Note**
> - To use the Iguazio Spark Connector, set the DataFrame's data-source format to `io.iguaz.v3io.spark.sql.kv`.
> - The data path in the Spark DataFrame is specified by using the `V3IO_HOME_URL` environment variable, which is set to `v3io://users/<running user>`.
>   See the [running-user directory](#running-user-dir) information.

In [6]:
import os
from pyspark.sql import SparkSession

# Create a new Spark session
spark = SparkSession.builder.appName("Iguazio data collection and exploration getting-started example").getOrCreate()

file_path=os.path.join(os.getenv('V3IO_HOME_URL')+'/examples')

# Read the sample stocks.csv file into a Spark DataFrame, and let Spark infer the schema of the CSV file
df = spark.read.option("header", "true").csv(os.path.join(file_path)+'/stocks.csv')

# Show the DataFrame data
df.show()

# Write the DataFrame data to a stocks_tab table in the users/<running user>/examples container directory,
# and define the "ISIN" column (attribute) as the table's primary key
df.write.format("io.iguaz.v3io.spark.sql.kv").mode("append").option("key", "ISIN").option("allow-overwrite-schema", "true").save(os.path.join(file_path)+'/stocks_tab/')

+------------+--------+--------------------+------------+--------+----------+----------+-----+----------+--------+--------+--------+------------+--------------+
|        ISIN|Mnemonic|        SecurityDesc|SecurityType|Currency|SecurityID|      Date| Time|StartPrice|MaxPrice|MinPrice|EndPrice|TradedVolume|NumberOfTrades|
+------------+--------+--------------------+------------+--------+----------+----------+-----+----------+--------+--------+--------+------------+--------------+
|CH0038389992|    BBZA|BB BIOTECH NAM.  ...|Common stock|     EUR|   2504244|2018-03-26|08:00|      56.4|    56.4|    56.4|    56.4|         320|             4|
|CH0038863350|    NESR|NESTLE NAM.      ...|Common stock|     EUR|   2504245|2018-03-26|08:00|     63.04|   63.06|      63|   63.06|         314|             3|
|LU0378438732|    C001|COMSTAGE-DAX UCIT...|         ETF|     EUR|   2504271|2018-03-26|08:00|    113.42|  113.42|  113.42|  113.42|         100|             1|
|LU0411075020|    DBPD|XTR.SHORTDA

<a id="getting-started-example-step-run-sql-queries"></a>
### Step 3: Run interactive SQL queries

Use the `%sql` Jupyter magic to run an SQL queries on the "stocks_tab" table that was created in the previous step.
(The queries is processed using Presto.)
The example runs a `SELECT` query that reads the first ten table items.

In [7]:
table_path = os.path.join('v3io.users."'+os.getenv('V3IO_USERNAME')+'/examples/stocks_tab"')
%sql select * from $table_path limit 10

 * presto://iguazio:***@presto-api-presto.default-tenant.app.dev34.lab.iguazeng.com:443/v3io?protocol=https
Done.


securitydesc,securitytype,time,isin,minprice,date,endprice,numberoftrades,mnemonic,currency,securityid,maxprice,tradedvolume,startprice
AT+S AUSTR.T.+SYSTEMT.,Common stock,08:04,AT0000969985,22.3,2018-03-26,22.4,3,AUS,EUR,2504191,22.4,761,22.3
"SENVION S.A. EUR -,01",Common stock,08:05,LU1377527517,9.95,2018-03-26,9.95,1,SEN,EUR,2506162,9.95,11,9.95
ENVITEC BIOGAS O.N.,Common stock,08:11,DE000A0MVLS8,7.25,2018-03-26,7.25,1,ETG,EUR,2504388,7.25,45,7.25
LY.MSCI AL.CO.WO.UETF CEO,ETF,08:27,FR0011079466,220.65,2018-03-26,220.65,1,LYY0,EUR,2505321,220.65,200,220.65
"ZURICH INSUR.GR.NA.SF0,10",Common stock,08:14,CH0011075394,259.1,2018-03-26,259.1,1,ZFIN,EUR,2504215,259.1,45,259.1
HOCHTIEF AG,Common stock,08:02,DE0006070006,147.9,2018-03-26,147.9,1,HOT,EUR,2505009,147.9,13,147.9
ISHS IV-AUTO.+ROBOTIC.ETF,ETF,08:01,IE00BYZK4552,6.423,2018-03-26,6.423,1,2B76,EUR,2505551,6.423,6,6.423
X(IE)-GERM.MITTEL.MCAP 1D,ETF,08:55,IE00B9MRJJ36,24.665,2018-03-26,24.665,1,XDGM,EUR,2505788,24.665,1102,24.665
IS.DJ CHINA OFFS.50 U.ETF,ETF,08:00,DE000A0F5UE8,47.52,2018-03-26,47.52,1,EXXU,EUR,2504302,47.52,420,47.52
A.SPRINGER SE VNA,Common stock,08:03,DE0005501357,68.0,2018-03-26,68.05,3,SPR,EUR,2504947,68.05,143,68.0


<a id="getting-started-example-step-convert-data-to-parquet"></a>
### Step 4: Convert the data to a Parquet table

Use a Spark DataFrame `write` command to write the data in the Spark DaraFrame &mdash; which was created from the CSV file and used to create the NoSQL table in [Step 2](#getting-started-example-step-convert-csv-to-nosql-table) &mdash; to a new **users/&lt;running user&gt;/examples/stocks_prqt** Parquet table.

In [8]:
df.write.mode('overwrite').parquet(os.path.join(file_path)+'/stocks_prqt')

<a id="getting-started-example-step-browse-the-examples-dir"></a>
### Step 5: Browse the example container directory

Use a file-system bash-shell command to list the contents of the **users/&lt;running user&gt;/examples** data-container directory to which all the ingested data in the previous steps were saved.
You should see in this directory the **stocks.csv** file, **stocks_tab** NoSQL table directory, and **stocks_prqt** Parquet table directory that you created in the previous steps.
The following cells demonstrate how to issue the same command using the local file system and using Hadoop FS.

In [9]:
# List the contents of the users/<running user>/examples directory using a local file-system command
!ls -lrt /v3io/${V3IO_HOME}/examples

total 0
drwxrwxr-x 2 iguazio iguazio      0 Apr  4 17:04 stocks_example_tab
-rw-r--r-- 1 iguazio iguazio 882055 Apr  4 17:21 stocks.csv
drwxrwxrwx 2 iguazio iguazio      0 Apr  4 17:23 stocks_tab
drwxr-xr-x 2 iguazio iguazio      0 Apr  4 17:25 stocks_prqt


In [10]:
%%sh

# List the contents of the users/<running user>/examples directory using an Hadoop FS command
hadoop fs -ls ${V3IO_HOME_URL}/examples

Found 4 items
-rw-r--r--   1 iguazio iguazio     882055 2019-04-04 17:21 v3io://users/iguazio/examples/stocks.csv
drwxrwxr-x   - iguazio iguazio          0 2019-04-04 17:04 v3io://users/iguazio/examples/stocks_example_tab
drwxr-xr-x   - iguazio iguazio          0 2019-04-04 17:25 v3io://users/iguazio/examples/stocks_prqt
drwxrwxrwx   - iguazio iguazio          0 2019-04-04 17:23 v3io://users/iguazio/examples/stocks_tab


19/04/04 17:26:19 INFO slf_4j.Slf4jLogger: Slf4jLogger started


<a id="getting-started-example-deleting-data"></a>
### Deleting Data

When are you are done, you can select to delete the any of the directories or files that you created.
See the instructions in the [Creating and Deleting Container Directories](https://www.iguazio.com/docs/tutorials/latest-release/getting-started/containers/#create-delete-container-dirs) tutorial.
The following example uses a local file-system bash-shell command to delete the entire contents of the **users/&lt;running user&gt;/examples** directory that was created in this example, but not the directory itself.

In [11]:
# Delete all files under my example directory
!rm -rf /v3io/${V3IO_HOME}/examples/*

<a id="getting-started-example-release-spark-resources"></a>
### Releasing Spark Resources

When you are done, run the following command to stop your Spark session and release its computation and memory resources:

In [12]:
spark.stop()