# Getting started
## For Data collection and exploration
1. [Overview](#Overview)
2. [Platform's built in tools for data collection and exploration](#Platform's-built-in-tools-for-data-collection-and-exploration)
3. [Data collection](#Data-collection)
  1. [Importing from external database](#Importing-data-from-external-database)
  2. [Reading from S3](#Reading-files-from-S3-and-store-it-in-Iguazio-file-system)
  3. [Reading from external stream](#Reading-data-from-external-streaming-engine-on-an-ongoing-basis)
  4. [Injesting data using  RestAPI](#Injesting-data-using-RestAPI)
4. [Exploration](#Exploration)
  1. [Using spark](#Using-spark)
  2. [Using Pandas dataframe and Iguazio frames library](#Using-Pandas-dataframe-and-Iguazio-frames-library)
  3. [Using SQL](#Using-SQL)
5. [Getting started example](#Getting-Started-Example)
6. [Demo applications](#Demo-Applications)
   1. [stocks demo](#Stocks-demo)

# Overview

Iguazio delivers a data platform to speed up delivery of time sensitive (“real time”) data driven and AI applications. it provides a fully integrated and secure Data Science PaaS with:<br>
•	Data science workbench (Jupyter with integrated analytics engines)<br>
•	Managed services over Kubernetes (Presto, Prometheus, Grafana)<br>
•	Fast data layer supporting SQL, NoSQL and time series workload<br>
•	Real-time serverless functions framework (aka Nuclio)<br>

Customers can ingest, enrich, analyze and serve data — all in one simple, fast, and secure platform. The platform accelerates the deployment of a variety of analytics services, eliminating data-pipeline complexities and reducing time to market for developing new applicaiton with machine learning / AI capabilities


This notebook contains code examples for performing common tasks to help you get started with the Iguazio Platform <br>
It provides guidance on various methods of collecting and exploring data as well as a simple getting started flow <br>
Under the Data collection and Exploration section you will find links to detailed notebooks in each topic (e.g. working with Spark etc..)




## Notebook basics

The JupyterLab notebook has a browser section (on the left) and content tabs (right), the tabs allow to view/edit/run notebooks, interactive shell and common file types. 

To create a new code or shell tab click the `+` icon on the upper left corner or via the top menu. 
Under the Home (Root) directory of Jupyter, users will find the following directories:
“v3io” - Allow accessing to the shared data containers on iguazio platform. 
Data can be ingested and updated using multiple APIs as well as retrieve and run queries on same data through multiple tools and services. 
A user access to the `v3io` data containers can be restricted by the administrator through the data container access policies. Same data containers under the Jupyter “home/v3io” can be viewed via iguazio dashboard under “Data” tab.
“Getting started” - containing getting started tutorials for basic operation. Mainly useful for data collection and analytics (see below for more details)
“Demos” - built-in demo applications (see below for more details)
Iguazio has a default data container called “users” where customers can store their data. The best practice is to create a folder with the name of the user and use it as the user “home” directory. One can leverage the following system environment: <br>
`$V3IO_HOME`  - pointing to /users/&lt;running user&gt; <br>
`$V3IO_USERNAME` = &lt;running user&gt; <br>
`$V3IO_HOME_URL` = v3io://users/&lt;running user&gt; <br>

# Platform's built in tools for data collection and exploration <br>

Iguazio provides various ways for collecting data from different sources such as databases, files and streaming engines<br>
Collecting data can be done as a one time operation (i.e. using a notebook in Jupyter or zeppelin) or on an ongoing basis using Nuclio functions <br>
In the examples below you'll find notebooks explaining how to import data into the system via Jupyter <br>


## Data collection

### Importing data from external database
In this notebook you'd learn how to collect data from different databases such as MySQL, Oracle, Postgress etc..<br>
[Reading from external databases](ReadingFromExternalDB.ipynb)

### Reading files from S3 and store it in Iguazio file system
Importing file to the system can be done by a simple curl command<br>
In this case we take a csv file from Iguazio public sample data buclet and store it under the home directory of the running user

In [None]:
!mkdir -p /v3io/${V3IO_HOME}/examples

In [None]:
!curl -L "iguazio-sample-data.s3.amazonaws.com/2018-03-26_BINS_XETR08.csv" > /v3io/${V3IO_HOME}/examples/stocks.csv

### Reading files from S3 and store it as a NoSQL table
In this notebook you'd learn how to import data using pandas dataframe and how to store it into a NoSQL table<br>
[Reading from S3 amd writing to NoSQL table](frames.ipynb)

### Reading data from external streaming engine on an ongoing basis

In order to read data from external streaming engine (e.g. kafka, kinesis,RabitMQ) you need to create a nuclio function that listen to the stream <br>
and write it to Iguazio NoSQL or Time series table <br>
Step 1 Go to the Functions page and create a project or use an existing one <br>
Step 2 click on Create function <br>
Step 3 Select a template (e.g. kafka to tsdb) and fill the properties <br>

### Injesting data using RestAPI

Users can injest and fetch data using RestAPI <br>
To get the Rest endpoint URL go to the services screen and look for the value under the API column for the WebAPI service <br> 
You can then use this Rest End point to execute http request for injesting or accessing data in Iguazio <br>
For detailed documentaiton look at https://www.iguazio.com/docs/reference/latest-release/api-reference/web-apis/

<a id='exploration'></a>
# Exploration 

Once the data resides in Iguazio users can leverage various technic and tools to explore and analyze the data <br>
Users can choose their favorite open source tool for working with the data. For example: Spark, Presto for SQL, Pandas dataframe etc.. <br>
Typically users are using Jupyter notebook to run the exploration phase.<br>
All the analytics services are integrated with Jupyer so users typically explore the data using Jupyter while running Spark jobs or SQL <br>
on the same dataset without the need to move the data <br>
Iguazio multi model enables users store and analyze different data types such as key value, time series, streaming, files and objects and <br>
leverage different tools for accessing and manipulating the data from a single interface <br>
In the notebooks below you'd find couple of ways for exploring data over Jupyter while leveraging different tools

### Using Pandas dataframe and Iguazio frames library

Iguazio provides a library for reading and writing data from its NoSQL, Streaming and time series format in a single interface to a dataframe <br>
Then the dataframe can be used by Pandas dataframe for further analysis <br>
[Reading and writing data using Frames](frames.ipynb)


### Using spark
Spark is a distributed computing framework for analytics purposes. Users can work with Spark leveraging iguazio cluster for running distributed jobs<br>
Spark users can access files, tables or streams stored on iguazio data platform through the native spark Dataframe interfaces. <br>
[Analyze data with Spark](SparkSQLAnalytics.ipynb)

### Using SQL
User can run SQL statements (select only) on top of iguazio NoSQL tables  <br>
To do that one needs to specify the Jupyter "magic" % and then the SQL statement <br>
In this example, as a preperation, we are taking the stocks csv file and write it down to iguazio NoSQL <br>
Once the data resides in a NoSQL table we simply run a SQL statement <br>
Under the hood once the user run a SQL statement it will be running via Presto which is a distributed SQL engine designed from the ground up for fast analytics queries  <br>
Note that Iguazio SQL support standard ANSI SQL semantic

In [6]:
# taking the csv that was generated in the first section and write it as a NoSQL table using frames
# make sure to run the "reading from S3"
import pandas as pd
import v3io_frames as v3f
import os
client = v3f.Client('framesd:8081', container='users')

df = pd.read_csv(os.path.join('/v3io/users/'+os.getenv('V3IO_USERNAME')+'/examples/stocks.csv'))

tablename = os.path.join(os.getenv('V3IO_USERNAME')+'/examples/stocks_example_tab')
client.write('kv', tablename, df)

In [7]:
table_path = os.path.join('v3io.users."'+os.getenv('V3IO_USERNAME')+'/examples/stocks_example_tab"')
%sql select * from $table_path limit 10

 * presto://iguazio:***@presto-api-presto.default-tenant.app.dev34.lab.iguazeng.com:443/v3io?protocol=https
Done.


securitydesc,securitytype,time,isin,minprice,date,endprice,numberoftrades,mnemonic,currency,securityid,maxprice,tradedvolume,startprice
ETFS COM.SEC.DZ06/UN.NICK,ETC,08:04,DE000A0KRJ44,10.133,2018-03-26,10.133,1,OD7M,EUR,2504364,10.133,1000,10.133
RIB SOFTWARE SE NA EO 1,Common stock,08:59,DE000A0Z2XN6,24.98,2018-03-26,25.04,9,RIB,EUR,2504436,25.04,4255,24.98
"PAYPAL HDGS INC.DL-,0001",Common stock,08:42,US70450Y1038,62.77,2018-03-26,62.77,1,2PP,EUR,2506551,62.77,40,62.77
ISHSIV-E.MSCI USA VAL.FA.,ETF,08:34,IE00BD1F4M44,5.224,2018-03-26,5.224,2,QDVI,EUR,2505410,5.224,50008,5.224
XTR.EU.ST.50 SH.DA.SW. 1C,ETF,08:46,LU0292106753,14.722,2018-03-26,14.722,1,DXSP,EUR,2506403,14.722,3650,14.722
ISHSVII-MSCI USA SC DL AC,ETF,08:11,IE00B3VWM098,254.05,2018-03-26,254.05,2,SXRG,EUR,2505637,254.55,7,254.55
ETFS COM.SEC.DZ06/UN.IDX,ETC,08:52,DE000A0KRKB8,3.844,2018-03-26,3.844,1,OD7U,EUR,2504370,3.844,3000,3.844
ISHSII-JPM DL EM BD DLDIS,ETF,08:27,IE00B2NPKV68,88.76,2018-03-26,88.76,1,IUS7,EUR,2505604,88.76,340,88.76
DK DAX,ETF,08:59,DE000ETFL011,109.32,2018-03-26,109.32,1,EL4A,EUR,2504258,109.32,4,109.32
ISV-E.C.B.I.R.H.U.ETF EOD,ETF,08:02,IE00B6X2VY59,97.134,2018-03-26,97.134,1,IS0Y,EUR,2505745,97.134,218,97.134


# Getting Started Example

Follow the tutorial by running the paragraphs in order of appearance.

> **Tip:** You can also browse the files and directories that you write to the "users" container in this tutorial from the platform dashboard: in the side navigation menu, select **Data**, and then select the **users** container from the table. On the container data page, select the **Browse** tab, and then use the side directory-navigation tree to browse the directories. Selecting a file or directory in the browse table displays its metadata.


## Step 1: Load a sample CSV file from S3
Use `curl` to download a sample stock data from Iguazio public S3 bucket.<br>
This file belongs to deutsche-boerse public dataset.<br>
For additional public datasets check out (https://registry.opendata.aws/) <br>
<br>
Note that each user in the system has its own home directory (similar to linux home) that resides in a default container called users <br>
The environment variable V3IO_HOME points to the home directory of the running user<br>
All the notebooks examples store the data under the "examples" directory that resides under the user's home directory <br>
Iguaizo's best practice is to use the home directory of the user for keeping personal experiments and data in a private workspace <br>
However, to work on other folders and share data with other users you need to specify the exact path using the following convention /v3io/"data container name"/"path" <br>
V3io is the name of the iguazio data source library and it is being used to define iguazio as the storage layer for that read/write operation<br>


In [6]:
%%sh 
mkdir -p /v3io/${V3IO_HOME}/examples

# Download a sample stocks file from Iguazio demo bucket in S3
curl -L "iguazio-sample-data.s3.amazonaws.com/2018-03-26_BINS_XETR08.csv" > /v3io/${V3IO_HOME}/examples/stocks.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  861k  100  861k    0     0   574k      0  0:00:01  0:00:01 --:--:--  575k


## Step 2: Convert the sample CSV file to a NoSQL table

Read the sample stocks.csv file that you downloaded in Step 1 into a Spark DataFrame, and write the data in NoSQL format to a new stocks_nosql table 

Note: To use the Iguazio Spark Connector, set the data-source format to "io.iguaz.v3io.spark.sql.kv". <br>
The V3IO_HOME_URL is an environment varible that points to the Home directory of the user using Spark/Hadoop  format

In [None]:
import os
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Iguazio getting started").getOrCreate()

file_path=os.path.join(os.getenv('V3IO_HOME_URL')+'/examples')

# Read the sample stocks.csv file into a Spark DataFrame, and let Spark infer the schema of the CSV file
df = spark.read.option("header", "true").csv(os.path.join(file_path)+'/stocks.csv')

# Show the DataFrame data
df.show()

# Write the DataFrame data to a stocks_tab table under "users" container and define "ISIN" column as a key
df.write.format("io.iguaz.v3io.spark.sql.kv").mode("append").option("key", "ISIN").option("allow-overwrite-schema", "true").save(os.path.join(file_path)+'/stocks_tab/')


## Step 3: Run interactive SQL queries

In [None]:
%sql select * from v3io.users."iguazio/examples/stocks_tab"  Limit 10

## Step 4: Convert the stocks_nosql table to a Parquet file

In [6]:
df.write.mode('overwrite').parquet(os.path.join(file_path)+'/stocks_prqt')

## Step 5: Display the content of the example container directory
Use hadoop fs to list the contents of the root directory under “users” container where all the example files are located
You should see in this directory the stocks.csv file and the stocks_nosql and stocks_prqt table directories.

In [6]:
!ls -lrt /v3io/${V3IO_HOME}/examples

total 0
drwxrwxrwx 2 51 nogroup      0 Feb 26 08:54 stocks_tab
drwxr-xr-x 2 51 nogroup      0 Feb 26 08:55 stocks_tab.parquet
-rw-r--r-- 1 51 nogroup 882055 Feb 26 09:27 stocks.csv
drwxrwxrwx 2 51 nogroup      0 Feb 26 09:27 mytable
drwxrwxrwx 2 51 nogroup      0 Feb 26 09:27 weather
drwxrwxrwx 2 51 nogroup      0 Feb 26 09:27 cars
drwxr-xr-x 2 51 nogroup      0 Feb 26 09:31 stocks_prqt


In [7]:
%%sh

# List the files and directories in the root directory of the "users" container using hadoop
hadoop fs -ls ${V3IO_HOME_URL}/examples

Found 7 items
drwxrwxrwx   - 51 nogroup          0 2019-02-26 09:27 v3io://users/iguazio/examples/cars
drwxrwxrwx   - 51 nogroup          0 2019-02-26 09:27 v3io://users/iguazio/examples/mytable
-rw-r--r--   1 51 nogroup     882055 2019-02-26 09:27 v3io://users/iguazio/examples/stocks.csv
drwxr-xr-x   - 51 nogroup          0 2019-02-26 09:31 v3io://users/iguazio/examples/stocks_prqt
drwxrwxrwx   - 51 nogroup          0 2019-02-26 08:54 v3io://users/iguazio/examples/stocks_tab
drwxr-xr-x   - 51 nogroup          0 2019-02-26 08:55 v3io://users/iguazio/examples/stocks_tab.parquet
drwxrwxrwx   - 51 nogroup          0 2019-02-26 09:27 v3io://users/iguazio/examples/weather


19/02/26 09:31:22 INFO slf_4j.Slf4jLogger: Slf4jLogger started


## Remove Data

In [8]:
# Delete all files under my example directory
!rm -rf /v3io/${V3IO_HOME}/examples/*

In order to release compute and memory resources taken by spark we recommend running the following command 

In [9]:
spark.stop()

# Demo Applications

The platfrorm comes with several end to end application demos<br>

### Stocks demo
The application demonstrates how to read stocks data and analyze the market sentiment on <br>
specific stocks in real time and store it in Iguazio database for generating reports and analytics <br>
[Stocks demo](../demos/stocks/read_stocks.ipynb)

### Network operation
The application demonstrates prediction for failure in network devices <br>
It demonstrated how to build, traing and delpoy a machine learning model for predictive analytics  <br>
[Network operation demo](../demos/netops/generator.ipynb)