# Loading data from external SQL databases
In this notebook we'll demonstrate how to load data from external databases <br>
we'll cover two methods: 
1. A generic example of reading data in chunks using python library called SQLAlchemy <br>
2. A Specific example of reading a table from mysql as a bulk operation <br>

## Reading from a MySQL database and writing to a NoSQL in chunks

In this example we're using a python tool called SQLAlchemy<br>
SQLAlchemy is the Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL.<br>
It can be used to read data from various databases including MySQL, PostgreSQL, Oracle, MSSQL, SQLLite etc..<br>
<br>
Below is an example for working with MySQL <br>
In order to work with a different database you need to change the engine setting<br>
For more details - https://docs.sqlalchemy.org/en/latest/core/engines.html#mysql<br>

In the example below we are using a public mysql database called Rfam (https://rfam.readthedocs.io/en/latest/database.html)<br>
The idea is to read data by chunks and then write it to a NoSQL table in iguazio <br>
Working in chunks is useful when working on big datasets that cannot fit into the available memory resources  <br>

In [1]:
import pandas as pd
from sqlalchemy.engine import create_engine
import v3io_frames as v3f
import os
 
client = v3f.Client('framesd:8081', container='users')
 
engine = create_engine('mysql+pymysql://rfamro@mysql-rfam-public.ebi.ac.uk:4497/Rfam')
 
query = """
select rfam_acc,rfam_id,auto_wiki,description,author,seed_source FROM family
"""
tablename = os.path.join(os.getenv('V3IO_USERNAME')+'/examples/family_tab') 
#Read Presto Data query into a DataFrame and write it to a NoSQL table
all_df = pd.read_sql(query,engine,chunksize = 100000)
for df in all_df:
    df = df.reset_index()
    out = client.write('kv', tablename , df)

In [None]:
table_path = os.path.join('v3io.users."'+os.getenv('V3IO_USERNAME')+'/examples/family_tab"')
%sql select * from $table_path limit 10

## Reading from MYSQL as a bulk operation using pandas dataframe

### Setting the database connection and running the query

Pandas DataFrame has native support for reading and writing to various SQL databases<br>
the user first create a DB connection using the database specific library and use the `pd.read_sql()` or `pd.read_sel_query()` to read the database table into a dataframe, once we have a DataFrame object we can manipulate it and store it into iguazio database or time-series tabels.

The following example demonstrate working with MySQL

Set the database connection. in this example we are using a public mysql database called Rfam (https://rfam.readthedocs.io/en/latest/database.html)<br>
Then run the sql and keep the result into the dataframe

In [2]:
import os
import pymysql
import pandas as pd 

conn = pymysql.connect(
    host=os.getenv('DB_HOST','mysql-rfam-public.ebi.ac.uk'),
    port=int(4497),
    user=os.getenv('DB_USER','rfamro'),
    passwd=os.getenv('DB_PASSWORD',''),
    db=os.getenv('DB_NAME','Rfam'),
    charset='utf8mb4')

df = pd.read_sql_query("select rfam_acc,rfam_id,auto_wiki,description,author,seed_source FROM family",
    conn) 

df.tail(10)

Unnamed: 0,rfam_acc,rfam_id,auto_wiki,description,author,seed_source
3006,RF03106,RT-11,2572,RT-11 RNA,Weinberg Z,Weinberg Z
3007,RF03107,saliva-tongue-1,2696,saliva-tongue-1 RNA,Weinberg Z,Weinberg Z
3008,RF03108,Methylosinus-1,2697,Methylosinus-1 RNA,Weinberg Z,Weinberg Z
3009,RF03109,Thermales-rpoB,2698,Thermales-rpoB RNA,Weinberg Z,Weinberg Z
3010,RF03110,throat-1,2699,throat-1 RNA,Weinberg Z,Weinberg Z
3011,RF03111,Zeta-pan,2700,Zeta-pan RNA,Weinberg Z,Weinberg Z
3012,RF03112,Staphylococcus-1,2701,Staphylococcus-1 RNA,Weinberg Z,Weinberg Z
3013,RF03113,Poribacteria-1,2702,Poribacteria-1 RNA,Weinberg Z,Weinberg Z
3014,RF03114,RT-1,2572,RT-1 RNA,Weinberg Z,Weinberg Z
3015,RF03115,KDPG-aldolase,2703,KDPG-aldolase RNA,Weinberg Z,Weinberg Z


### Writing the results to iguazio Key/Value Database
The following section demonstrate establishing a connection with iguazio high-performance DataFrames service (v3io_frames) and writing the data from the SQL database<br>
iguazio database support multiple models (KV/NoSQL, time-series, stream, object) those are specified in the first argument, read more in: `TBD Frames link`

In [3]:
import pandas as pd
import v3io_frames as v3f
import os
client = v3f.Client('framesd:8081', container='users')
tablename = os.path.join(os.getenv('V3IO_USERNAME')+'/examples/family')

Ingesting data into the database using NoSQL API

In [4]:
client.write('kv', tablename, df)

### Using Pandas streaming capabilities to copy large datasets 
Many pandas inputs/outputs including SQL, CSV, and iguazio support chunking. with chunking the driver forms a continious iterator and data is read/written chunk by chunk.
a user specify the `chunksize` (number of rows) which will return a dataframe iterator, this iterator can be passed as is to a dataframe writer like iguazio frames.
The following example will stream data from MySQL to iguazio NoSQL API.

In [5]:
tablename2 = tablename = os.path.join(os.getenv('V3IO_USERNAME')+'/examples/family2')
df_iter = pd.read_sql("select rfam_acc,rfam_id,auto_wiki,description,author,seed_source  FROM family", conn, chunksize=1000)
client.write('kv', tablename2, df_iter)

## Remove Data

In [6]:
client.delete('kv', tablename)
client.delete('kv', tablename2)