# Access a Database with Python - Iris Dataset

The Iris dataset is a popular dataset especially in the Machine Learning community. It is a set of features of 50  Iris flowers and their classification into 3 species.
It is often used to introduce classification Machine Learning algorithms.

First let's download the dataset in `SQLite` format from Kaggle:

<https://www.kaggle.com/uciml/iris/>

Download `database.sqlite` and save it in the `data/iris` folder.

<p><img   src="https://upload.wikimedia.org/wikipedia/commons/4/49/Iris_germanica_%28Purple_bearded_Iris%29%2C_Wakehurst_Place%2C_UK_-_Diliff.jpg" alt="Iris germanica (Purple bearded Iris), Wakehurst Place, UK - Diliff.jpg" height="145" width="114"></p>

<p><br> From <a href="https://commons.wikimedia.org/wiki/File:Iris_germanica_(Purple_bearded_Iris),_Wakehurst_Place,_UK_-_Diliff.jpg#/media/File:Iris_germanica_(Purple_bearded_Iris),_Wakehurst_Place,_UK_-_Diliff.jpg">Wikimedia</a>, by <a href="//commons.wikimedia.org/wiki/User:Diliff" title="User:Diliff">Diliff</a> - <span class="int-own-work" lang="en">Own work</span>, <a href="http://creativecommons.org/licenses/by-sa/3.0" title="Creative Commons Attribution-Share Alike 3.0">CC BY-SA 3.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=33037509">Link</a></p>

First let's check that the sqlite database is available and display an error message if the file is not available (**`assert`** checks if the expression is `True`, otherwise throws **`AssertionError`** with the error message string provided):

In [2]:
# get the folder connection
import os
data_iris_folder_content = os.listdir("C:/ml/iris")

In [3]:
# no error message displayed = file is there
error_message = "Error: sqlite file not available, check instructions above to download it"
assert "database.sqlite" in data_iris_folder_content, error_message

## Access the Database with the sqlite3 Package

We can use the `sqlite3` package from the Python standard library to connect to the `sqlite` database:

In [4]:
import sqlite3

conn = sqlite3.connect('C:/ml/iris/database.sqlite')
cursor = conn.cursor()
type(cursor)

sqlite3.Cursor

A **`sqlite3.Cursor`** object is our interface into the database, mostly throught the **`execute`** method that allows us to run any `SQL` query on the database.

First of all, we can get a list of all tables in the database by reading the column **`name`** from the **`sqlite_master` metadata table** with:

    SELECT name FROM sqlite_master
    
The output of the **`execute`** method is an iterator that can be used in a `for` loop to print the value of each row.
* if we just write cursor.execute('SQL Statement'), we get an iterator, NOT a result, so we have to go through that iterator

In [5]:
# print out each row/tuple from sqlite_master table via executing a SQL query to the database via our cursor
#   - our cursor is connected to the databse via .connect()
for row in cursor.execute("SELECT name FROM sqlite_master"):
    print(row)

('Iris',)


So we only have 1 table.

A shortcut to directly execute a query and gather the results is the **`fetchall`** method:

In [6]:
cursor.execute("SELECT name FROM sqlite_master").fetchall()

[('Iris',)]

**Notice**: this way of finding the available tables in a database is *specific to **`sqlite`***.

Other databases like `MySQL` or `PostgreSQL` have different syntax.

We can execute standard `SQL` query on the database. SQL has a standard specification, therefore the commands below work on any database.

If you need to connect to another database, you would use another package instead of `sqlite3`, for example:

* [MySQL Connector](https://dev.mysql.com/doc/connector-python/en/) for MySQL
* [Psycopg](http://initd.org/psycopg/docs/install.html) for PostgreSQL
* [pymssql](http://pymssql.org/en/stable/) for Microsoft MS SQL

Then you'd connect to the database using specific host, port and authentication credentials, and execute the same exact `SQL` statements.

Let's take a look for example at the first 3 rows in the Iris table:

In [9]:
sample_data = cursor.execute("SELECT * FROM Iris LIMIT 3").fetchall()

In [10]:
print(type(sample_data))
sample_data

<class 'list'>


[(1, 5.1, 3.5, 1.4, 0.2, 'Iris-setosa'),
 (2, 4.9, 3, 1.4, 0.2, 'Iris-setosa'),
 (3, 4.7, 3.2, 1.3, 0.2, 'Iris-setosa')]

So we got our query returned as a list of tuples, where each tuple is a record/row in the table

In [11]:
# get the names of the columns (headers)
[row[0] for row in cursor.description]

['Id',
 'SepalLengthCm',
 'SepalWidthCm',
 'PetalLengthCm',
 'PetalWidthCm',
 'Species']

It is evident that the interface provided by `sqlite3` is low-level

For data exploration purposes, we would like to directly import data into a more user-friendly library like `pandas`.

## Import data from a database to `pandas`

In [12]:
import pandas as pd

# use the sqlite3 connection from earlier to read a SQL query result table into a pandas DataFrame
iris_data = pd.read_sql_query("SELECT * FROM Iris", conn)
iris_data.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [13]:
iris_data.dtypes

Id                 int64
SepalLengthCm    float64
SepalWidthCm     float64
PetalLengthCm    float64
PetalWidthCm     float64
Species           object
dtype: object

`pandas.read_sql_query` takes a `SQL` query and a **connection object** and imports the data into a `DataFrame`, also keeping the same data types of the database columns. 

`pandas` provides a lot of the same functionality of `SQL` with a more user-friendly interface.

* However, `sqlite3` is extremely useful for downselecting data *before* importing them in `pandas`.
* For example you might have 1 TB of data in a table stored in a database on a server machine. You're interested in working on a subset of the data based on some criterion, and unfortunately it would be impossible to first load data into `pandas` and then filter them
* Therefore we should tell the database to perform the filtering via sqlite3 and then just load the downsized dataset into `pandas`.

In [14]:
# get all data from the iris data base for the setosa species
iris_setosa_data = pd.read_sql_query("SELECT * FROM Iris WHERE Species == 'Iris-setosa'", conn)

# compare with original dataset
print(iris_setosa_data.shape)
print(iris_data.shape)

(50, 6)
(150, 6)


So 1/3 of the total tuples/records in the dataset are of the setosa species

In [15]:
iris_setosa_data.head(10)

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
5,6,5.4,3.9,1.7,0.4,Iris-setosa
6,7,4.6,3.4,1.4,0.3,Iris-setosa
7,8,5.0,3.4,1.5,0.2,Iris-setosa
8,9,4.4,2.9,1.4,0.2,Iris-setosa
9,10,4.9,3.1,1.5,0.1,Iris-setosa
