# ClickHouse: how to access from JupyterHub

[ClickHouse®](https://clickhouse.tech/) is a fast open-source OLAP database management system. It is column-oriented and allows to generate analytical reports using SQL queries in real-time.

We have a local installation of ClickHouse database that collects logs from [GSOM site](https://gsom.spbu.ru/) and is available in read-only format for you to learn how to use ClickHouse in your tasks.

## Import libraries and set access parameters

In [None]:
import os
import json
import requests
import pandas as pd
pd.set_option('display.max_columns', None)

Set host and port to get to database. ClickHouse has no external IP address, so it is available only from JupyterHub notebooks:

In [None]:
CH_HOST = 'http://10.129.0.30'
CH_PORT = '8123'
SSL_VERIFY = True

## How to access

Our main function for access to ClickHouse will take parameters like database host, username, password, query and will returd data from database:

In [None]:
def get_data(query, host, user_name, user_passwd):
    if (user_name == '') and (user_passwd == ''):
        r = requests.post(host, params=query_dict, verify=SSL_VERIFY)
    else:
        r = requests.post(host, data=query,
                          auth=(user_name, user_passwd), verify=SSL_VERIFY)
    print('request status code:', r.status_code)
    return r.text

ClickHouse uses [SQL syntax](https://clickhouse.tech/docs/en/sql-reference/syntax/) for its queries, so let's define our first query:

In [None]:
query = 'SHOW DATABASES'

...and put it to function:

In [None]:
data = get_data(query=query, 
                host=':'.join([CH_HOST, CH_PORT]),
                user_name=os.environ['CLICKHOUSE_USER'],
                user_passwd=os.environ['CLICKHOUSE_PASSWORD'])

__NOTE:__ Username (or login) and password are stored in environment variables `CLICKHOUSE_USER`, `CLICKHOUSE_PASSWORD` for safety and are available through `os` library as `os.environ['<ENV_VARIABLE_NAME>']`.

The data from database is returnes as string, so postprocessing is needed:

In [None]:
data = [x.split('\t') for x in data.split('\n')]
pd.DataFrame(data)

## Example queries

Now we know what databases are in ClickHouse. Build a more complicated query to obtain names of all tables in database:

In [None]:
query = 'SHOW TABLES FROM gsomlogs'

In [None]:
data = get_data(query=query, 
                host=':'.join([CH_HOST, CH_PORT]),
                user_name=os.environ['CLICKHOUSE_USER'],
                user_passwd=os.environ['CLICKHOUSE_PASSWORD'])
data = [x.split('\t') for x in data.split('\n')]
pd.DataFrame(data)

Let's get fields of selected table:

In [None]:
query = 'SHOW CREATE TABLE gsomlogs.hits_all'

In [None]:
query = 'show create table gsomlogs.hits_all'
data = get_data(query=query, 
                host=':'.join([CH_HOST, CH_PORT]),
                user_name=os.environ['CLICKHOUSE_USER'],
                user_passwd=os.environ['CLICKHOUSE_PASSWORD'])
print()
for x in data.split('\\n'):
    print(x)

Now we are ready to deep dive into data in tables (but for the demo will limit our query for the first 5 rows):

In [None]:
query = 'SELECT * FROM gsomlogs.visits_all ORDER BY DateTime DESC LIMIT 5'

In [None]:
data = get_data(query=query, 
                host=':'.join([CH_HOST, CH_PORT]),
                user_name=os.environ['CLICKHOUSE_USER'],
                user_passwd=os.environ['CLICKHOUSE_PASSWORD'])
data = [x.split('\t') for x in data.split('\n')]
df = pd.DataFrame(data)
df.head()

In [None]:
query = 'SELECT * FROM gsomlogs.hits_all ORDER BY DateTime DESC LIMIT 5'

In [None]:
data = get_data(query=query, 
                host=':'.join([CH_HOST, CH_PORT]),
                user_name=os.environ['CLICKHOUSE_USER'],
                user_passwd=os.environ['CLICKHOUSE_PASSWORD'])
data = [x.split('\t') for x in data.split('\n')]
df = pd.DataFrame(data)
df.head()