# SQL parser Tutorial

- [https://github.com/mozilla/moz-sql-parser](https://github.com/mozilla/moz-sql-parser)
- [http://g.gravizo.com/](http://g.gravizo.com/)
- [https://github.com/TLmaK0/gravizo](https://github.com/TLmaK0/gravizo)
- [https://gist.github.com/svenevs/ce05761128e240e27883e3372ccd4ecd](https://gist.github.com/svenevs/ce05761128e240e27883e3372ccd4ecd)

```
$ pip install moz-sql-parser
```

## SQL Parsing

References: http://www.mathcs.emory.edu/~cheung/Courses/554/Syllabus/5-query-opt/intro-parsing.html

```
SQL query -> [SQL language parser] -> parser tree(with virtual tables) -> Pre-processor -> parser tree (without virtual table)
```

- Parsing: Converting an (SQL) query into a query parse tree
- Parser: a computer program that translate statements ("sentences") in a programming language (e.g., SQL) into a parse tree
- Parse tree: a tree whose nodes corresponds to
    1. atoms of the programming language
    2. syntactic categories of the programming language
- Atom: a lexical element in a (programming) language that cannot be expressed in more elementary lexical elements = leaf node
    - Keywords: SELECT, FROM, WHERE, etc
    - Identifiers: table/field name
    - Constants: numbers, strings
    - Operators: +,-, LIKE ...
    - Tokens: (, ;, ...
- Syntactic category: a lexical construct in a (programming) language that is built up with other lexical elements following some syntactic rules = internal nodes

### Example Query:

```SQL
SELECT movieTitle
FROM StarsIn, MovieStar
WHERE starName = name
AND birthdate LIKE '%1960'
```

<p align="center">
    <img alt="Alt Text" src="https://g.gravizo.com/svg?digraph%20G%20%7B%0A%20%201%20%5Blabel%3D%22%3CQuery%3E%22%2C%20fontcolor%3Dblue%5D%3B%0A%20%202%20%5Blabel%3D%22SELECT%22%2C%20fontcolor%3Dblue%2C%20fontname%3D%22times-bold%22%5D%3B%0A%20%203%20%5Blabel%3D%22%3CSelList%3E%22%2C%20fontcolor%3Dred%5D%3B%0A%20%204%20%5Blabel%3D%22%3CAttribute%3E%22%2C%20fontcolor%3Dred%5D%3B%0A%20%205%20%5Blabel%3D%22movieTitle%22%2C%20fontcolor%3Dblue%5D%3B%0A%20%206%20%5Blabel%3D%22FROM%22%2C%20fontcolor%3Dblue%2C%20fontname%3D%22times-bold%22%5D%3B%20%20%20%20%0A%20%207%20%5Blabel%3D%22%3CFromList%3E%22%2C%20fontcolor%3Dred%5D%3B%0A%20%208%20%5Blabel%3D%22%3CRelName%3E%22%2C%20fontcolor%3Dred%5D%3B%0A%20%209%20%5Blabel%3D%22StarsIn%22%2C%20fontcolor%3Dblue%5D%3B%0A%20%2010%20%5Blabel%3D%22%2C%22%2C%20fontcolor%3Dblue%2C%20fontname%3D%22times-bold%22%5D%3B%0A%20%2011%20%5Blabel%3D%22%3CFromList%3E%22%2C%20fontcolor%3Dred%5D%3B%0A%20%2012%20%5Blabel%3D%22%3CRelName%3E%22%2C%20fontcolor%3Dred%5D%3B%0A%20%2013%20%5Blabel%3D%22MovieStar%22%2C%20fontcolor%3Dblue%5D%3B%0A%20%2014%20%5Blabel%3D%22WHERE%22%2C%20fontcolor%3Dblue%2C%20fontname%3D%22times-bold%22%5D%3B%0A%20%2015%20%5Blabel%3D%22%3CCond%3E%22%2C%20fontcolor%3Dred%5D%3B%0A%20%2016%20%5Blabel%3D%22%3CCond%3E%22%2C%20fontcolor%3Dred%5D%3B%0A%20%2017%20%5Blabel%3D%22%3CAttr%3E%22%2C%20fontcolor%3Dred%5D%3B%0A%20%2018%20%5Blabel%3D%22starName%22%2C%20fontcolor%3Dblue%5D%3B%0A%20%2019%20%5Blabel%3D%22%3D%22%2C%20fontcolor%3Dblue%2C%20fontname%3D%22times-bold%22%5D%3B%0A%20%2020%20%5Blabel%3D%22%3CAttr%3E%22%2C%20fontcolor%3Dred%5D%3B%0A%20%2021%20%5Blabel%3D%22name%22%2C%20fontcolor%3Dblue%5D%3B%0A%20%2022%20%5Blabel%3D%22AND%22%2C%20fontcolor%3Dblue%2C%20fontname%3D%22times-bold%22%5D%3B%0A%20%2023%20%5Blabel%3D%22%3CCond%3E%22%2C%20fontcolor%3Dred%5D%3B%0A%20%2024%20%5Blabel%3D%22%3CAttr%3E%22%2C%20fontcolor%3Dred%5D%3B%0A%20%2025%20%5Blabel%3D%22birthdate%22%2C%20fontcolor%3Dblue%5D%3B%0A%20%2026%20%5Blabel%3D%22LIKE%22%2C%20fontcolor%3Dblue%2C%20fontname%3D%22times-bold%22%5D%3B%0A%20%2027%20%5Blabel%3D%22%3CPattern%3E%22%2C%20fontcolor%3Dred%5D%3B%0A%20%2028%20%5Blabel%3D%22%27%251960%27%22%2C%20fontcolor%3Dblue%5D%3B%0A%20%201%20-%3E%202%3B%0A%20%201%20-%3E%203%20-%3E%204%20-%3E%205%3B%0A%20%201%20-%3E%206%3B%0A%20%201%20-%3E%207%20-%3E%208%20-%3E%209%3B%0A%20%207%20-%3E%2010%3B%0A%20%207%20-%3E%2011%20-%3E%2012%20-%3E%2013%3B%0A%20%201%20-%3E%2014%3B%0A%20%201%20-%3E%2015%20-%3E%2016%20-%3E%2017%20-%3E%2018%3B%0A%20%2016%20-%3E%2019%3B%0A%20%2016%20-%3E%2020%20-%3E%2021%3B%0A%20%2015%20-%3E%2022%3B%0A%20%2015%20-%3E%2023%20-%3E%2024%20-%3E%2025%3B%0A%20%2023%20-%3E%2026%3B%0A%20%2023%20-%3E%2027%20-%3E%2028%3B%0A%7D" />
</p>

<details>
<summary>How to create graph in markdown?</summary>

```python
from urllib.parse import quote
raw = """digraph G {
  1 [label="<Query>", fontcolor=blue];
  2 [label="SELECT", fontcolor=blue, fontname="times-bold"];
  3 [label="<SelList>", fontcolor=red];
  4 [label="<Attribute>", fontcolor=red];
  5 [label="movieTitle", fontcolor=blue];
  6 [label="FROM", fontcolor=blue, fontname="times-bold"];    
  7 [label="<FromList>", fontcolor=red];
  8 [label="<RelName>", fontcolor=red];
  9 [label="StarsIn", fontcolor=blue];
  10 [label=",", fontcolor=blue, fontname="times-bold"];
  11 [label="<FromList>", fontcolor=red];
  12 [label="<RelName>", fontcolor=red];
  13 [label="MovieStar", fontcolor=blue];
  14 [label="WHERE", fontcolor=blue, fontname="times-bold"];
  15 [label="<Cond>", fontcolor=red];
  16 [label="<Cond>", fontcolor=red];
  17 [label="<Attr>", fontcolor=red];
  18 [label="starName", fontcolor=blue];
  19 [label="=", fontcolor=blue, fontname="times-bold"];
  20 [label="<Attr>", fontcolor=red];
  21 [label="name", fontcolor=blue];
  22 [label="AND", fontcolor=blue, fontname="times-bold"];
  23 [label="<Cond>", fontcolor=red];
  24 [label="<Attr>", fontcolor=red];
  25 [label="birthdate", fontcolor=blue];
  26 [label="LIKE", fontcolor=blue, fontname="times-bold"];
  27 [label="<Pattern>", fontcolor=red];
  28 [label="'%1960'", fontcolor=blue];
  1 -> 2;
  1 -> 3 -> 4 -> 5;
  1 -> 6;
  1 -> 7 -> 8 -> 9;
  7 -> 10;
  7 -> 11 -> 12 -> 13;
  1 -> 14;
  1 -> 15 -> 16 -> 17 -> 18;
  16 -> 19;
  16 -> 20 -> 21;
  15 -> 22;
  15 -> 23 -> 24 -> 25;
  23 -> 26;
  23 -> 27 -> 28;
}"""
txt = quote(raw)
```
    
copy the text behind https://g.gravizo.com/svg?
</details>

In [9]:
import json
from moz_sql_parser import parse

In [10]:
query = """
SELECT movieTitle
FROM StarsIn, MovieStar
WHERE starName = name
AND birthdate LIKE '%1960'
"""
jsonStr = json.dumps(parse(query), indent=2)
print(jsonStr)

{
  "select": {
    "value": "movieTitle"
  },
  "from": [
    "StarsIn",
    "MovieStar"
  ],
  "where": {
    "and": [
      {
        "eq": [
          "starName",
          "name"
        ]
      },
      {
        "like": [
          "birthdate",
          {
            "literal": "%1960"
          }
        ]
      }
    ]
  }
}


# Custom Database parsing to jsonl

## WikiSQL format

from: https://github.com/salesforce/WikiSQL/blob/master/README.md

### Question, query and table ID

These files are contained in the `*.jsonl` files. A line looks like the following:

```json
{
   "phase":1,
   "question":"who is the manufacturer for the order year 1998?",
   "sql":{
      "conds":[
         [
            0,
            0,
            "1998"
         ]
      ],
      "sel":1,
      "agg":0
   },
   "table_id":"1-10007452-3"
}
```

The fields represent the following:

- `phase`: the phase in which the dataset was collected. We collected WikiSQL in two phases.
- `question`: the natural language question written by the worker.
- `table_id`: the ID of the table to which this question is addressed.
- `sql`: the SQL query corresponding to the question. This has the following subfields:
  - `sel`: the numerical index of the column that is being selected. You can find the actual column from the table.
  - `agg`: the numerical index of the aggregation operator that is being used. You can find the actual operator from `Query.agg_ops` in `lib/query.py`.
  - `conds`: a list of triplets `(column_index, operator_index, condition)` where:
    - `column_index`: the numerical index of the condition column that is being used. You can find the actual column from the table.
    - `operator_index`: the numerical index of the condition operator that is being used. You can find the actual operator from `Query.cond_ops` in `lib/query.py`.
    - `condition`: the comparison value for the condition, in either `string` or `float` type.

### Tables

These files are contained in the `*.tables.jsonl` files. A line looks like the following:

```json
{
   "id":"1-1000181-1",
   "header":[
      "State/territory",
      "Text/background colour",
      "Format",
      "Current slogan",
      "Current series",
      "Notes"
   ],
   "types":[
      "text",
      "text",
      "text",
      "text",
      "text",
      "text"
   ],
   "rows":[
      [
         "Australian Capital Territory",
         "blue/white",
         "Yaa\u00b7nna",
         "ACT \u00b7 CELEBRATION OF A CENTURY 2013",
         "YIL\u00b700A",
         "Slogan screenprinted on plate"
      ],
      [
         "New South Wales",
         "black/yellow",
         "aa\u00b7nn\u00b7aa",
         "NEW SOUTH WALES",
         "BX\u00b799\u00b7HI",
         "No slogan on current series"
      ],
      [
         "New South Wales",
         "black/white",
         "aaa\u00b7nna",
         "NSW",
         "CPX\u00b712A",
         "Optional white slimline series"
      ],
      [
         "Northern Territory",
         "ochre/white",
         "Ca\u00b7nn\u00b7aa",
         "NT \u00b7 OUTBACK AUSTRALIA",
         "CB\u00b706\u00b7ZZ",
         "New series began in June 2011"
      ],
      [
         "Queensland",
         "maroon/white",
         "nnn\u00b7aaa",
         "QUEENSLAND \u00b7 SUNSHINE STATE",
         "999\u00b7TLG",
         "Slogan embossed on plate"
      ],
      [
         "South Australia",
         "black/white",
         "Snnn\u00b7aaa",
         "SOUTH AUSTRALIA",
         "S000\u00b7AZD",
         "No slogan on current series"
      ],
      [
         "Victoria",
         "blue/white",
         "aaa\u00b7nnn",
         "VICTORIA - THE PLACE TO BE",
         "ZZZ\u00b7562",
         "Current series will be exhausted this year"
      ]
   ]
}
```

The fields represent the following:
- `id`: the table ID.
- `header`: a list of column names in the table.
- `rows`: a list of rows. Each row is a list of row entries.

Tables are also contained in a corresponding `*.db` file.
This is a SQL database with the same information.
Note that due to the flexible format of HTML tables, the column names of tables in the database has been symbolized.
For example, for a table with the columns `['foo', 'bar']`, the columns in the database are actually `col0` and `col1`.


In [11]:
from pathlib import Path
import re
import records
from babel.numbers import parse_decimal, NumberFormatError

schema_re = re.compile(r'\((.+)\)')
num_re = re.compile(r'[-+]?\d*\.\d+|\d+')

db_path = Path("./private")

In [7]:
db = records.Database(f"sqlite:///{db_path / 'samsung_new.db'}")

In [8]:
db.get_table_names()

[]

In [53]:
table_id = "receipts"
table_info = db.query('SELECT sql from sqlite_master WHERE tbl_name = :name', name=table_id).all()[0].sql
schema_str = schema_re.findall(table_info.replace("\n", ""))[0]

In [54]:
schema = {}
for tup in schema_str.split(', '):
    c, t = tup.split()
    schema[c.strip('"')] = t

In [55]:
schema

{'index': 'INTEGER',
 'rcept_no': 'TEXT',
 'reprt_code': 'TEXT',
 'bsns_year': 'INTEGER',
 'corp_code': 'TEXT',
 'stock_code': 'TEXT',
 'fs_div': 'TEXT',
 'fs_nm': 'TEXT',
 'sj_div': 'TEXT',
 'sj_nm': 'TEXT',
 'account_nm': 'TEXT',
 'thstrm_nm': 'TEXT',
 'thstrm_dt': 'TEXT',
 'thstrm_amount': 'INTEGER',
 'frmtrm_nm': 'TEXT',
 'frmtrm_dt': 'TEXT',
 'frmtrm_amount': 'INTEGER',
 'bfefrmtrm_nm': 'TEXT',
 'bfefrmtrm_dt': 'TEXT',
 'bfefrmtrm_amount': 'INTEGER'}

In [11]:
df_sample = db.query("SELECT * FROM 'receipts'").export("df")
df_sample

Unnamed: 0,index,rcept_no,reprt_code,bsns_year,corp_code,stock_code,fs_div,fs_nm,sj_div,sj_nm,account_nm,thstrm_nm,thstrm_dt,thstrm_amount,frmtrm_nm,frmtrm_dt,frmtrm_amount,bfefrmtrm_nm,bfefrmtrm_dt,bfefrmtrm_amount
0,0,20160330003536,11011,2015,00126380,005930,CFS,연결재무제표,BS,재무상태표,유동자산,제 47 기,2015.12.31 현재,124814725000000,제 46 기,2014.12.31 현재,115146026000000,제 45 기,2013.12.31 현재,110760271000000
1,1,20160330003536,11011,2015,00126380,005930,CFS,연결재무제표,BS,재무상태표,비유동자산,제 47 기,2015.12.31 현재,117364796000000,제 46 기,2014.12.31 현재,115276932000000,제 45 기,2013.12.31 현재,103314747000000
2,2,20160330003536,11011,2015,00126380,005930,CFS,연결재무제표,BS,재무상태표,자산총계,제 47 기,2015.12.31 현재,242179521000000,제 46 기,2014.12.31 현재,230422958000000,제 45 기,2013.12.31 현재,214075018000000
3,3,20160330003536,11011,2015,00126380,005930,CFS,연결재무제표,BS,재무상태표,유동부채,제 47 기,2015.12.31 현재,50502909000000,제 46 기,2014.12.31 현재,52013913000000,제 45 기,2013.12.31 현재,51315409000000
4,4,20160330003536,11011,2015,00126380,005930,CFS,연결재무제표,BS,재무상태표,비유동부채,제 47 기,2015.12.31 현재,12616807000000,제 46 기,2014.12.31 현재,10320857000000,제 45 기,2013.12.31 현재,12743599000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
67,7,20210309000744,11011,2020,00126380,005930,CFS,연결재무제표,BS,재무상태표,이익잉여금,제 52 기,2020.12.31 현재,271068211000000,제 51 기,2019.12.31 현재,254582894000000,제 50 기,2018.12.31 현재,242698956000000
68,8,20210309000744,11011,2020,00126380,005930,CFS,연결재무제표,BS,재무상태표,자본총계,제 52 기,2020.12.31 현재,275948016000000,제 51 기,2019.12.31 현재,262880421000000,제 50 기,2018.12.31 현재,247753177000000
69,9,20210309000744,11011,2020,00126380,005930,CFS,연결재무제표,IS,손익계산서,매출액,제 52 기,2020.01.01 ~ 2020.12.31,236806988000000,제 51 기,2019.01.01 ~ 2019.12.31,230400881000000,제 50 기,2018.01.01 ~ 2018.12.31,243771415000000
70,10,20210309000744,11011,2020,00126380,005930,CFS,연결재무제표,IS,손익계산서,영업이익,제 52 기,2020.01.01 ~ 2020.12.31,35993876000000,제 51 기,2019.01.01 ~ 2019.12.31,27768509000000,제 50 기,2018.01.01 ~ 2018.12.31,58886669000000


In [12]:
list(enumerate(df_sample.columns))

[(0, 'index'),
 (1, 'rcept_no'),
 (2, 'reprt_code'),
 (3, 'bsns_year'),
 (4, 'corp_code'),
 (5, 'stock_code'),
 (6, 'fs_div'),
 (7, 'fs_nm'),
 (8, 'sj_div'),
 (9, 'sj_nm'),
 (10, 'account_nm'),
 (11, 'thstrm_nm'),
 (12, 'thstrm_dt'),
 (13, 'thstrm_amount'),
 (14, 'frmtrm_nm'),
 (15, 'frmtrm_dt'),
 (16, 'frmtrm_amount'),
 (17, 'bfefrmtrm_nm'),
 (18, 'bfefrmtrm_dt'),
 (19, 'bfefrmtrm_amount')]

## 데이터 예시

우선 terminology, `[시간]`, `[항목]`의 토큰을 정의하고, 그 다음에 unique한 자연어 질의 먼저 생성하고 SQL을 만들어야 할듯

- 자연어 질의: 제 51 기에 삼성전자의 이익잉여금은 어떻게 돼?
- SQL: 
    ```SQL
    /* Possible answer 1 */
    SELECT frmtrm_amount FROM receipts WHERE account_nm = '이익잉여금' AND bsns_year = 2020
    /* Possible answer 2 */
    SELECT thstrm_amount FROM receipts WHERE account_nm = '이익잉여금' AND bsns_year = 2019
    /* Possible answer 3 */
    SELECT thstrm_amount FROM receipts WHERE account_nm = '이익잉여금' AND thstrm_nm = '제 51 기'
    /* Possible answer 4 */
    SELECT frmtrm_amount FROM receipts WHERE account_nm = '이익잉여금' AND frmtrm_nm = '제 51 기'
    ```

<details>
<summary>Jsonl 형태:</summary>

```json
{
   "phase":1,
   "question":"제 51 기에 삼성전자의 유동자산은 어떻게 돼?",
   "sql":{
      "conds":[
         [10, 0, "이익잉여금"], [3, 0, 2020]
      ],
      "sel":16,
      "agg":0
   },
   "table_id":"receipts"
}
{
   "phase":1,
   "question":"제 51 기에 삼성전자의 유동자산은 어떻게 돼?",
   "sql":{
      "conds":[
         [10, 0, "이익잉여금"], [3, 0, 2019]
      ],
      "sel":13,
      "agg":0
   },
   "table_id":"receipts"
}
...
```
</details>

- `phase`: the phase in which the dataset was collected. We collected WikiSQL in two phases.
- `question`: the natural language question written by the worker.
- `table_id`: the ID of the table to which this question is addressed.
- `sql`: the SQL query corresponding to the question. This has the following subfields:
  - `sel`: the numerical index of the column that is being selected. You can find the actual column from the table.
  - `agg`: the numerical index of the aggregation operator that is being used. You can find the actual operator from `Query.agg_ops` in `lib/query.py`.
  - `conds`: a list of triplets `(column_index, operator_index, condition)` where:
    - `column_index`: the numerical index of the condition column that is being used. You can find the actual column from the table.
    - `operator_index`: the numerical index of the condition operator that is being used. You can find the actual operator from `Query.cond_ops` in `lib/query.py`.
    - `condition`: the comparison value for the condition, in either `string` or `float` type.

In [96]:
sqls = [
    "SELECT frmtrm_amount FROM receipts WHERE account_nm = '이익잉여금' AND bsns_year = 2020",
    "SELECT thstrm_amount FROM receipts WHERE account_nm = '이익잉여금' AND bsns_year = 2019",
    "SELECT thstrm_amount FROM receipts WHERE account_nm = '이익잉여금' AND thstrm_nm = '제 51 기'",
    "SELECT frmtrm_amount FROM receipts WHERE account_nm = '이익잉여금' AND frmtrm_nm = '제 51 기'"
]
for sql in sqls:
    res = db.query(sql)
    print(res.as_dict()[0])

{'frmtrm_amount': 254582894000000}
{'thstrm_amount': 254582894000000}
{'thstrm_amount': 254582894000000}
{'frmtrm_amount': 254582894000000}


In [167]:
import json
from typing import Union
from moz_sql_parser import parse as sql_parser
schema_re = re.compile(r'\((.+)\)')

class Generator:
    
    agg_ops = ["", "MAX", "MIN", "COUNT", "SUM", "AVG"]
    cond_ops = ["=", ">", "<", "OP", ">=", "<="]
    cond_ops_dict = {"eq": "=", "lt": "<",  "lte": "<=", "gt": ">", "gte": ">=", "neq": "<>"}

    syms = ["SELECT", "WHERE", "AND", "COL", "TABLE", "CAPTION", "PAGE", "SECTION", "OP", "COND", "QUESTION", "AGG", "AGGOPS", "CONDOPS"]
    
    def __init__(self, db_path: Union[Path, str]) -> None:
        self.db = records.Database(f"sqlite:///{db_path}")
        self._reset()
        
    def _reset(self) -> None:
        self.table_id = None
        self.schema = None
        self.col2idx = None
        
    def get_schema_info(self, table_id: str) -> None:
        table_info = self.db.query('SELECT sql from sqlite_master WHERE tbl_name = :name', name=table_id).all()[0].sql
        schema_str = schema_re.findall(table_info.replace("\n", ""))[0]
        schema = {}
        for tup in schema_str.split(', '):
            c, t = tup.split()
            schema[c.strip('"')] = t
        col2idx = {c: i for i, c in enumerate(schema.keys())}
        
        self.table_id = table_id
        self.schema = schema
        self.col2idx = col2idx
    
    def to_jsonl(self, sql: str, question: str) -> dict:
        r"""
        # Only 1 agg and select
        example:
        - sql: "SELECT frmtrm_amount FROM receipts WHERE account_nm = '이익잉여금' AND bsns_year = 2020",
        - question: "제 51 기에 삼성전자의 유동자산은 어떻게 돼?"
        
        return:
        {
           "phase":1,
           "question":"제 51 기에 삼성전자의 유동자산은 어떻게 돼?",
           "sql":{
              "conds":[
                 [10, 0, "이익잉여금"], [3, 0, 2020]
              ],
              "sel":16,
              "agg":0
           },
           "table_id":"receipts"
        }
        
        """
        
        parsed = sql_parser(sql)
        table_id = parsed["from"]
        jsonl = {"phase": 1, "question": question, "table_id": table_id, "sql": {}}
        
        if (self.table_id is None) or (self.table_id != table_id):
            self.get_schema_info(table_id)
#         else:
#             raise AttributeError("No schema information, please make sure to call `self.get_schema_info`")
        
        select_parsed = parsed["select"]["value"]
        if isinstance(select_parsed, dict):
            # Only 1 agg and select
            agg_name = list(select_parsed)[0]
            agg = self.agg_ops.index(agg_name.upper())
            select_name = select_parsed[agg]
        elif isinstance(select_parsed, str):
            agg = 0
            select_name = select_parsed
        else:
            raise TypeError(f"Parsed in select clause should be `str` or `dict` type, Current is {select_parsed}")
        select = self.col2idx.get(select_name)
        
        conds_parsed = parsed["where"]
        conds = []
        for operator, conditions in conds_parsed.items():
            cond = {operator.upper(): []}
            for condition in conditions:
                key, values = tuple(condition.items())[0]

                if self.cond_ops_dict.get(key) is None:
                    raise KeyError(f"No operator: {key}")
                else:
                    op = self.cond_ops_dict.get(key)
                    op_idx = self.cond_ops.index(op)
                    
                if self.col2idx.get(values[0]) is None:
                    raise KeyError(f"No column name: {values[0]}")
                else:
                    col_idx = self.col2idx.get(values[0])
                    
                
                if isinstance(values[1], dict):
                    # make sure all string values insert '' when parse to sql again
                    cond_value = values[1]["literal"]
                else:
                    cond_value = values[1]
                cond[operator.upper()].append([col_idx, op_idx, cond_value])
            conds.append(cond)
        
        jsonl["sql"]["sel"] = select
        jsonl["sql"]["agg"] = agg
        jsonl["sql"]["conds"] = conds
        return jsonl
        

In [168]:
sql_gen = Generator(db_path=db_path / "samsung_new.db")

In [169]:
sql_gen.schema

In [170]:
jsons = []
for sql in sqls:
    jsons.append(sql_gen.to_jsonl(sql, "제 51 기에 삼성전자의 유동자산은 어떻게 돼?"))

In [171]:
jsons

[{'phase': 1,
  'question': '제 51 기에 삼성전자의 유동자산은 어떻게 돼?',
  'table_id': 'receipts',
  'sql': {'sel': 16,
   'agg': 0,
   'conds': [{'AND': [[10, 0, '이익잉여금'], [3, 0, 2020]]}]}},
 {'phase': 1,
  'question': '제 51 기에 삼성전자의 유동자산은 어떻게 돼?',
  'table_id': 'receipts',
  'sql': {'sel': 13,
   'agg': 0,
   'conds': [{'AND': [[10, 0, '이익잉여금'], [3, 0, 2019]]}]}},
 {'phase': 1,
  'question': '제 51 기에 삼성전자의 유동자산은 어떻게 돼?',
  'table_id': 'receipts',
  'sql': {'sel': 13,
   'agg': 0,
   'conds': [{'AND': [[10, 0, '이익잉여금'], [11, 0, '제 51 기']]}]}},
 {'phase': 1,
  'question': '제 51 기에 삼성전자의 유동자산은 어떻게 돼?',
  'table_id': 'receipts',
  'sql': {'sel': 16,
   'agg': 0,
   'conds': [{'AND': [[10, 0, '이익잉여금'], [14, 0, '제 51 기']]}]}}]

In [17]:
import records
import re
from babel.numbers import parse_decimal, NumberFormatError
# from wikisql.lib.query import Query

# Jan 3, 2019. Wonseok modify the lib. path

class DBEngine:

    def __init__(self, fdb):
        self.db = records.Database('sqlite:///{}'.format(fdb))

    def execute_query(self, table_id, query, *args, **kwargs):
        return self.execute(table_id, query.sel_index, query.agg_index, query.conditions, *args, **kwargs)

    def execute(self, table_id, select_index, aggregation_index, conditions, lower=True):
        if not table_id.startswith('table'):
            table_id = 'table_{}'.format(table_id.replace('-', '_'))
        table_info = self.db.query('SELECT sql from sqlite_master WHERE tbl_name = :name', name=table_id).all()[0].sql
        schema_str = schema_re.findall(table_info)[0]
        schema = {}
        for tup in schema_str.split(', '):
            c, t = tup.split()
            schema[c] = t
        select = 'col{}'.format(select_index)
        agg = Query.agg_ops[aggregation_index]
        if agg:
            select = '{}({})'.format(agg, select)
        where_clause = []
        where_map = {}
        for col_index, op, val in conditions:
            if lower and isinstance(val, str):
                val = val.lower()
            if schema['col{}'.format(col_index)] == 'real' and not isinstance(val, (int, float)):
                try:
                    val = float(parse_decimal(val))
                except NumberFormatError as e:
                    val = float(num_re.findall(val)[0])
            where_clause.append('col{} {} :col{}'.format(col_index, Query.cond_ops[op], col_index))
            where_map['col{}'.format(col_index)] = val
        where_str = ''
        if where_clause:
            where_str = 'WHERE ' + ' AND '.join(where_clause)
        query = 'SELECT {} AS result FROM {} {}'.format(select, table_id, where_str)
        out = self.db.query(query, **where_map)
        return [o.result for o in out]

In [18]:
from collections import defaultdict
from copy import deepcopy
import re


re_whitespace = re.compile(r'\s+', flags=re.UNICODE)

def detokenize(tokens):
    ret = ''
    for g, a in zip(tokens['gloss'], tokens['after']):
        ret += g + a
    return ret.strip()

class Query:

    agg_ops = ['', 'MAX', 'MIN', 'COUNT', 'SUM', 'AVG']
    cond_ops = ['=', '>', '<', 'OP']
    syms = ['SELECT', 'WHERE', 'AND', 'COL', 'TABLE', 'CAPTION', 'PAGE', 'SECTION', 'OP', 'COND', 'QUESTION', 'AGG', 'AGGOPS', 'CONDOPS']

    def __init__(self, sel_index, agg_index, conditions=tuple(), ordered=False):
        self.sel_index = sel_index
        self.agg_index = agg_index
        self.conditions = list(conditions)
        self.ordered = ordered

    def __eq__(self, other):
        if isinstance(other, self.__class__):
            indices = self.sel_index == other.sel_index and self.agg_index == other.agg_index
            if other.ordered:
                conds = [(col, op, str(cond).lower()) for col, op, cond in self.conditions] == [(col, op, str(cond).lower()) for col, op, cond in other.conditions]
            else:
                conds = set([(col, op, str(cond).lower()) for col, op, cond in self.conditions]) == set([(col, op, str(cond).lower()) for col, op, cond in other.conditions])

            return indices and conds
        return NotImplemented

    def __ne__(self, other):
        if isinstance(other, self.__class__):
            return not self.__eq__(other)
        return NotImplemented

    def __hash__(self):
        return hash(tuple(sorted(self.__dict__.items())))

    def __repr__(self):
        rep = 'SELECT {agg} {sel} FROM table'.format(
            agg=self.agg_ops[self.agg_index],
            sel='col{}'.format(self.sel_index),
        )
        if self.conditions:
            rep +=  ' WHERE ' + ' AND '.join(['{} {} {}'.format('col{}'.format(i), self.cond_ops[o], v) for i, o, v in self.conditions])
        return rep

    def to_dict(self):
        return {'sel': self.sel_index, 'agg': self.agg_index, 'conds': self.conditions}

    def lower(self):
        conds = []
        for col, op, cond in self.conditions:
            conds.append([col, op, cond.lower()])
        return self.__class__(self.sel_index, self.agg_index, conds)

    @classmethod
    def from_dict(cls, d, ordered=False):
        return cls(sel_index=d['sel'], agg_index=d['agg'], conditions=d['conds'], ordered=ordered)

    @classmethod
    def from_tokenized_dict(cls, d):
        conds = []
        for col, op, val in d['conds']:
            conds.append([col, op, detokenize(val)])
        return cls(d['sel'], d['agg'], conds)

    @classmethod
    def from_generated_dict(cls, d):
        conds = []
        for col, op, val in d['conds']:
            end = len(val['words'])
            conds.append([col, op, detokenize(val)])
        return cls(d['sel'], d['agg'], conds)

    @classmethod
    def from_sequence(cls, sequence, table, lowercase=True):
        sequence = deepcopy(sequence)
        if 'symend' in sequence['words']:
            end = sequence['words'].index('symend')
            for k, v in sequence.items():
                sequence[k] = v[:end]
        terms = [{'gloss': g, 'word': w, 'after': a} for  g, w, a in zip(sequence['gloss'], sequence['words'], sequence['after'])]
        headers = [detokenize(h) for h in table['header']]

        # lowercase everything and truncate sequence
        if lowercase:
            headers = [h.lower() for h in headers]
            for i, t in enumerate(terms):
                for k, v in t.items():
                    t[k] = v.lower()
        headers_no_whitespcae = [re.sub(re_whitespace, '', h) for h in headers]

        # get select
        if 'symselect' != terms.pop(0)['word']:
            raise Exception('Missing symselect operator')

        # get aggregation
        if 'symagg' != terms.pop(0)['word']:
            raise Exception('Missing symagg operator')
        agg_op = terms.pop(0)['word']

        if agg_op == 'symcol':
            agg_op = ''
        else:
            if 'symcol' != terms.pop(0)['word']:
                raise Exception('Missing aggregation column')
        try:
            agg_op = cls.agg_ops.index(agg_op.upper())
        except Exception as e:
            raise Exception('Invalid agg op {}'.format(agg_op))
        
        def find_column(name):
            return headers_no_whitespcae.index(re.sub(re_whitespace, '', name))

        def flatten(tokens):
            ret = {'words': [], 'after': [], 'gloss': []}
            for t in tokens:
                ret['words'].append(t['word'])
                ret['after'].append(t['after'])
                ret['gloss'].append(t['gloss'])
            return ret
        where_index = [i for i, t in enumerate(terms) if t['word'] == 'symwhere']
        where_index = where_index[0] if where_index else len(terms)
        flat = flatten(terms[:where_index])
        try:
            agg_col = find_column(detokenize(flat))
        except Exception as e:
            raise Exception('Cannot find aggregation column {}'.format(flat['words']))
        where_terms = terms[where_index+1:]

        # get conditions
        conditions = []
        while where_terms:
            t = where_terms.pop(0)
            flat = flatten(where_terms)
            if t['word'] != 'symcol':
                raise Exception('Missing conditional column {}'.format(flat['words']))
            try:
                op_index = flat['words'].index('symop')
                col_tokens = flatten(where_terms[:op_index])
            except Exception as e:
                raise Exception('Missing conditional operator {}'.format(flat['words']))
            cond_op = where_terms[op_index+1]['word']
            try:
                cond_op = cls.cond_ops.index(cond_op.upper())
            except Exception as e:
                raise Exception('Invalid cond op {}'.format(cond_op))
            try:
                cond_col = find_column(detokenize(col_tokens))
            except Exception as e:
                raise Exception('Cannot find conditional column {}'.format(col_tokens['words']))
            try:
                val_index = flat['words'].index('symcond')
            except Exception as e:
                raise Exception('Cannot find conditional value {}'.format(flat['words']))

            where_terms = where_terms[val_index+1:]
            flat = flatten(where_terms)
            val_end_index = flat['words'].index('symand') if 'symand' in flat['words'] else len(where_terms)
            cond_val = detokenize(flatten(where_terms[:val_end_index]))
            conditions.append([cond_col, cond_op, cond_val])
            where_terms = where_terms[val_end_index+1:]
        q = cls(agg_col, agg_op, conditions)
        return q

    @classmethod
    def from_partial_sequence(cls, agg_col, agg_op, sequence, table, lowercase=True):
        sequence = deepcopy(sequence)
        if 'symend' in sequence['words']:
            end = sequence['words'].index('symend')
            for k, v in sequence.items():
                sequence[k] = v[:end]
        terms = [{'gloss': g, 'word': w, 'after': a} for  g, w, a in zip(sequence['gloss'], sequence['words'], sequence['after'])]
        headers = [detokenize(h) for h in table['header']]

        # lowercase everything and truncate sequence
        if lowercase:
            headers = [h.lower() for h in headers]
            for i, t in enumerate(terms):
                for k, v in t.items():
                    t[k] = v.lower()
        headers_no_whitespcae = [re.sub(re_whitespace, '', h) for h in headers]

        def find_column(name):
            return headers_no_whitespcae.index(re.sub(re_whitespace, '', name))

        def flatten(tokens):
            ret = {'words': [], 'after': [], 'gloss': []}
            for t in tokens:
                ret['words'].append(t['word'])
                ret['after'].append(t['after'])
                ret['gloss'].append(t['gloss'])
            return ret
        where_index = [i for i, t in enumerate(terms) if t['word'] == 'symwhere']
        where_index = where_index[0] if where_index else len(terms)
        where_terms = terms[where_index+1:]

        # get conditions
        conditions = []
        while where_terms:
            t = where_terms.pop(0)
            flat = flatten(where_terms)
            if t['word'] != 'symcol':
                raise Exception('Missing conditional column {}'.format(flat['words']))
            try:
                op_index = flat['words'].index('symop')
                col_tokens = flatten(where_terms[:op_index])
            except Exception as e:
                raise Exception('Missing conditional operator {}'.format(flat['words']))
            cond_op = where_terms[op_index+1]['word']
            try:
                cond_op = cls.cond_ops.index(cond_op.upper())
            except Exception as e:
                raise Exception('Invalid cond op {}'.format(cond_op))
            try:
                cond_col = find_column(detokenize(col_tokens))
            except Exception as e:
                raise Exception('Cannot find conditional column {}'.format(col_tokens['words']))
            try:
                val_index = flat['words'].index('symcond')
            except Exception as e:
                raise Exception('Cannot find conditional value {}'.format(flat['words']))

            where_terms = where_terms[val_index+1:]
            flat = flatten(where_terms)
            val_end_index = flat['words'].index('symand') if 'symand' in flat['words'] else len(where_terms)
            cond_val = detokenize(flatten(where_terms[:val_end_index]))
            conditions.append([cond_col, cond_op, cond_val])
            where_terms = where_terms[val_end_index+1:]
        q = cls(agg_col, agg_op, conditions)
        return q