## Construct Smaller DFs by Selecting Relevant Columns

### Classifications

| Field  | Meaning  | Values
|---|---|---|
| ClassificationID  | Used to link entries in `ClassificationXRefs`  | int, unique |
| Classification  | Actual class values | (string, 8 levels, most common 'Documentatie')  |
| AATCN |  Classification into types of objects | (string, 38 levels, uniform, e.g. 'Boeken') |
| SubClassification | Extra classifications | (string, 16 levels, most common 'Audiovisueel') |
| SubClassification2 | More extra classifications | (string, 14 levels, uniform, e.g. 'Drukwerk') |

<br>

 - `SubClassification3` is empty
 - idea: create tuple of `(Classification, SubClassification, SubClassification2)` as the single classification feature
 

### ClassificationXRefs

| Field  | Meaning  | Values
|---|---|---|
| ClassificationXRefID  | Only used in this table?  | int, unique |
| ClassificationID  | Used to link entries in `Classifications` | correspond to values of <br> `Classifications.Classification`, non-uniform counts  |
| ID |  Used to link entries in `Objects` | correspond to values of `Objects.ID`, uniform |
| TableID | ID of this table (for other contexts) | single value: 108 |



 
### Departments

| Field  | Meaning  | Values
|---|---|---|
| DepartmentID  | Used to link entries in Objects  | int, unique |
| Department  | Actual department names | (int, 18 levels, most not assigned, others uniform)  |
| Mnemonic |  Shorthand for the departmant name (field Department) | (same as Department) |

<br>

 - is `MainTableID` the ID of the departments in the main table? -> would be useful in that case for unification


### Objects


| Field  | Meaning  | Values
|---|---|---|
| ObjectID  | Linked to entries in `ClassificationXRefs`  | int, unique |
| DepartmentID  | Linked to entries in `Departments` | corresponds to `Departments.DepartmentID`  |
| ClassificationID |  Linked to entries in `Classifications` <br>and `ClassificationXRefs` | corresponds to `ClassificationID` in both tables |
| ObjectName | Name of the type of object | (string, 163 levels, most common 'Foto') |
| Title | The object's title | string |
| Description | The object's description | string |
| Provenance | Description of the object's history | string |


<br>

 - `ObjectNumber` seems to be an external ID for objects (is unique, has prefixes such as "TM", "RV", "NL")
 - `SortNumber` is similar to `ObjectNumber`
 - what does `ObjectCount` indicate? does `ObjectCount > 1` imply that entries should be merged?
 - what do `DateBegin` and `DateEnd` refer to? (most objects have `DateEnd == DateBegin`, latest date is 1990)
 - same for `Dated` -> which date is this?
 - technical properties:
     - Medium is the object's material
     - Dimensions
     - Signed, Inscribed, Markings
     - CreditLine: by whome the object was given
 - Exhibitions: Title of the exhibition the object was displayed at
 - Provenance: Description of how the object was acquired by the museum
 - **incomplete, TODO: go through the remaining columns**









#### Other Notes

 - `EnteredDate` is in all tables, earliest values around 1995, not uniformly distributed

In [1]:
import glob
from tqdm import tqdm

import pandas as pd
import numpy as np

import pyodbc

In [2]:
server = 'tcp:azuredfserv.database.windows.net' 
database = 'Azuredf' 
username = 'Demouser' 
password = 'Knxdde#77' 
driver='{ODBC Driver 17 for SQL Server}'

In [3]:
def table_to_DataFrame(connection, table_name, keys=None, until=None, random_n=None):
    
    keys = "*" if not keys else ",".join(keys)
    if not until:
        until = ""
    until = f"TOP {until}" if until else ""
    sample = f"TABLESAMPLE ({random_n} ROWS)" if random_n else ""
    query = f"SELECT {until} {keys} FROM {table_name} {sample};"
    print(query)
    df = pd.read_sql(query, connection)
    return df


def connect_to_DB():
    return pyodbc.connect('DRIVER='+driver+';SERVER='+server+';PORT=1433;DATABASE='+database+';UID='+username+';PWD='+ password)

### get table names

not necessary - tables defined manually

In [4]:
from time import time
t0 = time()
with connect_to_DB() as conn:
    print()
    q  = "SELECT t.name, t.modify_date FROM sys.tables t"
    tables = pd.read_sql(q, conn)
    tables = tables[tables.name != "Person"]
    print(tables)

OperationalError: ('HYT00', '[HYT00] [Microsoft][ODBC Driver 17 for SQL Server]Login timeout expired (0) (SQLDriverConnect)')

### define table column keys & dump via connection

In [None]:
keys = {"Classifications": ("ClassificationID", "Classification", "AATCN", "SubClassification", "SubClassification2"),
       "ClassificationXRefs": ("ClassificationXRefID", "ClassificationID", "ID", "TableID"),
       "Departments": ("DepartmentID", "Department", "Mnemonic"),
       "Objects": ("ObjectID", "DepartmentID", "ClassificationID", "ObjectName", "Title", "Description", "Provenance")}

tables = {}
with connect_to_DB() as conn:
    for table_name, key_ls in keys.items():
        print(table_name)
        tables[table_name] = table_to_DataFrame(conn, table_name, keys=key_ls, random_n=10000)

### TODO: clean up, process, etc tables

### save tables

In [None]:
for key, tbl in tables.items():
    tbl.to_csv(f"tables/{key}.csv.gz", index=False)