Skip to content

Add schema option to "read_xlsx" #11046

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task done
omdaniel opened this issue Mar 26, 2025 · 7 comments
Open
1 task done

Add schema option to "read_xlsx" #11046

omdaniel opened this issue Mar 26, 2025 · 7 comments
Labels
feature Features or general enhancements

Comments

@omdaniel
Copy link

omdaniel commented Mar 26, 2025

Is your feature request related to a problem?

Using the "spatial" extension work around to read excel files I was able to

con.read_geo("test.xlsx", table_name="newtable", layer='sheetname', open_options = ['HEADERS=AUTO','FIELD_TYPES=STRING'])
table = con.table("newtable")
table.try_cast(schema)

and then use the "try_cast" on "newtable" to apply a schema; however, the con.read_xlsx with "all_varchar=True" creates a table with data that is less malleable when using

table = con.read_xlsx("test.xlsx", header=True, sheet="sheetname", ignore_errors=True, all_varchar=True)
table.try_cast(schema)

many of the "date" and "time" values do not cast correctly (auto-magically as desired) when using the read_xlsx

add a method to the read_xlsx that takes a schema and runs the raw sql under the hood

CREATE TABLE test (a DOUBLE, b DOUBLE);
COPY test FROM 'test.xlsx' WITH (FORMAT xlsx, HEADER);
SELECT * FROM test;

where the python ibis.Schema generates the raw sql for the CREATE TABLE field types.

Alternatively, create a ergonomic way to do this if the desire is to limit the api complexity of the read_xlsx method.

What is the motivation behind your request?

The workflow example can solved with con.load_extension("excel") and a con.raw_sql but the ergonomics would be cleaner if there was way to get the same result within ibis in some form

Describe the solution you'd like

Add a schema parameter to "read_xlsx" method

table = con.read_xlsx(con.read_xlsx("test.xlsx", header=True, sheet="sheetname", ignore_errors=True, schema = schema)

As an alternative this maybe a support issue for the DuckDB native excel extension but I assume their implementation choices were for robustness even if the all_varchar = True renders data that is less malleable than what is returned from read_geo with FIELD_TYPES = STRING

Bellow is not an implementation recommendation just a hack to scratch my particular itch

def schema2fieldtypes(schema: ibis.Schema) -> str:
    fields = []
    for k,v in schema.as_struct().items():
        fields.append(f"{k} {v.name.upper()}")

    return ", ".join(fields).replace("FLOAT64", "FLOAT8")

def read_excel(excel_file: pathlib.Path, connection: ibis.BaseBackend , sheet_name: str = "Sheet1", schema: ibis.Schema = None, sheet_range: str = None):
    from uuid import uuid4
    table_name = f"read_excel_{uuid4()}".replace("-","")
    connection.load_extension("excel")

    if not schema:
        return connection.read_xlsx(excel_file, sheet=sheet_name, ignore_errors=True, range=sheet_range)

    fields = f"({schema2fieldtypes(schema)})"
    sheet_range = f", RANGE '{sheet_range}'" if sheet_range else ""

    raw_sql = ""
    raw_sql += f"CREATE TABLE {table_name} {fields}; "
    raw_sql += f"COPY {table_name} FROM {excel_file} WITH (FORMAT xlsx, SHEET {sheet_name}, IGNORE_ERRORS{sheet_range}); "
    raw_sql += f"SELECT * FROM {table_name};"

    connection.raw_sql(raw_sql)

    return connection.table(table_name)

here is a more robust version of my read_excel hack if someone stubles on this post later and finds it useful

import ibis
import ibis.expr.schema as sch
import pathlib
from uuid import uuid4


def read_excel(excel_file: pathlib.Path, connection: ibis.BaseBackend , sheet_name: str = "Sheet1", schema: sch.Schema = None, sheet_range: str = None) -> ibis.expr.types.Table:
    """
    Reads data from an Excel sheet into an Ibis table expression.

    This function provides two modes:
    1. If no schema is provided, it uses the backend's native `read_xlsx`.
    2. If a schema is provided, it constructs and executes raw SQL to:
       a. Create a temporary table with the specified schema.
       b. Use the backend's COPY command (requires 'excel' extension)
          to load data from the Excel file into the temporary table.
       c. Return an Ibis table expression referencing the temporary table.
       This method allows for explicit schema enforcement during loading.

    :param excel_file: Path to the Excel file (.xlsx).
    :type excel_file: pathlib.Path
    :param connection: Active Ibis backend connection.
    :type connection: ibis.BaseBackend
    :param sheet_name: Name of the sheet to read within the Excel file.
    :type sheet_name: str
    :param schema: Optional Ibis schema to define table structure and types.
                   If provided, triggers the raw SQL CREATE/COPY method.
    :type schema: ibis.Schema, optional
    :param sheet_range: Optional specific cell range to read (e.g., "A1:D10").
                       Only used when a schema is provided (for the COPY command).
    :type sheet_range: str, optional
    :return: An Ibis table expression representing the data from the Excel sheet.
    :rtype: ibis.expr.types.Table
    """

    table_name = f"read_excel_{uuid4()}".replace("-","")
    connection.load_extension("excel")

    if not schema:
        return connection.read_xlsx(excel_file, sheet=sheet_name, ignore_errors=True, range=sheet_range)

    # Get the dialect string from the connection (e.g., 'duckdb', 'postgres')
    # Using connection.name often works as the dialect identifier for sqlglot
    dialect_name = connection.name
    try:
        # Convert ibis.Schema to list of sqlglot ColumnDef objects
        column_defs = schema.to_sqlglot(dialect=dialect_name)
        # Format each ColumnDef object as SQL string and join
        fields_str = ", ".join(col_def.sql(dialect=dialect_name) for col_def in column_defs)
        fields = f"({fields_str})"
    except ImportError:
         # Fallback or raise error if sqlglot isn't available, though it usually is.
         # For now, let's raise an error if it's missing.
         raise ImportError("sqlglot is required for schema.to_sqlglot functionality.")



    sheet_range_clause = f", RANGE '{sheet_range}'" if sheet_range else ""

    raw_sql = ""
    raw_sql += f"CREATE OR REPLACE TABLE {table_name} {fields}; "
    raw_sql += f"COPY {table_name} FROM '{str(excel_file)}' WITH (FORMAT xlsx, SHEET '{sheet_name}', IGNORE_ERRORS TRUE{sheet_range_clause}); "

    connection.raw_sql(raw_sql)

    return connection.table(table_name)

What version of ibis are you running?

10.3.1

What backend(s) are you using, if any?

DuckDB

Code of Conduct

  • I agree to follow this project's Code of Conduct look no
@omdaniel omdaniel added the feature Features or general enhancements label Mar 26, 2025
@cpcloud
Copy link
Member

cpcloud commented Mar 27, 2025

Thanks for the issue!

Do you have a file or some example data that doesn't work as you'd like it to that you could upload here?

@omdaniel
Copy link
Author

omdaniel commented Mar 27, 2025

I will have to make a segregate example xlsx file that surfaces the same issues I encountered with my data and I should be able to provide that in the next few days

@omdaniel
Copy link
Author

omdaniel commented Apr 2, 2025

test.xlsx

Below shows some ipython repl output using the test.xlsx file of these test

read_geo (w/ 'FIELD_TYPES=STRING")
cast read_geo result

read_xlsx (w/ 'all_varchar = True')
cast read_xlsx result

read_xlsx (w/ 'all_varchar = False')
cast read_xlsx result

read_xlsx (w/ my custom function and a schema)

issues encountered:
read_geo implementation is tolerant to a blank 1st line after the header and will read past the end of the file
read_xlsx will stop reading at blank 1st line after the header and need the "range" set as a parameter

read_geo (w/ 'FIELD_TYPES=STRING") interprets into a string form that is more flexible for "try_cast"

both read_geo and read_xlsx may have parity at guessing the types with a more strictly formatted excel file, but this should be tested (I have not done any exhaustive testing of this)

castings = dict(date_type = "date",
  int_type = "int",
  time0 = "time",
  time1 = "time",
  bool0 = "bool",
  bool1 = "bool")

In [87]: geo = con.read_geo("test.xlsx", open_options = ['HEADERS=AUTO','FIELD_TYPES=STRING'])

In [88]: geo
Out[88]:
┏━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ date_typeint_typetime0time1bool0bool1  ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ stringstringstringstringstringstring │
├────────────┼──────────┼──────────┼──────────┼────────┼────────┤
│ NULLNULLNULLNULLNULLNULL   │
│ 2012/03/18125NANAPASSNA     │
│ 2012/03/18203NANAPASSNA     │
│ 2012/03/1840414:05:0014:08:47YY      │
│ 2012/03/1823514:05:0014:07:25YY      │
│ 2012/03/1891314:05:0014:07:58YY      │
│ NULLNULLNULLNULLNULLNULL   │
│ 2012/03/1920315:07:0015:09:30YN      │
│ 2012/03/1953715:07:0015:11:03YY      │
│ 2012/03/1961315:07:00NDYY      │
│ …          │ …        │ …        │ …        │ …      │ …      │
└────────────┴──────────┴──────────┴──────────┴────────┴────────┘

In [89]: geo.try_cast(castings)
Out[89]:
┏━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ date_typeint_typetime0time1bool0bool1   ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ dateint64timetimebooleanboolean │
├────────────┼──────────┼──────────┼──────────┼─────────┼─────────┤
│ NULLNULLNULLNULLNULLNULL    │
│ 2012-03-18125NULLNULLNULLNULL    │
│ 2012-03-18203NULLNULLNULLNULL    │
│ 2012-03-1840414:05:0014:08:47TrueTrue    │
│ 2012-03-1823514:05:0014:07:25TrueTrue    │
│ 2012-03-1891314:05:0014:07:58TrueTrue    │
│ NULLNULLNULLNULLNULLNULL    │
│ 2012-03-1920315:07:0015:09:30TrueFalse   │
│ 2012-03-1953715:07:0015:11:03TrueTrue    │
│ 2012-03-1961315:07:00NULLTrueTrue    │
│ …          │        … │ …        │ …        │ …       │ …       │
└────────────┴──────────┴──────────┴──────────┴─────────┴─────────┘

In [90]: xlsx = con.read_xlsx("test.xlsx", header=True, ignore_errors=True, all_varchar=True, range = "A1:F11")

In [91]: xlsx
Out[91]:
┏━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ date_typeint_typetime0time1bool0bool1  ┃
┡━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ stringstringstringstringstringstring │
├───────────┼──────────┼─────────────────────┼─────────────────────┼────────┼────────┤
│ NULLNULLNULLNULLNULLNULL   │
│ 40986125NANAPASSNA     │
│ 40986203NANAPASSNA     │
│ 409864040.586805555555555580.5894328703703704YY      │
│ 409862350.586805555555555580.58848379629629632YY      │
│ 409869130.586805555555555580.58886574074074072YY      │
│ NULLNULLNULLNULLNULLNULL   │
│ 409872030.629861111111111090.63159722222222225YN      │
│ 409875370.629861111111111090.63267361111111109YY      │
│ 409876130.62986111111111109NDYY      │
└───────────┴──────────┴─────────────────────┴─────────────────────┴────────┴────────┘

In [92]: xlsx.try_cast(castings)
Out[92]:
┏━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ date_typeint_typetime0time1bool0bool1   ┃
┡━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ dateint64timetimebooleanboolean │
├───────────┼──────────┼───────┼───────┼─────────┼─────────┤
│ NULLNULLNULLNULLNULLNULL    │
│ NULL125NULLNULLNULLNULL    │
│ NULL203NULLNULLNULLNULL    │
│ NULL404NULLNULLTrueTrue    │
│ NULL235NULLNULLTrueTrue    │
│ NULL913NULLNULLTrueTrue    │
│ NULLNULLNULLNULLNULLNULL    │
│ NULL203NULLNULLTrueFalse   │
│ NULL537NULLNULLTrueTrue    │
│ NULL613NULLNULLTrueTrue    │
└───────────┴──────────┴───────┴───────┴─────────┴─────────┘

In [93]: xlsx_type_guess = con.read_xlsx("test.xlsx", header=True, ignore_errors=True, range = "A1:F11")

In [94]: xlsx_type_guess
Out[94]:
┏━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ date_typeint_typetime0time1bool0bool1   ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ datefloat64float64timefloat64float64 │
├────────────┼──────────┼──────────┼──────────┼─────────┼─────────┤
│ NULLNULLNULLNULLNULLNULL │
│ 2012-03-18125.0NULLNULLNULLNULL │
│ 2012-03-18203.0NULLNULLNULLNULL │
│ 2012-03-18404.00.58680614:08:47NULLNULL │
│ 2012-03-18235.00.58680614:07:25NULLNULL │
│ 2012-03-18913.00.58680614:07:58NULLNULL │
│ NULLNULLNULLNULLNULLNULL │
│ 2012-03-19203.00.62986115:09:30NULLNULL │
│ 2012-03-19537.00.62986115:11:03NULLNULL │
│ 2012-03-19613.00.629861NULLNULLNULL │
└────────────┴──────────┴──────────┴──────────┴─────────┴─────────┘

In [95]: xlsx_type_guess.try_cast(castings)
Out[95]:
┏━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ date_typeint_typetime0time1bool0bool1   ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ dateint64timetimebooleanboolean │
├────────────┼──────────┼───────┼──────────┼─────────┼─────────┤
│ NULLNULLNULLNULLNULLNULL    │
│ 2012-03-18125NULLNULLNULLNULL    │
│ 2012-03-18203NULLNULLNULLNULL    │
│ 2012-03-18404NULL14:08:47NULLNULL    │
│ 2012-03-18235NULL14:07:25NULLNULL    │
│ 2012-03-18913NULL14:07:58NULLNULL    │
│ NULLNULLNULLNULLNULLNULL    │
│ 2012-03-19203NULL15:09:30NULLNULL    │
│ 2012-03-19537NULL15:11:03NULLNULL    │
│ 2012-03-19613NULLNULLNULLNULL    │
└────────────┴──────────┴───────┴──────────┴─────────┴─────────┘

And using my custom reader defined in my earlier comment

In [96]: excel_hack = read_excel("test.xlsx", connection=con, schema = geo.try_cast(castings).schema(), sheet_range = "A1:F11")
In [97]: excel_hack
Out[97]:
┏━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ date_typeint_typetime0time1bool0bool1   ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ dateint64timetimebooleanboolean │
├────────────┼──────────┼──────────┼──────────┼─────────┼─────────┤
│ NULLNULLNULLNULLNULLNULL    │
│ 2012-03-18125NULLNULLNULLNULL    │
│ 2012-03-18203NULLNULLNULLNULL    │
│ 2012-03-1840414:05:0014:08:47TrueTrue    │
│ 2012-03-1823514:05:0014:07:25TrueTrue    │
│ 2012-03-1891314:05:0014:07:58TrueTrue    │
│ NULLNULLNULLNULLNULLNULL    │
│ 2012-03-1920315:07:0015:09:30TrueFalse   │
│ 2012-03-1953715:07:0015:11:03TrueTrue    │
│ 2012-03-1961315:07:00NULLTrueTrue    │
└────────────┴──────────┴──────────┴──────────┴─────────┴─────────┘

In [98]:

@NickCrews
Copy link
Contributor

My first thought is that we should try to fix this upstream in duckdb. I would hope that in duckdb, FROM read_xlsx(all_varchar=true) and FROM st_read(open_options = ['HEADERS=AUTO','FIELD_TYPES=STRING']) should result in the same data.

As an alternative this maybe a support issue for the DuckDB native excel extension but I assume their implementation choices were for robustness even if the all_varchar = True renders data that is less malleable than what is returned from read_geo with FIELD_TYPES = STRING

Are you saying you don't think this is a good idea? Can you explain this further?

@omdaniel
Copy link
Author

omdaniel commented Apr 2, 2025

I would think that FROM read_xlsx(all_varchar=true) and FROM st_read(open_options ['HEADERS=AUTO','FIELD_TYPES=STRING']); however, I do not know if producing the same output was a goal of the DuckDB excel extension as the st_read implementation may be handling many edge cases and the perhaps read_xlsx expects the excel file to be formatted a specific way (just conjecture as I haven't communicated with DuckDB folks). The DuckDB api does provide a more structured option where the user creates the table with fields types followed by COPY... and this seems to wrangle the poorly formatted excel file into a usable form consistent with st_read (using STRINGS) followed by try_cast

@NickCrews
Copy link
Contributor

@omdaniel can you file an issue with duckdb and link to it here? Let's try to solve it there before we start to be hacky on our end.

@omdaniel
Copy link
Author

omdaniel commented Apr 4, 2025

I opened an issue with DuckDB excel extension #40

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements
Projects
Status: backlog
Development

No branches or pull requests

3 participants