Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL queries with pd.read_sql drop first two records #10

Closed
Marigold opened this issue Dec 1, 2017 · 5 comments
Closed

SQL queries with pd.read_sql drop first two records #10

Marigold opened this issue Dec 1, 2017 · 5 comments

Comments

@Marigold
Copy link

Marigold commented Dec 1, 2017

For some reason (unknown to me), using pd.read_sql drops the second record and uses the first one as a header. Any idea why is it happening? Here's a sample code used on the Rate example

import pandas as pd
query = """
select day, value from rate limit 2;
"""
df = pd.read_sql(query, engine)
print(df)

returns

Empty DataFrame
Columns: [2017-12-01, 200]
Index: []

Thanks!

@xzkostyan
Copy link
Owner

Hi!

Can you provide more code including engine configuration and data that been inserted to rate table?

@Marigold
Copy link
Author

Marigold commented Dec 4, 2017

I followed the tuturial on the main page...

Create the table

from sqlalchemy import create_engine, Column, MetaData, literal

from clickhouse_sqlalchemy import Table, make_session, get_declarative_base, types, engines

uri = 'clickhouse://default:@localhost/test'

engine = create_engine(uri)
session = make_session(engine)
metadata = MetaData(bind=engine)

Base = get_declarative_base(metadata=metadata)

class Rate(Base):
    day = Column(types.Date, primary_key=True)
    value = Column(types.Int32)

    __table_args__ = (
        engines.Memory(),
    )

table = Rate.__table__
table.create()

insert some data

from datetime import date, timedelta
from sqlalchemy import func

today = date.today()
rates = [{'day': today - timedelta(i), 'value': 200 - i} for i in range(100)]

# Emits single INSERT statement.
session.execute(table.insert(), rates)

and read it with pandas

import pandas as pd
query = """
select day, value from rate limit 2;
"""
df = pd.read_sql(query, engine)
print(df)

let me know if you need anything else. Thanks!

@xzkostyan
Copy link
Owner

Hi, @Marigold!

Sorry for late response.

Short answer: use native interface uri = 'clickhouse+native://default:@localhost/test' and it'll be fine.

Long answer. Pandas read_sql method actually emits two queries:

  • EXISTS TABLE select day, value from rate limit 2;
  • select day, value from rate limit 2;

There are two interfaces supported by clickhouse-sqlalchemy:

  • HTTP. It is the first one, but it has some limitations. For example we need to add FORMAT TabSeparatedWithNamesAndTypes for parsing on the python side. This interface is default now. This behavior will be changed in future releases.
  • TCP. This is the same interface used by clickhouse-client. This interface is more flexible. It appeared few moths after HTTP interface was implemented in this library.

In case of HTTP interface following SQL is rendered: EXISTS TABLE select day, value from rate limit 2; FORMAT TabSeparatedWithNamesAndTypes. Trailing semicolon doesn't make any sense, query will fail even without it. See ClickHouse logs /var/log/clickhouse-server/clickhouse-server.log. In read_sql terms this table doesn't exist if EXISTS query is failed.

You should use TCP interface. It hasn't problems with FORMAT clause.

@Marigold
Copy link
Author

Awesome response, thanks a lot for putting your time into this project!

@amoskaliov
Copy link

amoskaliov commented Nov 1, 2019

For those who have googled this issue:

  1. You can use pd.read_sql_query because it doesn't emit first query EXISTS ... but only executes your query.
  2. To keep first rows in place you have to add FORMAT TabSeparatedWithNamesAndTypes to your query explicitly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants