-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Properly reading results from Hive queries in Pandas in Python 3 #94
Comments
Hi Amelio, I am taking a look into this and will recommend a way to achieve this. BTW do you have any command id to share? Thanks, |
OK, got the issue. Can you change the delimiter to chr(9), which is correct ASCII for tab (\t) and see if that helps. |
Thanks @msumit. I changed the delimiter to
You can find the query here. |
While we get to the root of issue, one temporary way to handle this issue could be like this
|
Thanks @msumit I tried that command without luck. See this thread in SO for more info. Also, even with the accepted solution on that question, the resulting text is not directly readable by Pandas ( One more note, if possible, it would be great to be able to work directly with delimeters like |
So, I tried the following:
Then I was able to read it with Pandas. (Note that I haven't used Pandas before - so don't know what a dataframe is. :-) )
Then I tried specifying the delimiter in Pandas:
Then I went and remove the first line from
|
Right, thanks @mindprince That's exactly what I (and others here) currently do to load DataFrames in Pandas (i.e. the data structure that holds tables in Pandas). See my comment at the top of my post:
And keep in mind that even in the last step that you posted (the only one that extracted the columns correctly) you still don't have the DataFrame properly loaded (the column names are missing since they are in the first line that you discard). I do have code to do all this and parse the top line separately in Python, so that I can invoke I really appreciate your help on this. One more note, considering that Pandas is by far the most widely used and de-facto library for reading, processing and manipulating tabular data in Python, it would be really helpful to have |
#95 should fix the issue you are facing. After that change is in, you would be able to do the following:
Thanks for reporting this! |
Fantastic. Thanks @mindprince for the quick response! Looking forward to testing it. |
Thanks @mindprince and @msumit I tried it and it works great. One question though, why are we restricted to |
I agree that we shouldn't be restricted to When you specify |
I need to know if I can use panda method like read_sql or read_sql_query for the application where I need to queries like SET.... ; USE database; and then select query from some table of that database; "Its giving me error" |
@Geetanjli015 How is the question related to qds-sdk-py? I am not sure how are we calling the read_sql call without a driver which can connect to a service in qds. Could you elaborate a bit? The current issue was more about the csv file where read_sql methods in pandas work in a bit different way than a normal file read. So this seems unrelated to this particular issue. |
What is the best way to read the output from disk with Pandas after using
cmd.get_results
? (e.g. from a Hive command).For example, consider the following:
If, after successfully running the query, I then inspect the first few bytes of
results.csv
, I see the following:When I try to open this in Pandas:
df = pd.read_csv('results.csv')
it obviously doesn't work (I get an empty DataFrame), since it isn't properly formatted as a csv file.
While I could try to open
results.csv
and post-process it (to removeb'
, etc.) before I open it in Pandas, this would be a quite hacky way to load it.Am I using the interface correctly? This is using the very last version of
qds_sdk
:1.4.2
from a three hours ago.The text was updated successfully, but these errors were encountered: