Properly reading results from Hive queries in Pandas in Python 3 #94

amelio-vazquez-reina · 2015-03-26T00:23:42Z

What is the best way to read the output from disk with Pandas after using cmd.get_results ? (e.g. from a Hive command).

For example, consider the following:

out_file = 'results.csv'
delimiter = chr(1)
....

Qubole.configure(qubole_key)
hc_params = ['--query', query]
hive_args = HiveCommand.parse(hc_params)
cmd = HiveCommand.run(**hive_args)
if (HiveCommand.is_success(cmd.status)):
  with open(out_file, 'wt') as writer:
  cmd.get_results(writer, delim=delimiter, inline=False)

If, after successfully running the query, I then inspect the first few bytes of results.csv, I see the following:

$ head -c 300 results.csv
b'flight_uid\twinning_price\tbid_price\timpressions_source_timestamp\n'b'0FY6ZsrnMy\x012000\x012270.0\x011427243278000\n0FamrXG9AW\x01710\x01747.0\x011427243733000\n0FY6ZsrnMy\x012000\x012270.0\x011427245266000\n0FY6ZsrnMy\x012000\x012270.0\x011427245088000\n0FamrXG9AW\x01330\x01747.0\x011427243407000\n0FamrXG9AW\x01710\x01747.0\x011427243981000\n0FamrXG9AW\x01490\x01747.0\x011427245289000\n

When I try to open this in Pandas:

df = pd.read_csv('results.csv')

it obviously doesn't work (I get an empty DataFrame), since it isn't properly formatted as a csv file.

While I could try to open results.csv and post-process it (to remove b', etc.) before I open it in Pandas, this would be a quite hacky way to load it.

Am I using the interface correctly? This is using the very last version of qds_sdk: 1.4.2 from a three hours ago.

The text was updated successfully, but these errors were encountered:

msumit · 2015-03-26T07:19:31Z

Hi Amelio,

I am taking a look into this and will recommend a way to achieve this. BTW do you have any command id to share?

Thanks,
Sumit

msumit · 2015-03-26T11:55:43Z

OK, got the issue. Can you change the delimiter to chr(9), which is correct ASCII for tab (\t) and see if that helps.

amelio-vazquez-reina · 2015-03-26T14:10:35Z

Thanks @msumit. I changed the delimiter to chr(9) and got the following, which is still not readable by pd.read_csv() or other Pandas I/O methods (pd.read_csv will read it but not detect and parse the rows and columns properly, putting everything in one massive row):

b'flight_uid\twinning_price\tbid_price\timpressions_source_timestamp\n'b'0FY6ZsrnMy\x012000\x012270.0\x011427243278000\n0FamrXG9AW\x01710\x01747.0\x011427243733000\n0FY6ZsrnMy\x012000\x012270.0\x011427245266000\n0FY6ZsrnMy\x012000\x012270.0\x011427245088000\n0FamrXG9AW\x01330\x01747.0\x011427243407000\n0FamrXG9AW\x01710\x01747.0\x011427243981000\n0FamrXG9AW\x01490\x01747.0\x011427245289000\n0FamrXG9AW\x01735\x01747.0\x011427244634000\n0FamrXG9AW\x01420\x01747.0\x011427245595000\n0FamrXG9AW\x01470\x01747.0\x011427242443000\n0FK9yvBt9B\x011050\x011295.0\x011427242253000\n0FK9yvBt9B\x011050\x0112%

You can find the query here.

msumit · 2015-03-27T10:17:43Z

While we get to the root of issue, one temporary way to handle this issue could be like this

cat results.csv | tr $'\x01' \\t > result.csv

amelio-vazquez-reina · 2015-03-27T22:34:44Z

Thanks @msumit I tried that command without luck. See this thread in SO for more info. Also, even with the accepted solution on that question, the resulting text is not directly readable by Pandas ('b' and the quotes confuse pandas).

One more note, if possible, it would be great to be able to work directly with delimeters like chr(1) (or other general delimiters) in Pandas (i.e. not having to use \t). It's not odd for columns (not necessarily in Qubole / Hive) to contain \t (e.g. if they contain fields from text forms).

rohitagarwal003 · 2015-03-27T23:22:07Z

Hi @amelio-vazquez-reina

So, I tried the following:

>>> import qds_sdk
>>> from qds_sdk.qubole import Qubole
>>> Qubole.configure(api_token)
>>> from qds_sdk.commands import Command
>>> cmd = Command.find(5193554)
>>> out_file = 'results-py3.csv'
>>> with open(out_file, 'wb') as writer: #notice the binary mode here - this was only needed in python3 - with python2 i used the text mode.
...     cmd.get_results(writer, inline=False)

Then I was able to read it with Pandas. (Note that I haven't used Pandas before - so don't know what a dataframe is. :-) )

>>> import pandas
>>> x = pandas.read_csv('results-py3.csv')
>>> x
      flight_uid\twinning_price\tbid_price\timpressions_source_timestamp
0                   0FY6ZsrnMy20002270.01427243278000
1                     0FamrXG9AW710747.01427243733000
2                   0FY6ZsrnMy20002270.01427245266000
...                                                  ...
28838                 0FamrXG9AW630747.01427225265000
28839                 0FamrXG9AW694747.01427228053000

[28840 rows x 1 columns]

Then I tried specifying the delimiter in Pandas:

>>> x = pandas.read_csv('results-py3.csv', delimiter=chr(1))
>>> x
                      flight_uid\twinning_price\tbid_price\timpressions_source_timestamp
0FY6ZsrnMy 2000 2270                                      1427243278000
0FamrXG9AW 710  747                                       1427243733000
0FY6ZsrnMy 2000 2270                                      1427245266000
                2270                                      1427245088000
...                                                                 ...
                272                                       1427225392000
0FamrXG9AW 60   747                                       1427225264000
           630  747                                       1427225265000
           694  747                                       1427228053000

[28840 rows x 1 columns]

Then I went and remove the first line from results-py3.csv which contained the header with tabs.

>>> x = pandas.read_csv('results.csv', delimiter=chr(1))
>>> x
       0FY6ZsrnMy  2000  2270.0  1427243278000
0      0FamrXG9AW   710     747  1427243733000
1      0FY6ZsrnMy  2000    2270  1427245266000
...           ...   ...     ...            ...
28837  0FamrXG9AW   630     747  1427225265000
28838  0FamrXG9AW   694     747  1427228053000

[28839 rows x 4 columns]

amelio-vazquez-reina · 2015-03-28T01:37:54Z

Right, thanks @mindprince That's exactly what I (and others here) currently do to load DataFrames in Pandas (i.e. the data structure that holds tables in Pandas). See my comment at the top of my post:

While I could try to open results.csv and post-process it (to remove b', etc.) before I open it in Pandas, this would be a quite hacky way to load it.

And keep in mind that even in the last step that you posted (the only one that extracted the columns correctly) you still don't have the DataFrame properly loaded (the column names are missing since they are in the first line that you discard). I do have code to do all this and parse the top line separately in Python, so that I can invoke pd.read_csv with the field names to tell it how to interpret the column names but again, this is just too cumbersome.

I really appreciate your help on this. One more note, considering that Pandas is by far the most widely used and de-facto library for reading, processing and manipulating tabular data in Python, it would be really helpful to have qds-sdk store the results in a way that can be directly and properly opened in Pandas. Just a suggestion.

rohitagarwal003 · 2015-03-30T22:37:08Z

Hi @amelio-vazquez-reina

#95 should fix the issue you are facing.

After that change is in, you would be able to do the following:

>>> with open(out_file, 'wb') as writer:
...     cmd.get_results(writer, delim='\t', inline=False)
>>> x = pandas.read_csv(out_file, delimiter='\t')
>>> x
       flight_uid  winning_price  bid_price  impressions_source_timestamp
0      0FY6ZsrnMy           2000       2270                 1427243278000
1      0FamrXG9AW            710        747                 1427243733000
...           ...            ...        ...                           ...
17996  0FY6ZsrnMy           2000       2270                 1427183672000
17997  0F4MfJS6JH            190        544                 1427183624000

[17998 rows x 4 columns]

Thanks for reporting this!

amelio-vazquez-reina · 2015-03-30T22:59:24Z

Fantastic. Thanks @mindprince for the quick response! Looking forward to testing it.

amelio-vazquez-reina · 2015-04-01T13:06:21Z

Thanks @mindprince and @msumit I tried it and it works great. One question though, why are we restricted to '\t' (i.e. chr(9))? IIRC, Hive used to only support '\t' as field separator, but I thought we had upgraded to a newer version with support for other delimiters already.

rohitagarwal003 · 2015-04-01T20:49:17Z

Hi @amelio-vazquez-reina

I agree that we shouldn't be restricted to \t.

When you specify get_results(delim=...), the SDK replaces all occurrences of chr(1) in the result with delim. Currently, the results coming from the API have \t as the delimiter in header and chr(1) as the delimiter in all other rows. I have started a discussion internally to see why that is the case. Once it is fixed (if it is indeed a bug), then you would be able to use any delimiter.

Geetanjli015 · 2018-03-22T20:24:26Z

I need to know if I can use panda method like read_sql or read_sql_query for the application where I need to queries like SET.... ; USE database; and then select query from some table of that database;

"Its giving me error"

adityaka · 2018-03-23T14:36:42Z

@Geetanjli015 How is the question related to qds-sdk-py? I am not sure how are we calling the read_sql call without a driver which can connect to a service in qds. Could you elaborate a bit?

The current issue was more about the csv file where read_sql methods in pandas work in a bit different way than a normal file read. So this seems unrelated to this particular issue.

amelio-vazquez-reina changed the title ~~Properly reading the data in Pandas in Python 3~~ Properly reading results from Hive queries in Pandas in Python 3 Mar 26, 2015

rohitagarwal003 closed this as completed in 65eefb2 Mar 31, 2015

rohitagarwal003 pushed a commit that referenced this issue Apr 21, 2015

Python 3: Correctly read results from s3. Fixes #94.

c67d073

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Properly reading results from Hive queries in Pandas in Python 3 #94

Properly reading results from Hive queries in Pandas in Python 3 #94

amelio-vazquez-reina commented Mar 26, 2015

msumit commented Mar 26, 2015

msumit commented Mar 26, 2015

amelio-vazquez-reina commented Mar 26, 2015

msumit commented Mar 27, 2015

amelio-vazquez-reina commented Mar 27, 2015

rohitagarwal003 commented Mar 27, 2015

amelio-vazquez-reina commented Mar 28, 2015

rohitagarwal003 commented Mar 30, 2015

amelio-vazquez-reina commented Mar 30, 2015

amelio-vazquez-reina commented Apr 1, 2015

rohitagarwal003 commented Apr 1, 2015

Geetanjli015 commented Mar 22, 2018 •

edited

adityaka commented Mar 23, 2018

Properly reading results from Hive queries in Pandas in Python 3 #94

Properly reading results from Hive queries in Pandas in Python 3 #94

Comments

amelio-vazquez-reina commented Mar 26, 2015

msumit commented Mar 26, 2015

msumit commented Mar 26, 2015

amelio-vazquez-reina commented Mar 26, 2015

msumit commented Mar 27, 2015

amelio-vazquez-reina commented Mar 27, 2015

rohitagarwal003 commented Mar 27, 2015

amelio-vazquez-reina commented Mar 28, 2015

rohitagarwal003 commented Mar 30, 2015

amelio-vazquez-reina commented Mar 30, 2015

amelio-vazquez-reina commented Apr 1, 2015

rohitagarwal003 commented Apr 1, 2015

Geetanjli015 commented Mar 22, 2018 • edited

adityaka commented Mar 23, 2018

Geetanjli015 commented Mar 22, 2018 •

edited