Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly reading results from Hive queries in Pandas in Python 3 #94

Closed
amelio-vazquez-reina opened this issue Mar 26, 2015 · 13 comments
Closed

Comments

@amelio-vazquez-reina
Copy link

What is the best way to read the output from disk with Pandas after using cmd.get_results ? (e.g. from a Hive command).

For example, consider the following:

out_file = 'results.csv'
delimiter = chr(1)
....

Qubole.configure(qubole_key)
hc_params = ['--query', query]
hive_args = HiveCommand.parse(hc_params)
cmd = HiveCommand.run(**hive_args)
if (HiveCommand.is_success(cmd.status)):
  with open(out_file, 'wt') as writer:
  cmd.get_results(writer, delim=delimiter, inline=False)

If, after successfully running the query, I then inspect the first few bytes of results.csv, I see the following:

$ head -c 300 results.csv
b'flight_uid\twinning_price\tbid_price\timpressions_source_timestamp\n'b'0FY6ZsrnMy\x012000\x012270.0\x011427243278000\n0FamrXG9AW\x01710\x01747.0\x011427243733000\n0FY6ZsrnMy\x012000\x012270.0\x011427245266000\n0FY6ZsrnMy\x012000\x012270.0\x011427245088000\n0FamrXG9AW\x01330\x01747.0\x011427243407000\n0FamrXG9AW\x01710\x01747.0\x011427243981000\n0FamrXG9AW\x01490\x01747.0\x011427245289000\n

When I try to open this in Pandas:

df = pd.read_csv('results.csv')

it obviously doesn't work (I get an empty DataFrame), since it isn't properly formatted as a csv file.

While I could try to open results.csv and post-process it (to remove b', etc.) before I open it in Pandas, this would be a quite hacky way to load it.

Am I using the interface correctly? This is using the very last version of qds_sdk: 1.4.2 from a three hours ago.

@amelio-vazquez-reina amelio-vazquez-reina changed the title Properly reading the data in Pandas in Python 3 Properly reading results from Hive queries in Pandas in Python 3 Mar 26, 2015
@msumit
Copy link
Contributor

msumit commented Mar 26, 2015

Hi Amelio,

I am taking a look into this and will recommend a way to achieve this. BTW do you have any command id to share?

Thanks,
Sumit

@msumit
Copy link
Contributor

msumit commented Mar 26, 2015

OK, got the issue. Can you change the delimiter to chr(9), which is correct ASCII for tab (\t) and see if that helps.

@amelio-vazquez-reina
Copy link
Author

Thanks @msumit. I changed the delimiter to chr(9) and got the following, which is still not readable by pd.read_csv() or other Pandas I/O methods (pd.read_csv will read it but not detect and parse the rows and columns properly, putting everything in one massive row):

b'flight_uid\twinning_price\tbid_price\timpressions_source_timestamp\n'b'0FY6ZsrnMy\x012000\x012270.0\x011427243278000\n0FamrXG9AW\x01710\x01747.0\x011427243733000\n0FY6ZsrnMy\x012000\x012270.0\x011427245266000\n0FY6ZsrnMy\x012000\x012270.0\x011427245088000\n0FamrXG9AW\x01330\x01747.0\x011427243407000\n0FamrXG9AW\x01710\x01747.0\x011427243981000\n0FamrXG9AW\x01490\x01747.0\x011427245289000\n0FamrXG9AW\x01735\x01747.0\x011427244634000\n0FamrXG9AW\x01420\x01747.0\x011427245595000\n0FamrXG9AW\x01470\x01747.0\x011427242443000\n0FK9yvBt9B\x011050\x011295.0\x011427242253000\n0FK9yvBt9B\x011050\x0112%

You can find the query here.

@msumit
Copy link
Contributor

msumit commented Mar 27, 2015

While we get to the root of issue, one temporary way to handle this issue could be like this

cat results.csv | tr $'\x01' \\t > result.csv

@amelio-vazquez-reina
Copy link
Author

Thanks @msumit I tried that command without luck. See this thread in SO for more info. Also, even with the accepted solution on that question, the resulting text is not directly readable by Pandas ('b' and the quotes confuse pandas).

One more note, if possible, it would be great to be able to work directly with delimeters like chr(1) (or other general delimiters) in Pandas (i.e. not having to use \t). It's not odd for columns (not necessarily in Qubole / Hive) to contain \t (e.g. if they contain fields from text forms).

@rohitagarwal003
Copy link
Contributor

Hi @amelio-vazquez-reina

So, I tried the following:

>>> import qds_sdk
>>> from qds_sdk.qubole import Qubole
>>> Qubole.configure(api_token)
>>> from qds_sdk.commands import Command
>>> cmd = Command.find(5193554)
>>> out_file = 'results-py3.csv'
>>> with open(out_file, 'wb') as writer: #notice the binary mode here - this was only needed in python3 - with python2 i used the text mode.
...     cmd.get_results(writer, inline=False)

Then I was able to read it with Pandas. (Note that I haven't used Pandas before - so don't know what a dataframe is. :-) )

>>> import pandas
>>> x = pandas.read_csv('results-py3.csv')
>>> x
      flight_uid\twinning_price\tbid_price\timpressions_source_timestamp
0                   0FY6ZsrnMy20002270.01427243278000
1                     0FamrXG9AW710747.01427243733000
2                   0FY6ZsrnMy20002270.01427245266000
...                                                  ...
28838                 0FamrXG9AW630747.01427225265000
28839                 0FamrXG9AW694747.01427228053000

[28840 rows x 1 columns]

Then I tried specifying the delimiter in Pandas:

>>> x = pandas.read_csv('results-py3.csv', delimiter=chr(1))
>>> x
                      flight_uid\twinning_price\tbid_price\timpressions_source_timestamp
0FY6ZsrnMy 2000 2270                                      1427243278000
0FamrXG9AW 710  747                                       1427243733000
0FY6ZsrnMy 2000 2270                                      1427245266000
                2270                                      1427245088000
...                                                                 ...
                272                                       1427225392000
0FamrXG9AW 60   747                                       1427225264000
           630  747                                       1427225265000
           694  747                                       1427228053000

[28840 rows x 1 columns]

Then I went and remove the first line from results-py3.csv which contained the header with tabs.

>>> x = pandas.read_csv('results.csv', delimiter=chr(1))
>>> x
       0FY6ZsrnMy  2000  2270.0  1427243278000
0      0FamrXG9AW   710     747  1427243733000
1      0FY6ZsrnMy  2000    2270  1427245266000
...           ...   ...     ...            ...
28837  0FamrXG9AW   630     747  1427225265000
28838  0FamrXG9AW   694     747  1427228053000

[28839 rows x 4 columns]

@amelio-vazquez-reina
Copy link
Author

Right, thanks @mindprince That's exactly what I (and others here) currently do to load DataFrames in Pandas (i.e. the data structure that holds tables in Pandas). See my comment at the top of my post:

While I could try to open results.csv and post-process it (to remove b', etc.) before I open it in Pandas, this would be a quite hacky way to load it.

And keep in mind that even in the last step that you posted (the only one that extracted the columns correctly) you still don't have the DataFrame properly loaded (the column names are missing since they are in the first line that you discard). I do have code to do all this and parse the top line separately in Python, so that I can invoke pd.read_csv with the field names to tell it how to interpret the column names but again, this is just too cumbersome.

I really appreciate your help on this. One more note, considering that Pandas is by far the most widely used and de-facto library for reading, processing and manipulating tabular data in Python, it would be really helpful to have qds-sdk store the results in a way that can be directly and properly opened in Pandas. Just a suggestion.

@rohitagarwal003
Copy link
Contributor

Hi @amelio-vazquez-reina

#95 should fix the issue you are facing.

After that change is in, you would be able to do the following:

>>> with open(out_file, 'wb') as writer:
...     cmd.get_results(writer, delim='\t', inline=False)
>>> x = pandas.read_csv(out_file, delimiter='\t')
>>> x
       flight_uid  winning_price  bid_price  impressions_source_timestamp
0      0FY6ZsrnMy           2000       2270                 1427243278000
1      0FamrXG9AW            710        747                 1427243733000
...           ...            ...        ...                           ...
17996  0FY6ZsrnMy           2000       2270                 1427183672000
17997  0F4MfJS6JH            190        544                 1427183624000

[17998 rows x 4 columns]

Thanks for reporting this!

@amelio-vazquez-reina
Copy link
Author

Fantastic. Thanks @mindprince for the quick response! Looking forward to testing it.

@amelio-vazquez-reina
Copy link
Author

Thanks @mindprince and @msumit I tried it and it works great. One question though, why are we restricted to '\t' (i.e. chr(9))? IIRC, Hive used to only support '\t' as field separator, but I thought we had upgraded to a newer version with support for other delimiters already.

@rohitagarwal003
Copy link
Contributor

Hi @amelio-vazquez-reina

I agree that we shouldn't be restricted to \t.

When you specify get_results(delim=...), the SDK replaces all occurrences of chr(1) in the result with delim. Currently, the results coming from the API have \t as the delimiter in header and chr(1) as the delimiter in all other rows. I have started a discussion internally to see why that is the case. Once it is fixed (if it is indeed a bug), then you would be able to use any delimiter.

@Geetanjli015
Copy link

Geetanjli015 commented Mar 22, 2018

I need to know if I can use panda method like read_sql or read_sql_query for the application where I need to queries like SET.... ; USE database; and then select query from some table of that database;

"Its giving me error"

@adityaka
Copy link
Contributor

@Geetanjli015 How is the question related to qds-sdk-py? I am not sure how are we calling the read_sql call without a driver which can connect to a service in qds. Could you elaborate a bit?

The current issue was more about the csv file where read_sql methods in pandas work in a bit different way than a normal file read. So this seems unrelated to this particular issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants