Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot convert numbers to strings when reading an Excel spreadsheet #5891

Closed
deanjones opened this issue Jan 9, 2014 · 34 comments
Closed

Cannot convert numbers to strings when reading an Excel spreadsheet #5891

deanjones opened this issue Jan 9, 2014 · 34 comments
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions IO CSV read_csv, to_csv IO Excel read_excel, to_excel
Milestone

Comments

@deanjones
Copy link

I'm reading some excel spreadsheets (xlsx format) into pandas using read_excel, which generally works great. The problem I have is that when a column contains numbers, pandas converts these to float64 type, and I would like them to be treated as strings. After reading them in, I can convert the column to str:

 my_frame.my_col = my_frame.my_col.astype('str') 

This works as far as assigning the right type to the column, but when I view the values in this column, the strings are formatted in scientific-format e.g. 8.027770e+14, which is not what I want. I like to be able to tell Pandas to read the columns as strings. My current solution involves dropping down to xlrd to read the spreadsheet.

See stackoverflow question: http://stackoverflow.com/q/20970483/690890

@jreback
Copy link
Contributor

jreback commented Jan 9, 2014

have you tried passing the dtype argument? (its documented in read_csv)

@deanjones
Copy link
Author

Thanks, but the dtype argument isn't available for read_excel

@jreback
Copy link
Contributor

jreback commented Jan 9, 2014

did you try it? it passes thru other keywords thru.

@deanjones
Copy link
Author

Yes. I get 'ValueError: dtype is not supported with python parser'

@jreback
Copy link
Contributor

jreback commented Jan 9, 2014

ahh..sorry...related to #5232 is currently an open bug

not sure if their is a work-around ATM

@jtratner ?

@jmcnamara
Copy link
Contributor

I like to be able to tell Pandas to read the columns as strings. My current solution involves dropping down to xlrd to read the spreadsheet.

How exactly are you doing that?

As far as I can see from the xlrd code, when it reads a number from an xlsx file it immediately converts it to a float before returning the cell object to the user.

So unless I am missing something I don't see how Pandas can be made to read the columns as strings. It will always get a float from xlrd for an xls or xlsx file if Excel has stored them as numbers.

The CSV readers behave differently and can be type specified because they return all read data as strings regardless of whether it looks like a number of not.

Perhaps @jreback or @jtratner can suggest a different way to convert a column from float64 to formatted strings.

But apart from a workaround I don't think this issue is fixable in the way that you would like.

@jtratner
Copy link
Contributor

Have you tried this in 0.13? If all your numbers are ints, pandas will read them in as ints (recent change in 0.13) and you can just do x.astype(str) to get what you want, albeit not necessarily in a particularly fast way. That said, if you're converting them to strings, they'll all be object dtype, so you're going to be using more memory and space. All numbers are stored as float in Excel, so you can't really get around this...

@jreback
Copy link
Contributor

jreback commented Feb 18, 2014

closing as fixed in 0.13

@jreback jreback closed this as completed Feb 18, 2014
@manugarri
Copy link

why is this fixed? There is still no way of reading excel file columns as objects.

@jreback
Copy link
Contributor

jreback commented Mar 17, 2014

see @jmcnamara comment

unless their is an easy way to tell xlrd to read it as not a number (eg do not do any conversions)

@manugarri
Copy link

Would it be possible then to keep every column as a float? Because this means datetime columns in excel are very hard to handle (cannot read as text, cannot convert as text)

@jreback
Copy link
Contributor

jreback commented Mar 17, 2014

I believe that xlrd converts datetime properly (iirc you have to specify what is a date column)

is their a problem with some conversion?

@manugarri
Copy link

Well, in my case I don't know the file structure before hand, and hence i cannot specify which columns are dates at the moment of import.

Maybe there is another way I can detect the date columns before using the read_excel function?

@jmcnamara
Copy link
Contributor

Maybe there is another way I can detect the date columns before using the read_excel function

Not easily.

Dates in Excel are stored as real numbers. The only thing that distinguishes them from other real numbers is the number format that is applied to them. These number formats aren't generally stored with the number in the Excel file and in general some level of heuristic is required to determine if the cell contains a date.

Xlrd does a lot of work around this so you don't have to.

However, could we take a step back and ask what you are trying and do?

@manugarri
Copy link

well, my company receives several files of client data (normally Excel files) on a constant basis. I am in charge of ensuring that data is formatted according to our main import logic.

I built a Flask app to help our account managers formatting the files. This app relies on pandas heavily, because pandas is amazing.

However, since these files are coming straight from clients, my app doesn't know what is on each file, so each account manager uses my app to decide if a column should be an int, or a date, etc.

So how i built the app it would turn every column into a dtype object column and then format every column according to the requirements.

This works great with csv files (because of the dtype arg), but with excel files the date columns are giving me a hard type.

@jreback
Copy link
Contributor

jreback commented Mar 17, 2014

read_json does this type of inference
either looking at likely looking columns from the name or simply trying varous conversion frequencies (eg may be stored as seconds/ms/ns since epoch) you can try to convert columns
just use pd.to_datetime with a try except (and errors set to raise)

@andrea-lascola
Copy link

I have the same problem about numbers and string trying to convert an xls/xlsx to csv.
In my case there are columns with codes like "07453" , "000314"..

Reading from excel, pandas read them as integers and cut all the 0's and results to "7453" , "314".. (which is wrong in my case).
At that point i think there is no workarounds for my problem

@jmcnamara
Copy link
Contributor

@andrea-lascola: I have the same problem about numbers and string trying to convert an xls/xlsx to csv. In my case there are columns with codes like "07453" , "000314"..

Pandas uses the xlrd module for reading Excel files and xlrd preserves Excel's types when there is a corresponding type in Python.

Therefore if 07453 is read as a number and not a string it is likely that it is stored in the Excel file as the number 7453 and then has a format applied to it like 00000.

You can check that yourself by selecting that cell in Excel and changing the format to General.

If the data is actually a string but is being read as a number then that would be an issue.

If not then there isn't any workaround when reading the data. Perhaps it would be possible to reformat the data prior to writing it to the CSV file. @jreback might have some suggestions there.

@andrea-lascola
Copy link

The data is stored as "Text" as shown in the screenshot in the attachment ("Testo" mean "Text" in italian :)) i'll continue the investigation, thanks for the support
screen shot 2014-03-20 at 11 28 09

@andrea-lascola
Copy link

I'm debugging "xl.parse" pandas function and i found that the data array changes from:

[u'010170' u'010185' u'010204' u'010291' u'010605' u'010787', ....]

to :

[ 10170 10185 10204 10291 10605 10787 15049 15387 ....]

after the pandas parser call to : "lib.maybe_convert_numeric" as shown in the screenshot:
screen shot 2014-03-20 at 12 20 50

it would be awesome if xl.parse will expose "try_num_bool" as an input param to avoid the conversion if not needed.

@deanjones
Copy link
Author

Just to say that the solution I adopted was to use xlrd to inspect the type and then convert if necessary. This worked for my use case (which was avoid large number-like strings being converted to scientific notation; I don't think it will help with the removal of leading zeroes):

wb = xlrd.open_workbook(file)
sheet = wb.sheet_by_name('Sheet1')
column = sheet.col(1)
# if values are extracted as numbers, we need to convert to int then str
# otherwise just convert the value to str (also filtering headers)
numbers = map(lambda x: x.ctype==2 and str(int(x.value)) or str(x.value), column)

@manugarri
Copy link

Maybe it would be possible to use xlrd to create a csv tempfile of the csv that pandas would then import with the same arguments as read_csv?

@jreback
Copy link
Contributor

jreback commented Mar 20, 2014

the 'real' soln here is to support a dtype={} argument similar to read_csv, see issue #4932 which would allow you to specify that certain columns can be retained as object dtype (iow they wouldnt' be coerced).

PR anyone?

@manugarri
Copy link

Exactly, that would be the solution.
But you mentioned that converting data types is not possible straight from xlrd, so we would need to add a middle procession, is that correct?

@jreback
Copy link
Contributor

jreback commented Mar 20, 2014

not sure
but their is some type inference going by xlrd, further in excel.py, finally when the resultant data frame is actually constructed

@naoko
Copy link

naoko commented Jun 17, 2014

vote for suggested solution which is to support dtype for excel file.

@andrea-lascola
Copy link

i can't believe it, i have the same problem right now and i have to use somewhat else to read excel files because of this issue

@spearsem
Copy link

spearsem commented Oct 8, 2014

I'm having an issue with this too. @jreback -- there is an issue with xlrd: < python-excel/xlrd#77 >.

It appears to clip millisecond information off of the column when reading a time. My fall back was going to be reading the column as a string and then manually using strptime or something. But since pandas elects to punt entirely to xlrd, it seems not possible.

I suppose my best option right now is to open the files manually and re-save them as csv, then load them with read_csv so I can use the dtype argument. Is that correct?

@jreback
Copy link
Contributor

jreback commented Oct 8, 2014

@spearsem I think that is right

@jmcnamara
Copy link
Contributor

@spearsem I fixed the millisecond issue in #5945.

@jmcnamara
Copy link
Contributor

@spearsem Try upgrade to the latest version of xlrd.

@paupereira
Copy link

Here is a workaround to the problem of leading zeros. If you know that your column has a fixed-width you can first turn the column into a string, and then add the gone padding zeros with the string method zfill, as done in here: http://stackoverflow.com/questions/339007/nicest-way-to-pad-zeroes-to-string

Overall, I agree that read_excel should have a dtype option.

@aburnettnf
Copy link

where are we on getting read_excel to have a dtype option? I'm reading in an excessively large excel file which has over 1000 columns and 33 thousand rows. The cell values are comprised of ints, floats, and strings, but it seems that read_excel is treating all the numerics (ints and floats) as floats. So a 99 is becoming 99.0. Having the ability to specify to treat everything as a string would be ideal.

@jreback
Copy link
Contributor

jreback commented Mar 25, 2016

see the open issue, #5342; pull-requests are welcome for that. Its actually pretty easy. Though you realize that you generally don't need it. Only to specific make things object. Almost all other types (including dates) are supplied via excel already.

Furthermore treating ints/floats as a string is NOT good practice at all, and can lead to completely non-performant code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions IO CSV read_csv, to_csv IO Excel read_excel, to_excel
Projects
None yet
Development

No branches or pull requests

10 participants