ValueError: invalid literal for int() with base 10: in clean_osm_data #153

oayana · 2021-11-19T18:19:33Z

Some cables values (in df_all_lines["cables"]) are not suitable for conversion to int values (eg :3;6).

Workaround: df_all_lines["cables"] = df_all_lines["cables"].astype(str).str.replace(";", ".").astype(float).astype(int)
You can use.

pz-max · 2021-11-19T20:18:33Z

HI @oayana,
thanks for opening this issue :)
What script did you use?
Did it lead to an error?

We splitted at some point one line that had voltage information in this format {110,220) to two lines 110 and 220. So probably we missed splitting the cable as well. I think we need an alternative solution compared to the workaround. Do you want to create a PR and think about a solution?

oayana · 2021-11-19T20:35:00Z

First of all, you're welcome :)

While I was performing the analysis for Turkey, I got such an error on line 368 of the osm_data_cleaning script.
Error code:
Traceback (most recent call last):
File "C:\Users\muh1\Desktop\pypsa\Pypsa_data\pypsa-africa.snakemake\scripts\tmpb0bbyowa.osm_data_cleaning.py", line 611, in
clean_data(
File "C:\Users\muh1\Desktop\pypsa\Pypsa_data\pypsa-africa.snakemake\scripts\tmpb0bbyowa.osm_data_cleaning.py", line 493, in clean_data
df_all_lines = integrate_lines_df(df_all_lines)
File "C:\Users\muh1\Desktop\pypsa\Pypsa_data\pypsa-africa.snakemake\scripts\tmpb0bbyowa.osm_data_cleaning.py", line 371, in integrate_lines_df
df_all_lines["cables"] = df_all_lines["cables"].astype("int")
File "C:\Users\muh1.conda\envs\pypsa-africa\lib\site-packages\pandas\core\generic.py", line 5815, in astype
new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
File "C:\Users\muh1.conda\envs\pypsa-africa\lib\site-packages\pandas\core\internals\managers.py", line 418, in astype
return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
File "C:\Users\muh1.conda\envs\pypsa-africa\lib\site-packages\pandas\core\internals\managers.py", line 327, in apply
applied = getattr(b, f)(**kwargs)
File "C:\Users\muh1.conda\envs\pypsa-africa\lib\site-packages\pandas\core\internals\blocks.py", line 591, in astype
new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
File "C:\Users\muh1.conda\envs\pypsa-africa\lib\site-packages\pandas\core\dtypes\cast.py", line 1309, in astype_array_safe
new_values = astype_array(values, dtype, copy=copy)
File "C:\Users\muh1.conda\envs\pypsa-africa\lib\site-packages\pandas\core\dtypes\cast.py", line 1257, in astype_array
values = astype_nansafe(values, dtype, copy=copy)
File "C:\Users\muh1.conda\envs\pypsa-africa\lib\site-packages\pandas\core\dtypes\cast.py", line 1174, in astype_nansafe
return lib.astype_intsafe(arr, dtype)
File "pandas_libs\lib.pyx", line 679, in pandas._libs.lib.astype_intsafe
ValueError: invalid literal for int() with base 10: '3;2'

I tried a workaround as a solution. But if you want to find a better solution, of course I would like to help.

pz-max · 2021-11-19T21:16:43Z

Hi @oayana,
Thanks for the error message. So it seems in line 371 is the problem.... Your workaround is currently fixing the bug but introduces a small error. You are setting two lines with i.e. 'cable' = 2, but one line need to have 'cable'=3. Let me explain.

A cable can be only an integer (1,2,3,...). The line with the trouble has 2 lines in parallel running. One with 2 cables one with 3 cables. This indeed lead to the error that (2;3) cannot be converted to an integer.

https://github.com/pypsa-meets-africa/pypsa-africa/blob/509d2ba691c622c2245df63631c7f09c91089355/scripts/osm_data_cleaning.py#L362-L366

As you can see below, we splitted voltages because there was also a semicolon.
Split cells function is used in the voltage case to break the lines apart. But let's check out the split_cells function in more detail
https://github.com/pypsa-meets-africa/pypsa-africa/blob/509d2ba691c622c2245df63631c7f09c91089355/scripts/osm_data_cleaning.py#L238-L244

What the docstring explains is that we separate the voltage in this function and create an identical line with all previous data. So while we have now two lines with separated voltage without semicolon, we still have the problem that the cable information is not fixed - it has now in each line smt like Line1['cable'] -> (2;3), Line2 ['cable'] -> (2:3)

https://github.com/pypsa-meets-africa/pypsa-africa/blob/509d2ba691c622c2245df63631c7f09c91089355/scripts/osm_data_cleaning.py#L218-L235

Solution:

check by row if voltage and cable cells contain a semicolon
if yes split them in both cells
than safe the information into a new line
Hope it makes sense

Goal (simplified):

That's how data was stored: 
Line['voltage'] = (100;200)
Line['cable'] = (1;2)

That's how we need it:
Line1['voltage'] = (100)
Line1['cable'] = (1)
Line2['voltage'] = (200)
Line2['cable'] = (2)

Do you want to be on the contributor list with such a fix? :)

oayana · 2021-11-20T08:08:03Z

I understood the problem and would like to offer you a solution. I will handle it as soon as possible and get back to you :)

oayana · 2021-11-20T15:27:35Z

semicolon

Hi Max,
I had a chance to look at the df_all_lines dataframe now and as you mentioned some cables and voltages values are semicolon. However, both cables and voltages are not semicolon in the same row(only one of them is semicolon).

You have already produced the solution for this. It is solved with the split_cells function, as you threw into the voltages values.
All you have to do is call the split function before parsing the cables values into int:

     df_all_lines = split_cells(df_all_lines, lst_col="cables")

Then,
df_all_lines["cables"] = df_all_lines["cables"].astype("int")

pz-max · 2021-11-20T16:34:31Z

Hi @oayana,
I would suggest following these steps 1-7 to create a pull request that can be accepted in PyPSA-Africa (it also helps to check your code and to stream it in): https://www.tomasbeuzen.com/post/git-fork-branch-pull/

I have only one doubt which we need to check to be ready for an accepted PR.

Let's assume one line has a semicolon separated values for voltage and cable at the same time:

Line['voltage'] = (100;200)
Line['cable'] = (1;2)

Then, I believe applying split cells two times could lead to the following:

Line1['voltage'] = (100)
Line1['cable'] = (1)
Line2['voltage'] = (100)
Line2['cable'] = (2)
Line3['voltage'] = (200)
Line3['cable'] = (1)
Line4['voltage'] = (200)
Line4['cable'] = (2)

Instead of the desired solution which I mentioned above (#153 (comment)). I would suggest writing a small jupyterscript to test if a one line with the semicolon in both cases can be solved by the split cell function correctly.

Almost there :)

oayana · 2021-11-21T09:26:08Z

Hi @oayana, I would suggest following these steps 1-7 to create a pull request that can be accepted in PyPSA-Africa (it also helps to check your code and to stream it in): https://www.tomasbeuzen.com/post/git-fork-branch-pull/

I have only one doubt which we need to check to be ready for an accepted PR.

Let's assume one line has a semicolon separated values for voltage and cable at the same time:
Line['voltage'] = (100;200)
Line['cable'] = (1;2)
Then, I believe applying split cells two times could lead to the following:
Line1['voltage'] = (100)
Line1['cable'] = (1)
Line2['voltage'] = (100)
Line2['cable'] = (2)
Line3['voltage'] = (200)
Line3['cable'] = (1)
Line4['voltage'] = (200)
Line4['cable'] = (2)
Instead of the desired solution which I mentioned above (#153 (comment)). I would suggest writing a small jupyterscript to test if a one line with the semicolon in both cases can be solved by the split cell function correctly.

Almost there :)

Hi Max,
Maybe this code can provide the solution you want (be careful it only works for cables and voltage).

import pandas as pd
dic = {'line_id': {0: 557828023, 1: 176542356, 2: 305004661, 3: 233127728}, 'voltage': {0: '154000;380000', 1: '380000', 2: '154000;380000', 3: '154000'}, 'cables': {0: '3;2', 1: '0', 2: '3;6', 3: '0'}}
data = pd.DataFrame(dic) #example data

def split_cells_multiple(df,list_col=['cables','voltage']): # split function for cables and voltage
    for i in range(df.shape[0]): 
        sub = df.loc[i,list_col] # for each cables and voltage
        if sub.notnull().all() == True:  # check not both empty
            if [ ";" in s for s in sub].count(True) == len(list_col): # check both contain ";"
                d = [ s.split(';')  for s in sub] #split them
                r = df.loc[i,:].copy()
                df.loc[i,list_col[0]] = d[0][0] # first split  [0]
                df.loc[i,list_col[1]] = d[1][0]
                r[list_col[0]] = d[0][1] # second split [1]
                r[list_col[1]] = d[1][1]
                df = df.append(r) 
    return df # return new frame


data = split_cells_multiple(data)

pz-max mentioned this issue Nov 21, 2021

Restyle Update osm_data_cleaning.py #158

Merged

pz-max closed this as completed Nov 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: invalid literal for int() with base 10: in clean_osm_data #153

ValueError: invalid literal for int() with base 10: in clean_osm_data #153

oayana commented Nov 19, 2021

pz-max commented Nov 19, 2021

oayana commented Nov 19, 2021

pz-max commented Nov 19, 2021 •

edited

Loading

oayana commented Nov 20, 2021

oayana commented Nov 20, 2021

pz-max commented Nov 20, 2021

oayana commented Nov 21, 2021

ValueError: invalid literal for int() with base 10: in clean_osm_data #153

ValueError: invalid literal for int() with base 10: in clean_osm_data #153

Comments

oayana commented Nov 19, 2021

pz-max commented Nov 19, 2021

oayana commented Nov 19, 2021

pz-max commented Nov 19, 2021 • edited Loading

oayana commented Nov 20, 2021

oayana commented Nov 20, 2021

pz-max commented Nov 20, 2021

oayana commented Nov 21, 2021

pz-max commented Nov 19, 2021 •

edited

Loading