Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: invalid literal for int() with base 10: in clean_osm_data #153

Closed
oayana opened this issue Nov 19, 2021 · 7 comments
Closed

Comments

@oayana
Copy link
Contributor

oayana commented Nov 19, 2021

Some cables values (in df_all_lines["cables"]) are not suitable for conversion to int values (eg :3;6).

Workaround: df_all_lines["cables"] = df_all_lines["cables"].astype(str).str.replace(";", ".").astype(float).astype(int)
You can use.

@pz-max
Copy link
Member

pz-max commented Nov 19, 2021

HI @oayana,
thanks for opening this issue :)
What script did you use?
Did it lead to an error?

We splitted at some point one line that had voltage information in this format {110,220) to two lines 110 and 220. So probably we missed splitting the cable as well. I think we need an alternative solution compared to the workaround. Do you want to create a PR and think about a solution?

@oayana
Copy link
Contributor Author

oayana commented Nov 19, 2021

First of all, you're welcome :)

While I was performing the analysis for Turkey, I got such an error on line 368 of the osm_data_cleaning script.
Error code:
Traceback (most recent call last):
File "C:\Users\muh1\Desktop\pypsa\Pypsa_data\pypsa-africa.snakemake\scripts\tmpb0bbyowa.osm_data_cleaning.py", line 611, in
clean_data(
File "C:\Users\muh1\Desktop\pypsa\Pypsa_data\pypsa-africa.snakemake\scripts\tmpb0bbyowa.osm_data_cleaning.py", line 493, in clean_data
df_all_lines = integrate_lines_df(df_all_lines)
File "C:\Users\muh1\Desktop\pypsa\Pypsa_data\pypsa-africa.snakemake\scripts\tmpb0bbyowa.osm_data_cleaning.py", line 371, in integrate_lines_df
df_all_lines["cables"] = df_all_lines["cables"].astype("int")
File "C:\Users\muh1.conda\envs\pypsa-africa\lib\site-packages\pandas\core\generic.py", line 5815, in astype
new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
File "C:\Users\muh1.conda\envs\pypsa-africa\lib\site-packages\pandas\core\internals\managers.py", line 418, in astype
return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
File "C:\Users\muh1.conda\envs\pypsa-africa\lib\site-packages\pandas\core\internals\managers.py", line 327, in apply
applied = getattr(b, f)(**kwargs)
File "C:\Users\muh1.conda\envs\pypsa-africa\lib\site-packages\pandas\core\internals\blocks.py", line 591, in astype
new_values ​​= astype_array_safe(values, dtype, copy=copy, errors=errors)
File "C:\Users\muh1.conda\envs\pypsa-africa\lib\site-packages\pandas\core\dtypes\cast.py", line 1309, in astype_array_safe
new_values ​​= astype_array(values, dtype, copy=copy)
File "C:\Users\muh1.conda\envs\pypsa-africa\lib\site-packages\pandas\core\dtypes\cast.py", line 1257, in astype_array
values ​​= astype_nansafe(values, dtype, copy=copy)
File "C:\Users\muh1.conda\envs\pypsa-africa\lib\site-packages\pandas\core\dtypes\cast.py", line 1174, in astype_nansafe
return lib.astype_intsafe(arr, dtype)
File "pandas_libs\lib.pyx", line 679, in pandas._libs.lib.astype_intsafe
ValueError: invalid literal for int() with base 10: '3;2'

I tried a workaround as a solution. But if you want to find a better solution, of course I would like to help.

@pz-max
Copy link
Member

pz-max commented Nov 19, 2021

Hi @oayana,
Thanks for the error message. So it seems in line 371 is the problem.... Your workaround is currently fixing the bug but introduces a small error. You are setting two lines with i.e. 'cable' = 2, but one line need to have 'cable'=3. Let me explain.

A cable can be only an integer (1,2,3,...). The line with the trouble has 2 lines in parallel running. One with 2 cables one with 3 cables. This indeed lead to the error that (2;3) cannot be converted to an integer.

https://github.com/pypsa-meets-africa/pypsa-africa/blob/509d2ba691c622c2245df63631c7f09c91089355/scripts/osm_data_cleaning.py#L362-L366

As you can see below, we splitted voltages because there was also a semicolon.
Split cells function is used in the voltage case to break the lines apart. But let's check out the split_cells function in more detail
https://github.com/pypsa-meets-africa/pypsa-africa/blob/509d2ba691c622c2245df63631c7f09c91089355/scripts/osm_data_cleaning.py#L238-L244

What the docstring explains is that we separate the voltage in this function and create an identical line with all previous data. So while we have now two lines with separated voltage without semicolon, we still have the problem that the cable information is not fixed - it has now in each line smt like Line1['cable'] -> (2;3), Line2 ['cable'] -> (2:3)

https://github.com/pypsa-meets-africa/pypsa-africa/blob/509d2ba691c622c2245df63631c7f09c91089355/scripts/osm_data_cleaning.py#L218-L235

Solution:

  • check by row if voltage and cable cells contain a semicolon
  • if yes split them in both cells
  • than safe the information into a new line
    Hope it makes sense

Goal (simplified):

That's how data was stored: 
Line['voltage'] = (100;200)
Line['cable'] = (1;2)

That's how we need it:
Line1['voltage'] = (100)
Line1['cable'] = (1)
Line2['voltage'] = (200)
Line2['cable'] = (2)

Do you want to be on the contributor list with such a fix? :)

@oayana
Copy link
Contributor Author

oayana commented Nov 20, 2021

I understood the problem and would like to offer you a solution. I will handle it as soon as possible and get back to you :)

@oayana
Copy link
Contributor Author

oayana commented Nov 20, 2021

  • semicolon

Hi Max,
I had a chance to look at the df_all_lines dataframe now and as you mentioned some cables and voltages values are semicolon. However, both cables and voltages are not semicolon in the same row(only one of them is semicolon).

You have already produced the solution for this. It is solved with the split_cells function, as you threw into the voltages values.
All you have to do is call the split function before parsing the cables values into int:

     df_all_lines = split_cells(df_all_lines, lst_col="cables")

Then,
df_all_lines["cables"] = df_all_lines["cables"].astype("int")

@pz-max
Copy link
Member

pz-max commented Nov 20, 2021

Hi @oayana,
I would suggest following these steps 1-7 to create a pull request that can be accepted in PyPSA-Africa (it also helps to check your code and to stream it in): https://www.tomasbeuzen.com/post/git-fork-branch-pull/

I have only one doubt which we need to check to be ready for an accepted PR.

Let's assume one line has a semicolon separated values for voltage and cable at the same time:

Line['voltage'] = (100;200)
Line['cable'] = (1;2)

Then, I believe applying split cells two times could lead to the following:

Line1['voltage'] = (100)
Line1['cable'] = (1)
Line2['voltage'] = (100)
Line2['cable'] = (2)
Line3['voltage'] = (200)
Line3['cable'] = (1)
Line4['voltage'] = (200)
Line4['cable'] = (2)

Instead of the desired solution which I mentioned above (#153 (comment)). I would suggest writing a small jupyterscript to test if a one line with the semicolon in both cases can be solved by the split cell function correctly.

Almost there :)

@oayana
Copy link
Contributor Author

oayana commented Nov 21, 2021

Hi @oayana, I would suggest following these steps 1-7 to create a pull request that can be accepted in PyPSA-Africa (it also helps to check your code and to stream it in): https://www.tomasbeuzen.com/post/git-fork-branch-pull/

I have only one doubt which we need to check to be ready for an accepted PR.

Let's assume one line has a semicolon separated values for voltage and cable at the same time:

Line['voltage'] = (100;200)
Line['cable'] = (1;2)

Then, I believe applying split cells two times could lead to the following:

Line1['voltage'] = (100)
Line1['cable'] = (1)
Line2['voltage'] = (100)
Line2['cable'] = (2)
Line3['voltage'] = (200)
Line3['cable'] = (1)
Line4['voltage'] = (200)
Line4['cable'] = (2)

Instead of the desired solution which I mentioned above (#153 (comment)). I would suggest writing a small jupyterscript to test if a one line with the semicolon in both cases can be solved by the split cell function correctly.

Almost there :)

Hi Max,
Maybe this code can provide the solution you want (be careful it only works for cables and voltage).

import pandas as pd
dic = {'line_id': {0: 557828023, 1: 176542356, 2: 305004661, 3: 233127728}, 'voltage': {0: '154000;380000', 1: '380000', 2: '154000;380000', 3: '154000'}, 'cables': {0: '3;2', 1: '0', 2: '3;6', 3: '0'}}
data = pd.DataFrame(dic) #example data

def split_cells_multiple(df,list_col=['cables','voltage']): # split function for cables and voltage
    for i in range(df.shape[0]): 
        sub = df.loc[i,list_col] # for each cables and voltage
        if sub.notnull().all() == True:  # check not both empty
            if [ ";" in s for s in sub].count(True) == len(list_col): # check both contain ";"
                d = [ s.split(';')  for s in sub] #split them
                r = df.loc[i,:].copy()
                df.loc[i,list_col[0]] = d[0][0] # first split  [0]
                df.loc[i,list_col[1]] = d[1][0]
                r[list_col[0]] = d[0][1] # second split [1]
                r[list_col[1]] = d[1][1]
                df = df.append(r) 
    return df # return new frame


data = split_cells_multiple(data)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants