Skip to content

Colombia issue#109

Merged
xmnlab merged 2 commits intothegraphnetwork:mainfrom
eduardocorrearaujo:colombia_issue
Jun 9, 2022
Merged

Colombia issue#109
xmnlab merged 2 commits intothegraphnetwork:mainfrom
eduardocorrearaujo:colombia_issue

Conversation

@eduardocorrearaujo
Copy link
Copy Markdown
Contributor

This PR aims to solve the problem with the colombia scripts related in #103.

The problem was using the column id_ as unique constraint. I made this following the foph.py script, bu since we are reading the colombia data in chunks, for each chunk the id_ values repeat what was giving an error. So I changed it to the column id_de_caso, which must be unique. To use the id_de_caso as unique constraint I also needed to type this code in the SQL editor:

ALTER TABLE colombia.positive_cases_covid_d
ADD UNIQUE (id_de_caso);

In the future when we migrate the scripts to Apache Airflow, I think that would be great for us to write a good tutorial about how to create the scripts to upload data and make it easier for other people to help us.

@fccoelho
Copy link
Copy Markdown
Contributor

@eduardocorrearaujo ADD UNIQUE only works if the column is truly unique in the data. If, in the future you get a repeated id_de_caso it will raise and insertion error.

The best solution is to use the id_ column as an auto increment column. So you don't have to provide values for it, Postgresql will take care of automatically increment it when you insert new rows.

The reason you are getting repetitions for id_ is because you are creating it in the dataframe instead of letting the Database take care of it.

To do that, you need to create the table in SQL, before starting to insert data:

CREATE table colombia.positive_cases_covid_d (
id_ BIGSERIAL PRIMARY KEY,
fecha_inicio_sintomas TIMESTAMP WITHOUT TIME ZONE,
<outras colunas>
)

But this can be a bit cumbersome to do for a table with many columns.
So you can simply run the following code, after inserting at least the first chunk of data:

ALTER TABLE colombia.positive_cases_covid_d ADD COLUMN id_ BIGSERIAL PRIMARY KEY;

Which is much simpler, but you need to be sure that you don't have any duplicates in the database at this point.

After you do this, you can continue to append rows to your table, and Postgresql will increment id_ automatically for you.

@eduardocorrearaujo
Copy link
Copy Markdown
Contributor Author

@eduardocorrearaujo ADD UNIQUE only works if the column is truly unique in the data. If, in the future you get a repeated id_de_caso it will raise and insertion error.

The best solution is to use the id_ column as an auto increment column. So you don't have to provide values for it, Postgresql will take care of automatically increment it when you insert new rows.

The reason you are getting repetitions for id_ is because you are creating it in the dataframe instead of letting the Database take care of it.

To do that, you need to create the table in SQL, before starting to insert data:

CREATE table colombia.positive_cases_covid_d (
id_ BIGSERIAL PRIMARY KEY,
fecha_inicio_sintomas TIMESTAMP WITHOUT TIME ZONE,
<outras colunas>
)

But this can be a bit cumbersome to do for a table with many columns. So you can simply run the following code, after inserting at least the first chunk of data:

ALTER TABLE colombia.positive_cases_covid_d ADD COLUMN id_ BIGSERIAL PRIMARY KEY;

Which is much simpler, but you need to be sure that you don't have any duplicates in the database at this point.

After you do this, you can continue to append rows to your table, and Postgresql will increment id_ automatically for you.

ok, thank you for the great explanation. But I still have a doubt about it. Don't I need to define the id_ column in each chunk to the upsert to be able to make the deduplication and add the new lines?

@eduardocorrearaujo
Copy link
Copy Markdown
Contributor Author

@fccoelho, I made a commit adding the line:

df_new.replace(to_replace ={ 'ubicacion': {'casa': 'Casa', 'CASA': 'Casa'}, 
                             'estado': {'leve': 'Leve', 'LEVE': 'Leve'}, 
                             'sexo': {'f': 'F', 'm': 'M'}}, inplace = True)

Because these typing errors were interfering with the Colombia dashboard, I would like to discuss if I should replace the values using pandas (as I did). Or, if it would be faster to create a SQL query to make the changes after we upload the new data to the database.

@fccoelho
Copy link
Copy Markdown
Contributor

ok, thank you for the great explanation. But I still have a doubt about it. Don't I need to define the id_ column in each chunk to the upsert to be able to make the deduplication and add the new lines?

No, after you create the id_ column, it will continue to exist in the table during the following upserts. It does not need to be added to the chunk Dataframe with the data, because it will be automatically filled by Postgres.

@fccoelho
Copy link
Copy Markdown
Contributor

fccoelho commented May 31, 2022

Because these typing errors were interfering with the Colombia dashboard, I would like to discuss if I should replace the values using pandas (as I did). Or, if it would be faster to create a SQL query to make the changes after we upload the new data to the database.

I think you can do this "replace" on every chunk, it shouldn't slow things down too much.

@xmnlab
Copy link
Copy Markdown
Member

xmnlab commented Jun 1, 2022

@eduardocorrearaujo I am not super familiar with the code here. but in order to test it, I think that it depends on #103
I will work on that today, I need to fix an issue with the celery service and it should be ready to go

@fccoelho
Copy link
Copy Markdown
Contributor

fccoelho commented Jun 9, 2022

@eduardocorrearaujo let's try to close this PR

@xmnlab
Copy link
Copy Markdown
Member

xmnlab commented Jun 9, 2022

@eduardocorrearaujo could you rebase your branch pls? the CI should work now 🤞

@xmnlab
Copy link
Copy Markdown
Member

xmnlab commented Jun 9, 2022

rebased 🤞

@xmnlab xmnlab merged commit 83eb0dd into thegraphnetwork:main Jun 9, 2022
@xmnlab
Copy link
Copy Markdown
Member

xmnlab commented Jun 9, 2022

thanks @eduardocorrearaujo for working on that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants