Colombia issue by eduardocorrearaujo · Pull Request #109 · thegraphnetwork/EpiGraphHub

eduardocorrearaujo · 2022-05-30T14:23:55Z

This PR aims to solve the problem with the colombia scripts related in #103.

The problem was using the column id_ as unique constraint. I made this following the foph.py script, bu since we are reading the colombia data in chunks, for each chunk the id_ values repeat what was giving an error. So I changed it to the column id_de_caso, which must be unique. To use the id_de_caso as unique constraint I also needed to type this code in the SQL editor:

ALTER TABLE colombia.positive_cases_covid_d
ADD UNIQUE (id_de_caso);

In the future when we migrate the scripts to Apache Airflow, I think that would be great for us to write a good tutorial about how to create the scripts to upload data and make it easier for other people to help us.

fccoelho · 2022-05-30T15:28:26Z

@eduardocorrearaujo ADD UNIQUE only works if the column is truly unique in the data. If, in the future you get a repeated id_de_caso it will raise and insertion error.

The best solution is to use the id_ column as an auto increment column. So you don't have to provide values for it, Postgresql will take care of automatically increment it when you insert new rows.

The reason you are getting repetitions for id_ is because you are creating it in the dataframe instead of letting the Database take care of it.

To do that, you need to create the table in SQL, before starting to insert data:

CREATE table colombia.positive_cases_covid_d (
id_ BIGSERIAL PRIMARY KEY,
fecha_inicio_sintomas TIMESTAMP WITHOUT TIME ZONE,
<outras colunas>
)

But this can be a bit cumbersome to do for a table with many columns.
So you can simply run the following code, after inserting at least the first chunk of data:

ALTER TABLE colombia.positive_cases_covid_d ADD COLUMN id_ BIGSERIAL PRIMARY KEY;

Which is much simpler, but you need to be sure that you don't have any duplicates in the database at this point.

After you do this, you can continue to append rows to your table, and Postgresql will increment id_ automatically for you.

eduardocorrearaujo · 2022-05-30T19:03:55Z

@eduardocorrearaujo ADD UNIQUE only works if the column is truly unique in the data. If, in the future you get a repeated id_de_caso it will raise and insertion error.

The best solution is to use the id_ column as an auto increment column. So you don't have to provide values for it, Postgresql will take care of automatically increment it when you insert new rows.

The reason you are getting repetitions for id_ is because you are creating it in the dataframe instead of letting the Database take care of it.

To do that, you need to create the table in SQL, before starting to insert data:
CREATE table colombia.positive_cases_covid_d (
id_ BIGSERIAL PRIMARY KEY,
fecha_inicio_sintomas TIMESTAMP WITHOUT TIME ZONE,
<outras colunas>
)
But this can be a bit cumbersome to do for a table with many columns. So you can simply run the following code, after inserting at least the first chunk of data:
ALTER TABLE colombia.positive_cases_covid_d ADD COLUMN id_ BIGSERIAL PRIMARY KEY;
Which is much simpler, but you need to be sure that you don't have any duplicates in the database at this point.

After you do this, you can continue to append rows to your table, and Postgresql will increment id_ automatically for you.

ok, thank you for the great explanation. But I still have a doubt about it. Don't I need to define the id_ column in each chunk to the upsert to be able to make the deduplication and add the new lines?

eduardocorrearaujo · 2022-05-30T20:07:02Z

@fccoelho, I made a commit adding the line:

df_new.replace(to_replace ={ 'ubicacion': {'casa': 'Casa', 'CASA': 'Casa'}, 
                             'estado': {'leve': 'Leve', 'LEVE': 'Leve'}, 
                             'sexo': {'f': 'F', 'm': 'M'}}, inplace = True)

Because these typing errors were interfering with the Colombia dashboard, I would like to discuss if I should replace the values using pandas (as I did). Or, if it would be faster to create a SQL query to make the changes after we upload the new data to the database.

fccoelho · 2022-05-31T08:19:22Z

ok, thank you for the great explanation. But I still have a doubt about it. Don't I need to define the id_ column in each chunk to the upsert to be able to make the deduplication and add the new lines?

No, after you create the id_ column, it will continue to exist in the table during the following upserts. It does not need to be added to the chunk Dataframe with the data, because it will be automatically filled by Postgres.

fccoelho · 2022-05-31T08:21:31Z

Because these typing errors were interfering with the Colombia dashboard, I would like to discuss if I should replace the values using pandas (as I did). Or, if it would be faster to create a SQL query to make the changes after we upload the new data to the database.

I think you can do this "replace" on every chunk, it shouldn't slow things down too much.

xmnlab · 2022-06-01T14:28:35Z

@eduardocorrearaujo I am not super familiar with the code here. but in order to test it, I think that it depends on #103
I will work on that today, I need to fix an issue with the celery service and it should be ready to go

fccoelho · 2022-06-09T08:07:23Z

@eduardocorrearaujo let's try to close this PR

xmnlab · 2022-06-09T14:37:06Z

@eduardocorrearaujo could you rebase your branch pls? the CI should work now 🤞

xmnlab · 2022-06-09T17:41:16Z

rebased 🤞

xmnlab · 2022-06-09T18:20:54Z

thanks @eduardocorrearaujo for working on that!

eduardocorrearaujo requested review from fccoelho and xmnlab May 30, 2022 14:23

eduardocorrearaujo added 2 commits June 9, 2022 13:27

Remove the id_ index

023be01

Change some strings to a standard

e2bf7de

xmnlab force-pushed the colombia_issue branch from a2af493 to e2bf7de Compare June 9, 2022 17:30

xmnlab merged commit 83eb0dd into thegraphnetwork:main Jun 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Colombia issue#109

Colombia issue#109
xmnlab merged 2 commits intothegraphnetwork:mainfrom
eduardocorrearaujo:colombia_issue

eduardocorrearaujo commented May 30, 2022

Uh oh!

fccoelho commented May 30, 2022

Uh oh!

eduardocorrearaujo commented May 30, 2022

Uh oh!

eduardocorrearaujo commented May 30, 2022

Uh oh!

fccoelho commented May 31, 2022

Uh oh!

fccoelho commented May 31, 2022 •

edited

Loading

Uh oh!

xmnlab commented Jun 1, 2022

Uh oh!

fccoelho commented Jun 9, 2022

Uh oh!

xmnlab commented Jun 9, 2022

Uh oh!

xmnlab commented Jun 9, 2022

Uh oh!

xmnlab commented Jun 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

eduardocorrearaujo commented May 30, 2022

Uh oh!

fccoelho commented May 30, 2022

Uh oh!

eduardocorrearaujo commented May 30, 2022

Uh oh!

eduardocorrearaujo commented May 30, 2022

Uh oh!

fccoelho commented May 31, 2022

Uh oh!

fccoelho commented May 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xmnlab commented Jun 1, 2022

Uh oh!

fccoelho commented Jun 9, 2022

Uh oh!

xmnlab commented Jun 9, 2022

Uh oh!

xmnlab commented Jun 9, 2022

Uh oh!

xmnlab commented Jun 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fccoelho commented May 31, 2022 •

edited

Loading