New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transition from comma to pipe separated values #49

Closed
riceissa opened this Issue Feb 25, 2018 · 18 comments

Comments

Projects
None yet
2 participants
@riceissa
Collaborator

riceissa commented Feb 25, 2018

@riceissa riceissa self-assigned this Feb 25, 2018

@riceissa

This comment has been minimized.

Collaborator

riceissa commented Mar 31, 2018

It looks like author in documents is also a CSV

@riceissa

This comment has been minimized.

Collaborator

riceissa commented Mar 31, 2018

I'm working with this restricted list of SQL files which I think excludes all the script-generated SQL files: https://gist.github.com/riceissa/cffe01a35f92ce9d060ff0c89ad484a8

@riceissa

This comment has been minimized.

Collaborator

riceissa commented Mar 31, 2018

So I'm doing the "Review all manually written SQL files and update them" step right now. My current process looks like this:

  1. I run a query like select affected_donors from documents where affected_donors REGEXP ','; to find all the places a comma appears in one of the fields listed above.
  2. I search in the SQL for where the string appears, and replace the commas with pipes.

Once I run out of commas to replace, I will go and insert in commas to entities (like "80,000 Hours") that require them.

@riceissa

This comment has been minimized.

Collaborator

riceissa commented Mar 31, 2018

Queries to run (check mark indicates I've already replaced commas with pipes, so that the query returns an empty set; hash is from the commit that fixes that query):

  • select affected_donors from documents where affected_donors REGEXP ','; 07c6927
  • select affected_donees from documents where affected_donees REGEXP ','; 85e72a1
  • select author from documents where author REGEXP ','; fb6afeb
  • select affected_countries from donations where affected_countries REGEXP ','; 6c01f4c
  • select affected_states from donations where affected_states REGEXP ','; 014bc8c
  • select affected_cities from donations where affected_cities REGEXP ','; d1f1550
  • select affected_regions from donations where affected_regions REGEXP ',';
  • select affiliated_orgs from donors where affiliated_orgs REGEXP ','; 7f6308c
  • select identified_donees from money_moved where identified_donees REGEXP ',';
  • select identified_donors from money_moved where identified_donors REGEXP ',';
  • select key_people from donees where key_people REGEXP ','; d68f1ef
  • select recommenders from donees where recommenders REGEXP ','; 4ffcbc2

The ones with a check mark but no hash were empty to begin with.

@riceissa

This comment has been minimized.

Collaborator

riceissa commented Mar 31, 2018

(ETA: see comment below for a bigger version of this list)

mysql> select distinct(dlw_data_processing_script) from donors;
+--------------------------------------------------------------+                        
| dlw_data_processing_script                                   |                        
+--------------------------------------------------------------+
| NULL                                                         |                        
| https://github.com/riceissa/michaeldello-donations           |                        
| https://github.com/riceissa/gates-foundation-iati-processing |                        
| https://github.com/riceissa/arnold-foundation-grants         |                        
| https://github.com/riceissa/public-welfare-foundation        |                        
| https://github.com/riceissa/ea-grants-processing             |                        
| https://github.com/riceissa/macarthur-foundation             |                        
| https://github.com/riceissa/hewlett-foundation               |                        
| https://github.com/riceissa/ford-foundation                  |                        
| https://github.com/riceissa/wellcome-trust                   |                        
| https://github.com/riceissa/bauman-foundation                |                        
| https://github.com/riceissa/unbound-philanthropy             |                        
| https://github.com/riceissa/templeton-foundation             |                        
| https://github.com/riceissa/mcdowell-foundation              |                        
| https://github.com/riceissa/nathan-cummings-foundation       |                        
| https://github.com/riceissa/pineapple-fund                   |                        
+--------------------------------------------------------------+                        
16 rows in set (0.00 sec)
@riceissa

This comment has been minimized.

Collaborator

riceissa commented Mar 31, 2018

Ok, I looked through all of the processing scripts listed above, and it looks like none of them removed the comma in the first place (oops), which means less work now (yay).

@riceissa

This comment has been minimized.

Collaborator

riceissa commented Apr 1, 2018

So actually for a lot of the scripts, I've mostly been using the location information verbatim, which means some of them might need to be split into multiple locations (and separated using pipes).

@riceissa

This comment has been minimized.

Collaborator

riceissa commented Apr 1, 2018

Social media mass grab touches neither key_people nor recommenders for donees so nothing to do there.

@riceissa

This comment has been minimized.

Collaborator

riceissa commented Apr 3, 2018

Nathan Cummings and Gates IATI need to be regenerated because the SQL file is now different.

Otherwise, this is basically done.

@riceissa

This comment has been minimized.

Collaborator

riceissa commented Apr 3, 2018

By the way, the code changes from this issue (that are part of this repo) are in https://github.com/vipulnaik/donations/tree/pipe-separated-values

@vipulnaik

This comment has been minimized.

Owner

vipulnaik commented Apr 5, 2018

I have merged the changes in, but have not regenerated the entire SQL (just wanted to merge them in so that I can add stuff to the SQL without creating merge conflicts).

@riceissa, can you take care of regenerating the data for Gates and Cummings and copying the regenerated data over to the relevant locations in the donations repo?

Thanks!

@riceissa

This comment has been minimized.

Collaborator

riceissa commented Apr 5, 2018

Ok done (see 3368750 and 0950708).

@riceissa riceissa closed this Apr 5, 2018

@vipulnaik

This comment has been minimized.

Owner

vipulnaik commented Apr 11, 2018

@riceissa I think we missed the influencer column in the donations table.

@riceissa

This comment has been minimized.

Collaborator

riceissa commented Apr 13, 2018

@riceissa

This comment has been minimized.

Collaborator

riceissa commented Apr 13, 2018

I fixed the ones listed above in 1bb2d48

Now checking the PHP code.

I don't think the data generation code needs to be checked because the influencer was never touched. (All the examples flagged in the above Gist are from manual entry files.)

@riceissa

This comment has been minimized.

Collaborator

riceissa commented Apr 13, 2018

PHP code fix is done in 7e530b6.

@riceissa

This comment has been minimized.

Collaborator

riceissa commented Apr 13, 2018

@vipulnaik So I think this is done again, unless you find some other thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment