Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add touch table script #43

Merged
merged 7 commits into from Sep 23, 2019
Merged

Add touch table script #43

merged 7 commits into from Sep 23, 2019

Conversation

@wgranger
Copy link
Contributor

wgranger commented May 23, 2019

This PR includes a script to parse a project's subject export from Panoptes and create a new CSV that will be used when preparing a local database (.db) file for the Galaxy Zoo touch table app to use.

After creating a SQL table according to the schema on the touch table repo, the parsed CSV created from this script can be uploaded to that SQL table, thus populating the table with data which become the touch table subjects.

# Read the local CSV subject export
classifications=pd.read_csv("classification-export.csv")

classifications['metadata']=classifications['metadata'].map(lambda x: json.loads(x))

This comment has been minimized.

Copy link
@chelseatroy

chelseatroy Sep 17, 2019

Member

Typically here when working with pandas you'd want to use apply rather than map, unless there's a specific reason to stick with map. This liberates pandas to routinize those operations to improve speed.

Fun fact: folks will tell you that pandas "parallelizes" row operations where it can. Technically, this is not quite true: pandas is written in python, which applies something called a Global Interpreter Lock to prevent running an operation on separate threads. However, there are situations in which we can drop into C (which both loops faster than python and facilitates threading) or use routines (which more efficiently delegate from a single thread) to speed things up.

import pandas as pd
import json
import string
from tqdm import tqdm

This comment has been minimized.

Copy link
@chelseatroy

chelseatroy Sep 17, 2019

Member

I'd probably do from tqdm import tqdm as progress_bar, so folks unfamiliar with either this library or the Arabic language will understand what it is :P

classifications['star'] = 0;

# Copy data from metadata into correct/new CSV headers with progress bar
with tqdm(total=len(classifications)) as pbar:

This comment has been minimized.

Copy link
@chelseatroy

chelseatroy Sep 17, 2019

Member

apropos of the above, with progress_bar(total=len(classifications)) as current_progress:

# Copy data from metadata into correct/new CSV headers with progress bar
with tqdm(total=len(classifications)) as pbar:
for index, row in classifications.iterrows():
pbar.update(1)

This comment has been minimized.

Copy link
@chelseatroy

chelseatroy Sep 17, 2019

Member

apropos of above, current_progress.update(1)

pbar.update(1)
for column in row['metadata']:
if column in ['ra', 'dec', '!ra', '!dec']:
strippedPunctuation = column.strip(string.punctuation)

This comment has been minimized.

Copy link
@chelseatroy

chelseatroy Sep 17, 2019

Member

python sticks to snake case rather than camel, so here the pythonic name would be stripped_punctuation. Clearly, this doesn't affect either functionality or legibility, but python people will sometimes get their feathers ruffled over this, so figured I'd mention.

classifications=pd.read_csv("classification-export.csv")

classifications['metadata']=classifications['metadata'].map(lambda x: json.loads(x))
classifications['locations']=classifications['locations'].map(lambda x: json.loads(x))

This comment has been minimized.

Copy link
@chelseatroy

chelseatroy Sep 17, 2019

Member

Excellent. Assigning back like this saves space in large dataframes; I don't know how big this one gets but it's a good practice to follow, provided you don't need the original column. Bravo.


# Include in subject_set_ids all subject sets you want to keep
subject_set_ids = [];
classifications = classifications.loc[classifications['subject_set_id'].isin(subject_set_ids)]

This comment has been minimized.

Copy link
@chelseatroy

chelseatroy Sep 17, 2019

Member

Kudos on using .loc correctly. A lot of times I see slicing done for something like this without it, which creates a copy of the dataframe rather than a window into the original one. That impacts your space efficiency, plus if you later try to modify your subset, you'll get a warning that you are attempting to modify a copy of a dataframe. Unless you really know what you're looking at, it can be tough to track down where the copy was made, because the warning appears at the point of modification rather than at the point of copying.

This comment has been minimized.

Copy link
@chelseatroy

chelseatroy Sep 17, 2019

Member

The aforementioned warning, explained with pictures that make it really clear what this is: https://www.dataquest.io/blog/settingwithcopywarning/

if column in ['ra', 'dec', '!ra', '!dec']:
strippedPunctuation = column.strip(string.punctuation)
classifications.loc[index, strippedPunctuation] = row['metadata'][column]
if column in ['iauname', '!iauname']:

This comment has been minimized.

Copy link
@chelseatroy

chelseatroy Sep 17, 2019

Member

What do each of these column names mean?

This comment has been minimized.

Copy link
@wgranger

wgranger Sep 18, 2019

Author Contributor

This is specific to the metadata of the Galaxy Zoo project. ra and dec are both astronomical terms for location of an item in space, kinda like a latitude and longitude. iauname is what this metadata uses for a filename. I can add comments if helpful.

This comment has been minimized.

Copy link
@chelseatroy

chelseatroy Sep 23, 2019

Member

I think it might be, but I won't block merge on it :)

classifications.loc[index, 'image'] = row['locations'][column]

# Drop unnecessary columns and rearrange
classifications = classifications[['subject_id', 'classifications_count', 'ra', 'dec', 'image', 'filename', 'smooth', 'features', 'star']]

This comment has been minimized.

Copy link
@chelseatroy

chelseatroy Sep 17, 2019

Member

you can also use the .drop method here rather than assigning back, if you like

This comment has been minimized.

Copy link
@wgranger

wgranger Sep 18, 2019

Author Contributor

I think I'll keep the reassignment so it's stated explicitly which columns are used in the parsed csv.

This comment has been minimized.

Copy link
@chelseatroy
classifications = classifications[['subject_id', 'classifications_count', 'ra', 'dec', 'image', 'filename', 'smooth', 'features', 'star']]

# Create a parsed CSV for DB import
classifications.to_csv('parsed-subject-set.csv',sep=',',index = False,encoding='utf-8')

This comment has been minimized.

Copy link
@chelseatroy

chelseatroy Sep 17, 2019

Member

sep in this method defaults to a comma, so you don't need to include that parameter (though you certainly can). Removing it, you would have:

classifications.to_csv('parsed-subject-set.csv',index = False,encoding='utf-8')

Copy link
Member

chelseatroy left a comment

See comments, but nothing major!

@wgranger

This comment has been minimized.

Copy link
Contributor Author

wgranger commented Sep 18, 2019

Nice. Thanks for the thorough review and explanations. I think I covered everything in the recent commits.

@willettk

This comment has been minimized.

Copy link
Contributor

willettk commented Sep 20, 2019

Apparently I'm still subscribed to this repo for updates. Just wanted to say how much I liked seeing such a good example of constructive code review - thanks very much, @chelseatroy!

@chelseatroy chelseatroy merged commit 8686b01 into master Sep 23, 2019
@chelseatroy

This comment has been minimized.

Copy link
Member

chelseatroy commented Sep 23, 2019

Apparently I'm still subscribed to this repo for updates. Just wanted to say how much I liked seeing such a good example of constructive code review - thanks very much, @chelseatroy!

@willettk , thank you :) This made my day!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.