### Handling our GZ3D Sample

Before we begin collecting a useful sample, let us first get a rough idea of what's going on. We'll use a simple function to gather all filenames their paths to all GZ3D files.

In [1]:
import sys
import numpy as np

In [2]:
sys.path.insert(0, '../')
from sa_utils import append_files

In [3]:
gz3d_dir = '/home/sshamsi/sas/mangawork/manga/sandbox/galaxyzoo3d/v3_0_0/'

all_gz3d_paths = append_files(gz3d_dir, ext='.fits.gz', ret_path=True)

In [4]:
print(len(all_gz3d_paths))

29813


### Filtering with MaNGA data

There are a lot (~30000!) files here! Sadly not all of these can be used. Mainly, not all GZ3D galaxies have been observed my MaNGA. We'll filter some of these GZ3D galaxies byt checking if they exist in the current (v3_1_1-3.1.0) DAPALL file. But first, We'll save the `all_gz3d_paths` array for future conveinience.

In [5]:
np.save('all_gz3d_paths.npy', all_gz3d_paths, allow_pickle=True)

In [6]:
from astropy.io import fits

In [7]:
dapall_path = '/home/sshamsi/sas/mangawork/manga/spectro/analysis/v3_1_1/3.1.0/dapall-v3_1_1-3.1.0.fits'

dapall = fits.open(dapall_path)
dapall_mangaids = dapall[1].data['MANGAID']
dapall_mangaids.size

10782

In [8]:
manga_gz3d_paths = np.array([])

for path in all_gz3d_paths:
    gz3d_mangaid = path.split('/')[-1].split('_')[0]
    
    if gz3d_mangaid in dapall_mangaids:
        manga_gz3d_paths = np.append(manga_gz3d_paths, path)

In [9]:
manga_gz3d_paths.size

9314

In [10]:
np.save('manga_gz3d_paths.npy', manga_gz3d_paths, allow_pickle=True)

Perfect!

Now we will filter out the galaxies not in the DAPALL. We'll extract all MaNGA IDs from the file.

Here we see IDs for the ~10,000 MaNGA galaxies. We'll filter GZ3D galaxies by seeing if they exist in the MaNGA sample.

So we can see that we truly have 9314 galaxies to use with our MaNGA data. Let's save this too!

### Filtering with spiral arms

Furthermore, not all GZ3D galaxies have a `spiral_arm` array. Let's find out how many.

In [11]:
sys.path.insert(0, '../../GZ3D_production/')
from gz3d_fits import gz3d_fits

[0;34m[INFO]: [0mNo release version set. Setting default to MPL-11


In [12]:
#this function will now tell us the percentage of pixels identified as a spiral arm in the spiral galaxy by at least one person.
#we'll find that for many MaNGA galaxies, this is 0! We can filter those out.

def get_pc_spiral_pixels(path):
    data = gz3d_fits(path)
    
    image_spiral_mask = data.spiral_mask
    pixels_above_threshold = (image_spiral_mask > 0).sum()
    
    return (pixels_above_threshold * 100) / image_spiral_mask.size

In [13]:
import pandas as pd #we'll filter galaxies with Pandas

In [14]:
#we'll now form a list of dictionaries, each containing some information (the filepath and MaNGA ID) for
#the galaxy. We'll also calculate what percent of pixels in the galaxy's image have been classified as spiral arms.
#This will help us drop galaxies with no spiral classifications.
galdict_array = []
manga_gz3d_paths_len = len(manga_gz3d_paths)

for idx, path in enumerate(manga_gz3d_paths):
    mangaid = path.split('/')[-1].split('_')[0]
    percent  = get_pc_spiral_pixels(path)
    
    galdict = {
        'filepath': path,
        'mangaid': mangaid,
        'pc_spiral_pixels': percent
    }
    
    if (idx+1) % 25 == 0: #just to keep track of processing
        print((manga_gz3d_paths_len - idx + 1), 'galaxies left')
        
    galdict_array.append(galdict)

9291 galaxies left
9266 galaxies left
9241 galaxies left
9216 galaxies left
9191 galaxies left
9166 galaxies left
9141 galaxies left
9116 galaxies left
9091 galaxies left
9066 galaxies left
9041 galaxies left
9016 galaxies left
8991 galaxies left
8966 galaxies left
8941 galaxies left
8916 galaxies left
8891 galaxies left
8866 galaxies left
8841 galaxies left
8816 galaxies left
8791 galaxies left
8766 galaxies left
8741 galaxies left
8716 galaxies left
8691 galaxies left
8666 galaxies left
8641 galaxies left
8616 galaxies left
8591 galaxies left
8566 galaxies left
8541 galaxies left
8516 galaxies left
8491 galaxies left
8466 galaxies left
8441 galaxies left
8416 galaxies left
8391 galaxies left
8366 galaxies left
8341 galaxies left
8316 galaxies left
8291 galaxies left
8266 galaxies left
8241 galaxies left
8216 galaxies left
8191 galaxies left
8166 galaxies left
8141 galaxies left
8116 galaxies left
8091 galaxies left
8066 galaxies left
8041 galaxies left
8016 galaxies left
7991 galaxie

In [15]:
df = pd.DataFrame.from_dict(galdict_array)

This is our DF ready to work with!

In [16]:
df

Unnamed: 0,filepath,mangaid,pc_spiral_pixels
0,/home/sshamsi/sas/mangawork/manga/sandbox/gala...,1-458301,0.000000
1,/home/sshamsi/sas/mangawork/manga/sandbox/gala...,1-26306,6.989932
2,/home/sshamsi/sas/mangawork/manga/sandbox/gala...,1-289729,0.000000
3,/home/sshamsi/sas/mangawork/manga/sandbox/gala...,1-387106,0.000000
4,/home/sshamsi/sas/mangawork/manga/sandbox/gala...,1-604878,0.000000
...,...,...,...
9309,/home/sshamsi/sas/mangawork/manga/sandbox/gala...,1-384554,0.000000
9310,/home/sshamsi/sas/mangawork/manga/sandbox/gala...,1-117091,26.577415
9311,/home/sshamsi/sas/mangawork/manga/sandbox/gala...,1-383608,0.000000
9312,/home/sshamsi/sas/mangawork/manga/sandbox/gala...,1-419301,0.000000


Now we remove all galaxies without classifications.

In [17]:
df = df[df.pc_spiral_pixels > 0]

In [18]:
df

Unnamed: 0,filepath,mangaid,pc_spiral_pixels
1,/home/sshamsi/sas/mangawork/manga/sandbox/gala...,1-26306,6.989932
8,/home/sshamsi/sas/mangawork/manga/sandbox/gala...,1-178542,4.912109
10,/home/sshamsi/sas/mangawork/manga/sandbox/gala...,1-91339,23.261678
21,/home/sshamsi/sas/mangawork/manga/sandbox/gala...,1-51315,15.471383
23,/home/sshamsi/sas/mangawork/manga/sandbox/gala...,1-94066,53.655147
...,...,...,...
9292,/home/sshamsi/sas/mangawork/manga/sandbox/gala...,1-2604,9.803900
9293,/home/sshamsi/sas/mangawork/manga/sandbox/gala...,1-71763,2.464943
9296,/home/sshamsi/sas/mangawork/manga/sandbox/gala...,1-548639,24.729615
9305,/home/sshamsi/sas/mangawork/manga/sandbox/gala...,1-352635,42.394195


Thus we can see that the number of galaxies with at least one classification as a spiral galaxy is quite low (2296). We'll gather a list of these galaxies and save them for future convenience.

In [19]:
manga_gz3d_spirals = df.filepath.to_numpy()

In [20]:
np.save('manga_gz3d_spirals.npy', manga_gz3d_spirals, allow_pickle=True)

Now that we have this file with all galaxies with at least some spiral identification, the process of cleaning data should be easier. We can experiment with different thresholds for the count for what pixel is a spiral arm, which we will also do later.