# Code to verify the data extraction process
Feb 27, 2020

### This code verifies that the second part of the extraction ie. storing information in the .npy is correct
It accomplishes this by 
- Reading the .npy files (x,y,idx) for a few IDs
- Reading the file locations for those files directly and extracting the gif files to get the numpy array
- Comparing the images and labels for these (x,y)


In [1]:
import numpy as np
import pandas as pd

import time
import matplotlib.pyplot as plt

### Steps:
- Getting data from .npy files
    - Read the idx .npy files
    - Extract a subsample of these using a random index array
- Getting data directly: 
    - Read summary_label_files.csv to extract the entire dataframe.
    - Use the IDs extracted before to slice the dataframe for only those IDs

- Compare
    - For each ID, extract the labels from the datagrame and contents of the .gif files.
    - Compare these with the contents of the .npy files (images of all 3 file types: temp,srch,diff)
    - Done!

## Read data from .npy files

In [2]:
## Read ID array
fname='/global/project/projectdirs/dasrepo/vpa/supernova_cnn/data/gathered_data/full_idx.npy'
a_id=np.load(fname)
## Read label array
fname='/global/project/projectdirs/dasrepo/vpa/supernova_cnn/data/gathered_data/full_y.npy'
a_y=np.load(fname)
## Read image array
fname='/global/project/projectdirs/dasrepo/vpa/supernova_cnn/data/gathered_data/full_x.npy'
a_x=np.load(fname)


In [3]:
### Pick a sample of indices for the .npy arrays
num_samples=5000
full_size=a_id.shape[0]
np.random.seed(323389)
sample_idxs=np.random.choice(np.arange(full_size),size=num_samples) 
### Grab slices of the numpy arrays
arr_x,arr_y,arr_IDs=a_x[sample_idxs],a_y[sample_idxs],a_id[sample_idxs]

## Read data directly from files

### Get IDs and labels

In [4]:
f2='/global/project/projectdirs/dasrepo/vpa/supernova_cnn/data/gathered_data/summary_label_files.csv'
df=pd.read_csv(f2,sep=',',comment='#')

In [5]:
### Get the subset of the big dataframe
df=df[df.ID.isin(arr_IDs)]

In [6]:
df.head()

Unnamed: 0,ID,filename,file path,Label
141,148663,temp148663.gif,/global/project/projectdirs/dasrepo/vpa/supern...,1
142,148663,srch148663.gif,/global/project/projectdirs/dasrepo/vpa/supern...,1
143,148663,diff148663.gif,/global/project/projectdirs/dasrepo/vpa/supern...,1
207,78883,diff78883.gif,/global/project/projectdirs/dasrepo/vpa/supern...,1
208,78883,srch78883.gif,/global/project/projectdirs/dasrepo/vpa/supern...,1


### Compare data values from .npy files vs directly from .gif files

In [7]:

for count,iD in enumerate(arr_IDs):
    df2=df[df.ID==iD]
#     display(df2)
    
    ### Get image arrays and labels
    ### original order of data stored in .npy files is 'temp'=0,'srch'=1,'diff'=2 
    img={}
    for prefix,loc in zip(['diff','temp','srch'],[2,0,1]):        
        fle=prefix+str(iD)+'.gif'
        fname=df2[df2.filename==fle]['file path'].values[0]
        img[prefix]=plt.imread(fname)
        
        ### Compare image arrays
        if not np.array_equal(img[prefix],arr_x[count,:,:,loc]):
            print("Image arrays are not equal for count {0}, ID {1}".format(count,iD))
            print(img[prefix])
            raise SystemError

        ### Compare labels
        label=df2[df2.filename==fle].Label.values[0]
        if label!=arr_y[count]:
            print("Labels do not match for count {0}, ID {1}".format(count,iD))
            print(label,arr_y[count])
            raise SystemError
            
            
print("All images and labels match!")

All images and labels match!
