# get-data-from-cdcs-test-v3

Sandbox for developing documentation resources (example notebooks and demo videos) for the SciServer AMBench project.

This notebook includes some methods for querying and obtaining results from the ambench.nist.gov website.

# Setup: prepare the notebook to do what we need

## Install and import required packages

First, we need to import packages that we will use to do the processing. Each import statement includes a comment describing what that package does.

In [None]:
# !pip install git+https://github.com/lmhale99/pycdcs
# print('ok')

In [1]:
### Data processing packages: 
import pandas # data processing
import math # math operations
import numpy as np # math operations


### JORDAN NOTE TO SELF: Call it "AMBench repository" instead of / in addition to "CDCS"
### JORDAN NOTE TO SELF: Add a description here about what CDCS is and why you want to use it
### Packages to download data from the Configurable Data Curation System (CDCS):
##### CDCS website: https://www.nist.gov/itl/ssd/information-systems-group/configurable-data-curation-system-cdcs
from cdcs import CDCS         # query CDCS data
import lxml.etree as et       # parse the XML file returned by CDCS to find image download locations
import requests               # download images
from urllib import request    # download images
import os                     # save and load image files

### Packages to do image processing once we have the images downloaded:
##### These are from the scikit-image package, but we only import the parts of the package that we need 
##### scikit-image website: https://scikit-image.org/
##### JORDAN NOTE TO SELF: Be more specific about what kind of processing each sub-package does
##### JORDAN NOTE TO SELF: Try it as "from skimage import io, filters, segmentation, morphology, measure"
from skimage import io           # read/write images
from skimage import filters      # process images
from skimage import segmentation # process images
from skimage import morphology   # process images
from skimage import measure      # process images

### Packages to plot results:
import matplotlib.pyplot as plt     # make plots
import matplotlib.image as mpimg    # make plots related to images, including displaying images in colorscale ("False Color")


print('Packages imported!')

Packages imported!


## Create functions

For some complex operations that we need to do repeatedly, it's easier to move the operations into a function call. This slows down the code slightly, but makes it much easier to read and to follow the logic of the notebook. 

In [2]:
#### xml_url_find:
# Parameters:
# - XML file for one of the ambench.nist.gov datasets in string format
# - search phrase contained in the file name
# - file type

# Returns:
# - a list containing (0) name of a file, (1) the download url for the file, (2) laser track number, (3) case 
def xml_url_find(xml,searchphrase,mtype):
    print('\tin xml_url_find...')
    root=et.fromstring(xml)
    print('\t\t{0:}'.format(root))
    caseid=root.find('.//TraceID')[0].tag[5]
    track=root.find('.//TrackNumber')
    for element in root.iter('downloadURL'):
        u=request.urlopen(element.text)
        if searchphrase in u.info().get_filename() and mtype in u.info().get_content_type(): #looking for searchphrase in
            #filename and mtype in file type
            name=u.info().get_filename()
            url=element.text
            return [name,url,track.text,caseid]

#Given an xml string like the one returned by CDCS_xml, will return a dataframe of download links for the files on the AMBench
#page (includes name, description, comments, etc.)
def CDCS_downloads(xml):
    root=et.fromstring(xml)
    result=[]
    for element in root.iter('downloadURL'):
        entry={}
        for e in element.getparent():
            entry[e.tag]=e.text
        result.append(entry)
    df=py.DataFrame.from_dict(result,orient='columns')
    return df        


#### draw_box_opt
# This method (1) crops the melt pool images (to reduce runtime), (2) applies the felzenszwalb segmentation algorithm to the image, (3) removes smaller regions (eliminate some background noise), (4) acquires a table of center coordinates for the remaining regions, (5) measures the distances from centers to the center of the image, (6) finds the index number of the closest region, (7) returns information about the closes region

# Parameters:
# - image to be analyzed

# Returns:
# - list of RegionProperties related to the melt pool region
def draw_box_opt(image):
    cropim=image[900:1800,800:3400] #cropping so that there's less pixels to cover - incredibly slow if left at original size
    segments=segmentation.felzenszwalb(cropim,scale=270,sigma=0.8,min_size=1000)
    isolateim=morphology.remove_small_objects(segments,100000)
    center=measure.regionprops_table(isolateim,properties=['centroid']) #table of information about the center points of regions
    distances=[]
    for n in range(len(center['centroid-0'])):
        distances.append(math.dist([center['centroid-0'][n],center['centroid-1'][n]],[450,1300]))
    index=distances.index(min(distances)) #Looking for the region closest to the center of the image - should be the melt pool
    object_features=measure.regionprops(isolateim)
    return object_features[index] 

print('ok')

ok


# 1. Import records from CDCS

The first step is to download the data we need from NIST. We search the Configurable Data Curation System (CDCS) for a keyword that will find the data we are looking for. In this example, the keyword is "MP" (for "melt pool"), which will return datasets associated with the Melt Pool Challenge.

Another option is to use the keyword "GS", which will retrieve scanning electron microscope (SEM) images of melt pools taken as part of the grain size (GS) AM-Bench Challenge.

The notebook returns all public datasets with your selected keyword. You can see information about each returned dataset by running the code block with the `describe_datasets = True`.

JORDAN NOTE TO SELF: Replace "CDCS" with "AMBench repository"

JORDAN NOTE TO SELF: Look at Lyle's data management diagram to make sure we are using the same terms

 - uses the CDCS REST API client to search for datasets related to challenges within the ambench.nist.gov instance with these keywords
    - MP: search for Melt Pool challenges
    - GS: retrieve scanning electron microscope (SEM) images of melt pools taken as part of the grain size (GS) AM-Bench challenge 


In [None]:
keyword='MP'                # keyword to search for: MP = melt pool, GS = grain size
describe_datasets = True    # show name and basic metadata for each dataset
show_dataset_xml = False    # parse xml to show full metadata for each dataset


print('querying CDCS for keyword {0:}...'.format(keyword))
curator = CDCS('https://ambench.nist.gov/', username='') # query CDCS, accessing anonymously
datasets_df = curator.query(template='AM-Bench-2018',keyword=keyword) # searching for results with the keyword given; returns a pandas dataframe with a row for each dataset

### convert all date fields to pandas date format, then use the dataset ID as the index of the table and sort by ID
for thiscol in ['creation_date', 'last_modification_date', 'last_change_date']:
    datasets_df.loc[:, thiscol] = pandas.to_datetime(datasets_df[thiscol], errors='coerce')
# datasets_df = datasets_df.set_index('id')
# datasets_df = datasets_df.sort_index()

print('Found {0:,.0f} datasets matching keyword {1:}!'.format(len(datasets_df), keyword))
print('\n')

if (describe_datasets):
    for ix, thisrow in datasets_df.iterrows():
        print('Dataset id = {0:}'.format(thisrow['id']))
        print('\tTitle: {0:}'.format(thisrow['title']))
        print('\tCreated: {0:}'.format(thisrow['creation_date'].strftime('%Y-%m-%d')))
#         print('\tLast modified: {0:}'.format(thisrow['last_modification_date'].strftime('%Y-%m-%d')))
#         print('\tLast changed: {0:}'.format(thisrow['last_change_date'].strftime('%Y-%m-%d')))
        if (show_dataset_xml):
            show_xml(thisrow['xml_content'])
        print('\n')

datasets_df

# 2. Download and process images from those datasets

In addition to the metadata for each dataset, CDCS returns the dataset itself - but in XML format, which is not intended to be human readable. The dataset XML is given in the `xml_content` column.

This code cell uses the function `xml_url_find`, created in the setup step, to search for the names of the TIFF image files to associated with these datasets. For the purposes of this demo notebook, consider that function as a black box; if you want to learn how it works, see the function documentation in the setup step above.

Each dataset includes several images (TIFF files), each associated with a specific track and case number. The code cell below returns these track and case numbers, as well as the URL where the associated image can be downloaded.

The code cell then downloads the image from that URL, one at a time, and runs an image processing algorithm to measure the width and depth of the melt pool in the image. Set `show_image = True` to see each image as it is downloaded from the NIST server (note that the images themselves are not saved, only the download URL and the measured parameters.

All the image processing happens in the `draw_box_opt` function, created in the setup step.

The code cell outputs a list of measured melt pool parameters for each dataset.

JORDAN NOTE TO SELF: Better distinguish between metadata and data


### Previous text to integrate into the above:

General approach to image analysis:
- using scikit-image to perform image analysis and matplotlib to display images
- uses the felzenszwalb image segmentation algorithm to help obtain widths and depths of the melt pools
- displays plots of width and depth
- stores measurements in a csv file

#### draw_box_opt
This method (1) crops the melt pool images (to reduce runtime), (2) applies the felzenszwalb segmentation algorithm to the image, (3) removes smaller regions (eliminate some background noise), (4) acquires a table of center coordinates for the remaining regions, (5) measures the distances from centers to the center of the image, (6) finds the index number of the closest region, (7) returns information about the closes region

Parameters:
- image to be analyzed

Returns:
- list of RegionProperties related to the melt pool region


In [None]:
searchphrase='BF'     # used by the xml_url_find function; returns track, case, and image URL for each row in the dataset
show_images = True   # set to True to see what the images look like as they are found

print('finding...')
datasets_df = datasets_df.assign(filename = np.nan, url = np.nan, track = np.nan, case = np.nan, melt_pool_width_microns = np.nan, melt_pool_depth_microns = np.nan)


for ix, thisrow in datasets_df.iterrows():
#    print(thisrow['xml_content'])
    print(ix)
    print('---------')
    res=xml_url_find(thisrow['xml_content'],'BF','image/tiff') # giving xml_url_find each xml string and searching for
    print('---------')

    
#     datasets_df.loc[ix, ['filename', 'url', 'track', 'case']] = res
    
#     print('Getting image for track {0:,.0f} case {1:}...'.format(thisrow['track'], thisrow['case']))
    
#     this_image = io.imread(thisrow['url'])
    
#     if (show_images):
#         print('showing marked image...')
#         fig, ax = plt.subplots(1,1)
#         ax.imshow(this_image)
    
#     print('\tmeasuring melt pool in image...')
#     minr,minc,maxr,maxc = draw_box_opt(this_image).bbox #using the bbox drawn around the melt pool to find width and depth
    
#     datasets_df.loc[ix, 'melt_pool_width_microns'] = (maxc-minc)*(0.062)
#     datasets_df.loc[ix, 'melt_pool_depth_microns'] = (maxr-minr)*(0.062)


# datasets_df.loc[:, 'track'] = pandas.to_numeric(datasets_df['track'], downcast='integer', errors='coerce')
# datasets_df = datasets_df[['title', 'track', 'case', 'melt_pool_width_microns', 'melt_pool_depth_microns', 'template', 'workspace',  'creation_date', 'last_modification_date', 'last_change_date', 'filename', 'url', 'xml_content', 'template_title', 'user_id']]
# datasets_df = datasets_df.sort_values(by='track')


# datasets_df[['filename', 'track', 'case']]


## 4. Make plots of melt pool sizes

Bar plot of melt pool widths

In [None]:
datasets_df[['melt_pool_width_microns', 'track']].plot(y='melt_pool_width_microns',x='track',kind='bar')

Bar plot of melt pool depths

In [None]:
datasets_df[['melt_pool_depth_microns', 'track']].plot(y='melt_pool_depth_microns',x='track',kind='bar')

In [None]:
datasets_df.plot('melt_pool_depth_microns', 'melt_pool_depth_microns', kind='scatter')

Example of what the image looks like with the bbox marked

In [None]:
# fig, ax = plt.subplots()
# minr,minc,maxr,maxc=bboxes[4]
# ax.imshow(result[4][900:1800,800:3400])
# bx = (minc, maxc, maxc, minc, minc)
# by = (minr, minr, maxr, maxr, minr)
# ax.plot(bx, by, '-b', linewidth=2.5)
# plt.show()

In [None]:


# depths=[]
# widths=[]
# bboxes=[]
# for im in dfres['image']:
#     objf=draw_box_opt(im)
#     bboxes.append(objf.bbox)
#     minr,minc,maxr,maxc=objf.bbox #using the bbox drawn around the melt pool to find width and depth
#     widths.append((maxc-minc)*(0.062)) #using the conversion given in the CDCS comment above the images to convert to mm (1 pixel=0.062 microns)
#     depths.append((maxr-minr)*(0.062))
# dat={'width':widths,'depth':depths,'track':trace,'case':case}
# dfmp=pandas.DataFrame(dat)
# dfmp

In [None]:

# fig = plt.subplots(1,1)

# for ix, thisrow in datasets_df.iterrows():
    
#     print('Getting image for track {0:,.0f} case {1:}...'.format(thisrow['track'], thisrow['case']))
#     this_image = io.imread(thisrow['url'])

#     print('\tmeasuring melt pool in image...')
#     minr,minc,maxr,maxc = draw_box_opt(this_image).bbox #using the bbox drawn around the melt pool to find width and depth

#     datasets_df.loc[ix, 'melt_pool_width_microns'] = (maxc-minc)*(0.062)
#     datasets_df.loc[ix, 'melt_pool_depth_microns'] = (maxr-minr)*(0.062)
    
#     print('\tshowing marked image...')    
#     fig, ax = plt.subplots(1,1)
    
#     bx = (minc, maxc, maxc, minc, minc)
#     by = (minr, minr, maxr, maxr, minr)
    
#     ax.imshow(this_image[900:1800,800:3400])
#     ax.plot(bx, by, '-b', linewidth=2.5)
#     ax.set_title('Track {0:.0f} case {1:} (width = {2:.1f} µm, depth = {3:.1f} µm)'.format(thisrow['track'], thisrow['case'], datasets_df.loc[ix]['melt_pool_width_microns'], datasets_df.loc[ix]['melt_pool_depth_microns']))
#     plt.show()
    
    
#     #using the conversion given in the CDCS comment above the images to convert to mm (1 pixel=0.062 microns)
# # objf
# #datasets_df.head(2)
# #plt.show()
# print('Done')

In [None]:
# def show_xml(xml):
#     print('xml tree:')
#     print('---------')
#     root = et.fromstring(xml)
#     for i in range(0, len(root)):
#         if (len(root[i]) <= 1):
#             print('{0:.0f}. {1:}: {2:}'.format(i, root[i].tag, root[i].text))
#         else:
#             print('{0:.0f}. {1:} has {2:.0f} elements!'.format(i, root[i].tag, len(root[i])))
#             for j in range(0, len(root[i])):
#                 if (len(root[i][j]) <= 1):
#                     print('\t{0:.0f}.{1:.0f}. {2:}: {3:}'.format(i, j, root[i][j].tag, root[i][j].text))
#                 else:
#                     print('\t{0:.0f}.{1:.0f}. {2:} has {3:.0f} elements!'.format(i, j, root[i][j].tag, len(root[i][j])))
#                     for k in range(0, len(root[i][j])):
#                         if (len(root[i][j][k]) <= 1):
#                             print('\t\t{0:.0f}.{1:.0f}.{2:.0f}. {3:}: {4:}'.format(i, j, k, root[i][j][k].tag, root[i][j][k].text))
#                         else:
#                             print('\t\t\t{0:.0f}.{1:.0f}.{2:.0f}. {3:} has {4:.0f} elements!'.format(i, j, k, root[i][j][k].tag, len(root[i][j][k])))
#                             for l in range(0, len(root[i][j][k])):
#                                 if (len(root[i][j][k][l]) <= 1):
#                                     print('\t\t\t{0:.0f}.{1:.0f}.{2:.0f}.{3:.0f}. {4:}: {5:}'.format(i, j, k, l, root[i][j][k][l].tag, root[i][j][k][l].text))
#                                 else:
#                                     for m in range(0, len(root[i][j][k][l])):
#                                         if (len(root[i][j][k][l][m]) <= 1):
#                                             print('\t\t\t\t{0:.0f}.{1:.0f}.{2:.0f}.{3:.0f}.{4:.0f}. {5:}: {6:}'.format(i, j, k, l, m, root[i][j][k][l][m].tag, root[i][j][k][l][m].text))
#                                         else:
#                                             print('\t\t\t\t{0:.0f}.{1:.0f}.{2:.0f}.{3:.0f}.{4:.0f}. {5:} has {6:.0f} elements!'.format(i, j, k, l, m, root[i][j][k][l][m].tag, len(root[i][j][k][l][m])))