# Custom Image Analysis

## Table of Contents (TOC) <a class="anchor" id="toc"></a>

- [1. Imports](#first-bullet)
- [2. Zegami Client, Workspace, and Collection](#second-bullet)
- [3. Droping rows from the datafile that are not useful](#third-bullet)
- [4. Adding chest x-rays of healthy individuals](#fourth-bullet)
- [5. Selecting data based on image similarity](#fifth-bullet)
- [6. Custom analysis of data](#sixth-bullet)

## 1. Imports <a class="anchor" id="first-bullet"></a> <span style="font-size:0.5em;">[(Back to TOC)](#toc)</span>

In [1]:
from zegami_sdk.client import ZegamiClient
from zegami_sdk.source import UploadableSource

from skimage.transform import resize
from skimage.feature import graycomatrix, graycoprops

from matplotlib import pyplot as plt

import os
import pandas as pd
import numpy as np

from torch.utils.data import Dataset, DataLoader

# 2. Zegami client, workspace, and collection <a class="anchor" id="second-bullet"></a> <span style="font-size:0.5em;">[(Back to TOC)](#toc)</span>

In [2]:
zc = ZegamiClient()

Used token from '/home/martim-zegami/zegami_com.zegami.token'.
Client initialized successfully, welcome .



In [3]:
workspaces_lst = zc.workspaces
# print(workspaces_lst)

In [4]:
# Get workspace using the ID
WORKSPACE_ID = 'GUDe4kRY' # zc.workspaces[1].id - select ID after printing workspaces
workspace = zc.get_workspace_by_id(WORKSPACE_ID)

In [5]:
workspace.show_collections()


Collections in 'Martim Chaves' (3):
6271698f81e4bccb640d6e24 : Flags of the world
62850bd97e2168af00fef191 : X-ray-analysis
629f7935295f96ab09285a8a : Xray-analysis


In [6]:
collection = workspace.get_collection_by_name('X-ray-analysis')
print(collection)

<CollectionV1 id=62850bd97e2168af00fef191 name=X-ray-analysis>


## 3. Droping rows from the datafile that are not useful <a class="anchor" id="third-bullet"></a> <span style="font-size:0.5em;">[(Back to TOC)](#toc)</span>

<img src="./images/wrong_filenames.png" width="1000"/>

Some rows in the datafile pointed at files that were non-existent. In order to clean that, a Tag was created for those files. Removing them is simple. 

In [30]:
collec_remove_tag = collection.get_rows_by_tags(['wrong_file_name']).copy()

In [47]:
temp_coll_rows = collection.rows[~collection.rows.filename.isin(collec_remove_tag.filename)]

In [49]:
collection.replace_data(temp_coll_rows)

## 4. Adding chest x-rays of healthy individuals <a class="anchor" id="fourth-bullet"></a> <span style="font-size:0.5em;">[(Back to TOC)](#toc)</span>

To get a better idea of differences between individuals with covid and healthy individuals, we should get some healthy x-rays!

First, let's look at how many covid-19 positive images we have a focus on them.

In [58]:
pa_view_df = collection.get_rows_by_tags(['pa_sim_view'])

In [59]:
pa_covid_df = pa_view_df[pa_view_df['finding'] == 'Pneumonia/Viral/COVID-19']

In [60]:
len(pa_covid_df)

481

Quickly, we were able to determine that there are 481 covid positive images in a view similar or equal to PA. Let's get a similar number of images of healthy chest x-rays. For that, we used a Kaggle dataset (https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia).

Besides the images, we only need to provide a supplementary data file, containing the column 'filename' with the name of the file and 'finding' saying healthy.

In [70]:
file_names = os.listdir('./data/healthy_x_rays')
healthy_findings = ['healthy' for _ in range(len(file_names))] # there's probably a better way to do this

In [76]:
len(file_names)

481

In [74]:
sup_data = pd.DataFrame({
    'finding': healthy_findings,
    'filename': file_names
    })

In [75]:
sup_data.head()

Unnamed: 0,finding,filename
0,healthy,IM-0332-0001.jpeg
1,healthy,IM-0273-0001.jpeg
2,healthy,IM-0162-0001.jpeg
3,healthy,IM-0335-0001.jpeg
4,healthy,IM-0472-0001.jpeg


In [78]:
sup_data.to_excel('./data/sup_data_healthy.xlsx')

In [82]:
supplementary_data_file = './data/sup_data_healthy.xlsx'

images = './data/healthy_x_rays'

upload = UploadableSource('X-ray-analysis', images, column_filename='filename')

UploadableSource "X-ray-analysis" found 481 images in "./data/healthy_x_rays"


In [85]:
collection.add_images(upload, supplementary_data_file)

- Checking data matches uploadable sources


  new_rows = self.rows.append(data)


- Uploadable source 0 "X-ray-analysis" beginning upload


100%|██████████| 490/490 [00:28<00:00, 17.23image/s]


## 5. Selecting data based on Image Similarity <a class="anchor" id="ffth-bullet"></a> <span style="font-size:0.5em;">[(Back to TOC)](#toc)</span>

<img src="./images/multiple_clusters.png" width="1000"/>

Looking at the clusters and colouring the samples by finding, we can see that the main large cluster in the center corresponds to samples with a view similar to posteroanterior (PA). The healthy images we added are also of that view. Perhaps we could create a subset containing these two clusters only. 

<img src="./images/clusters_by_finding.png" width="1000"/>

Looking at the clusters coloured by finding, apparently, clustering based on features extracted already does quite a good job separating healthy and non-healthy x-rays. We start by focusing on the cluster in the centre, selecting it using the scatter plot filter.

<img src="./images/focus_centre_cluster.png" width="1000"/>

Afterwards, we select all of them, and create a Tag called 'pneumonia_pa', images that contain some sort of pneumonia in a equal or similar to PA view.

<img src="./images/pneumonia_pa.png" width="1000"/>

We can do a similar thing with the other main cluster, containing healthy images, creating a 'healthy_pa' Tag.

In [8]:
collection.tags.keys()

dict_keys(['pa_sim_view', 'pneumonia_pa', 'wrong_file_name', 'healthy_pa'])

## 6. Custom Analysis of Data <a class="anchor" id="sixth-bullet"></a> <span style="font-size:0.5em;">[(Back to TOC)](#toc)</span>

### 6.1. Creating a data generator

In [9]:
class CovidDataset(Dataset):

    def __init__(self, collection, class_column='class', image_size = (450,450), tag = ''):
        
        
        self._collection = collection
        self._class_column = class_column

        if len(tag) > 0:
            self.subject_ids = list(collection.get_rows_by_tags([tag]).index)
        else:
            self.subject_ids = list(collection.rows.index)
        self.image_size = image_size

    @property
    def collection(): pass
    
    @collection.getter
    def collection(self): return self._collection

    def __len__(self):
        return len(self.subject_ids)

    def rgb2gray(self,img):
        
        if len(img.shape) == 2: return img
        
        img = np.copy(img.astype(np.float32))
        gray = np.add(img[0::,0::,0],np.add(img[0::,0::,1],img[0::,0::,2]))
        gray = np.divide(gray,3)
        gray = gray.astype(np.uint8)
        
        return gray

    def __getitem__(self,idx):
        subject_id = int(self.subject_ids[idx])
        class_name = self._collection.rows.at[subject_id, self._class_column]

        if 'healthy' in class_name:
            class_id = 0
        else:
            class_id = 1

        url = self.collection.get_image_urls(subject_id)[0]
        img = self.collection.download_image(url)
        img = np.array(img, dtype='uint8')
        img = self.rgb2gray(img)
        img = resize(img, self.image_size)
        
        return img, class_id, subject_id

In [11]:
healthy_data = CovidDataset(collection, class_column='finding', tag = 'healthy_pa')
pathologic_data = CovidDataset(collection, class_column='finding', tag = 'pneumonia_pa')

In [12]:
healthy_generator = DataLoader(healthy_data, batch_size=4, shuffle=True, num_workers=0, pin_memory=True)
pathologic_generator = DataLoader(healthy_data, batch_size=4, shuffle=True, num_workers=0, pin_memory=True)

### 6.2. Custom Analysis (homogeneity)

In [13]:
def convert(img, target_type_min, target_type_max, target_type):
    imin = img.min()
    imax = img.max()

    a = (target_type_max - target_type_min) / (imax - imin)
    b = target_type_max - a * imax
    new_img = (a * img + b).astype(target_type)
    return new_img

In [14]:
def calculate_homogeneity(img, offsetdist=[1], offsetang = [7*np.pi/4], imgvals = 256):
    img = convert(img, 0, 255, np.uint8)
    glcm = graycomatrix(img, distances=offsetdist, angles=offsetang, levels=imgvals, symmetric=False, normed=True)
    return graycoprops(glcm, 'homogeneity')[0, 0]

In [15]:
imgs_index = []
imgs_homogeneity = []

for counter, (imgs, _, idxs) in enumerate(healthy_generator):
    
    imgs = imgs.detach().cpu().numpy()
    idxs = idxs.detach().cpu().numpy()

    for img, idx in zip(imgs, idxs):
        homogeneity_val = calculate_homogeneity(img)

        imgs_index.append(idx)
        imgs_homogeneity.append(homogeneity_val)

In [16]:
for counter, (imgs, _, idxs) in enumerate(pathologic_generator):
    
    imgs = imgs.detach().cpu().numpy()
    idxs = idxs.detach().cpu().numpy()

    for img, idx in zip(imgs, idxs):
        homogeneity_val = calculate_homogeneity(img)

        imgs_index.append(idx)
        imgs_homogeneity.append(homogeneity_val)

## 6.3 Creating a new feature

In [17]:
homogeneity_feature = pd.DataFrame({'index': imgs_index,
                                    'homogeneity': imgs_homogeneity})

In [20]:
homogeneity_feature.set_index('index', inplace=True)

## 6.4 Adding the feature to the datafile

In [7]:
datafile_df = collection.rows.copy()

In [32]:
# Add homegeneity data to the datafile
datafile_df = datafile_df.join(homogeneity_feature)

In [33]:
# Replace datafile to include new data
collection.replace_data(datafile_df)

In [12]:
# Check status of collection (status has to be completed before continuing)
collection.status

{'changed_at': 'Wed, 10 Aug 2022 11:04:35 GMT',
 'progress': 1.0,
 'status': 'completed'}

After that, the new feature should be added to the collection datafile!