This jupyter notebook shall serve as accompanying material to this repositories' README, a report for the “Big Data Engineering” subject at UPM’s Master in Computational Biology. It is thus only intended as a recopilation of used code; for the full discussion, please refer to the README.

In [1]:
# First, we will create a Conda Environment to do all our processing
# in python3.7, the only one orca supports

#!conda create -n bigdataenv python=3.7
#!source activate bigdataenv
!python --version # We can check everything went successfully

Python 3.7.11


In [2]:
# Now, we will install Java8

#!apt-get install openjdk-8-jdk-headless -qq > /dev/null
# Set environment variable JAVA_HOME.
#os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
#!update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java

!java -version # And check it went correctly, too

openjdk version "11.0.13" 2021-10-19
OpenJDK Runtime Environment (build 11.0.13+8-Ubuntu-0ubuntu1.20.04)
OpenJDK 64-Bit Server VM (build 11.0.13+8-Ubuntu-0ubuntu1.20.04, mixed mode, sharing)


In [3]:
# And the latest pre-release version of BigDL Orca 
# Installing BigDL Orca from pip will automatically install pyspark, bigdl, and their dependencies.
#!pip install --pre --upgrade bigdl-orca

import findspark; findspark.init()
from bigdl.orca import init_orca_context, stop_orca_context
from bigdl.orca import OrcaContext

In [4]:
# Install python dependencies

#!pip3 install torch==1.7.1 torchvision==0.8.2
#!pip install six cloudpickle
#!pip install jep==3.9.0

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import torchvision
import torchvision.models as models
import torchvision.transforms as T

In [5]:
import sys; import os
from os import listdir
from os.path import isfile, join
import numpy as np

In [6]:
# Define the path to the different folders:
test_path = './chest_xray/test'
train_path = './chest_xray/train'
validation_path = './chest_xray/val'

We want to realize the following preprocessing: 

* CenterCrop - resizes the image to 224 x 224 
* RandomFlip - Randomly flips 50% of the image horizontally 
* ColorJitter- Randomly adjust the brigthness of 50% of the images
* Normalize - Normalize the images 


In [7]:
def custom_transform(sample):
  transformer = torchvision.transforms.Compose([T.CenterCrop(size=(299, 299)), T.ToTensor(), 
                                                T.RandomHorizontalFlip(p=0.5),
                                                T.ColorJitter(brightness=0.5, hue=0), 
                                                T.Normalize((0.5,), (0.5,)),])
  return transformer(sample["image"]), sample["label"]

We create the datasets:

In [8]:
from PIL import Image
#from skimage import io

class CustomDataset(Dataset):
    """Face Landmarks dataset."""

    def __init__(self, root_dir, transform=None):
        """
        Args:
            root_dir (string): Directory with all the images.
            transform (callable, optional): Optional transform to be applied
                on a sample.
        """
        self.root_dir = root_dir
        self.transform = transform
        normal_names = ["NORMAL/" + f for f in listdir(join(root_dir, "NORMAL")) if isfile(join(root_dir, "NORMAL", f))]
        labels_normal = [0]*len(normal_names)
        pneumonia_names = ["PNEUMONIA/" + f for f in listdir(join(root_dir, "PNEUMONIA")) if isfile(join(root_dir, "PNEUMONIA", f))]
        labels_pneumonia = [1]*len(pneumonia_names)
        self.labels = labels_normal
        self.labels.extend(labels_pneumonia)
        self.labels = np.asarray(self.labels, dtype=np.float32)
        # labelling done

        self.filenames = normal_names
        self.filenames.extend(pneumonia_names)

        

    def __len__(self):
        return len(self.filenames)

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()

        img_name = os.path.join(self.root_dir,
                                self.filenames[idx])
        image = Image.open(img_name).convert("RGB")
        #print(image.shape)
        label = torch.Tensor([self.labels[idx]])
        sample = {"image": image, "label": label}

        if self.transform:
            sample = self.transform(sample)

        return sample

In [9]:
train_data = CustomDataset(train_path, transform=custom_transform)
val_data = CustomDataset(validation_path, transform=custom_transform)
test_data = CustomDataset(test_path, transform=custom_transform)

We load the data:

In [10]:
import torchvision
batch_size = 32
train_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size, shuffle=True, num_workers=2)
val_loader = torch.utils.data.DataLoader(val_data, batch_size=batch_size, shuffle=True, num_workers=2)
test_loader = torch.utils.data.DataLoader(test_data, batch_size=batch_size, shuffle=False, num_workers=2)

In [11]:
# Check number of images 
print('Number of images for training: ', len(train_data))
print('Number of images for testing: ', len(test_data))
print('Number of images for validation: ', len(val_data))

Number of images for training:  5216
Number of images for testing:  624
Number of images for validation:  16


## 5. NEURAL NETWORK ESTRUCTURE

### INTEGRATED STACKING NETWORK

In [12]:
# import necesary libraries and modules
from __future__ import print_function
import os
import argparse

In [21]:
# recommended to set it to True when running BigDL in Jupyter notebook. 
OrcaContext.log_output = True # (this will display terminal's stdout and stderr in the Jupyter notebook).
init_orca_context()

In [22]:
# check that stacking works as expected
a = torch.rand(size=(8, 32))
b = torch.rand(size=(8, 32))
c = torch.rand(size=(8, 32))
print(a.shape)
d = torch.cat((a, b, c), axis=-1)
print(d.shape)

torch.Size([8, 32])
torch.Size([8, 96])


At the end the stacking of the 5 neural networks because a problem of the memory ram (12 gb in google collab free) and the resources of the paper are bigger than the one we have. 
It is just not possible define the same neural network they did because the session crush due to the limit of the ram, but we define the code and check that the network works anyways: 

```
# Define the network
class IntegratedNet(nn.Module):
  def __init__(self):
    super(IntegratedNet, self).__init__()

    self.resnet18 = models.resnet18(pretrained=True)
    self.resnet18.fc = nn.Linear(512, 32)

    self.densenet = models.densenet161(pretrained=True)
    self.densenet.classifier = nn.Linear(2208, 32)

    self.inception = models.inception_v3(pretrained=True)
    self.inception.fc = nn.Linear(2048, 32)
    
    self.mnasnet = models.mnasnet1_0(pretrained=True)
    self.mnasnet.classifier = nn.Sequential(nn.Dropout(0.2, inplace=True),
                                            nn.Linear(1280, 32))

    self.mobilenet_v2 = models.mobilenet_v2(pretrained=True)
    self.mobilenet_v2.classifier = nn.Sequential(nn.Dropout(0.2), nn.Linear(1280, 32))
    

    self.fc_out = nn.Linear(2*32, 1)  # for binary classification, use single output

  def forward(self, x):
    x_res = self.resnet18(x)
    x_dense = self.densenet(x.detach())
    x_inception = self.inception(x)[0]
    x_mnas = self.mnasnet(x)
    x_mobilenet = self.mobilenet_v2(x)
    
    #Concatenate the outputs
    x = torch.cat((x_res, x_dense, x_inception, x_mnas, x_mobilenet), axis=-1)
    x = self.fc_out(x)
    return x



```
But with the actual resources we cannot use it, we define therefore another 
neural network stacking two models instead. Which is what google collab can handle. 

In [23]:
# Define the network
class IntegratedNet(nn.Module):
  def __init__(self):
    super(IntegratedNet, self).__init__()

    self.resnet18 = models.resnet18(pretrained=True)
    self.resnet18.fc = nn.Linear(512, 32)

    #self.densenet = models.densenet161(pretrained=True)
    #self.densenet.classifier = nn.Linear(2208, 32)

    #self.inception = models.inception_v3(pretrained=True)
    #self.inception.fc = nn.Linear(2048, 32)
    
    self.mnasnet = models.mnasnet1_0(pretrained=True)
    self.mnasnet.classifier = nn.Sequential(nn.Dropout(0.2, inplace=True),
                                            nn.Linear(1280, 32))

    #self.mobilenet_v2 = models.mobilenet_v2(pretrained=True)
    #self.mobilenet_v2.classifier = nn.Sequential(nn.Dropout(0.2), nn.Linear(1280, 32))
    

    self.fc_out = nn.Linear(2*32, 1)  # for binary classification, use single output

  def forward(self, x):
    x_res = self.resnet18(x)
    #x_dense = self.densenet(x.detach())
    #x_inception = self.inception(x)[0]
    x_mnas = self.mnasnet(x)
    #x_mobilenet = self.mobilenet_v2(x)
    #x_mobilenet = self.mobilenet_v2(x.detach())
    
    #Concatenate the outputs
    #x = torch.cat((x_res, x_dense, x_inception, x_mnas, x_mobilenet), axis=-1)
    x = torch.cat((x_res, x_mnas), axis=-1)
    x = self.fc_out(x)
    return x

In [24]:
net = IntegratedNet()
optimizer = optim.Adam(net.parameters(), lr=0.001)

Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /home/pablo/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth


  0%|          | 0.00/44.7M [00:00<?, ?B/s]

Downloading: "https://download.pytorch.org/models/mnasnet1.0_top1_73.512-f206786ef8.pth" to /home/pablo/.cache/torch/hub/checkpoints/mnasnet1.0_top1_73.512-f206786ef8.pth


  0%|          | 0.00/16.9M [00:00<?, ?B/s]

In [25]:
# training loss vs. epochs
criterion = nn.BCEWithLogitsLoss()
batch_size = 32
epochs = 1

In [26]:
# we are still writing code# still there

In [27]:
from bigdl.orca.learn.pytorch import Estimator # don't stop, google
from bigdl.orca.learn.metrics import Accuracy # please dont strop we are still here

est = Estimator.from_torch(model=net, optimizer=optimizer, loss=criterion, metrics=[Accuracy()])

creating: createTorchLoss


  assert(len(new_weight) == 1, "TorchModel's weights should be one tensor")


TypeError: 'JavaPackage' object is not callable

In [None]:
from bigdl.orca.learn.trigger import EveryEpoch 

est.fit(data=train_loader, epochs=1, validation_data=test_loader,
        checkpoint_trigger=EveryEpoch())

In [None]:
result = est.evaluate(data=test_loader)
for r in result:
    print(r, ":", result[r])

In [None]:
# stop orca context when program finishes
stop_orca_context()

## 6. VALIDATION OF THE MODEL

Here do the visualization of the loss, accuracy, visualize the model graph, the roc curve, the histograms and the confussion matrix.

## 7. CONCLUSION

## 8. REFERENCES

HERE PUT WITH CORRECT APA THE CITATIONS OF THE INTRODUCTION