![dphi banner](https://dphi-courses.s3.ap-south-1.amazonaws.com/Datathons/dphi_banner.png)

# **Getting Started Code For [Pneumonia Classification in CXRs](https://dphi.tech/challenges/pneumonia-classification-in-cxrs/76/overview/about) on DPhi**

## Download the images
We are given google drive link in the '[Data](https://dphi.tech/challenges/pneumonia-classification-in-cxrs/76/data)' section of problem page which has all the required train images (to build the model) and test images to predict the label of these images and submit the predictions on the [DPhi platform](https://dphi.tech/challenges/pneumonia-classification-in-cxrs/76/submit).

We can use **GoogleDriveDownloader** form **google_drive_downloader** library in Python to download the shared files from the shared Google drive link: https://drive.google.com/file/d/1d_93d9oFNRBK9Vg6BRxs9wvRbKtNTylY/view?usp=sharing

The file id in the above link is: **1d_93d9oFNRBK9Vg6BRxs9wvRbKtNTylY**

In [5]:
#%pip install googledrivedownloader
from google_drive_downloader import GoogleDriveDownloader as gdd

gdd.download_file_from_google_drive(file_id='1d_93d9oFNRBK9Vg6BRxs9wvRbKtNTylY',
                                    dest_path='content/pneumonia_dataset.zip',
                                    unzip=True)

We have all the files from the shared Google drive link downloaded in the colab environment.

## Loading Libraries
All Python capabilities are not loaded to our working environment by default (even they are already installed in your system). So, we import each and every library that we want to use.

We chose alias names for our libraries for the sake of our convenience (numpy --> np and pandas --> pd, tensorlow --> tf).

Note: You can import all the libraries that you think will be required or can import it as you go along.

In [2]:
!apt-get update
!apt-get install ffmpeg libsm6 libxext6 -y
%pip install pandas numpy opencv-python scikit-learn keras
import pandas as pd                                     # Data analysis and manipultion tool
import numpy as np                                      # Fundamental package for linear algebra and multidimensional arrays
import tensorflow as tf                                 # Deep Learning Tool
import os                                               # OS module in Python provides a way of using operating system dependent functionality
import cv2                                              # Library for image processing
from sklearn.model_selection import train_test_split    # For splitting the data into train and validation set
from sklearn.metrics import accuracy_score
from keras.layers.normalization import BatchNormalization

Reading package lists... Done
E: Could not open lock file /var/lib/apt/lists/lock - open (13: Permission denied)
E: Unable to lock directory /var/lib/apt/lists/
W: Problem unlinking the file /var/cache/apt/pkgcache.bin - RemoveCaches (13: Permission denied)
W: Problem unlinking the file /var/cache/apt/srcpkgcache.bin - RemoveCaches (13: Permission denied)
E: Could not open lock file /var/lib/dpkg/lock-frontend - open (13: Permission denied)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), are you root?
Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


## Loading and preparing training data
The train and test images are given in two different folders - 'train' and 'test'.

In [6]:
data=[]
img_size = 100
def create_data():
        for item in ['normal','pneumonia']:
            path='./content/pneumonia_dataset/train/' + item+"/"
            
            for img in os.listdir(path):         # os.listdir gets you all the list of name of files located in the given path
                try:
                    img_array=cv2.imread(os.path.join(path,img),cv2.IMREAD_GRAYSCALE)    # converts the image to pixels and gray scales the images
                    new_img_array=cv2.resize(img_array,(img_size,img_size))
                    # print(img_array)
                    if item == 'normal':
                        data.append([new_img_array,0])
                    else:
                        data.append([new_img_array, 1]) # appending the list of image pixels and respective target value in data
                except Exception as e:
                    pass                                      # try and except is exception handling case in python, saves you from getting errors
                
            
create_data()

In [7]:
len(data)

2425

In [9]:
# image pixels of a image
data[2422]

[array([[ 9,  9,  9, ...,  0,  0,  0],
        [ 9,  9, 11, ...,  0,  0,  0],
        [10,  9,  9, ...,  0,  0,  0],
        ...,
        [32,  9,  9, ...,  0,  0,  0],
        [26,  9, 11, ...,  0,  0,  0],
        [18, 10, 13, ...,  0,  0,  0]], dtype=uint8),
 1]

#### Shuffle the data

In [10]:
np.random.shuffle(data)

#### Separating the images and labels


In [11]:
x = []
y = []
for image in data:
  x.append(image[0])
  y.append(image[1])

# converting x & y to numpy array as they are list
x = np.array(x)
y = np.array(y)

In [12]:
np.unique(y, return_counts=True)

(array([0, 1]), array([1280, 1145]))

#### Splitting the data into Train and Validation Set
We want to check the performance of the model that we built. For this purpose, we always split (both independent and dependent data) the given data into training set which will be used to train the model, and test set which will be used to check how accurately the model is predicting outcomes.

For this purpose we have a class called 'train_test_split' in the 'sklearn.model_selection' module.

In [13]:
## Convert into 4D Array
x =  x.reshape(-1, 100, 100, 1)

In [14]:
# split the data
X_train, X_val, y_train, y_val = train_test_split(x,y,test_size=0.2, random_state = 42)

In [15]:
X_train.shape

(1940, 100, 100, 1)

In [16]:
X_train =  X_train.reshape(-1, 100, 100, 1)

In [17]:
X_train.shape

(1940, 100, 100, 1)

In [18]:
X_train[0]

array([[[164],
        [115],
        [ 85],
        ...,
        [114],
        [136],
        [159]],

       [[ 27],
        [  0],
        [  0],
        ...,
        [  0],
        [  0],
        [110]],

       [[  0],
        [  0],
        [  0],
        ...,
        [  0],
        [  0],
        [  6]],

       ...,

       [[  2],
        [  0],
        [  0],
        ...,
        [  0],
        [  0],
        [  5]],

       [[  3],
        [  0],
        [  0],
        ...,
        [  0],
        [  0],
        [  5]],

       [[  9],
        [  0],
        [  0],
        ...,
        [  0],
        [  0],
        [ 11]]], dtype=uint8)

## Building Model
Now we are finally ready, and we can train the model.

There are many machine learning or deep learning models like Random Forest, Decision Tree, Multi-Layer Perceptron (MLP), Convolution Neural Network (CNN), etc. to say you some.


Then we would feed the model both with the data (X_train) and the answers for that data (y_train)

In [75]:
cnn = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(filters=16, kernel_size=(3, 3), activation='relu', padding='same', input_shape=(100, 100, 1)),
#     tf.keras.layers.Conv2D(filters=16, kernel_size=(3, 3), activation='relu', padding='same'),
    tf.keras.layers.MaxPooling2D((2, 2)),
    
    tf.keras.layers.Conv2D(filters=32, kernel_size=(3, 3), activation='relu', padding='same'),
#     tf.keras.layers.Conv2D(filters=32, kernel_size=(3, 3), activation='relu', padding='same'),
    tf.keras.layers.MaxPooling2D((2, 2)),
    
    tf.keras.layers.Conv2D(filters=64, kernel_size=(3, 3), activation='relu', padding='same'),
#     tf.keras.layers.Conv2D(filters=64, kernel_size=(3, 3), activation='relu', padding='same'),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.MaxPooling2D((2, 2)),
    
    tf.keras.layers.Conv2D(filters=128, kernel_size=(3, 3), activation='relu', padding='same'),
#     tf.keras.layers.Conv2D(filters=128, kernel_size=(3, 3), activation='relu', padding='same'),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.MaxPooling2D((2, 2)),
#     tf.keras.layers.Dropout(rate=0.2),
    
#      tf.keras.layers.Conv2D(filters=256, kernel_size=(3, 3), activation='relu', padding='same'),
#     tf.keras.layers.Conv2D(filters=256, kernel_size=(3, 3), activation='relu', padding='same'),
#      tf.keras.layers.BatchNormalization(),
#      tf.keras.layers.MaxPooling2D((2, 2)),
#     tf.keras.layers.Dropout(rate=0.2),
    
    # tf.keras.layers.Flatten(input_shape=(100, 100, 1)),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(512, activation='relu'),
#     tf.keras.layers.Dropout(rate=0.7),
    tf.keras.layers.Dense(128, activation='relu'),
#     tf.keras.layers.Dropout(rate=0.5),
    tf.keras.layers.Dense(64, activation='relu'),
#     tf.keras.layers.Dropout(rate=0.3),
    tf.keras.layers.Dense(1, activation='sigmoid')    
])

In [76]:
cnn.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [77]:
print(cnn.summary())

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_23 (Conv2D)           (None, 100, 100, 16)      160       
_________________________________________________________________
max_pooling2d_14 (MaxPooling (None, 50, 50, 16)        0         
_________________________________________________________________
conv2d_24 (Conv2D)           (None, 50, 50, 32)        4640      
_________________________________________________________________
max_pooling2d_15 (MaxPooling (None, 25, 25, 32)        0         
_________________________________________________________________
conv2d_25 (Conv2D)           (None, 25, 25, 64)        18496     
_________________________________________________________________
batch_normalization_8 (Batch (None, 25, 25, 64)        256       
_________________________________________________________________
max_pooling2d_16 (MaxPooling (None, 12, 12, 64)       

In [78]:
cnn.fit(X_train, y_train, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fb57c2d96a0>

## Validate the model
Wonder🤔 how well your model learned! Lets check its performance on the X_val data.

In [79]:
cnn.evaluate(X_val, y_val)



[0.7576945424079895, 0.6103093028068542]

## Predict The Output For Testing Dataset 😅
We have trained our model, evaluated it and now finally we will predict the output/target for the testing data.

#### Load Test Set
Load the test data on which final submission is to be made.

In [80]:
# Loading the order of the image's name that has been provided
test_image_order = pd.read_csv("./content/pneumonia_dataset/test.csv")
test_image_order.head()

Unnamed: 0,filename
0,CXR_test_519.png
1,CXR_test_578.png
2,CXR_test_359.png
3,CXR_test_573.png
4,CXR_test_471.png


#### Getting images file path

In [81]:
file_paths = [[fname, './content/pneumonia_dataset/test/' + fname] for fname in test_image_order['filename']]

#### Confirm if number of images in test folder is same as number of image names in 'Testing_set_face_mask.csv'

In [82]:
# Confirm if number of images is same as number of labels given
if len(test_image_order) == len(file_paths):
    print('Number of image names i.e. ', len(test_image_order), 'matches the number of file paths i.e. ', len(file_paths))
else:
    print('Number of image names does not match the number of filepaths')

Number of image names i.e.  606 matches the number of file paths i.e.  606


#### Converting the file_paths to dataframe

In [83]:
test_images = pd.DataFrame(file_paths, columns=['filename', 'filepaths'])
test_images.head()

Unnamed: 0,filename,filepaths
0,CXR_test_519.png,./content/pneumonia_dataset/test/CXR_test_519.png
1,CXR_test_578.png,./content/pneumonia_dataset/test/CXR_test_578.png
2,CXR_test_359.png,./content/pneumonia_dataset/test/CXR_test_359.png
3,CXR_test_573.png,./content/pneumonia_dataset/test/CXR_test_573.png
4,CXR_test_471.png,./content/pneumonia_dataset/test/CXR_test_471.png


## Data Pre-processing on test_data


In [84]:
test_pixel_data = []     # initialize an empty numpy array
for i in range(len(test_images)):
  
  img_array = cv2.imread(test_images['filepaths'][i], cv2.IMREAD_GRAYSCALE)   # converting the image to gray scale
  new_img_array=cv2.resize(img_array,(img_size,img_size))
  test_pixel_data.append(new_img_array)

In [85]:
test_pixel_data = np.asarray(test_pixel_data)

In [86]:
test_pixel_data =  test_pixel_data.reshape(-1, 100, 100, 1)

In [87]:
test_pixel_data

array([[[[  3],
         [  3],
         [  3],
         ...,
         [  5],
         [  6],
         [  7]],

        [[  3],
         [  2],
         [  2],
         ...,
         [  5],
         [  6],
         [  6]],

        [[  4],
         [  2],
         [  2],
         ...,
         [  5],
         [  5],
         [  6]],

        ...,

        [[  7],
         [  5],
         [  3],
         ...,
         [100],
         [ 85],
         [ 64]],

        [[  7],
         [  4],
         [  3],
         ...,
         [106],
         [ 97],
         [ 71]],

        [[  5],
         [  3],
         [  3],
         ...,
         [103],
         [ 94],
         [ 72]]],


       [[[  0],
         [  0],
         [  0],
         ...,
         [  0],
         [  0],
         [  0]],

        [[  0],
         [  0],
         [  0],
         ...,
         [  0],
         [  0],
         [  0]],

        [[  0],
         [  0],
         [  0],
         ...,
         [  0],
         [

### Make Prediction on Test Dataset
Time to make a submission!!!

In [88]:
pred = cnn.predict(test_pixel_data)

In [89]:
pred

array([[7.79612839e-01],
       [9.88206863e-01],
       [7.48752236e-01],
       [2.62503982e-01],
       [8.83125663e-01],
       [8.02066207e-01],
       [1.95445806e-01],
       [3.56617868e-02],
       [1.73045576e-01],
       [1.18305475e-01],
       [3.11128378e-01],
       [4.80713546e-02],
       [2.10706443e-01],
       [5.81347346e-01],
       [4.76379305e-01],
       [8.73991013e-01],
       [7.76166022e-01],
       [8.41259837e-01],
       [3.48188758e-01],
       [2.16569036e-01],
       [1.39822721e-01],
       [8.38483393e-01],
       [6.77076280e-02],
       [8.02899539e-01],
       [9.29692090e-01],
       [7.72473097e-01],
       [9.93752003e-01],
       [9.48668361e-01],
       [5.73262632e-01],
       [2.52656579e-01],
       [7.33826160e-02],
       [5.65483630e-01],
       [7.12723851e-01],
       [2.58764237e-01],
       [6.44357443e-01],
       [2.21185774e-01],
       [5.58236122e-01],
       [5.82157969e-01],
       [4.20467794e-01],
       [7.32667744e-01],


Convert the numerical classes to corresponding categorical classes.

In [90]:
predictions = []
for item in pred:
  if item <= 0.5:
    predictions.append('normal')
  else:
    predictions.append('pneumonia')

In [91]:
predictions

['pneumonia',
 'pneumonia',
 'pneumonia',
 'normal',
 'pneumonia',
 'pneumonia',
 'normal',
 'normal',
 'normal',
 'normal',
 'normal',
 'normal',
 'normal',
 'pneumonia',
 'normal',
 'pneumonia',
 'pneumonia',
 'pneumonia',
 'normal',
 'normal',
 'normal',
 'pneumonia',
 'normal',
 'pneumonia',
 'pneumonia',
 'pneumonia',
 'pneumonia',
 'pneumonia',
 'pneumonia',
 'normal',
 'normal',
 'pneumonia',
 'pneumonia',
 'normal',
 'pneumonia',
 'normal',
 'pneumonia',
 'pneumonia',
 'normal',
 'pneumonia',
 'pneumonia',
 'normal',
 'pneumonia',
 'normal',
 'normal',
 'pneumonia',
 'normal',
 'pneumonia',
 'pneumonia',
 'pneumonia',
 'normal',
 'pneumonia',
 'normal',
 'pneumonia',
 'pneumonia',
 'pneumonia',
 'normal',
 'pneumonia',
 'normal',
 'pneumonia',
 'pneumonia',
 'normal',
 'pneumonia',
 'pneumonia',
 'normal',
 'normal',
 'normal',
 'pneumonia',
 'pneumonia',
 'pneumonia',
 'pneumonia',
 'normal',
 'normal',
 'normal',
 'normal',
 'pneumonia',
 'normal',
 'pneumonia',
 'normal',
 '

## **How to save prediciton results locally via jupyter notebook?**
If you are working on Jupyter notebook, execute below block of codes. A file named 'submission.csv' will be created in your current working directory.

In [92]:
res = pd.DataFrame({'filename': test_images['filename'], 'label': predictions})  # prediction is nothing but the final predictions of your model on input features of your new unseen test data
res.to_csv("submission.csv", index = False)      # the csv file will be saved locally on the same location where this notebook is located.

# **OR,**
**If you are working on Google Colab then use the below set of code to save prediction results locally**

## **How to save prediction results locally via colab notebook?**
If you are working on Google Colab Notebook, execute below block of codes. A file named 'prediction_results' will be downloaded in your system.

In [None]:
res = pd.DataFrame({'filename': test_images['filename'], 'label': predictions})  # prediction is nothing but the final predictions of your model on input features of your new unseen test data
res.to_csv("submission.csv", index = False) 

# To download the csv file locally
from google.colab import files        
files.download('submission.csv')

# **Well Done! 👍**
You are all set to make a submission. Let's head to the **[challenge page](https://dphi.tech/challenges/pneumonia-classification-challenge-by-segmind/76/overview/about)** to make the submission.