<a href="https://colab.research.google.com/github/tcivie/Bone_Marrow_Cells_Classification/blob/main/AI_Project_SVM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Begin monitoring memmory and make the nessecary imports for the project
This cell installs the sysstat package, which is a command line utility that allows you to monitor system performance, and starts logging memory usage using the sar command.

In [None]:
!sudo apt-get install sysstat

In [None]:
!nohup sar -r 1 -o sar-mem-rbf.log &

In [None]:
import sklearn as sk
from sklearn import datasets
import tensorflow as tf
from sklearn import base as Bunch
import os
import random
from sklearn.datasets import load_files
import shutil
import re

##Mount Google drive
This cell mounts the user's Google Drive to the Colab notebook file system.

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

##Configurations
This cell sets the path for the dataset and the classes that the model will be trained on. It also sets a variable for the size of the dataset to be used and the path for the representive data.

In [None]:
# DATASET = 'gs://bm_cytomorphology_data_tf/BM_cytomorphology_data/*'
DATASET = '/content/drive/MyDrive/BM_cytomorphology_data'
CLASSES = ["ABE", "ART", "BAS", "BLA", "EBO", "EOS", "FGC", "HAC", "KSC", "LYI", "LYT", "MMZ", "MON", "MYB", "NGB",
              "NGS", "NIF", "OTH", "PEB", "PLM", "PMO"]
DATASET_SIZE = 72 # Size of each class to test (72 is the lowest number of a class)
REPRESENTIVE_DATA_PATH = '/content/BM_cytomorphology_data_rep'

#Data loading and configuration
This cell creates a directory for the representive data and for each class, it creates a directory in this folder. Then it selects a random sample of images from the original dataset folder for each class and copies them to the representive data folder. It uses the os library to create, join and list the directories and files, random library to select the random sample of images, and shutil library to copy the images. It also checks if the folders already exist and in that case, it prints a message that the folder already exists.

In [None]:
try:
  os.mkdir(REPRESENTIVE_DATA_PATH)
except FileExistsError:
  print('folder already exists:' + REPRESENTIVE_DATA_PATH)
for class_name in CLASSES:
  class_path = os.path.join(DATASET,class_name)
  local_class_path = os.path.join(REPRESENTIVE_DATA_PATH,class_name)
  try:
    os.mkdir(local_class_path)
  except FileExistsError:
    print('folder already exists:' + local_class_path)

  images_list = os.listdir(class_path)
  selected_images = random.sample(images_list,DATASET_SIZE)
  print(selected_images)

  for image in selected_images:
    dest_image_path = os.path.join(local_class_path,image)
    orig_image = os.path.join(class_path,image) # Make the list include the whole path to the image
    print('Copied: ' + image)
    shutil.copy(orig_image, dest_image_path)


#SVN running

The first code cell is loading the necessary libraries and package dependencies required to run the code.
The below commands install the python packages glob3, numpy, and opencv-python. glob3 is a library that allows for matching file paths with a pattern, numpy is a library for scientific computing with python, and opencv-python is a library for image processing.

In [None]:
# !pip3 install glob3
# !pip3 install numpy
# !pip3 install opencv-python 

##Loading Data

In this section, we are loading the data that we will use to train and test our model. We set the base directory where the images are stored, and then initialize empty lists X and y to store the image data and labels respectively. We then use `os.listdir()` to get a list of subdirectories in the base directory, which will represent the class names. We loop over the subdirectories, using `glob.glob()` to get a list of image files in the subdirectory with the extension .jpg. We then load the image data using OpenCV, append the image data to the X list, and append the class label (i.e., the subdirectory name) to the y list. Finally, we convert the X and y lists to numpy arrays so that we can use them to train and test our model.

In [None]:
import os
import glob
import numpy as np
import cv2

# Set the base directory where the images are stored
base_dir = '/content/drive/MyDrive/BM_cytomorphology_data_rep'

# Initialize lists to store the image data and labels
X = []
y = []

# Get the list of subdirectories (i.e., the class names)
subdirs = [d for d in os.listdir(base_dir) if os.path.isdir(os.path.join(base_dir, d))]

# Loop over the subdirectories
for subdir in subdirs:
    # Get the list of image files in the subdirectory
    image_files = glob.glob(os.path.join(base_dir, subdir, '*.jpg'))

    # Loop over the image files
    for image in image_files:
        print('loading: ' + image)
        # Load the image data using OpenCV or any other image processing library
        image = cv2.imread(image)
        # Append the image data to the X list
        X.append(image)
        # Append the class label (i.e., the subdirectory name) to the y list
        y.append(subdir)

# Convert the X and y lists to numpy arrays
X = np.array(X)
y = np.array(y)

##Train the model
In this section, we are training our model.

We start by updating the scikit-learn package to the latest version. Next, we import the necessary libraries and modules, including numpy, sklearn, SVC (Support Vector Classification) from sklearn.svm, OneVsRestClassifier from sklearn.multiclass, accuracy_score and train_test_split from sklearn.metrics and sklearn.model_selection. We also print out the version of scikit-learn that we are using.

In [None]:
!pip install -U scikit-learn

In [None]:
import numpy as np
import sklearn
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
print("Sklearn version " + sklearn.__version__)

We begin by splitting our data into training and test sets, with a test set size of 20% and a random state of 42. Then, we create an SVM classifier with a radial basis function (rbf) kernel, and set the class weight to 'balanced' and verbose to True. We also use the OneVsRestClassifier class which allows for a multi-class problem to be reduced to multiple binary classification problems, each classifying data as either in the class or not in the class. Then we flatten the data so that it can be inputted into the classifier. Finally, we fit the classifier to the training data.

In [None]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create an SVM classifier with a linear kernel
clf = SVC(kernel='rbf', class_weight='balanced', verbose=True)

clf = OneVsRestClassifier(clf, n_jobs=-1, verbose=10)

# Flatten the data
X_train = X_train.reshape(X_train.shape[0], -1)
X_test = X_test.reshape(X_test.shape[0], -1)


In [None]:
# Train the classifier on the training data
clf.fit(X_train, y_train)

In this section, we use our trained model to make predictions on the test data. We then print the accuracy of the classifier using `accuracy_score()`. We also import classification_report from sklearn.metrics and print out the report, which includes precision, recall, f1-score and support for each class. Then, we use `seaborn` and `matplotlib` to create a confusion matrix visualization of the classifier's performance.

In [None]:
# Make predictions on the test data
y_pred = clf.predict(X_test)

# Print the accuracy of the classifier
print("Accuracy:", accuracy_score(y_test, y_pred))

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred,
                            target_names=CLASSES))

In [None]:
# use seaborn plotting defaults
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt

In [None]:
from sklearn.metrics import confusion_matrix
plt.figure(figsize=(15,10))
plt.rcParams.update({'font.size': 16})
mat = confusion_matrix(y_test, y_pred)
sns.heatmap(mat.T, annot=False, cmap="crest",
            xticklabels=CLASSES,
            yticklabels=CLASSES)
plt.xlabel('true label')
plt.ylabel('predicted label');

In [None]:
!killall sar

##Store Model
In this section, we are storing the trained model to a file.
We start by importing the pickle library, which allows us to serialize and save the model to a file. We then use the `open()` function to create a new file in the specified directory with the name 'svm_rbf.pkl' and the mode 'wb' (write binary). We then use the `pickle.dump()` function to write the clf object (our trained model) to the file. This way we can use it later on.

Once the model is stored, it can be loaded and used to make predictions on new data. The stored model can also be used to retrain the model on new data.

In [None]:
import pickle
# Save the model to a file
with open('/content/drive/MyDrive/svm_rbf.pkl', 'wb') as f:
    pickle.dump(clf, f)

##Load Model
Load the data and make some predictions on it.

In [None]:
import pickle
with open('/content/drive/MyDrive/svm_rbf.pkl', 'rb') as f:
    clf = pickle.load(f)

In [None]:
!sudo apt-get install gnuplot gawk

In [None]:
!!sadf -d /content/drive/MyDrive/sar-mem-rbf.log -- -r| gawk -F";" '{print $3 " " $6}' | gawk '{gsub(/ UTC/,"",$2); print}' | gnuplot -persist -e "set terminal png; set output 'memstats.png';set xdata time; set timefmt '%Y-%m-%d %H:%M:%S'; set xlabel 'Time'; set ylabel 'Memory Usage'; plot '-' using 1:4 with lines;"