#  Comparison of CNN to Vision Transformer model trained on HiRISE Mars Satellite Images     (Feel free to change this)
### by Aniruddha Prasad and Andrew Hartnett

The following notebook will compare the accuracies of a Convolutional Neural Network (CNN) and Vision Transformer (ViT) trained on satellite images taken of Mars from the HiRISE dataset. The goal of this work is to determine whether or not a pre-trained ViT model, which has been seen used as the state-of-the-art for image classification in certain circumstances, will prove better when pre-trained on a significant size dataset and fine-tuned to this data. Then, we will train 3 version of each model with larger and larger subsets of the data to determine the trend in accuracy for each model. This will tell us which model will be best as more images are accumulated over the years.

### Table of Contents

1. Prepare the Training Data
2. Define and Train the CNN - **WIP**
3. Define and Train the Vision Transformer (ViT) - **WIP**
4. Evaluate CNN vs ViT - **WIP**
5. Retrain CNN and ViT on small, medium, and full HiRISE - **WIP**
6. Compare three CNNs vs three ViTs - **WIP**

## 1. Prepare the training data

In [1]:
# Tendorflow imports
import tensorflow as tf

# Helper libraries
import math
import numpy as np
import matplotlib
matplotlib.use('PS') #prevent import error due to venv
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from PIL import Image

# Imports for dataset separation
from sklearn.model_selection import train_test_split
from keras.preprocessing.image import ImageDataGenerator

# Improve progress bar display
import tqdm
import tqdm.auto
tqdm.tqdm = tqdm.auto.tqdm

Using TensorFlow backend.


## The way Neihusst Preprocesses their data:

In [2]:
data_images = []
data_labels = []
rel_img_path = 'hirise-map-proj-v3/map-proj/' # add path of folder to image name for later loading

# open up the labeled data file
with open('hirise-map-proj-v3/labels-map-proj.txt') as labels:
  for line in labels:
    file_name, label = line.split(' ')
    data_images.append(rel_img_path + file_name)
    data_labels.append(int(label))

# divide data into testing and training (total len 3820)
train_images, test_images, train_labels, test_labels = train_test_split(
    data_images, data_labels, test_size=0.15, random_state=666)
test_len = len(test_images)   # 573
train_len = len(train_images) # 3247

# label translations
class_labels = ['other','crater','dark_dune','streak',
                'bright_dune','impact','edge']


In [3]:
# Print length of training set (should be 3247)
print(train_len)

3247


In [4]:
# Print length of testing set (should be 573)
print(test_len)

573


### Data Preprocessing

In [5]:
#convert image paths into numpy matrices
def parse_image(filename):
  img_obj = Image.open(filename)
  img = np.asarray(img_obj).astype(np.float32)
  #normalize image to 0-1 range
  img /= 255.0
  return img

train_images = np.array(list(map(parse_image, train_images)))
test_images = np.array(list(map(parse_image, test_images)))

# Add 4th dimension to image arrays to allow for model.fit to take them as inputs
train_images = np.reshape(train_images, (-1, 227, 227, 1))
test_images = np.reshape(test_images, (-1, 227, 227, 1))

In [6]:
# Ensure that training images have been created into a 3D array of 3247 images of 227x227 pixels
np.shape(train_images[0])

(227, 227, 1)

In [7]:
print(train_images[0])

[[[0.4       ]
  [0.4       ]
  [0.40392157]
  ...
  [0.32156864]
  [0.30980393]
  [0.29803923]]

 [[0.4       ]
  [0.4       ]
  [0.4       ]
  ...
  [0.30588236]
  [0.2901961 ]
  [0.27058825]]

 [[0.4       ]
  [0.4       ]
  [0.39607844]
  ...
  [0.29803923]
  [0.26666668]
  [0.23921569]]

 ...

 [[0.4117647 ]
  [0.4117647 ]
  [0.4117647 ]
  ...
  [0.654902  ]
  [0.6862745 ]
  [0.6784314 ]]

 [[0.40392157]
  [0.40392157]
  [0.40784314]
  ...
  [0.70980394]
  [0.75686276]
  [0.76862746]]

 [[0.4       ]
  [0.4       ]
  [0.4       ]
  ...
  [0.74509805]
  [0.80784315]
  [0.8352941 ]]]


### Convert labels to one-hot encoding

In [8]:
train_labels[0]

2

In [9]:
def to_one_hot(label):
  encoding = [0 for _ in range(len(class_labels))]
  encoding[label] = 1
  return np.array(encoding).astype(np.float32)

train_labels = np.array(list(map(to_one_hot, train_labels)))
test_labels = np.array(list(map(to_one_hot, test_labels)))

train_labels = np.reshape(train_labels, (-1, 7))
test_labels = np.reshape(test_labels, (-1, 7))

In [10]:
# Print one-hot encoding of labels (train_image[0] is classified as "dark_dune")
train_labels[0]

array([0., 0., 1., 0., 0., 0., 0.], dtype=float32)

## 2. Define and Train the CNN - WIP

#### Code Source: https://github.com/niehusst/HiRISE-Net

This project by **niehusst** on GitHub sought to emulate the HiRISENet model used in the "Deep Mars" paper. The CNN defined in his project provided us a base CNN to use which we tuned to improve the test set accuracy.

In [11]:
# make a generator to train the model with
generator = ImageDataGenerator(rotation_range=0, zoom_range=0,
    width_shift_range=0, height_shift_range=0, shear_range=0,
    horizontal_flip=False, fill_mode="nearest")

In [12]:
###             BUILD SHAPE OF THE MODEL              ###

# increase kernel size and stride??
model = tf.keras.Sequential([
  tf.keras.layers.Conv2D(32, (3,3), padding='same', activation=tf.nn.relu,
      input_shape=(227,227,1)),
  tf.keras.layers.MaxPooling2D((2,2), strides=2),
  tf.keras.layers.Conv2D(64, (3,3), padding='same', activation=tf.nn.relu),
  tf.keras.layers.MaxPooling2D((2,2), strides=2),
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(128, activation=tf.nn.relu),
  tf.keras.layers.Dense(7, activation=tf.nn.softmax), # final layer with node for each classification
])

# specify loss and SGD functions
model.compile(optimizer='adam',
    loss='categorical_crossentropy',
    metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 227, 227, 32)      320       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 113, 113, 32)      0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 113, 113, 64)      18496     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 56, 56, 64)        0         
_________________________________________________________________
flatten (Flatten)            (None, 200704)            0         
_________________________________________________________________
dense (Dense)                (None, 128)               25690240  
_________________________________________________________________
dense_1 (Dense)              (None, 7)                 9

In [13]:
###                 TRAIN THE MODEL                   ###

#specify training metadata
BATCH_SIZE = 32

# train the model on the training data
num_epochs = 5
model.fit(generator.flow(train_images, train_labels, batch_size=BATCH_SIZE), epochs=num_epochs)

  ...
    to  
  ['...']
Train for 102 steps
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x203364ece08>

## 3. Define and Train the Vision Transformer (ViT) - WIP
Website used as a source: https://theaisummer.com/hugging-face-vit/

Should change this code with this: https://github.com/google-research/vision_transformer/blob/main/vit_jax_augreg.ipynb

In [None]:
# Import files from repository.

import sys
if './vision_transformer' not in sys.path:
  sys.path.append('./vision_transformer')

%load_ext autoreload
%autoreload 2

from vit_jax import checkpoint
from vit_jax import models
from vit_jax import train
from vit_jax.configs import augreg as augreg_config
from vit_jax.configs import models as models_config

 # 

#  

# DISREGARD

In [None]:
# TBR
from datasets import load_dataset
train_ds, test_ds = load_dataset('cifar10', split=['train[:5000]', 'test[:2000]'])

# TBR
splits - train_ds.train_test_split(test_size=0.1)
train_ds = splits['train']
val_ds = splits['test']

In [None]:
# Set the training metric to minimize
from datasets import load_metric
metric = load_metric("accuracy")

# Instantiate ViT model
from transformers import ViTForImageClassification
model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')
model.eval()

In [None]:
# Feature extraction
from transformers import ViTFeatureExtractor

feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224-in21k')

def preprocess_images(examples):
    
    images = examples['img']
    images = [np.array(image, dtype=np.uint8) for image in images]
    images = [np.moveaxis(image, source=-1, destination=0) for image in images]
    examples['pixel_values'] = inputs['pixel_values']
    
    return examples

from datasets import Features, ClassLabel, Array3D

features = Features({
    'label': ClassLabels(names=['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']),
    'img': Array3D(dtype="int64", shape=(3,32,32)),
    'pixel_values': Array3D(dtype="float32", shape=(3, 224, 244)),
})

preprocessed_train_ds = train_ds.map(preprocess_images, batched=True, features=features)
preprocessed_val_ds = val_ds.map(preprocess_images, batched=True, features=features)
preprocessed_test_ds = test_ds.map(preprocess_images, batched=True, features=features)

In [None]:
# Data collator - Used for forming batches from the dataset when training the model
from transformers import default_data_collator

data_collator = default_data_collator

In [None]:
# Defining the model - Part 1
from transformers import ViTForImageClassification

model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224-ink21k')

model.train()

In [None]:
# Defining the model - Part 2
from transformers import ViTModel

class ViTForImageClassification2(nn.Module):
    
    def __init__(self, num_labels=10):
        
        super(ViTForImageClassification2, self).__init__()
        self.vit = ViTModel.from_pretrained('google/vit-base-patch16-224-in21k')
        self.classifier = nn.Linear(self.vit.config.hidden_size, num_labels)
        self.num_labels = num_labels
        
    def forward(self, pixel_values, labels):
        
        outputs = self.vit(pixel_values=pixel_values)
        logits = self.classifier(output)
        loss = None
        
        if labels is not None:
            
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            
        return SequenceClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

In [None]:
# Calculate the metrics during evaluation (CUSTOM - May need to change)
def compute_metrics(eval_pred):
    
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predicitons=predictions, references=labels)

In [None]:
# Training the model
trainer = Trainer(
    model,
    args,
    train_dataset = preprocessed_train_ds,
    eval_dataset = preprocessed_val_ds,
    data_collator = data_collator,
    compute_metrics = compute_metrics,
)

In [None]:
# Training arguments
args = TrainingArguments(
    "test-cifar-10",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=10,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    logging_dir='logs',
)

trainer.train()

In [None]:
# Callbacks - This cell is not complete
from transformers import WandbCallback
callbacks = [WandbCallback(...)]

## 4. Evaluate CNN vs ViT - WIP

In [14]:
# Evaluating the CNN - BEST IS 81% AFTER 5 EPOCHS
test_loss, test_accuracy = model.evaluate(test_images, test_labels)
print("Final loss was {}.\nAccuracy of model was {}".format(test_loss, test_accuracy))

Final loss was 0.6362588601170618.
Accuracy of model was 0.7870855331420898


In [None]:
# Evaluating the ViT
outputs = trainer.predict(preprocessed_test_ds)
y_pred = outputs.predictions.argmax(1)

In [None]:
# Make a plot for the clout


#### Short response to our findings:
Was the output expected? what did we do for optimizations? is it overfit/underfit?

## 5. Retrain CNN and ViT on small, medium, and full HiRISE - WIP

## 6. Compare three CNNs vs three ViTs - WIP

In [None]:
# Plot for the clout
