<html>
<body>
    <center> 
        <h1><u>Drosophila Species Categorizer</u></h1>
        <h3> Distinguishes D. mel from D. americanus and D. virilis</h3>
    </center>
</body>
</html>

In [1]:
### General libraries useful for python ###

import os
import sys
from tqdm.notebook import tqdm
import json
import random
import pickle
import copy
from IPython.display import display
import ipywidgets as widgets
import numpy as np

In [2]:
### Finding where you clone your repo, so that code upstream paths can be specified programmatically ####
git_dir = 'C:/Users/alisc/Documents/GitHub/Harvard_BAI'
print('Your github directory is :%s'%git_dir)

Your github directory is :C:/Users/alisc/Documents/GitHub/Harvard_BAI


In [3]:
project_folder = "%s/Final_Project"%git_dir

In [4]:
os.chdir(project_folder)

In [5]:
### Libraries for visualizing our results and data ###
from PIL import Image
import matplotlib.pyplot as plt

In [6]:
### Import PyTorch and its components ###
import torch
import torchvision
from torchvision import transforms
import torch.nn as nn
import torch.optim as optim

#### Let's load our flexible code-base which you will build on for your research projects in future assignments.

Like assignment 1, we are loading in our code-base for convenient dataloading/model loading etc.

In [7]:
### Making helper code under the folder res available. This includes loaders, models, etc. ###
sys.path.append('%s/res/'%git_dir)
from models.models import get_model
from loader.loader import get_loader

Models are being loaded from: C:\Users\alisc\Documents\GitHub\Harvard_BAI\res\models
Loaders are being loaded from: C:\Users\alisc\Documents\GitHub\Harvard_BAI\res\loader


#### See those paths printed above?

As earlier, models i.e. architectures are being loaded from `res/models`. In this assignment we will be using the ResNet18 architecture, which is being loaded from the script `res/ResNet.py`

### Specifying settings/hyperparameters for our code below ###

By changing above, different experiments can be run. For example, you can specify which model architecture to load, which dataset you will be loading, and so on.

In [8]:
wandb_config = {}
wandb_config['batch_size'] = 10
wandb_config['base_lr'] = 0.01
wandb_config['model_arch'] = 'ResNet152'
wandb_config['num_classes'] = 3
wandb_config['run_name'] = 'Final_Project'

### If you are using a CPU, please set wandb_config['use_gpu'] = 0 below. However, if you are using a GPU, leave it unchanged ####
wandb_config['use_gpu'] = 1

wandb_config['num_epochs'] = 100
wandb_config['git_dir'] = git_dir

### Data Loading ###

The most common task many of you will be doing in your projects will be running a script on a new dataset. In PyTorch this is done using data loaders, and it is extremely important to understand this works. In next assignment, you will be writing your own dataloader. For now, we only expose you to basic data loading which for the MNIST dataset for which PyTorch provides easy functions.

### Let's load our own custom dataset. We will be using the Cats vs Dogs dataset from Kaggle.com

Download the data from https://www.kaggle.com/c/dogs-vs-cats/data.

Store it in `assignment_2/data/` and unzip the files.

So, the train images should be inside the directory: Harvard_BAI/assignment_2/data/dogs-vs-cats/train/

Data Transforms tell PyTorch how to pre-process your data. Recall that images are stored with values between 0-255 usually. One very common pre-processing for images is to normalize to be 0 mean and 1 standard deviation. This pre-processing makes the task easier for neural networks. There are many, many kinds of normalization in deep learning, the most basic one being those imposed on the image data while loading it.

### Creates file lists for training (50%), validation (30%), and testing (20%) images

In [9]:
total_paths = {}
total_files = {}
iterator = 0
for path, directories, files in os.walk('%s/data/insect_species/'%project_folder): 
    if files != []:
        total_paths[iterator] = path
        total_files[iterator] = files
        iterator += 1

In [10]:
total_points = np.array(range(len(total_files)))
total_file_paths = {0:[], 1:[], 2:[]}
train_file_paths = {0:[], 1:[], 2:[]}
val_file_paths = {0:[], 1:[], 2:[]}
test_file_paths = {0:[], 1:[], 2:[]}

for species in range(len(total_files)):
    total_points[species] = len(total_files[species])
    for file in range(int(total_points[species]*0.5)):
        train_file_paths[species].append(total_paths[species] + '/' + list(total_files[species])[file])
        total_file_paths[species].append(total_paths[species] + '/' + list(total_files[species])[file])
    for file in range(int(total_points[species]*0.5)+1, int(total_points[species]*0.8)):
        val_file_paths[species].append(total_paths[species] + '/' + list(total_files[species])[file])
        total_file_paths[species].append(total_paths[species] + '/' + list(total_files[species])[file])
    for file in range(int(total_points[species]*0.8)+1, int(total_points[species])):
        test_file_paths[species].append(total_paths[species] + '/' + list(total_files[species])[file])
        total_file_paths[species].append(total_paths[species] + '/' + list(total_files[species])[file])

In [11]:
labels_dictionary = {}

for species in range(len(total_file_paths)):
    for file_path in total_file_paths[species]:
        labels_dictionary[file_path] = species


Dumps the created labels_dictionary and training, val, and test data file paths for later reference.

In [12]:
with open('%s/data/labels_dictionary.p'%project_folder, 'wb') as F:
    pickle.dump(labels_dictionary, F)

with open('%s/data/train_file_list.txt'%project_folder, 'w') as filehandle:
    for i in range(3):
        for listitem in train_file_paths[i]:
            filehandle.write('%s\n' % listitem)

with open('%s/data/val_file_list.txt'%project_folder, 'w') as filehandle:
    for i in range(3):
        for listitem in val_file_paths[i]:
            filehandle.write('%s\n' % listitem)
            
with open('%s/data/test_file_list.txt'%project_folder, 'w') as filehandle:
    for i in range(3):
        for listitem in test_file_paths[i]:
            filehandle.write('%s\n' % listitem)

# Using our custom data-loader. 

Use the 'cats_dogs_loader' for now, mainly a naming issue - the dataset formatting is absolutely fine.

In [13]:
file_list_loader = get_loader('cats_dogs_loader')

In [14]:
data_transforms = {
    'train': transforms.Compose([
        transforms.RandomResizedCrop(224),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
    'val': transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
     'test': transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
}

In assignment 1 we just used `torchvision.datasets.MNIST` to load MNIST data. But now we, can't rely on that function as we have a custom dataset. Here we learn how to handle such custom data.

We use our custom file_list_loader to load data that we have downloaded and unzipped. Above, we created file lists which contain paths to our train, validation and test datasets, here we will pass these file lists to the file_list_loader.

### Open the file cats_dogs_loaders.py, you will see a class FileListFolder.

In the file loader.py we load this class FileListFolder in the function get_loader. So, when we run get_loader("cats_dogs_loader") above, what is returned is the class FileListFolder. So, now when we run file_list_folder(), the arguments inside are passed to the class FileListFolder as described in cats_dogs_loader.py.

Thus, the first time you pass this, the __init__ function is run i.e. an object of that class is initialized. As you can see, the __init__ function in cats_dogs_loader.py requires 3 attributes - a file list, a labels dictionary and a pytorch transform object. To create a new data loader, we need to create a file lie cats_dogs_loder.py and make necessary changes to it.


The file lists contain paths to train/val/test files. The labels dictionary is a dictionary storing category numbers for each of these files, and the transforms are the pre-processing pytorch does to our loaded images before starting trainig.

In [15]:
dsets = {}
dsets['train'] = file_list_loader('%s/data/train_file_list.txt'%project_folder, '%s/data/labels_dictionary.p'%project_folder, data_transforms['train'])
dsets['val'] = file_list_loader('%s/data/val_file_list.txt'%project_folder, '%s/data/labels_dictionary.p'%project_folder, data_transforms['val'])
dsets['test'] = file_list_loader('%s/data/test_file_list.txt'%project_folder, '%s/data/labels_dictionary.p'%project_folder, data_transforms['test'])

In [16]:
### Above, we created datasets. Now, we will pass them into pytorch's inbuild dataloaders, 
### these will help us load batches of data for training.
dset_loaders = {}
dset_loaders['train'] = torch.utils.data.DataLoader(dsets['train'], batch_size=wandb_config['batch_size'], shuffle = True, num_workers=2,drop_last=False)
dset_loaders['val'] = torch.utils.data.DataLoader(dsets['val'], batch_size=wandb_config['batch_size'], shuffle = False, num_workers=2,drop_last=False)
dset_loaders['test'] = torch.utils.data.DataLoader(dsets['test'], batch_size=wandb_config['batch_size'], shuffle = True, num_workers=2,drop_last=False)

In [17]:
data_sizes = {}
data_sizes['train'] = len(dsets['train'])
data_sizes['val'] = len(dsets['val'])
data_sizes['test'] = len(dsets['test'])

# Load a resnet 152 model

In [18]:
### Above we loaded a ResNet18 model. 
### Read the file `res/models/models.py` and decide what you 
### should fill below to load the ResNet34 model instead.

model = get_model('ResNet152', 1000)
in_filters = model.fc.in_features
model.fc = nn.Linear(in_features=in_filters, out_features=wandb_config['num_classes'])
model.cuda();

#### Below we have the function which trains, tests and returns the best model weights.

In [19]:
def model_pipeline(model, criterion, optimizer, dset_loaders, dset_sizes, hyperparameters):
    #with wandb.init(project="HARVAR_BAI", config=hyperparameters):
        #if hyperparameters['run_name']:
            #wandb.run.name = hyperparameters['run_name']
    config = wandb_config
    best_model = model
    best_acc = 0.0

    print(config)

    for epoch_num in tqdm(range(config['num_epochs'])):
        #wandb.log({"Current Epoch": epoch_num})
        model = train_model(model, criterion, optimizer, dset_loaders, dset_sizes, config)
        best_acc, best_model = val_model(model, best_acc, best_model, dset_loaders, dset_sizes, config)

    return best_model

#### The different steps of the train model function are annotated below inside the function. Read them step by step

In [20]:
def train_model(model, criterion, optimizer, dset_loaders, dset_sizes, configs):
    print('Starting training epoch...')
    best_model = model
    best_acc = 0.0

    
    ### This tells python to track gradients. While testing weights aren't updated hence they are not stored.
    model.train() 
    running_loss = 0.0
    running_corrects = 0
    iters = 0
    
    
    ### We loop over the data loader we created above. Simply using a for loop.
    for data in enumerate(dset_loaders['train']):
        inputs, labels = data[1][0], data[1][1]
        
        ### If you are using a gpu, then script will move the loaded data to the GPU. 
        ### If you are not using a gpu, ensure that wandb_configs['use_gpu'] is set to False above.
        if wandb_config['use_gpu']:
            inputs = inputs.float().cuda()
            labels = labels.long().cuda()
        else:
            print('WARNING: NOT USING GPU!')
            inputs = inputs.float()
            labels = labels.long()

        
        ### We set the gradients to zero, then calculate the outputs, and the loss function. 
        ### Gradients for this process are automatically calculated by PyTorch.
        
        optimizer.zero_grad()
        outputs = model(inputs)
        _, preds = torch.max(outputs.data, 1)

        loss = criterion(outputs, labels)
        
        
        ### At this point, the program has calculated gradient of loss w.r.t. weights of our NN model.
        loss.backward()
        optimizer.step()
        
        ### optimizer.step() updated the models weights using calculated gradients.
        
        ### Let's store these and log them using wandb. They will be displayed in a nice online
        ### dashboard for you to see.
        
        iters += 1
        running_loss += loss.item()
        running_corrects += torch.sum(preds == labels.data)
        #wandb.log({"train_running_loss": running_loss/float(iters*len(labels.data))})
        #wandb.log({"train_running_corrects": running_corrects/float(iters*len(labels.data))})

    epoch_loss = float(running_loss) / dset_sizes['train']
    epoch_acc = float(running_corrects) / float(dset_sizes['train'])
    #wandb.log({"train_accuracy": epoch_acc})
    #wandb.log({"train_loss": epoch_loss})
    return model



In [21]:
def val_model(model, best_acc, best_model, dset_loaders, dset_sizes, configs):
    print('Starting testing epoch...')
    model.eval() ### tells pytorch to not store gradients as we won't be updating weights while testing.

    running_corrects = 0
    iters = 0   
    for data in enumerate(dset_loaders['val']):
        inputs, labels = data[1][0], data[1][1]
        
        if wandb_config['use_gpu']:
            inputs = inputs.float().cuda()
            labels = labels.long().cuda()
        else:
            print('WARNING: NOT USING GPU!')
            inputs = inputs.float()
            labels = labels.long()

        
        outputs = model(inputs)
        _, preds = torch.max(outputs.data, 1)
        
        iters += 1
        running_corrects += torch.sum(preds == labels.data)
        #wandb.log({"train_running_corrects": running_corrects/float(iters*len(labels.data))})


    epoch_acc = float(running_corrects) / float(dset_sizes['val'])

    #wandb.log({"test_accuracy": epoch_acc})
    
    ### Code is very similar to train set. One major difference, we don't update weights. 
    ### We only check the performance is best so far, if so, we save this model as the best model so far.
    
    if epoch_acc > best_acc:
        best_acc = epoch_acc
        best_model = copy.deepcopy(model)
    #wandb.log({"best_accuracy": best_acc})
    
    return best_acc, best_model
    

# Make sure your runtime is GPU. If you changed your run time, make sure to run your code again from the top.

In [23]:
### Criterion is simply specifying what loss to use. Here we choose cross entropy loss. 
criterion = nn.CrossEntropyLoss()

### tells what optimizer to use. There are many options, we here choose Adam.
### the main difference between optimizers is that they vary in how weights are updated based on calculated gradients.
optimizer_ft = optim.Adam(model.parameters(), lr = wandb_config['base_lr'])

if wandb_config['use_gpu']:
    criterion.cuda()
    model.cuda()
    

### Creating the folder where our models will be saved.
if not os.path.isdir("%s/saved_models/"%wandb_config['git_dir']):
    os.mkdir("%s/saved_models/"%wandb_config['git_dir'])
    
### Let's run it all, and save the final best model.
best_final_model = model_pipeline(model, criterion, optimizer_ft, dset_loaders, data_sizes, wandb_config)


save_path = '%s/saved_models/%s_final.pt'%(wandb_config['git_dir'], wandb_config['run_name'])
with open(save_path,'wb') as F:
    torch.save(best_final_model,F)
print('Best model saved in %s'%save_path)

{'batch_size': 10, 'base_lr': 0.01, 'model_arch': 'ResNet152', 'num_classes': 3, 'run_name': 'Final_Project', 'use_gpu': 1, 'num_epochs': 100, 'git_dir': 'C:/Users/alisc/Documents/GitHub/Harvard_BAI'}


HBox(children=(HTML(value=''), FloatProgress(value=0.0), HTML(value='')))

Starting training epoch...
Starting testing epoch...
Starting training epoch...



KeyboardInterrupt: 

### Save final model 