# A simple CNN for Poverty dataset

In this HW, you will be working on implementing models / algorithms for performing classification on the [WILDS Poverty Map](https://wilds.stanford.edu/datasets/#povertymap) dataset.
To help you get started with this, we will build a simple CNN for binary classification on the given dataset. In this notebook, I'll be covering the following things:
1. Loading data
2. Designing a CNN
3. Training the model
4. Evaluating the model

**The code given here assumes you have access to a NVIDIA GPU and have cuda installed.**

## Imports

In [1]:
import os
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import pytorch_lightning as pl
from pytorch_lightning.loggers import CSVLogger
from sklearn.model_selection import train_test_split
import torchmetrics.functional as metrics
from resnet import *
import ntpath
import collections

  from .autonotebook import tqdm as notebook_tqdm


## 1. Loading data

The dataset consists of satellite images (8 channels) across 23 countries and their corresponding wealth index. The wealth index has been thresholded to provide binary labels indicating wealthy (1)/poor(0). You can read more about the dataset and it's characteristics [here](https://www.nature.com/articles/s41467-020-16185-w).

### Dataset format
The dataset is stored as numpy dumps `(.npz)` format and are available in `/datasets/cs255-sp22-a00-public/poverty/anon_images` on datahub. We have partitioned the data into train and test sets. The different partitions and mapping from file to labels are provided as csv files.
The partitions are decided by:
- `train.csv` Containes metadata of the training split
- `random_test_reduct.csv` Test images sampled from all countries included in the trainset
- `country_test_reduct.csv` Test images sampled from countries NOT in the trainset 

We provide 2 test sets as we want you to work on 2 different problems. The `random_test_reduct` partition is for in-domain testing and the `country_test_reduct` partition is meant for out-of-domain testing. The former problem is the easier one to start and the focus of this notebook.

In [2]:
csv_path = '/home/bitwiz/codeden/ta/cse255/Poverty_Analysis/public_tables/'
train_csv_path = os.path.join(csv_path, 'train.csv')
test_csv_path = os.path.join(csv_path, 'random_test_reduct.csv')
image_path = '/home/bitwiz/codeden/data/wilds/anon_images'

In [3]:
train_df = pd.read_csv(train_csv_path, index_col = 0)
train_df.head()

Unnamed: 0,filename,country,wealthpooled,urban,label,nl_mean
0,image14517.npz,6,-1.019361,False,0,-0.086633
2,image7407.npz,6,-1.143002,False,0,-0.141589
3,image390.npz,6,1.056769,True,0,15.228898
4,image7980.npz,6,1.454064,True,1,11.082343
5,image13397.npz,6,1.708446,True,1,12.646744


### Dataloading pipeline

In order to work with this dataset we need to convert the data to a format compatible with PyTorch. The PyTorch API exposes a set of utility functions at `torch.utils.data`. The 2 important classes you need to know about are `torch.utils.data.Dataset` and `torch.utils.data.DataLoader`. The `Dataset` class implements an iterable dataset which allows you to step through your data with indices. The `DataLoader` class provides the functionality to load the data in batches and feed it to the GPU/CPU during training/testing.

For our Poverty dataset, we will implement a custom `Dataset` class that will load the csv file and appropriately configure the data. The `Dataset` class needs to implement 2 methods: `__len__` and `__getitem__` as a bare minimum.


In [4]:
class WildsDataset(Dataset):
    '''
    Custom Dataset class for wilds poverty dataset
    input:
        image_paths: csv path to split
        idx_to_class: a dictionary mapping index of datapoint to it's label
    '''
    def __init__(self, image_paths, idx_to_class = None, transform = None):
        super().__init__()
        self.image_paths = image_paths
        self.idx_to_class = idx_to_class
        self.transform = transform
    
    def __len__(self):
        return len(self.image_paths)
    
    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()
        
        image = np.load(self.image_paths[idx])
        image = image.f.x
        
        if self.transform:
            image = self.transform(image)
        
        if self.idx_to_class:
            index = self.image_paths[idx].split('/')[-1]
            label = self.idx_to_class[index]
        else:
            return image
        
        return image, label

#### Test the dataset

Our WildsDataset takes as input 2 arguments. First a list of image paths, 2nd a mapping from images to labels.
We will generate these image paths and the mapping from `train.csv`.

In [5]:
csv_rows = train_df.loc[:, ['filename', 'label']].to_dict(orient='records')
label_map = {x['filename']: x['label'] for x in csv_rows}

train_image_paths = [os.path.join(image_path, csv_rows[index]['filename']) for index in range(len(train_df))]

In [6]:
ds = WildsDataset(train_image_paths, label_map)
x, y = ds[0]
print(x.shape, y)

(8, 224, 224) 0


### Convert `Dataset` to `DataLoader`

Now that we have the dataset class implemented, we will wrap it in the `DataLoader` class for us to generate batches of images to train the mode on.
In order to make life easier, we use a package called `pytorch-lightning` which abstracts away all the boiler plate code of PyTorch. You do not need to know much about this package to use it. We will walkthrough the relevant functions in this notebook.

We will implement a `LightningDataModule` which provides APIs to fetch the required DataLoader. Below is the template for creating a dataModule. We will define the `_create` function next which will initialize the train, val and test splits of the data.

In [7]:
class WildsDM(pl.LightningDataModule):
    def __init__(self, csv_path, image_path, test_csv_path, batch_size = 256, train_val_split_ratio = 0.9, test = True):
        super().__init__()
        self.csv_path = csv_path
        self.image_path = image_path
        self.batch_size = batch_size
        self.train_val_split_ratio = train_val_split_ratio
        self.test_csv_path = test_csv_path
        
        if test:
            self._create()
    
    def setup(self, stage):
        self._create()
    
    def train_dataloader(self):
        return DataLoader(self.train_set, batch_size = self.batch_size, shuffle = True, num_workers = 8)
    
    def val_dataloader(self):
        return DataLoader(self.val_set, batch_size = self.batch_size, shuffle = False, num_workers = 8)
    
    def test_dataloader(self):
        return DataLoader(self.test_set, batch_size = self.batch_size, shuffle = False, num_workers = 8)

We will now implement the `_create` function. The function essentially is going to implement the logic we wrote in `Test the dataset`

In [8]:
class WildsDM(WildsDM):
    def _create(self):
        csv_df = pd.read_csv(self.csv_path, index_col = 0)
        csv_rows = train_df.loc[:, ['filename', 'label']].to_dict(orient='records')
        
        train_indices, val_indices = train_test_split(range(len(csv_rows)), train_size = self.train_val_split_ratio)
        
        train_image_paths = [os.path.join(self.image_path, csv_rows[index]['filename']) for index in train_indices]
        val_image_paths = [os.path.join(self.image_path, csv_rows[index]['filename']) for index in val_indices]
        label_map = {x['filename']: x['label'] for x in csv_rows}
        
        self.train_set = WildsDataset(train_image_paths, label_map)
        self.val_set = WildsDataset(val_image_paths, label_map)
        
        test_df = pd.read_csv(self.test_csv_path, index_col = 0)
        self.test_image_paths = [os.path.join(self.image_path, row['filename']) for index,row in test_df.iterrows()]
        self.test_set = WildsDataset(self.test_image_paths)

#### Test dataloaders

In [None]:
dm = WildsDM(train_csv_path, image_path, test_csv_path)

for batch in dm.train_dataloader():
    x, y = batch
    print(x.shape, y.shape)
    break

for batch in dm.val_dataloader():
    x, y = batch
    print(x.shape, y.shape)
    break
    
for batch in dm.test_dataloader():
    x = batch
    print(x.shape)
    break

## 2. Model Architecture and Loss

Next we will look at defining a CNN architecture and setting up the loss functions and optimizer to train the CNN. Again we will make use of `pytorch-lightning` to abstract the boilerplate code for PyTorch training.

The lightning framework exposes a `LightningModule` which instantiates functions for training, validation and testing.
First we instantiate the model, loss function and define the forward pass for the model. For our model, we will be using a ResNet-18 architecture. ResNet is a residual CNN architecture that was introduced in this [paper](https://arxiv.org/pdf/1512.03385.pdf).

In [9]:
class baseline_module(pl.LightningModule):
    def __init__(self, lr = 0.001, weight_decay = 1e-4):
        super().__init__()
        self.model = ResNet18(num_classes = 2, num_channels = 8)
        self.lr = lr
        self.loss = nn.CrossEntropyLoss()
        self.weight_decay = weight_decay
        
        self.save_hyperparameters()
    
    def forward(self, x):
        out = self.model(x)
        
        return out

Next we define the operations we will perform during each step of training, validation and testing. We will do this through the hooks provided by `LightningModule`. The `LightningModule` provides 3 functions: `training_step`, `validation_step` and `test_step` inside which you define the operations that happen one batch of data. Internally `LightningModule` loads batches of data from the `dataModule` and for each batch,
calls the appropriate step function depending on whether the model is being trained or evaluated. The step functions take as input a single batch and expects you to return the loss value for backpropogation.

In [10]:
class baseline_module(baseline_module):
    def training_step(self, batch, batch_idx):
        # process a single batch
        loss, acc = self.single_step(batch)
        
        #log the values and display them on the progress bar
        self.log('tloss', loss, on_epoch=True, on_step=False, logger=True, prog_bar=True)
        self.log('tacc', acc, on_epoch=True, on_step=False, logger=True, prog_bar=True)
        
        return loss
    
    def validation_step(self, batch, batch_idx):
        loss, acc = self.single_step(batch)
        
        self.log('vloss', loss, on_epoch=True, on_step=False, logger=True, prog_bar=True)
        self.log('vacc', acc, on_epoch=True, on_step=False, logger=True, prog_bar=True)
    
#     def test_step(self, batch, batch_idx):
#         loss = self.single_step(test = True)
        
#         self.log('test_loss', loss, on_epoch=True, on_step=False, logger=True, prog_bar=True)

### Forward pass (Process a single batch)

In [11]:
class baseline_module(baseline_module):
    def single_step(self, batch, test = False):
        x, y = batch
        y_hat = self(x)
        
        loss = self.loss(y_hat, y)
        _, preds = torch.max(y_hat, dim = 1)
        
        acc = metrics.accuracy(preds, y)
        
        return loss, acc

Now that we have the model and have implemented the forward passes, we need to configure the optimizer for training. The `LightningModule` takes care of backpropogation and moving the data to GPUs.

In [12]:
class baseline_module(baseline_module):
    def configure_optimizers(self):
        optim = torch.optim.SGD(self.parameters(), lr=self.lr, weight_decay=self.weight_decay, momentum=0.9)
        
        return {'optimizer': optim}

## 3. Training

We have implemented the data pipeline and the model training pipeline. The last step is to start training. `pytorch_lightning` provides a `Trainer` class which wraps the model and data module into an end-to-end pipeline and begins training. We will also add some callbacks to checkpoint and save our best model along with it's hyperparameters and train val logs.

In [13]:
csv_logger = CSVLogger(save_dir = './logs', name = 'resnet', version = 'test')
ckpt = pl.callbacks.ModelCheckpoint(dirpath = './checkpoints', monitor = 'vloss', mode = 'min')

In [14]:
model = baseline_module()
dm = WildsDM(train_csv_path, image_path, test_csv_path)

trainer = pl.Trainer(
    gpus = 1, accelerator = 'gpu', max_epochs = 5, precision = 16,
    strategy = 'dp', logger = [csv_logger], callbacks = [ckpt]
)

Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


In [15]:
trainer.fit(model, datamodule = dm)

  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

  | Name  | Type             | Params
-------------------------------------------
0 | model | ResNet18         | 11.2 M
1 | loss  | CrossEntropyLoss | 0     
-------------------------------------------
11.2 M    Trainable params
0         Non-trainable params
11.2 M    Total params
22.386    Total estimated model params size (MB)
  rank_zero_warn(


                                                                                                                                                                                            

  rank_zero_warn(


Epoch 0:  89%|████████████████████████████████████████████████████████████████████████████████████████████████████████▉             | 40/45 [00:14<00:01,  2.85it/s, loss=0.691, v_num=test]
Validation: 0it [00:00, ?it/s][A
Validation:   0%|                                                                                                                                                     | 0/5 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|                                                                                                                                        | 0/5 [00:00<?, ?it/s][A
Validation DataLoader 0:  20%|█████████████████████████▌                                                                                                      | 1/5 [00:00<00:00,  8.10it/s][A
Epoch 0:  91%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▌          | 41/45 [00:16<00:01,  2.46it/s, loss=0.691, v_num=test][A
Validatio

### Loading a pre-trained model

`LightningModule` also provides APIs to easily load pre-trained models and resume training from a checkpoint.
For us to load a pre-trained model, the model's module/class needs to be imported in the namespace.

In [None]:
ckpt_path = ckpt.best_model_path

# Any parameters you want to change while loading the model can be passed along as well
pre_trained = baseline_module.load_from_checkpoint(ckpt_path)

In [None]:
del(pre_trained)
del(model)

## 4. Testing

Now that we have trained our model, we will write the code to feed data from our test dataloader and get predictions for them

In [17]:
ckpt_path = ckpt.best_model_path
best_model = baseline_module.load_from_checkpoint(ckpt_path).to('cuda')
test_image_names = list(map(lambda x: ntpath.basename(x), dm.test_image_paths))

name_labels_nn = collections.defaultdict(list)
name_scores_nn = collections.defaultdict(list)

for batch_idx, batch in enumerate(dm.test_dataloader()):
    start_index = batch_idx * dm.batch_size
    x = batch
    x = x.cuda()
    y_hat = best_model(x)
    y_hat = y_hat.softmax(dim = 1)
    preds = y_hat.argmax(dim=1)
    
    for pred_index, pred in enumerate(preds):
        name_labels_nn[test_image_names[start_index + pred_index]].append(pred.item())
            
    for score_index, score in enumerate(y_hat):
        name_scores_nn[test_image_names[start_index + score_index]].append(score)

RuntimeError: CUDA out of memory. Tried to allocate 98.00 MiB (GPU 0; 23.69 GiB total capacity; 21.03 GiB already allocated; 57.44 MiB free; 21.67 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

In [None]:
test_df = pd.read_csv(test_csv_path, index_col = 0)
test_df.head()

In [None]:
def get_preds(threshold=0.6, use_score=True):
    name_preds = []
    for index, row in test_df.iterrows():
        filename = row['filename']
        
        score = name_scores_nn[filename]
        curr_pred = name_labels_nn[filename][0]
        # print(curr_pred, score)
        if score[curr_pred] > threshold:
            pred = curr_pred
        else:
            pred = -1                
        
        name_preds.append([filename, pred, curr_pred])
        
    preds_df = pd.DataFrame(name_preds, columns=['filename', 'pred_with_abstention', 'pred_wo_abstention'])
    
    return preds_df
        
preds_df = get_preds()
outputs_df = test_csv_df[['filename', 'urban']].merge(preds_df, on='filename')
outputs_df = outputs_df.astype({'urban': int})
outputs_df.head()