# CSE527 Programming Assignment 4
**Due date: 23:59 on November 26th 2021**

In this semester, we will use Google Colab for the assignments, which allows us to utilize resources that some of us might not have in their local machines such as GPUs. You will need to use your Stony Brook (*.stonybrook.edu) account for coding and Google Drive to save your results.

## Google Colab Tutorial
---
Go to https://colab.research.google.com/notebooks/, you will see a tutorial named "Welcome to Colaboratory" file, where you can learn the basics of using google colab.

Settings used for assignments: ***Edit -> Notebook Settings -> Runtime Type (Python 3)***.


## Description
---
You train a deep network from scratch if you have enough data (it's not always obvious whether or not you do), and if you cannot then instead you fine-tune a pre-trained network as in this problem.

In Problem 1, you will be finetuning a pretrained resnet and using it to classify JPL interaction video frames. 

For Problem 2, you are going to use thread pooling/convolution to classify the video files using the similar pretrained network as your baseline model.



There are 2 problems in this homework with a total of 120 points including 20 bonus points. Be sure to read **Submission Guidelines** below. They are important. For the problems requiring text descriptions, you might want to add a markdown block for that.

## Dataset
---
Save the dataset(click me) into your working folder in your Google Drive for this homework. <br>
Under your root folder, there should be a folder named "data" (i.e. XXX/Surname_Givenname_SBUID/data) containing the images.
**Do not upload** the data subfolder before submitting on blackboard due to size limit. There should be only one .ipynb file under your root folder Surname_Givenname_SBUID.

## Some Tutorials (PyTorch)
---
- You will be using PyTorch for deep learning toolbox (follow the [link](http://pytorch.org) for installation).
- For PyTorch beginners, please read this [tutorial](http://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html) before doing your homework.
- Feel free to study more tutorials at http://pytorch.org/tutorials/.
- Find cool visualization here at http://playground.tensorflow.org.




In [None]:
# import packages here
import cv2
import numpy as np
import matplotlib.pyplot as plt
import glob
import random 
import time

import torch
import torchvision
import torchvision.transforms as transforms

from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F

In [None]:
# Mount your google drive where you've saved your assignment folder
from google.colab import drive
drive._mount('/content/gdrive',force_remount=True)

Mounted at /content/gdrive


In [None]:
# Set your working directory (in your google drive)
#   change it to your specific homework directory.
%cd '/content/gdrive/MyDrive/Kamat_SaahilSuhas_114360951_PA4'

/content/gdrive/MyDrive/Kamat_SaahilSuhas_114360951_PA4


## Problem 1 Data Preparation and Fine-tuning
## First-Person Activity Recognition: What Are They Doing to Me?
In this part of the assignment, you will implement an Activity Classifier using JPL dataset. You will use an an ImageNet pre-trained CNN that serves as a feature extractor.
## About JPL dataset
This first-person dataset contains videos of interactions between humans and the observer. We attached a GoPro2 camera to the head of our humanoid model, and asked human participants to interact with the humanoid by performing activities. In order to emulate the mobility of a real robot, we also placed wheels below the humanoid and made an operator to move the humanoid by pushing it from the behind. Videos were recorded continuously during human activities where each video sequence contains 0 to 3 activities. The videos are in 320*240 resolution with 30 fps.

There are 7 different types of activities in the dataset, including :
<ol>

###4 positive (i.e., friendly) interactions with the observer: 

<li> 'Shaking hands with the observer', <li> 'hugging the observer', <li> 'petting the observer', and <li> 'waving a hand to the observer' 

###1 neutral interaction:
<li>  the situation where two persons have a conversation about the observer while occasionally pointing it.

###2 negative (i.e., hostile) interactions: 
<li>  'Punching the observer' and <li> 'throwing objects to the observer'
</ol>
We will thus assign label to each action, for example:


```
{
  'Shaking hands with the observer': 1, 
  'hugging the observer': 2, 
  'petting the observer': 3, 
  'waving a hand to the observer': 4,
  'the situation where two persons have a conversation about the observer while occasionally pointing it': 5,
  'Punching the observer': 6,
  'throwing objects to the observer': 7
}
```



### Problem 1.0
## Loading the JPL dataset: 5 points
Check the segmented version from [here](https://drive.google.com/file/d/1eivyF3gPbS3ejea-NYebMBzS40xsRrqF/view?usp=sharing). 
Save the videos into your working folder in your Google Drive.
Under your root folder, there should be a folder named "data" (i.e. XXX/Surname_Givenname_SBUID/data) containing the jpl_vid directory where you should extract the jpl dataset. Do not upload the data subfolder before submitting on blackboard due to size limit. There should be only one .ipynb file under your root folder Surname_Givenname_SBUID. 
In the first part of data preparation, we will convert the videos into images. We will only use all frames from each video and store them as .jpg files. The data folder now also consists of two other directories: jpl_vid, jpl_img. **We will delete the jpl_img directory from you data folder before evaluating.**


In [None]:
# !rmdir ./data/jpl_img/

In [None]:
def get_frames(filename, n_frames= -1):
#--------------------------------------------------
#       given the filename of a video,  generate all the frames for that video and return it with length of the frame
#       Example: if path /data/jpl_img/10_1/ should contain the frames from 10_1.avi

#       if n_frames is -1 store all the frames of the video. 
#       Else we will only use n_frames frames from each video that are equally spaced across the entire video and store them as .jpg files.
#       We expect you to use CV2 library to read video frames.
#--------------------------------------------------
    vidcap = cv2.VideoCapture(filename)
    total_frames = vidcap.get(cv2.CAP_PROP_FRAME_COUNT)
    # print(total_frames)
    if(n_frames==-1):
      n_frames = total_frames
    frames_step = total_frames//n_frames
    imgs = []
    for i in range(n_frames):
      vidcap.set(1,i*frames_step)
      success,image = vidcap.read()
      imgs.append(image)
    
    frames = imgs
    v_len = len(frames)
    return frames, v_len
    
def store_frames(frames, path2store):
    for ii, frame in enumerate(frames):
        frame = cv2.cvtColor(frame, cv2.COLOR_RGB2BGR)  
        path2img = os.path.join(path2store, "frame"+str(ii)+".jpg")
        cv2.imwrite(path2img, frame)

In [None]:
import os

path2data = "./data"
sub_folder = "jpl_vid"
sub_folder_jpg = "jpl_img"
path2video = os.path.join(path2data, sub_folder)
listOfCategories = os.listdir(path2video)
# listOfCategories, len(listOfCategories)

In [None]:
n_frames = 16

In [None]:
extension = ".avi"
#--------------------------------------------------
#choose a value for n_frames below to optimize your solution. We might randomly choose n_frames while evaluating 
#--------------------------------------------------
n_frames = 16
for root, dirs, files in os.walk(path2video, topdown=False):
    for name in files:
        if extension not in name:
            continue
        path2vid = os.path.join(root, name)
        frames, vlen = get_frames(path2vid, n_frames= n_frames)
        path2store = path2vid.replace(sub_folder, sub_folder_jpg)
        path2store = path2store.replace(extension, "")
        print(path2store)
        os.makedirs(path2store, exist_ok= True)
        store_frames(frames, path2store)
    print("-"*50) 

## Training, Test and Validation set
**Training set:**
Participant 1-9


**Test set:**
Participant 10 - 12

In [None]:
def prepare_sets(path2ajpgs):
    listOfCats = os.listdir(path2ajpgs)
    train_id = []
    train_label = []
    test_id = []
    test_label = []
    for i in range(len(listOfCats)):
      x = listOfCats[i].split("_")
      if(int(x[0]) > 9):
        test_id.append(listOfCats[i])
        test_label.append(int(x[1]))
      else:
        train_id.append(listOfCats[i])
        train_label.append(int(x[1]))
      
    return train_id, train_label, test_id, test_label

In [None]:
path2jpg = os.path.join(path2data, sub_folder_jpg)

In [None]:
train_ids, train_labels, test_ids, test_labels = prepare_sets(path2jpg)

In [None]:
from torch.utils.data import Dataset, DataLoader, Subset
import glob
from PIL import Image
import torch
import numpy as np
import random
np.random.seed(2020)
random.seed(2020)
torch.manual_seed(2020)
class VideoDataset(Dataset):
    def __init__(self, ids, labels,transform):      
        self.transform = transform
        self.ids = ids
        self.labels = labels
    def __len__(self):
        return len(self.ids)
    def __getitem__(self, idx):
        path2imgs=glob.glob(path2jpg+"/"+self.ids[idx]+"/*.jpg")
        path2imgs = path2imgs[:n_frames]
        label = self.labels[idx]
        frames = []
        for p2i in path2imgs:
            frame = Image.open(p2i)
            frames.append(frame)
        
        seed = np.random.randint(1e9)        
        frames_tr = []
        for frame in frames:
            random.seed(seed)
            np.random.seed(seed)
            frame = self.transform(frame)
            frames_tr.append(frame)
        if len(frames_tr)>0:
            frames_tr = torch.stack(frames_tr)
        return frames_tr.to(device), label

### Problem 1.1
## Dataloader: 5 points
We now need to create scripts so that it accepts the generator that we just created

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# device = "cpu"
train_ds = VideoDataset(ids= train_ids, labels= train_labels,transform=transforms.ToTensor())
print(len(train_ds))

63


In [None]:
test_ds = VideoDataset(ids= test_ids, labels= test_labels, transform= transforms.ToTensor())
print(len(test_ds))

21


In [None]:
#--------------------------------------------------
#create dataloader for all the datasets(train and test) 
#--------------------------------------------------

batch_size = 8
train_dl = DataLoader(train_ds,batch_size=batch_size)
test_dl = DataLoader(test_ds,batch_size=batch_size)
dataloaders = {'train':train_dl,'test':test_dl}
data_siz = {'train':len(train_ds),'test':len(test_ds)}
print(data_siz)


{'train': 63, 'test': 21}


In [None]:
for xb,yb in train_dl:
    print(xb.shape, yb.shape)
    break

torch.Size([8, 16, 3, 240, 320]) torch.Size([8])


## Problem 1.2: Fine Tuning a Pre-Trained Deep Network: 40 points
The representations learned by deep convolutional networks generalize surprisingly well to other recognition tasks. 

But how do we use an existing deep network for a new recognition task? Take for instance,  [ResNet network](https://pytorch.org/docs/stable/_modules/torchvision/models/resnet.html) [(paper)](https://arxiv.org/abs/1512.03385).


**Hints**:
- Many pre-trained models are available in PyTorch at [here](http://pytorch.org/docs/master/torchvision/models.html).
- For fine-tuning pretrained network using PyTorch, please read this [tutorial](http://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html).

###Problem 1.2.1

##*Fine-tune* an existing network: 30 points
 In this scenario you take an existing network, replace the final layer (or more) with random weights, and train the entire network again with images and ground truth labels for your recognition task. You are effectively treating the pre-trained deep network as a better initialization than the random weights used when training from scratch. When you don't have enough training data to train a complex network from scratch (e.g. with the 7 classes) this is an attractive option. In [this paper](http://www.cc.gatech.edu/~hays/papers/deep_geo.pdf) from CVPR 2015, there wasn't enough data to train a deep network from scratch, but fine tuning led to 4 times higher accuracy than using off-the-shelf networks directly.
 You are required to implement above strategy to fine-tune a pre-trained **ResNet** for this video frames classification task with 7 classes.



In [None]:
def train_model(model, criterion, optimizer, scheduler, num_epochs=25):
    since = time.time()

    best_model_wts = copy.deepcopy(model.state_dict())
    best_acc = 0.0

    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        print('-' * 10)

        # Each epoch has a training and validation phase
        for phase in ['train', 'test']:
            if phase == 'train':
                model.train()  # Set model to training mode
            else:
                model.eval()   # Set model to evaluate mode

            running_loss = 0.0
            running_corrects = 0

            # Iterate over data.
            for inputs, labels in dataloaders[phase]:
                inputs = torch.swapaxes(inputs,1,2)
                inputs = inputs.to(device)
                labels = labels-1
                labels = labels.to(device)

                # zero the parameter gradients
                optimizer.zero_grad()

                # forward
                # track history if only in train
                with torch.set_grad_enabled(phase == 'train'):
                    outputs = model(inputs)
                    _, preds = torch.max(outputs, 1)
                    # print(preds)
                    # print(labels)
                    loss = criterion(outputs, labels)

                    # backward + optimize only if in training phase
                    if phase == 'train':
                        loss.backward()
                        optimizer.step()

                # statistics
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)
            if phase == 'train':
                scheduler.step()

            epoch_loss = running_loss / data_siz[phase]
            epoch_acc = running_corrects.double() / data_siz[phase]

            print('{} Loss: {:.4f} Acc: {:.4f}'.format(
                phase, epoch_loss, epoch_acc))

            # deep copy the model
            if phase == 'test' and epoch_acc > best_acc:
                best_acc = epoch_acc
                best_model_wts = copy.deepcopy(model.state_dict())

        print()

    time_elapsed = time.time() - since
    print('Training complete in {:.0f}m {:.0f}s'.format(
        time_elapsed // 60, time_elapsed % 60))
    print('Best val Acc: {:4f}'.format(best_acc))

    # load best model weights
    model.load_state_dict(best_model_wts)
    return model

###Problem 1.2.2
###Training and Testing your fine-tuned Network: 10 points
You will fine-tune your network using every frame in the video as a sample with the class label. Use train_dl and test_dl and feed it to your fine-tuned network. Please provide detailed descriptions of:<br>
(1) which layers of Resnet have been replaced<br>
(2) the architecture of the new layers added including activation methods <br>
(3) the final accuracy on test set <br>

In [None]:
#--------------------------------------------------
#       Fine-Tune Pretrained Network
#--------------------------------------------------
from torchvision import datasets, models, transforms
import torch.optim as optim
from torch.optim import lr_scheduler
import copy

def set_parameter_requires_grad(model, feature_extracting):
    if feature_extracting:
        for param in model.parameters():
            param.requires_grad = False


num_classes = 7

model_ft = models.video.r3d_18(pretrained=True)
model_ft.to(device)
set_parameter_requires_grad(model_ft,True)
num_ftrs = model_ft.fc.in_features
model_ft.fc = nn.Linear(num_ftrs, num_classes).to(device)
input_size = 224


criterion = nn.CrossEntropyLoss()

# Observe that all parameters are being optimized
optimizer_ft = optim.SGD(model_ft.parameters(), lr=0.005, momentum=0.9)

# Decay LR by a factor of 0.1 every 7 epochs
exp_lr_scheduler = lr_scheduler.StepLR(optimizer_ft, step_size=7, gamma=0.1)

model_ft = train_model(model_ft, criterion, optimizer_ft, exp_lr_scheduler,
                       num_epochs=25)




Downloading: "https://download.pytorch.org/models/r3d_18-b3b3357e.pth" to /root/.cache/torch/hub/checkpoints/r3d_18-b3b3357e.pth


  0%|          | 0.00/127M [00:00<?, ?B/s]

Epoch 0/24
----------
train Loss: 2.0468 Acc: 0.1429
test Loss: 1.8197 Acc: 0.3333

Epoch 1/24
----------
train Loss: 1.8369 Acc: 0.2698
test Loss: 1.5515 Acc: 0.4286

Epoch 2/24
----------
train Loss: 1.4682 Acc: 0.6190
test Loss: 1.3637 Acc: 0.5714

Epoch 3/24
----------
train Loss: 1.2647 Acc: 0.7460
test Loss: 1.1806 Acc: 0.7619

Epoch 4/24
----------
train Loss: 1.0504 Acc: 0.8254
test Loss: 1.1075 Acc: 0.7143

Epoch 5/24
----------
train Loss: 0.9323 Acc: 0.8413
test Loss: 1.0213 Acc: 0.6667

Epoch 6/24
----------
train Loss: 0.8097 Acc: 0.8889
test Loss: 0.9819 Acc: 0.6667

Epoch 7/24
----------
train Loss: 0.7128 Acc: 0.9524
test Loss: 0.9716 Acc: 0.7143

Epoch 8/24
----------
train Loss: 0.7038 Acc: 0.9524
test Loss: 0.9606 Acc: 0.7143

Epoch 9/24
----------
train Loss: 0.6957 Acc: 0.9683
test Loss: 0.9546 Acc: 0.7143

Epoch 10/24
----------
train Loss: 0.6882 Acc: 0.9683
test Loss: 0.9507 Acc: 0.7143

Epoch 11/24
----------
train Loss: 0.6812 Acc: 0.9683
test Loss: 0.9475 Acc

## Problem 2: Video Classification
### Previous Implementation
This dataset was released as a part of the [paper](http://michaelryoo.com/papers/cvpr2013_ryoo.pdf) in CVPR 2013. The paper investigates multichannel kernels to integrate global and local motion information, and presents a new activity learning/recognition methodology that explicitly considers temporal structures displayed in first-person activity videos. As stated in the paper, *We first introduce video features designed to capture
global motion (Subsection 3.1) and local motion (Subsection 3.2) observed during humans’ various interactions with
the observer. Next, in Subsection 3.3, we cluster features to
form visual words and obtain histogram representations. In
Subsection 3.4, multi-channel kernels are described.* These features were prepared for an input to the SVM.

###Using CNNs
In this approach of video classification we are using an image classifier on every single frame of the video. We then have to merge the feature vectors obtained per frames using a fusion layer. This need to be built into the network itself. A Fusion layer is used to merge the output of separate networks that operate on temporally distant frames. It is normally implemented using the max pooling, average pooling or flattening technique. We then define a fully connected layer to provide the output.



### Problem 2.1
### Temporal Pooling: 20 points
As suggested in this [paper](https://arxiv.org/abs/1503.08909), we position the temporal pooling layer right before the ﬁrst fully connected layer as illustrated. This layer performs either mean-pooling or max-pooling across all video frames. The structure of the CNN-component is identical single-frame model. This network is able to collect all the spatial features in a given time window. However, the order of the temporal events is lost due to the nature of pooling across frames


In [None]:
#--------------------------------------------------
#       Utilities for Temporal pooling
#--------------------------------------------------

def temporal_pooling(feature_vectors, pooling_shape):
  ##########--WRITE YOUR CODE HERE--##########
  # check the shape 
  maxpool = nn.MaxPool2d(pooling_shape, stride=(1, 1))
  feature_vectors_pooled = maxpool(feature_vectors)

  ##########-------END OF CODE-------##########
  return feature_vectors_pooled



In [None]:
#above method is used after the feature extraction part below

### Problem 2.2
### Network Definition: 20 points
### Feature Extraction using an ImageNet pre-trained CNN 
Use a fine-tuned resNet model that you used in Part1 to extract the features from every video frames.

Training new alexnet by training on 2d images for feature extraction



In [None]:
def train_model_for_imgs(model, criterion, optimizer, scheduler, num_epochs=25):
    since = time.time()

    best_model_wts = copy.deepcopy(model.state_dict())
    best_acc = 0.0

    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        print('-' * 10)

        # Each epoch has a training and validation phase
        for phase in ['train', 'test']:
            if phase == 'train':
                model.train()  # Set model to training mode
            else:
                model.eval()   # Set model to evaluate mode

            running_loss = 0.0
            running_corrects = 0

            # Iterate over data.
            for inputs, labels in dataloaders[phase]:
                lim = min(batch_size,len(inputs))
                inp = [inputs[i][j] for i in range(lim) for j in range(n_frames)]
                lab = [labels[i] for i in range(lim) for j in range(n_frames)]
                inputs = torch.stack(inp)
                labels = torch.stack(lab)
                # print(inputs.shape)
                # print(labels.shape)
                inputs = inputs.to(device)
                labels = labels-1
                labels = labels.to(device)

                # zero the parameter gradients
                optimizer.zero_grad()

                # forward
                # track history if only in train
                with torch.set_grad_enabled(phase == 'train'):
                    outputs = model(inputs)
                    _, preds = torch.max(outputs, 1)
                    # print(preds)
                    # print(labels)
                    loss = criterion(outputs, labels)

                    # backward + optimize only if in training phase
                    if phase == 'train':
                        loss.backward()
                        optimizer.step()

                # statistics
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)
            if phase == 'train':
                scheduler.step()

            epoch_loss = running_loss / data_siz[phase]
            epoch_acc = running_corrects.double() / data_siz[phase]

            print('{} Loss: {:.4f} Acc: {:.4f}'.format(
                phase, epoch_loss, epoch_acc))

            # deep copy the model
            if phase == 'test' and epoch_acc > best_acc:
                best_acc = epoch_acc
                best_model_wts = copy.deepcopy(model.state_dict())

        print()

    time_elapsed = time.time() - since
    print('Training complete in {:.0f}m {:.0f}s'.format(
        time_elapsed // 60, time_elapsed % 60))
    print('Best val Acc: {:4f}'.format(best_acc))

    # load best model weights
    model.load_state_dict(best_model_wts)
    return model

In [None]:
#--------------------------------------------------
#       Fine-Tune Pretrained Network
#--------------------------------------------------
from torchvision import datasets, models, transforms
import torch.optim as optim
from torch.optim import lr_scheduler
import copy

num_classes = 7

def removeLayer(alexnet):    
    temp = nn.Sequential(*list(alexnet.classifier.children())[:-1])
    alexnet.classifier = temp
    return alexnet

model_ft_img = models.alexnet(pretrained=True)
# for param in model_ft_img.parameters():
#     param.requires_grad = False




# model_ft_img = removeLayer(model_ft_img)
# print(model_ft_img)



num_ftrs = model_ft_img.classifier[6].in_features
layers = list(model_ft_img.classifier.children())[:-1]
layers.extend([nn.Linear(num_ftrs,num_classes)])
model_ft_img.classifier = nn.Sequential(*layers)
model_ft_img.to(device)
print(model_ft_img.classifier)
# input_size = 224


criterion = nn.CrossEntropyLoss()

# Observe that all parameters are being optimized
optimizer_ft = optim.SGD(model_ft_img.parameters(), lr=0.001, momentum=0.9)

# Decay LR by a factor of 0.1 every 7 epochs
exp_lr_scheduler = lr_scheduler.StepLR(optimizer_ft, step_size=7, gamma=0.1)

data_siz = {'train':len(train_ds)*n_frames,'test':len(test_ds)*n_frames}
model_ft_img = train_model_for_imgs(model_ft_img, criterion, optimizer_ft, exp_lr_scheduler,
                       num_epochs=15)




Downloading: "https://download.pytorch.org/models/alexnet-owt-7be5be79.pth" to /root/.cache/torch/hub/checkpoints/alexnet-owt-7be5be79.pth


  0%|          | 0.00/233M [00:00<?, ?B/s]

Sequential(
  (0): Dropout(p=0.5, inplace=False)
  (1): Linear(in_features=9216, out_features=4096, bias=True)
  (2): ReLU(inplace=True)
  (3): Dropout(p=0.5, inplace=False)
  (4): Linear(in_features=4096, out_features=4096, bias=True)
  (5): ReLU(inplace=True)
  (6): Linear(in_features=4096, out_features=7, bias=True)
)
Epoch 0/14
----------
train Loss: 2.0488 Acc: 0.1151
test Loss: 1.6601 Acc: 0.3958

Epoch 1/14
----------
train Loss: 1.5446 Acc: 0.4395
test Loss: 1.3645 Acc: 0.4970

Epoch 2/14
----------
train Loss: 1.0445 Acc: 0.6290
test Loss: 1.2395 Acc: 0.3899

Epoch 3/14
----------
train Loss: 0.7472 Acc: 0.7421
test Loss: 1.2000 Acc: 0.4554

Epoch 4/14
----------
train Loss: 0.5327 Acc: 0.8204
test Loss: 1.2453 Acc: 0.4613

Epoch 5/14
----------
train Loss: 0.3778 Acc: 0.8780
test Loss: 1.2835 Acc: 0.5238

Epoch 6/14
----------
train Loss: 0.2851 Acc: 0.9067
test Loss: 1.2981 Acc: 0.5744

Epoch 7/14
----------
train Loss: 0.2245 Acc: 0.9365
test Loss: 1.2713 Acc: 0.5952

Epoch

In [None]:
import pickle
pickle.dump(model_ft_img, open("/content/gdrive/MyDrive/Kamat_SaahilSuhas_114360951_PA4/model_trained_on_frames.pkl", "wb"))

In [None]:
import pickle
with open("/content/gdrive/MyDrive/Kamat_SaahilSuhas_114360951_PA4/model_ft_img.pkl", "rb") as f:
  model_ft_img = pickle.load(f)

In [None]:
def removeLayer(alexnet):    
    temp = nn.Sequential(*list(alexnet.classifier.children())[:-1])
    alexnet.classifier = temp
    return alexnet

In [None]:
alexnet = removeLayer(model_ft_img)
train_features = []
test_features = []
vidp = []
for inputs, labels in dataloaders['train']:
  out = []
  for i in range(min(len(inputs),batch_size)):
    out.append(alexnet(inputs[i]))
  out = torch.stack(out)
  train_features.append(temporal_pooling(out,(16,3)))

for inputs, labels in dataloaders['test']:
  out = []
  for i in range(min(len(inputs),batch_size)):
    out.append(alexnet(inputs[i]))
  out = torch.stack(out)
  test_features.append(temporal_pooling(out,(16,3)))

# temporalp = 
# print(temporalp.shape)
# train_features = torch.stack(train_features)
# vidp = temporal_pooling(train_features,(3,3))

# print(vidp.shape)
train_features = torch.cat(train_features,dim=0)
test_features = torch.cat(test_features,dim=0)

train_features = torch.squeeze(train_features)
test_features = torch.squeeze(test_features)
print(train_features.shape)
print(test_features.shape)

torch.Size([63, 4094])
torch.Size([21, 4094])


### Problem 2.2
**Train and Test:10 points**



Training and testing SVC classifier on the features extracted using alexnet.

In [None]:
from sklearn import svm
from sklearn.svm import SVC



#14 - 3.445
#C=11.0115
train_SVC = SVC(C=10,max_iter=5000)
cpu = 'cpu'
train_data = train_features.to(cpu)
train_data = train_data.data.numpy()
labels = np.array(train_labels)
print(labels)
train_SVC.fit(train_data,labels)

[7 5 2 5 7 4 4 3 7 5 3 2 6 6 2 1 3 3 7 1 6 4 5 7 1 3 5 1 3 4 2 1 1 7 3 5 4
 1 2 4 6 2 4 6 7 2 5 7 1 2 3 5 7 6 6 6 4 6 4 1 3 5 2]


SVC(C=10, max_iter=5000)

In [None]:
test_data = test_features.to(cpu)
test_data = test_data.data.numpy()
test_lab = np.array(test_labels)
test_label_pred_SVC = train_SVC.predict(test_data)
  
#Evaluation

#The prediction is test_label_pred
print("Predicted:\t\t" + str(test_label_pred_SVC))
print("Ground truth labels:\t" + str(test_lab))
accuracy = sum(np.array(test_label_pred_SVC) == test_lab) / float(len(test_lab))
print("The accuracy of SVM model on temporal pooled features is {:.2f}%".format(accuracy*100))


Predicted:		[5 1 4 6 7 6 2 5 3 5 6 4 1 2 5 2 5 6 4 1 6]
Ground truth labels:	[5 1 4 3 7 6 2 5 3 7 6 4 1 3 5 2 7 6 4 1 2]
The accuracy of SVM model on temporal pooled features is 76.19%


### Problem 2.2
**Fusion based implementation: 20 bonus points**

--->Implemented late pooling

In [None]:
#--------------------------------------------------
#Define your  Vid_Classifier
#you may add extra parameters here.
#remember to define :
  # a base model which is an ImageNet pre-trained CNN : Extract One feature vector per frame
  # a max pooling layer that finds the maximum feature map over a local temporal neighborhood
  # a fully connected layer to unify the feature maps
#You may also want to include other parameters for your module
#if you are using a knn classifer, please indicate it well with your code.
#--------------------------------------------------
  
from torchvision import models
from torch import nn
class Vid_Classifier(nn.Module):
    def __init__(self, params_model):
        super(Vid_Classifier, self).__init__()
        num_classes = params_model["num_classes"]
        dr_rate= params_model["dr_rate"] #drop out rate
        pretrained = params_model["pretrained"]
        #--------------------------------------------------
        #Your code here
        #late pooling
        resnet = models.resnet18(pretrained)
        for param in resnet.parameters():
            param.requires_grad = False
        self.resnet = resnet
        self.fc1 = nn.Linear(1000, 800)
        self.fc2 = nn.Linear(800,612)
        
        
        self.fc3 = nn.Linear(610,num_classes)
        #--------------------------------------------------
             
    def forward(self, x):
        #--------------------------------------------------
        #Your code here
        #--------------------------------------------------
        b_z, ts, c, h, w = x.shape

        out = []
        #late pooling - passing frames for each video through resnet and 2 fully connected layers
        for i in range(len(x)):
          im = self.resnet(x[i])
          im = self.fc1(im)
          im = self.fc2(im)
          out.append(im)
        out = torch.stack(out)
        out = temporal_pooling(out,(16,3))
        out = torch.squeeze(out)
        out = self.fc3(out)
        return out





In [None]:
num_classes = 7

In [None]:
params_model={
        "num_classes": num_classes,
        "dr_rate": 0.1,
        "pretrained" : True,}
model = Vid_Classifier(params_model) 

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = model.to(device)

In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum = 0.9)

In [None]:
num_epochs = 25
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(dataloaders['train']):
        labels-=1
        labels = labels.to(device)
        outputs = model(images)
        # print(outputs)
        loss = criterion(outputs, labels)

        # Backprop and perform optimisation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Track the accuracy
        total = labels.size(0)
        _, predicted = torch.max(outputs.data, 1)
        correct = (predicted == labels).sum().item()

        if(i+1)%8==0:
          print('Epoch [{}/{}], Loss: {:.4f}, Accuracy: {:.2f}%'
                  .format(epoch + 1, num_epochs, loss.item(),
                          (correct / total) * 100))

Epoch [1/25], Loss: 2.0151, Accuracy: 14.29%
Epoch [2/25], Loss: 1.7441, Accuracy: 42.86%
Epoch [3/25], Loss: 1.5363, Accuracy: 85.71%
Epoch [4/25], Loss: 1.3783, Accuracy: 100.00%
Epoch [5/25], Loss: 1.2065, Accuracy: 100.00%
Epoch [6/25], Loss: 1.0508, Accuracy: 100.00%
Epoch [7/25], Loss: 0.9330, Accuracy: 100.00%
Epoch [8/25], Loss: 0.8266, Accuracy: 100.00%
Epoch [9/25], Loss: 0.7370, Accuracy: 100.00%
Epoch [10/25], Loss: 0.6620, Accuracy: 100.00%
Epoch [11/25], Loss: 0.5955, Accuracy: 100.00%
Epoch [12/25], Loss: 0.5407, Accuracy: 100.00%
Epoch [13/25], Loss: 0.4909, Accuracy: 100.00%
Epoch [14/25], Loss: 0.4449, Accuracy: 100.00%
Epoch [15/25], Loss: 0.4077, Accuracy: 100.00%
Epoch [16/25], Loss: 0.3743, Accuracy: 100.00%
Epoch [17/25], Loss: 0.3422, Accuracy: 100.00%
Epoch [18/25], Loss: 0.3150, Accuracy: 100.00%
Epoch [19/25], Loss: 0.2903, Accuracy: 100.00%
Epoch [20/25], Loss: 0.2665, Accuracy: 100.00%
Epoch [21/25], Loss: 0.2475, Accuracy: 100.00%
Epoch [22/25], Loss: 0.22

In [None]:
correct = 0
total = 0
for images, labels in dataloaders['test']:
    labels-=1
    labels = labels.to(device)
    outputs = model(images)
    _, predicted = torch.max(outputs.data, 1)
    total += labels.size(0)
    correct += (predicted == labels).sum().item()
print('Accuracy on test set: {:.2f}%'.format((correct/total)*100))

Accuracy on test set: 42.86%


**Answer**:

Accuracy on test set: 42.86%

In [None]:
path2weights = "/content/gdrive/MyDrive/Kamat_SaahilSuhas_114360951_PA4/fusion_cnn_weights.pt"
torch.save(model.state_dict(), path2weights)

In [None]:
#define your training function

## Submission guidelines
---
Your need to submit a single zip file to Blackboard described as follow.

Please generate a pdf file that includes a ***google shared link*** (explained in the next paragraph). This pdf file should be named as ***Surname_Givenname_SBUID_pa*\*.pdf** (example: Jordan_Michael_111234567_pa3.pdf for this assignment).

To generate the ***google shared link***, first create a folder named ***Surname_Givenname_SBUID_pa**** in your Google Drive with your Stony Brook account. The structure of the files in the folder should be exactly the same as the one you downloaded. For instance in this homework:

```
Surname_Givenname_SBUID_pa4
        |---data
        |---CSE527-PA4-fall21.ipynb
```
Note that this folder should be in your Google Drive with your Stony Brook account.

Then right click this folder, click ***Get shareable link***, in the People textfield, enter the TA's email: ***bjha@cs.stonybrook.edu***, ***li.wenchen@stonybrook.edu***, ***yifeng.huang@stonybrook.edu***. Make sure that TAs who have the link **can edit**, ***not just*** **can view**, and also **UNCHECK** the **Notify people** box.

Note that in google colab, we will only grade the version of the code right before the timestamp of the submission made in blackboard. 

To submit to Blackboard, zip ***Surname_Givenname_SBUID_pa*\*.pdf** and ***Surname_Givenname_SBUID_pa**** folder together and name your zip file as ***Surname_Givenname_SBUID_pa*\*.zip**. 

**DO NOT upload the datasets to Blackboard.**

The input and output paths are predefined and **DO NOT** change them, (we assume that 'Surname_Givenname_SBUID_pa4' is your working directory, and all the paths are relative to this directory).  The image read and write functions are already written for you. All you need to do is to fill in the blanks as indicated to generate proper outputs.


-- DO NOT change the folder structure, please just fill in the blanks. <br>

You are encouraged to post and answer questions on Piazza. Based on the amount of email that we have received in past years, there might be dealys in replying to personal emails. Please ask questions on Piazza and send emails only for personal issues.

If you alter the folder structures, the grading of your homework will be significantly delayed and possibly penalized.

Be aware that your code will undergo plagiarism check both vertically and horizontally. Please do your own work.

<!--Write your report here in markdown or html-->
