# Action Recognition @ UCF101  
**Due date: 11:59 pm on Nov. 19, 2019 (Tuesday)**

## Description
---
In this homework, you will be doing action recognition using Recurrent Neural Network (RNN), (Long-Short Term Memory) LSTM in particular. You will be given a dataset called UCF101, which consists of 101 different actions/classes and for each action, there will be 145 samples. We tagged each sample into either training or testing. Each sample is supposed to be a short video, but we sampled 25 frames from each videos to reduce the amount of data. Consequently, a training sample is an image tuple that forms a 3D volume with one dimension encoding *temporal correlation* between frames and a label indicating what action it is.

To tackle this problem, we aim to build a neural network that can not only capture spatial information of each frame but also temporal information between frames. Fortunately, you don't have to do this on your own. RNN — a type of neural network designed to deal with time-series data — is right here for you to use. In particular, you will be using LSTM for this task.

Instead of training an end-to-end neural network from scratch whose computation is prohibitively expensive, we divide this into two steps: feature extraction and modelling. Below are the things you need to implement for this homework:
- **{35 pts} Feature extraction**. Use any of the [pre-trained models](https://pytorch.org/docs/stable/torchvision/models.html) to extract features from each frame. Specifically, we recommend not to use the activations of the last layer as the features tend to be task specific towards the end of the network. 
    **hints**: 
    - A good starting point would be to use a pre-trained VGG16 network, we suggest first fully connected layer `torchvision.models.vgg16` (4096 dim) as features of each video frame. This will result into a 4096x25 matrix for each video. 
    - Normalize your images using `torchvision.transforms` 
    ```
    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    prep = transforms.Compose([ transforms.ToTensor(), normalize ])
    prep(img)
    The mean and std. mentioned above is specific to Imagenet data
    
    ```
    More details of image preprocessing in PyTorch can be found at http://pytorch.org/tutorials/beginner/data_loading_tutorial.html
    
- **{35 pts} Modelling**. With the extracted features, build an LSTM network which takes a **dx25** sample as input (where **d** is the dimension of the extracted feature for each frame), and outputs the action label of that sample.
- **{20 pts} Evaluation**. After training your network, you need to evaluate your model with the testing data by computing the prediction accuracy **(5 points)**. The baseline test accuracy for this data is 75%, and **10 points** out of 20 is for achieving test accuracy greater than the baseline. Moreover, you need to compare **(5 points)** the result of your network with that of support vector machine (SVM) (stacking the **dx25** feature matrix to a long vector and train a SVM).
- **{10 pts} Report**. Details regarding the report can be found in the submission section below.

Notice that the size of the raw images is 256x340, whereas your pre-trained model might take **nxn** images as inputs. To solve this problem, instead of resizing the images which unfavorably changes the spatial ratio, we take a better solution: Cropping five **nxn** images, one at the image center and four at the corners and compute the **d**-dim features for each of them, and average these five **d**-dim feature to get a final feature representation for the raw image.
For example, VGG takes 224x224 images as inputs, so we take the five 224x224 croppings of the image, compute 4096-dim VGG features for each of them, and then take the mean of these five 4096-dim vectors to be the representation of the image.

In order to save you computational time, you need to do the classification task only for **the first 25** classes of the whole dataset. The same applies to those who have access to GPUs. **Bonus 10 points for running and reporting on the entire 101 classes.**


## Dataset
Download **dataset** at [UCF101](http://vision.cs.stonybrook.edu/~yangwang/public/UCF101_images.tar)(Image data for each video) and the **annos folder** which has the video labels and the label to class name mapping is included in the assignment folder uploaded. 


UCF101 dataset contains 101 actions and 13,320 videos in total.  

+ `annos/actions.txt`  
  + lists all the actions (`ApplyEyeMakeup`, .., `YoYo`)   
  
+ `annots/videos_labels_subsets.txt`  
  + lists all the videos (`v_000001`, .., `v_013320`)  
  + labels (`1`, .., `101`)  
  + subsets (`1` for train, `2` for test)  

+ `images/`  
  + each folder represents a video
  + the video/folder name to class mapping can be found using `annots/videos_labels_subsets.txt`, for e.g. `v_000001` belongs to class 1 i.e. `ApplyEyeMakeup`
  + each video folder contains 25 frames  



## Some Tutorials
- Good materials for understanding RNN and LSTM
    - http://blog.echen.me
    - http://karpathy.github.io/2015/05/21/rnn-effectiveness/
    - http://colah.github.io/posts/2015-08-Understanding-LSTMs/
- Implementing RNN and LSTM with PyTorch
    - [LSTM with PyTorch](http://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html#sphx-glr-beginner-nlp-sequence-models-tutorial-py)
    - [RNN with PyTorch](http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html)

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


In [2]:
cd '/content/gdrive/My Drive/Sarnot_SurabhiSantosh_112584690_hw5'

/content/gdrive/My Drive/Sarnot_SurabhiSantosh_112584690_hw5


In [0]:
# import tarfile

# tar = tarfile.open('UCF101_images.tar', "r:")
# tar.extractall()
# tar.close()

In [0]:
from torchvision import models
from torchvision import transforms
import scipy.io
from random import shuffle
import copy
from sklearn.metrics import accuracy_score 
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import accuracy_score
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision

import numpy as np
import scipy as sp
import pandas as pd
import cv2
from scipy.io import savemat
import os
import glob
import pdb

---
---
## **Problem 1.** Feature extraction

In [0]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [0]:
# \*write your codes for feature extraction (You can use multiple cells, this is just a place holder)
vgg16 = models.vgg16(pretrained=True)
vgg16 = vgg16.to(device)
del vgg16.classifier[2:]

Downloading: "https://download.pytorch.org/models/vgg16-397923af.pth" to /root/.cache/torch/checkpoints/vgg16-397923af.pth
100%|██████████| 528M/528M [00:15<00:00, 36.7MB/s]


In [0]:
anno = pd.read_csv("annos/videos_labels_subsets.txt", header=None, sep='\t')

In [0]:
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
prep = transforms.Compose([ transforms.ToTensor(), normalize ])

In [0]:
def crop_img(img, size=(224, 224)):
    return (img[:, :size[0], :size[1]], img[:, (img.shape[1]-size[0]):, :size[1]], 
            img[:, :size[0], (img.shape[2]-size[1]):],
            img[:, (img.shape[1]-size[0]):, (img.shape[2]-size[1]):], 
            img[:, (img.shape[1]-size[0])//2:(img.shape[1]+size[0])//2, (img.shape[2]-size[1])//2:(img.shape[2]+size[1])//2])

In [0]:
class1folder = 'UCF101_images/'
targetfolder = 'UCF101_target/'

for _, line in anno.iterrows():
    val = torch.tensor(line[1]-1)
    val = val.to(device)
    features = []
    d = line[0]
    
    if(d in l):
      for f in os.listdir(os.path.join(class1folder, d)):
          fname = os.path.join(class1folder, d, f)
          img = prep(cv2.imread(fname))
          arr = crop_img(img)
          with torch.no_grad():
              img = vgg16(torch.stack(crop_img(img)).to(device))
              features.append(img.mean(dim=0).cpu())
      dat = torch.tensor(torch.stack(features).numpy())
      savemat(os.path.join(targetfolder, d+'.mat'), {'Feature':torch.stack(features).numpy()})

In [0]:
anno = pd.read_csv("annos/videos_labels_subsets.txt", header=None, sep='\t')
anno = anno[anno[1]<=25]

In [0]:
train_data = []
train_labels = []
test_data = []
test_labels = []

count = 0
for _, line in anno.iterrows():
    val = torch.tensor(line[1]-1)
    dat = torch.tensor(sp.io.loadmat(os.path.join(targetfolder, line[0]+'.mat'))['Feature'])
    if line[2]==1:
        train_data.append(dat)
        train_labels.append(val)
    else:
        test_data.append(dat)
        test_labels.append(val)

***
***
## **Problem 2.** Modelling

* ##### **Print the size of your training and test data**

In [39]:
# Don't hardcode the shape of train and test data
print('Size of training data is :', len(train_data), " Each of size:", train_data[0].shape)
print('Size of test/validation data is :', len(test_data), " Each of size:", test_data[0].shape)
print('Size of total data is :', len(test_data) + len(train_data))

Size of training data is : 2409  Each of size: torch.Size([25, 4096])
Size of test/validation data is : 951  Each of size: torch.Size([25, 4096])
Size of total data is : 3360


In [0]:
#Creating batches for data

def create_batch_data(batch_num,data,labels):
  batch_data = []
  batch_labels = []

  for i in range(int(len(data) / batch_num)):
      minibatch_d = data[i*batch_num: (i+1)*batch_num]
      minibatch_d = [t.numpy() for t in minibatch_d]
      minibatch_d = np.reshape(minibatch_d, (batch_num, 1, 25*4096))
      batch_data.append(torch.from_numpy(minibatch_d))

      minibatch_l = labels[i*batch_num: (i+1)*batch_num]
      batch_labels.append(torch.LongTensor(minibatch_l))
  return batch_data, batch_labels

In [0]:
# \*write your codes for modelling using the extracted feature (You can use multiple cells, this is just a place holder)
#LSTM Model

class video_net(nn.Module):
    def __init__(self, hidden_dim, batch_size, label_size):
        super(video_net, self).__init__()
        self.hidden_dim = hidden_dim
        self.batch_size= batch_size
        self.lstm = nn.LSTM(102400, hidden_dim)
        self.hidden2label = nn.Linear(hidden_dim, label_size)
        
    def forward(self, input_features):
        lstm_out, hidden = self.lstm(
            input_features.view(len(input_features), 1, -1))
        outputs = self.hidden2label(lstm_out.view(self.batch_size,-1))
        output_labels = F.log_softmax(outputs, dim=-1)
        return output_labels

---
---
## **Problem 3.** Evaluation

In [0]:
def test_accuracy(data_test):
  with torch.no_grad():
      correct = 0
      total = 0
      for i in range(len(data_test)):
        video = data_test[i]
        label = labels_test[i]
        size = video.size()
        video = video.to(device)
        label = label.view(-1)
        label = label.to(device)
        model_1.zero_grad()
        outputs = model_1(video)
        _, predicted = torch.max(outputs.data, 1)
        total += label.size(0)
        correct += (predicted == label).sum().item()
        accuracy = (correct/total)*100
  print("Accuracy on the test dataset is:", accuracy)

**With hidden dimension 256**

In [0]:
Batch_size = 8
Hidden_dim = 256
Num_classes =25

data , labels = create_batch_data(Batch_size, train_data, train_labels)

model_1 = video_net(Hidden_dim,Batch_size,Num_classes)
print(model_1)
model_1 = model_1.to(device)
loss_function = nn.CrossEntropyLoss()
optimizer = optim.SGD(model_1.parameters(), lr=0.1)

initial_loss = 0
for epoch in range(40):
  loss_count = 0.0
  for i in range(len(data)):
      video = data[i]
      label = labels[i]
      size = video.size()
      video = video.to(device)
      label = label.view(-1)
      label = label.to(device)
      model_1.zero_grad()
      outputs = model_1(video)

      loss = loss_function(outputs, label)
      loss.backward()
      optimizer.step()
      loss_count += loss.item()
  print('Epoch : %d Loss: %.3f' %(epoch+1, loss_count/len(data)))
        

video_net(
  (lstm): LSTM(102400, 256)
  (hidden2label): Linear(in_features=256, out_features=25, bias=True)
)
Epoch : 1 Loss: 0.729
Epoch : 2 Loss: 0.645
Epoch : 3 Loss: 0.473
Epoch : 4 Loss: 0.358
Epoch : 5 Loss: 0.299
Epoch : 6 Loss: 0.215
Epoch : 7 Loss: 0.188
Epoch : 8 Loss: 0.144
Epoch : 9 Loss: 0.122
Epoch : 10 Loss: 0.107
Epoch : 11 Loss: 0.092
Epoch : 12 Loss: 0.084
Epoch : 13 Loss: 0.074
Epoch : 14 Loss: 0.062
Epoch : 15 Loss: 0.064
Epoch : 16 Loss: 0.061
Epoch : 17 Loss: 0.056
Epoch : 18 Loss: 0.049
Epoch : 19 Loss: 0.044
Epoch : 20 Loss: 0.037
Epoch : 21 Loss: 0.033
Epoch : 22 Loss: 0.030
Epoch : 23 Loss: 0.026
Epoch : 24 Loss: 0.022
Epoch : 25 Loss: 0.019
Epoch : 26 Loss: 0.017
Epoch : 27 Loss: 0.015
Epoch : 28 Loss: 0.012
Epoch : 29 Loss: 0.010
Epoch : 30 Loss: 0.009
Epoch : 31 Loss: 0.008
Epoch : 32 Loss: 0.007
Epoch : 33 Loss: 0.006
Epoch : 34 Loss: 0.005
Epoch : 35 Loss: 0.005
Epoch : 36 Loss: 0.004
Epoch : 37 Loss: 0.004
Epoch : 38 Loss: 0.004
Epoch : 39 Loss: 0.003
E

In [0]:
#Creating batches for test data of size 8

Batch_size = 8
data_test , labels_test = create_batch_data(Batch_size, test_data, test_labels)
test_accuracy(data_test)

Accuracy on the test dataset is: 79.23728813559322


**With hidden dimension 512**

In [0]:
Batch_size = 8
Hidden_dim = 512
Num_classes =25

data , labels = create_batch_data(Batch_size, train_data, train_labels)

model_1 = video_net(Hidden_dim,Batch_size,Num_classes)
print(model_1)
model_1 = model_1.to(device)
loss_function = nn.CrossEntropyLoss()
optimizer = optim.SGD(model_1.parameters(), lr=0.1)

initial_loss = 0
for epoch in range(40):
  loss_count = 0.0
  for i in range(len(data)):
      video = data[i]
      label = labels[i]
      size = video.size()
      video = video.to(device)
      label = label.view(-1)
      label = label.to(device)
      model_1.zero_grad()
      outputs = model_1(video)

      loss = loss_function(outputs, label)
      loss.backward()
      optimizer.step()
      loss_count += loss.item()
  print('Epoch : %d Loss: %.3f' %(epoch+1, loss_count/len(data)))

video_net(
  (lstm): LSTM(102400, 512)
  (hidden2label): Linear(in_features=512, out_features=25, bias=True)
)
Epoch : 1 Loss: 0.595
Epoch : 2 Loss: 0.546
Epoch : 3 Loss: 0.388
Epoch : 4 Loss: 0.282
Epoch : 5 Loss: 0.211
Epoch : 6 Loss: 0.174
Epoch : 7 Loss: 0.120
Epoch : 8 Loss: 0.105
Epoch : 9 Loss: 0.079
Epoch : 10 Loss: 0.069
Epoch : 11 Loss: 0.059
Epoch : 12 Loss: 0.053
Epoch : 13 Loss: 0.049
Epoch : 14 Loss: 0.046
Epoch : 15 Loss: 0.040
Epoch : 16 Loss: 0.037
Epoch : 17 Loss: 0.037
Epoch : 18 Loss: 0.037
Epoch : 19 Loss: 0.033
Epoch : 20 Loss: 0.035
Epoch : 21 Loss: 0.025
Epoch : 22 Loss: 0.021
Epoch : 23 Loss: 0.024
Epoch : 24 Loss: 0.021
Epoch : 25 Loss: 0.021
Epoch : 26 Loss: 0.018
Epoch : 27 Loss: 0.018
Epoch : 28 Loss: 0.021
Epoch : 29 Loss: 0.016
Epoch : 30 Loss: 0.014
Epoch : 31 Loss: 0.013
Epoch : 32 Loss: 0.010
Epoch : 33 Loss: 0.011
Epoch : 34 Loss: 0.009
Epoch : 35 Loss: 0.010
Epoch : 36 Loss: 0.009
Epoch : 37 Loss: 0.010
Epoch : 38 Loss: 0.008
Epoch : 39 Loss: 0.009
E

In [0]:
test_accuracy(data_test)

Accuracy on the test dataset is: 78.8135593220339


**With hidden dimension 1024**

In [0]:
Batch_size = 8
Hidden_dim = 1024
Num_classes =25

data , labels = create_batch_data(Batch_size, train_data, train_labels)

model_1 = video_net(Hidden_dim,Batch_size,Num_classes)
print(model_1)
model_1 = model_1.to(device)
loss_function = nn.CrossEntropyLoss()
optimizer = optim.SGD(model_1.parameters(), lr=0.1)

initial_loss = 0
for epoch in range(40):
  loss_count = 0.0
  for i in range(len(data)):
      video = data[i]
      label = labels[i]
      size = video.size()
      video = video.to(device)
      label = label.view(-1)
      label = label.to(device)
      model_1.zero_grad()
      outputs = model_1(video)

      loss = loss_function(outputs, label)
      loss.backward()
      optimizer.step()
      loss_count += loss.item()
  print('Epoch : %d Loss: %.3f' %(epoch+1, loss_count/len(data)))

video_net(
  (lstm): LSTM(102400, 1024)
  (hidden2label): Linear(in_features=1024, out_features=25, bias=True)
)
Epoch : 1 Loss: 0.562
Epoch : 2 Loss: 0.504
Epoch : 3 Loss: 0.351
Epoch : 4 Loss: 0.241
Epoch : 5 Loss: 0.184
Epoch : 6 Loss: 0.137
Epoch : 7 Loss: 0.097
Epoch : 8 Loss: 0.081
Epoch : 9 Loss: 0.062
Epoch : 10 Loss: 0.062
Epoch : 11 Loss: 0.046
Epoch : 12 Loss: 0.036
Epoch : 13 Loss: 0.035
Epoch : 14 Loss: 0.034
Epoch : 15 Loss: 0.030
Epoch : 16 Loss: 0.027
Epoch : 17 Loss: 0.031
Epoch : 18 Loss: 0.026
Epoch : 19 Loss: 0.023
Epoch : 20 Loss: 0.021
Epoch : 21 Loss: 0.030
Epoch : 22 Loss: 0.019
Epoch : 23 Loss: 0.017
Epoch : 24 Loss: 0.014
Epoch : 25 Loss: 0.015
Epoch : 26 Loss: 0.013
Epoch : 27 Loss: 0.018
Epoch : 28 Loss: 0.031
Epoch : 29 Loss: 0.011
Epoch : 30 Loss: 0.010
Epoch : 31 Loss: 0.009
Epoch : 32 Loss: 0.008
Epoch : 33 Loss: 0.007
Epoch : 34 Loss: 0.007
Epoch : 35 Loss: 0.006
Epoch : 36 Loss: 0.005
Epoch : 37 Loss: 0.004
Epoch : 38 Loss: 0.004
Epoch : 39 Loss: 0.003

In [0]:
test_accuracy(data_test)

Accuracy on the test dataset is: 80.08474576271186


**With learning rate 0.01 and hidden dimension 1024**

In [0]:
Batch_size = 8
Hidden_dim = 1024
Num_classes =25

data , labels = create_batch_data(Batch_size, train_data, train_labels)

model_1 = video_net(Hidden_dim,Batch_size,Num_classes)
print(model_1)
model_1 = model_1.to(device)
loss_function = nn.CrossEntropyLoss()
optimizer = optim.SGD(model_1.parameters(), lr=0.01)

initial_loss = 0
for epoch in range(40):
  loss_count = 0.0
  for i in range(len(data)):
      video = data[i]
      label = labels[i]
      size = video.size()
      video = video.to(device)
      label = label.view(-1)
      label = label.to(device)
      model_1.zero_grad()
      outputs = model_1(video)

      loss = loss_function(outputs, label)
      loss.backward()
      optimizer.step()
      loss_count += loss.item()
  print('Epoch : %d Loss: %.3f' %(epoch+1, loss_count/len(data)))

video_net(
  (lstm): LSTM(102400, 1024)
  (hidden2label): Linear(in_features=1024, out_features=25, bias=True)
)
Epoch : 1 Loss: 0.897
Epoch : 2 Loss: 0.412
Epoch : 3 Loss: 0.194
Epoch : 4 Loss: 0.118
Epoch : 5 Loss: 0.089
Epoch : 6 Loss: 0.070
Epoch : 7 Loss: 0.057
Epoch : 8 Loss: 0.050
Epoch : 9 Loss: 0.038
Epoch : 10 Loss: 0.028
Epoch : 11 Loss: 0.021
Epoch : 12 Loss: 0.016
Epoch : 13 Loss: 0.013
Epoch : 14 Loss: 0.010
Epoch : 15 Loss: 0.008
Epoch : 16 Loss: 0.007
Epoch : 17 Loss: 0.006
Epoch : 18 Loss: 0.005
Epoch : 19 Loss: 0.005
Epoch : 20 Loss: 0.004
Epoch : 21 Loss: 0.004
Epoch : 22 Loss: 0.004
Epoch : 23 Loss: 0.003
Epoch : 24 Loss: 0.003
Epoch : 25 Loss: 0.003
Epoch : 26 Loss: 0.003
Epoch : 27 Loss: 0.003
Epoch : 28 Loss: 0.002
Epoch : 29 Loss: 0.002
Epoch : 30 Loss: 0.002
Epoch : 31 Loss: 0.002
Epoch : 32 Loss: 0.002
Epoch : 33 Loss: 0.002
Epoch : 34 Loss: 0.002
Epoch : 35 Loss: 0.002
Epoch : 36 Loss: 0.002
Epoch : 37 Loss: 0.002
Epoch : 38 Loss: 0.002
Epoch : 39 Loss: 0.002

In [0]:
Batch_size = 8
data_test , labels_test = create_batch_data(Batch_size, test_data, test_labels)
test_accuracy(data_test)

Accuracy on the test dataset is: 81.46186440677965


In [0]:
##LSTM with two nn.LSTM layers
class video_net_2(nn.Module):

    def __init__(self, hidden_dim_1, hidden_dim_2, batch_size, label_size):
        super(video_net_2, self).__init__()
        self.hidden_dim_1 = hidden_dim_1
        self.hidden_dim_2 = hidden_dim_2
        self.batch_size = batch_size
        self.lstm_1 = nn.LSTM(102400, hidden_dim_1)
        self.lstm_2 = nn.LSTM(hidden_dim_1, hidden_dim_2)
        self.hidden2label = nn.Linear(hidden_dim_2, label_size)

    def forward(self, input_features):
        lstm_out_1, hidden_1 = self.lstm_1(
            input_features.view(len(input_features), 1, -1))
        lstm_out_2, hidden_2 = self.lstm_2(
            lstm_out_1.view(len(lstm_out_1), 1, -1))
        outputs = self.hidden2label(lstm_out_2.view(self.batch_size,-1))
        output_labels = F.log_softmax(outputs, dim=-1)
        return output_labels

In [44]:
Batch_size = 8
Hidden_dim_1 = 512
Hidden_dim_2 = 1024
Num_classes =25

data , labels = create_batch_data(Batch_size, train_data, train_labels)

model_1 = video_net_2(Hidden_dim_1, Hidden_dim_2,Batch_size,Num_classes)
print(model_1)
model_1 = model_1.to(device)
loss_function = nn.CrossEntropyLoss()
optimizer = optim.SGD(model_1.parameters(), lr=0.01)

initial_loss = 0
for epoch in range(40):
  loss_count = 0.0
  for i in range(len(data)):
      video = data[i]
      label = labels[i]
      size = video.size()
      video = video.to(device)
      label = label.view(-1)
      label = label.to(device)
      model_1.zero_grad()
      outputs = model_1(video)

      loss = loss_function(outputs, label)
      loss.backward()
      optimizer.step()
      loss_count += loss.item()
  print('Epoch : %d Loss: %.3f' %(epoch+1, loss_count/len(data)))
        

video_net_2(
  (lstm_1): LSTM(102400, 512)
  (lstm_2): LSTM(512, 1024)
  (hidden2label): Linear(in_features=1024, out_features=25, bias=True)
)
Epoch : 1 Loss: 2.971
Epoch : 2 Loss: 2.309
Epoch : 3 Loss: 1.445
Epoch : 4 Loss: 0.867
Epoch : 5 Loss: 0.575
Epoch : 6 Loss: 0.429
Epoch : 7 Loss: 0.326
Epoch : 8 Loss: 0.256
Epoch : 9 Loss: 0.210
Epoch : 10 Loss: 0.172
Epoch : 11 Loss: 0.142
Epoch : 12 Loss: 0.119
Epoch : 13 Loss: 0.104
Epoch : 14 Loss: 0.105
Epoch : 15 Loss: 0.087
Epoch : 16 Loss: 0.071
Epoch : 17 Loss: 0.061
Epoch : 18 Loss: 0.051
Epoch : 19 Loss: 0.046
Epoch : 20 Loss: 0.037
Epoch : 21 Loss: 0.033
Epoch : 22 Loss: 0.029
Epoch : 23 Loss: 0.026
Epoch : 24 Loss: 0.022
Epoch : 25 Loss: 0.019
Epoch : 26 Loss: 0.017
Epoch : 27 Loss: 0.015
Epoch : 28 Loss: 0.014
Epoch : 29 Loss: 0.013
Epoch : 30 Loss: 0.012
Epoch : 31 Loss: 0.011
Epoch : 32 Loss: 0.011
Epoch : 33 Loss: 0.010
Epoch : 34 Loss: 0.010
Epoch : 35 Loss: 0.009
Epoch : 36 Loss: 0.009
Epoch : 37 Loss: 0.008
Epoch : 38 Los

In [45]:
Batch_size = 8
data_test , labels_test = create_batch_data(Batch_size, test_data, test_labels)
test_accuracy(data_test)

Accuracy on the test dataset is: 72.35169491525424



# **Bonus : Training on all 101 classes**




In [0]:
anno = pd.read_csv("annos/videos_labels_subsets.txt", header=None, sep='\t')

In [0]:
train_data_101 = []
train_labels_101 = []
test_data_101 = []
test_labels_101 = []

count = 0
for _, line in anno.iterrows():
    val = torch.tensor(line[1]-1)
    dat = torch.tensor(sp.io.loadmat(os.path.join(targetfolder, line[0]+'.mat'))['Feature'])
  
    if line[2]==1:
        train_data_101.append(dat)
        train_labels_101.append(val)
    else:
        test_data_101.append(dat)
        test_labels_101.append(val)

In [52]:
l1 = len(train_data_101)
l2 = len(test_data_101)
print("Size of training data is: ", l1)
print("Size of testing data is: ",l2)
print("Total size is:", l1+l2)

Size of training data is:  9537
Size of testing data is:  3783
Total size is: 13320


In [22]:
Batch_size = 8
Hidden_dim = 1024
Num_classes =101

data , labels = create_batch_data(Batch_size, train_data_101, train_labels_101)

model_1 = video_net(Hidden_dim,Batch_size,Num_classes)
print(model_1)
model_1 = model_1.to(device)
loss_function = nn.CrossEntropyLoss()
optimizer = optim.SGD(model_1.parameters(), lr=0.1)

initial_loss = 0
for epoch in range(40):
  loss_count = 0.0
  for i in range(len(data)):
      video = data[i]
      label = labels[i]
      size = video.size()
      video = video.to(device)
      label = label.view(-1)
      label = label.to(device)
      model_1.zero_grad()
      outputs = model_1(video)

      loss = loss_function(outputs, label)
      loss.backward()
      optimizer.step()
      loss_count += loss.item()
  print('Epoch : %d Loss: %.3f' %(epoch+1, loss_count/len(data)))

video_net(
  (lstm): LSTM(102400, 1024)
  (hidden2label): Linear(in_features=1024, out_features=101, bias=True)
)
Epoch : 1 Loss: 0.909
Epoch : 2 Loss: 0.848
Epoch : 3 Loss: 0.655
Epoch : 4 Loss: 0.495
Epoch : 5 Loss: 0.385
Epoch : 6 Loss: 0.309
Epoch : 7 Loss: 0.244
Epoch : 8 Loss: 0.188
Epoch : 9 Loss: 0.154
Epoch : 10 Loss: 0.126
Epoch : 11 Loss: 0.106
Epoch : 12 Loss: 0.089
Epoch : 13 Loss: 0.076
Epoch : 14 Loss: 0.065
Epoch : 15 Loss: 0.054
Epoch : 16 Loss: 0.049
Epoch : 17 Loss: 0.045
Epoch : 18 Loss: 0.041
Epoch : 19 Loss: 0.042
Epoch : 20 Loss: 0.035
Epoch : 21 Loss: 0.032
Epoch : 22 Loss: 0.034
Epoch : 23 Loss: 0.028
Epoch : 24 Loss: 0.024
Epoch : 25 Loss: 0.022
Epoch : 26 Loss: 0.019
Epoch : 27 Loss: 0.018
Epoch : 28 Loss: 0.016
Epoch : 29 Loss: 0.014
Epoch : 30 Loss: 0.012
Epoch : 31 Loss: 0.011
Epoch : 32 Loss: 0.010
Epoch : 33 Loss: 0.009
Epoch : 34 Loss: 0.008
Epoch : 35 Loss: 0.007
Epoch : 36 Loss: 0.006
Epoch : 37 Loss: 0.006
Epoch : 38 Loss: 0.005
Epoch : 39 Loss: 0.00

In [23]:
Batch_size = 8
data_test , labels_test = create_batch_data(Batch_size, test_data_101, test_labels_101)
test_accuracy(data_test)

Accuracy on the test dataset is: 52.7542372881356


In [46]:
Batch_size = 8
Hidden_dim_1 = 512
Hidden_dim_2 = 1024
Num_classes =101

data , labels = create_batch_data(Batch_size, train_data_101, train_labels_101)

model_1 = video_net_2(Hidden_dim_1,Hidden_dim_2,Batch_size,Num_classes)
print(model_1)
model_1 = model_1.to(device)
loss_function = nn.CrossEntropyLoss()
optimizer = optim.SGD(model_1.parameters(), lr=0.01)

initial_loss = 0
for epoch in range(40):
  loss_count = 0.0
  for i in range(len(data)):
      video = data[i]
      label = labels[i]
      size = video.size()
      video = video.to(device)
      label = label.view(-1)
      label = label.to(device)
      model_1.zero_grad()
      outputs = model_1(video)

      loss = loss_function(outputs, label)
      loss.backward()
      optimizer.step()
      loss_count += loss.item()
  print('Epoch : %d Loss: %.3f' %(epoch+1, loss_count/len(data)))

video_net_2(
  (lstm_1): LSTM(102400, 512)
  (lstm_2): LSTM(512, 1024)
  (hidden2label): Linear(in_features=1024, out_features=101, bias=True)
)
Epoch : 1 Loss: 3.440
Epoch : 2 Loss: 2.534
Epoch : 3 Loss: 1.608
Epoch : 4 Loss: 1.035
Epoch : 5 Loss: 0.730
Epoch : 6 Loss: 0.545
Epoch : 7 Loss: 0.421
Epoch : 8 Loss: 0.332
Epoch : 9 Loss: 0.271
Epoch : 10 Loss: 0.219
Epoch : 11 Loss: 0.186
Epoch : 12 Loss: 0.160
Epoch : 13 Loss: 0.134
Epoch : 14 Loss: 0.116
Epoch : 15 Loss: 0.100
Epoch : 16 Loss: 0.089
Epoch : 17 Loss: 0.087
Epoch : 18 Loss: 0.071
Epoch : 19 Loss: 0.058
Epoch : 20 Loss: 0.076
Epoch : 21 Loss: 0.051
Epoch : 22 Loss: 0.044
Epoch : 23 Loss: 0.039
Epoch : 24 Loss: 0.035
Epoch : 25 Loss: 0.047
Epoch : 26 Loss: 0.035
Epoch : 27 Loss: 0.030
Epoch : 28 Loss: 0.031
Epoch : 29 Loss: 0.029
Epoch : 30 Loss: 0.024
Epoch : 31 Loss: 0.032
Epoch : 32 Loss: 0.023
Epoch : 33 Loss: 0.020
Epoch : 34 Loss: 0.018
Epoch : 35 Loss: 0.016
Epoch : 36 Loss: 0.014
Epoch : 37 Loss: 0.013
Epoch : 38 Lo

In [47]:
Batch_size = 8
data_test , labels_test = create_batch_data(Batch_size, test_data_101, test_labels_101)
test_accuracy(data_test)

Accuracy on the test dataset is: 31.54131355932203


# **Training SVM**

In [0]:
from sklearn.svm import LinearSVC
svc_train = []
svc_train_label = []
for i in range(len(train_data)):
    svc_train.append(train_data[i].view(1, -1))
    svc_train_label.append(train_labels[i].view(1, -1))

svc_test = []
svc_test_label = []
for i in range(len(test_data)):
    svc_test.append(test_data[i].view(1, -1))
    svc_test_label.append(test_labels[i].view(1, -1))

In [0]:
svc_train = torch.cat(svc_train).numpy()
svc_train_label = torch.cat(svc_train_label).view(-1).numpy()
svc_test = torch.cat(svc_test).numpy()
svc_test_label = torch.cat(svc_test_label).view(-1).numpy()

In [0]:
clf = LinearSVC(random_state=0, C=0.08920 , multi_class='ovr')
clf.fit(svc_train, svc_train_label)

LinearSVC(C=0.0892, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=0, tol=0.0001,
          verbose=0)

In [0]:
svc_test_label = np.array(svc_test_label)
svctest_results = clf.predict(svc_test)

print("Accuracy for SVC model: %f %%" % (100.0*np.sum(svctest_results == svc_test_label) / len(svctest_results)))

Accuracy for SVC model: 83.491062 %


## Submission
---
**Runnable source code in ipynb file and a pdf report are required**.

The report should be of 3 to 4 pages describing what you have done and learned in this homework and report performance of your model. If you have tried multiple methods, please compare your results. If you are using any external code, please cite it in your report. Note that this homework is designed to help you explore and get familiar with the techniques. The final grading will be largely based on your prediction accuracy and the different methods you tried (different architectures and parameters).

Please indicate clearly in your report what model you have tried, what techniques you applied to improve the performance and report their accuracies. The report should be concise and include the highlights of your efforts.
The naming convention for report is **Surname_Givenname_SBUID_report*.pdf**

When submitting your .zip file through blackboard, please
-- name your .zip file as **Surname_Givenname_SBUID_hw*.zip**.

This zip file should include:
```
Surname_Givenname_SBUID_hw*
        |---Surname_Givenname_SBUID_hw*.ipynb
        |---Surname_Givenname_SBUID_hw*.pdf
        |---Surname_Givenname_SBUID_report*.pdf
```

For instance, student Michael Jordan should submit a zip file named "Jordan_Michael_111134567_hw5.zip" for homework5 in this structure:
```
Jordan_Michael_111134567_hw5
        |---Jordan_Michael_111134567_hw5.ipynb
        |---Jordan_Michael_111134567_hw5.pdf
        |---Jordan_Michael_111134567_report*.pdf
```

The **Surname_Givenname_SBUID_hw*.pdf** should include a **google shared link**. To generate the **google shared link**, first create a folder named **Surname_Givenname_SBUID_hw*** in your Google Drive with your Stony Brook account. 

Then right click this folder, click ***Get shareable link***, in the People textfield, enter two TA's emails: ***bo.cao.1@stonybrook.edu*** and ***sayontan.ghosh@stonybrook.edu***. Make sure that TAs who have the link **can edit**, ***not just*** **can view**, and also **uncheck** the **Notify people** box.

Colab has a good feature of version control, you should take advantage of this to save your work properly. However, the timestamp of the submission made in blackboard is the only one that we consider for grading. To be more specific, we will only grade the version of your code right before the timestamp of the submission made in blackboard. 

You are encouraged to post and answer questions on Piazza. Based on the amount of email that we have received in past years, there might be dealys in replying to personal emails. Please ask questions on Piazza and send emails only for personal issues.

Be aware that your code will undergo plagiarism check both vertically and horizontally. Please do your own work.