### This notebook contains the following operations
* Load the US Video data with Titles/Tags/Descriptions
* Select the subset of the complete data using the given categories (with primary category: news and politics)
* Create iterators for train, validation and test datasets
* Run the analysis with neural network models (including RNN, CNN, and simple Linear NN)

In [28]:
import data_input as data_in
import nnmodels as nnm
import mlmodels as mlm
import bow_models as bowm
import train_eval
import visualization as vis
from torchtext import data
import torch
import pandas as pd

### Import Tags/Titles/Descriptions Data with the selected categories

In [50]:
ori_data_dir = r'D:\Researching Data\Youtube data\USvideos.csv' # should specify the directory for US video data
sub_categories_id = [25, 24] # switching selected categories
#sub_categories_id = [25, 22]
#sub_categories_id = [25, 28]
#sub_categories_id = [25, 1]

prime_id = 25
ori_data = pd.read_csv(ori_data_dir)
new_data = ori_data[ori_data["category_id"].isin(sub_categories_id)]
new_data_dir = r'D:\Researching Data\Youtube data\sub_categories\sub_data.csv' # should specify the directory for the subset
new_data.to_csv(new_data_dir)

In [51]:
new_TEXT, new_label, new_arr = data_in.load_data(new_data_dir, 25, "full")

61.23188405797102  percent of videos are labelled as the selected category
the baseline precision is  61.23188405797102  in this model


In [52]:
path = r'D:\Researching Data\Youtube data\sub_categories' # should specify the directory for the subset
MAX_VOCAB_SIZE = 25000
TRAIN_VALID_TEST_R = (0.4, 0.4, 0.2)
BATCH_SIZE = 64

torch.backends.cudnn.deterministic = True
TEXT = data.Field(tokenize = 'spacy')
LABEL = data.LabelField(dtype = torch.float)
train_data, valid_data, test_data = data_in.build_train_test(path, new_arr, TRAIN_VALID_TEST_R, TEXT, LABEL)

TEXT.build_vocab(train_data, max_size = MAX_VOCAB_SIZE)
LABEL.build_vocab(train_data)
device = torch.device('cpu')
#device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
train_iterator, valid_iterator, test_iterator = data_in.build_iterator(BATCH_SIZE, device, train_data, valid_data, test_data)

The size of train, valid and test data are 331 331 166
Number of training examples: 330
Number of testing examples: 165
Number of validation examples:330


In [53]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 50
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5
N_FILTERS = 100
FILTER_SIZES = [3,4,5]
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

model_wordem = nnm.WordEmbAvg_2linear(INPUT_DIM, EMBEDDING_DIM, 
                                      HIDDEN_DIM, OUTPUT_DIM, PAD_IDX)
model_rnn = nnm.SimpleRNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, 
                          OUTPUT_DIM, PAD_IDX)
model_BLSTM = nnm.LSTM(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, 
                       N_LAYERS, BIDIRECTIONAL, DROPOUT, PAD_IDX)
model_GRU = nnm.GRU(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, 
                    N_LAYERS, BIDIRECTIONAL, DROPOUT, PAD_IDX)
model_CNN = nnm.CNN(INPUT_DIM, EMBEDDING_DIM, N_FILTERS, FILTER_SIZES, 
                    OUTPUT_DIM, DROPOUT, PAD_IDX)
MODEL_DICT = {"avg_embedding": model_wordem, "SimpleRNN": model_rnn,
              "BLSTM": model_BLSTM, "BGRU": model_GRU, "CNN": model_CNN}

### Run the Neural Network models with the data of the given pairs

#### politics (25) vs entertainment (24)
23.79% of videos are labelled as the selected category (politics)

In [39]:
best_models, models_perf = train_eval.compare_models(MODEL_DICT, device, train_iterator, valid_iterator, test_iterator, 5)
train_eval.get_effective_norms(best_models, TEXT)

currently training the model:  avg_embedding
Epoch 0: Dev Accuracy: 0.8799894962991986 Dev Loss:0.6713152836476054
Epoch 1: Dev Accuracy: 0.9042804624353137 Dev Loss:0.5997440655316625
Epoch 2: Dev Accuracy: 0.875525210584913 Dev Loss:0.9372916306768145
Epoch 3: Dev Accuracy: 0.8576680677277702 Dev Loss:1.058928749391011
Epoch 4: Dev Accuracy: 0.8565519962991986 Dev Loss:1.0720378437212534
Test Loss: 0.525 | Test Acc: 88.82%
Test Prec: 69.814% | Test Rec: 96.203%
currently training the model:  SimpleRNN
Epoch 0: Dev Accuracy: 0.7792804624353137 Dev Loss:0.5343857066971915
Epoch 1: Dev Accuracy: 0.7792804624353137 Dev Loss:0.5313312879630497
Epoch 2: Dev Accuracy: 0.7792804624353137 Dev Loss:0.5373117050954274
Epoch 3: Dev Accuracy: 0.7792804624353137 Dev Loss:0.5330587582928794
Epoch 4: Dev Accuracy: 0.7792804624353137 Dev Loss:0.5375687118087497
Test Loss: 0.552 | Test Acc: 75.97%
Test Prec: nan% | Test Rec: 0.000%
currently training the model:  BLSTM
Epoch 0: Dev Accuracy: 0.85431985

#### politics (25) vs people & blogs (22)
50.40% of videos are labelled as the selected category (politics)

In [44]:
best_models, models_perf = train_eval.compare_models(MODEL_DICT, device, train_iterator, valid_iterator, test_iterator, 5)
train_eval.get_effective_norms(best_models, TEXT)

currently training the model:  avg_embedding
Epoch 0: Dev Accuracy: 0.8125 Dev Loss:0.6217183300427028
Epoch 1: Dev Accuracy: 0.8549107142857143 Dev Loss:0.43280339666775297
Epoch 2: Dev Accuracy: 0.8861607142857143 Dev Loss:0.30445895344018936
Epoch 3: Dev Accuracy: 0.8861607142857143 Dev Loss:0.3391383034842355
Epoch 4: Dev Accuracy: 0.9040178571428571 Dev Loss:0.29253519432885305
Test Loss: 0.270 | Test Acc: 88.98%
Test Prec: 85.960% | Test Rec: 92.567%
currently training the model:  SimpleRNN
Epoch 0: Dev Accuracy: 0.4642857142857143 Dev Loss:0.6976036940302167
Epoch 1: Dev Accuracy: 0.5602678571428571 Dev Loss:0.6940380420003619
Epoch 2: Dev Accuracy: 0.5178571428571429 Dev Loss:0.6979015469551086
Epoch 3: Dev Accuracy: 0.49107142857142855 Dev Loss:0.7072133677346366
Epoch 4: Dev Accuracy: 0.49107142857142855 Dev Loss:0.7273320640836444
Test Loss: 0.750 | Test Acc: 44.36%
Test Prec: 40.509% | Test Rec: 37.297%
currently training the model:  BLSTM
Epoch 0: Dev Accuracy: 0.495535714

#### politics (25) vs science & technologies (28)
57.10% of videos are labelled as the selected category (politics)

In [49]:
best_models, models_perf = train_eval.compare_models(MODEL_DICT, device, train_iterator, valid_iterator, test_iterator, 5)
train_eval.get_effective_norms(best_models, TEXT)

currently training the model:  avg_embedding
Epoch 0: Dev Accuracy: 0.6135110308726629 Dev Loss:0.6524952252705892
Epoch 1: Dev Accuracy: 0.7864583333333334 Dev Loss:0.514727920293808
Epoch 2: Dev Accuracy: 0.9077818592389425 Dev Loss:0.30390355984369916
Epoch 3: Dev Accuracy: 0.907169113556544 Dev Loss:0.24439624200264612
Epoch 4: Dev Accuracy: 0.9175857802232107 Dev Loss:0.23426234101255736
Test Loss: 0.233 | Test Acc: 92.59%
Test Prec: 84.007% | Test Rec: 100.000%
currently training the model:  SimpleRNN
Epoch 0: Dev Accuracy: 0.5441176493962606 Dev Loss:0.6894890467325846
Epoch 1: Dev Accuracy: 0.5467218160629272 Dev Loss:0.689419706662496
Epoch 2: Dev Accuracy: 0.5493259827295939 Dev Loss:0.6912829180558523
Epoch 3: Dev Accuracy: 0.553768386443456 Dev Loss:0.7007514735062917
Epoch 4: Dev Accuracy: 0.563725491364797 Dev Loss:0.6868770023187002
Test Loss: 0.672 | Test Acc: 59.95%
Test Prec: 53.830% | Test Rec: 21.632%
currently training the model:  BLSTM
Epoch 0: Dev Accuracy: 0.554

#### politics (25) vs film & animation (1)
61.23% of videos are labelled as the selected category (politics)

In [54]:
best_models, models_perf = train_eval.compare_models(MODEL_DICT, device, train_iterator, valid_iterator, test_iterator, 5)
train_eval.get_effective_norms(best_models, TEXT)

currently training the model:  avg_embedding
Epoch 0: Dev Accuracy: 0.6510416666666666 Dev Loss:0.6208838224411011
Epoch 1: Dev Accuracy: 0.8390625019868215 Dev Loss:0.5024471034606298
Epoch 2: Dev Accuracy: 0.8052083353201548 Dev Loss:0.4195164442062378
Epoch 3: Dev Accuracy: 0.7807291646798452 Dev Loss:0.5098017056783041
Epoch 4: Dev Accuracy: 0.7677083313465118 Dev Loss:0.8191860119501749
Test Loss: 0.531 | Test Acc: 84.42%
Test Prec: 77.101% | Test Rec: 87.871%
currently training the model:  SimpleRNN
Epoch 0: Dev Accuracy: 0.5796875009934107 Dev Loss:0.681547224521637
Epoch 1: Dev Accuracy: 0.5416666666666666 Dev Loss:0.673817773660024
Epoch 2: Dev Accuracy: 0.5817708373069763 Dev Loss:0.6792107025782267
Epoch 3: Dev Accuracy: 0.5885416666666666 Dev Loss:0.6968029538790385
Epoch 4: Dev Accuracy: 0.59375 Dev Loss:0.6667316953341166
Test Loss: 0.693 | Test Acc: 61.63%
Test Prec: nan% | Test Rec: 1.333%
currently training the model:  BLSTM
Epoch 0: Dev Accuracy: 0.6364583373069763 De