# ML4NLP1
## Starting Point for Exercise 1, part II

This notebook is supposed to serve as a starting point and/or inspiration when starting exercise 1, part II.

One of the goals of this exercise is o make you acquainted with **skorch**. You will probably need to consult the [documentation](https://skorch.readthedocs.io/en/stable/).

# Installing skorch and loading libraries

In [1]:
import subprocess

# Installation on Google Colab
try:
    import google.colab
    subprocess.run(['python', '-m', 'pip', 'install', 'skorch'])
except ImportError:
    pass

In [2]:
import torch
from torch import nn
import torch.nn.functional as F
from skorch import NeuralNetClassifier

In [3]:
torch.manual_seed(0)
torch.cuda.manual_seed(0)

In [4]:
import pandas as pd
import numpy as np
import csv
import re
import string
from collections import defaultdict

## Training a classifier and making predictions

In [5]:
# download dataset
!gdown 1QP6YuwdKFNUPpvhOaAcvv2Pcp4JMbIRs # x_train
!gdown 1QVo7PZAdiZKzifK8kwhEr_umosiDCUx6 # x_test
!gdown 1QbBeKcmG2ZyAEFB3AKGTgSWQ1YEMn2jl # y_train
!gdown 1QaZj6bI7_78ymnN8IpSk4gVvg-C9fA6X # y_test

Downloading...
From: https://drive.google.com/uc?id=1QP6YuwdKFNUPpvhOaAcvv2Pcp4JMbIRs
To: /content/x_train.txt
100% 64.1M/64.1M [00:00<00:00, 245MB/s]
Downloading...
From: https://drive.google.com/uc?id=1QVo7PZAdiZKzifK8kwhEr_umosiDCUx6
To: /content/x_test.txt
100% 65.2M/65.2M [00:00<00:00, 126MB/s]
Downloading...
From: https://drive.google.com/uc?id=1QbBeKcmG2ZyAEFB3AKGTgSWQ1YEMn2jl
To: /content/y_train.txt
100% 480k/480k [00:00<00:00, 113MB/s]
Downloading...
From: https://drive.google.com/uc?id=1QaZj6bI7_78ymnN8IpSk4gVvg-C9fA6X
To: /content/y_test.txt
100% 480k/480k [00:00<00:00, 58.8MB/s]


In [6]:
with open(f'x_train.txt') as f:
    x_train = f.read().splitlines()
with open(f'y_train.txt') as f:
    y_train = f.read().splitlines()
with open(f'x_test.txt') as f:
    x_test = f.read().splitlines()
with open(f'y_test.txt') as f:
    y_test = f.read().splitlines()

In [7]:
import pandas as pd
# combine x_train and y_train into one dataframe
train_df = pd.DataFrame({'text': x_train, 'label': y_train})

#combine x_test and y_test into one dataframe
test_df = pd.DataFrame({'text': x_test, 'label': y_test})

In [8]:
# T: Please use again the train/test data that includes English, German, Dutch, Danish, Swedish and Norwegian, plus 20 additional languages of your choice (the labels can be found in the file labels.csv)
# and adjust the train/test split if needed
from sklearn.model_selection import train_test_split

l = train_df['label'].unique().tolist()
l.sort()


all_df = pd.concat([train_df, test_df], ignore_index=True)


# pre selected langauges
languages = ['eng', 'deu', 'nld', 'dan', 'swe', 'nob']

# get 20 other langauges randomly
num_languages = 20
l = [lang for lang in l if lang not in languages]
np.random.seed(42)
rand_idx = np.random.choice(len(l), num_languages)

for i in range(num_languages):
  idx = rand_idx[i]
  languages.append(l[idx])

# now use this list to get corresponding training data
new_df = all_df[all_df['label'].isin(languages)]
train_x, test_x, train_y, test_y= train_test_split(new_df['text'], new_df['label'], test_size = 0.2, random_state = 42, stratify = new_df['label'])



In [9]:
# T: use your adjusted code to encode the labels here

from sklearn.preprocessing import LabelEncoder
# le_fitted = LabelEncoder().fit(train_df['label'])
# y_train_dev, y_test = le_fitted.fit(train_df['label']), le_fitted.fit(test_df['label'])

le_fitted = LabelEncoder().fit(train_y)
y_train_dev = le_fitted.transform(train_y)
y_test = le_fitted.transform(test_y)

In [10]:
# T: In the following, you can find a small (almost) working example of a neural network. Unfortunately, again, the cat messed up some of the code. Please fix the code such that it is executable.

In [18]:
# First, we extract some simple features as input for the neural network
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
vectorizer = CountVectorizer(analyzer='char', ngram_range=(2, 2), max_features=500,)
X = vectorizer.fit_transform(train_x.to_numpy())

In [12]:
X = X.astype(np.float32)
y = y_train_dev.astype(np.int64)

In the following, we define a vanilla neural network with two hidden layers. The output layer should have as many outputs as there are classes. In addition, it should have a nonlinearity function.

In [13]:
class ClassifierModule(nn.Module):
    def __init__(
            self,
            num_units=200,
            nonlin=F.relu,
    ):
        super(ClassifierModule, self).__init__()
        self.num_units = num_units
        self.nonlin = nonlin

        self.dense0 = nn.Linear(500, num_units)
        self.nonlin = nonlin
        self.dense1 = nn.Linear(num_units, 250)
        self.dense2 = nn.Linear(250, 100)
        self.output = nn.Linear(100, 26)


    def forward(self, X, **kwargs):
      X = self.nonlin(self.dense0(X))

      X = self.nonlin(self.dense1(X))
      X = self.nonlin(self.dense2(X))
      X = self.output(X)
      return X.squeeze(dim=1)

In [14]:
net = NeuralNetClassifier(
    ClassifierModule,
    max_epochs=20,
    criterion=nn.CrossEntropyLoss(),
    lr=0.1,
    device='cuda',  # comment this to train with CPU
)

In [15]:
net.fit(X, y)

  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        [36m1.4489[0m       [32m0.9357[0m        [35m0.3554[0m  2.4107
      2        [36m0.2920[0m       [32m0.9575[0m        [35m0.2032[0m  2.0235
      3        [36m0.1869[0m       [32m0.9640[0m        [35m0.1698[0m  1.9718
      4        [36m0.1394[0m       [32m0.9670[0m        [35m0.1515[0m  1.9670
      5        [36m0.1181[0m       [32m0.9692[0m        [35m0.1436[0m  2.7986
      6        [36m0.0960[0m       [32m0.9710[0m        [35m0.1377[0m  1.9043
      7        [36m0.0846[0m       [32m0.9718[0m        [35m0.1340[0m  1.9962
      8        [36m0.0761[0m       [32m0.9720[0m        [35m0.1311[0m  1.9780
      9        [36m0.0688[0m       0.9720        [35m0.1299[0m  1.9624
     10        [36m0.0626[0m       0.9720        [35m0.1295[0m  1.9654
     11        [36m0.0580[0m       [32m0.9725[0m        0.13

<class 'skorch.classifier.NeuralNetClassifier'>[initialized](
  module_=ClassifierModule(
    (dense0): Linear(in_features=500, out_features=200, bias=True)
    (dense1): Linear(in_features=200, out_features=250, bias=True)
    (dense2): Linear(in_features=250, out_features=100, bias=True)
    (output): Linear(in_features=100, out_features=26, bias=True)
  ),
)

In [19]:
X = vectorizer.transform(test_x.to_numpy())
X = X.astype(np.float32)
y = y_test.astype(np.int64)
y_test


array([16, 16, 24, ..., 17, 11,  3])

In [21]:
y_pred = net.predict(X)

In [22]:
accuraccy = y_pred == y_test

In [24]:
accuraccy.mean()

0.973