## Support Vector Machine to classify the [Ten Thousand German News Articles Dataset](https://github.com/tblock/10kGNAD)
This Notebook contains the code to reproduce the results in my thesis.
The code reproduces the exact results.

Run all cells consecutively.

### Enviroment Setup 

In [0]:
# load the dataset and generate subsets
!rm -rf 10kGNAD lowshot
!git config --global advice.detachedHead false
!git clone -q --branch v1.1 https://github.com/tblock/10kGNAD.git && echo "downloaded dataset"
!mkdir lowshot
!cp 10kGNAD/train.csv .
!python 10kGNAD/code/generate_lowshot_sets.py > /dev/null && echo "generated train subsets"

downloaded dataset
generated train subsets


In [0]:
import glob
import numpy as np
import pandas as pd
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.preprocessing import StandardScaler

### Train Models

In [0]:
# load test set
df_test = pd.read_csv('10kGNAD/test.csv', header=None, sep=';', quotechar="'", names=['label', 'text'])

In [0]:
filenames = sorted(glob.glob("lowshot/*.csv"))

for filename in filenames:  # for each subset
  
  df_train = pd.read_csv(filename, header=None, sep=';', quotechar="'", names=['label', 'text'])

  # build the classifier pipeline
  lsvc_classifier = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer(
        sublinear_tf=True # Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
    )),
    ('clf', LinearSVC(
        dual=False,
        C=1.6,
        class_weight="balanced"
    ))
  ])

  lsvc_classifier.fit(df_train['text'], df_train['label'])  # train the classifier
  predicted = lsvc_classifier.predict(df_test['text'])  # predict the test set 
  acc = np.mean(predicted == df_test['label'])  # calculate the accuracy
  
  print(filename[16:-4],"%.2f" % float((100 - acc*100)), sep=" -> ") # print the error rate 

0_0.01_0 -> 40.56
0_0.01_1 -> 39.40
0_0.01_2 -> 36.87
0_0.01_3 -> 38.13
0_0.01_4 -> 42.02
0_0.01_5 -> 40.56
0_0.01_6 -> 39.59
0_0.01_7 -> 38.33
0_0.01_8 -> 38.04
0_0.01_9 -> 39.59
1_0.02_0 -> 32.10
1_0.02_1 -> 27.72
1_0.02_2 -> 30.64
1_0.02_3 -> 32.78
1_0.02_4 -> 28.21
1_0.02_5 -> 29.38
1_0.02_6 -> 30.06
1_0.02_7 -> 28.99
1_0.02_8 -> 31.81
1_0.02_9 -> 32.10
2_0.05_0 -> 22.76
2_0.05_1 -> 24.22
2_0.05_2 -> 20.53
2_0.05_3 -> 21.89
2_0.05_4 -> 22.86
2_0.05_5 -> 22.57
2_0.05_6 -> 22.76
2_0.05_7 -> 24.51
2_0.05_8 -> 21.69
2_0.05_9 -> 22.28
3_0.075_0 -> 18.97
3_0.075_1 -> 19.26
3_0.075_2 -> 20.72
3_0.075_3 -> 20.43
3_0.075_4 -> 20.53
3_0.075_5 -> 19.94
3_0.075_6 -> 19.94
3_0.075_7 -> 19.84
3_0.075_8 -> 19.65
3_0.075_9 -> 20.14
4_0.1_0 -> 19.36
4_0.1_1 -> 18.68
4_0.1_2 -> 17.02
4_0.1_3 -> 18.77
4_0.1_4 -> 17.32
4_0.1_5 -> 17.61
4_0.1_6 -> 18.48
4_0.1_7 -> 18.68
4_0.1_8 -> 17.61
4_0.1_9 -> 17.02
5_0.2_0 -> 14.98
5_0.2_1 -> 15.47
5_0.2_2 -> 14.88
5_0.2_3 -> 15.66
5_0.2_4 -> 15.27
5_0.2_5 -> 15.3