# Predict decision time

## Load data and preprocess

In [1]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
URL = r'https://raw.githubusercontent.com/rostro36/Vernehmlassungen/master/laws.csv'
df = pd.read_csv(URL)

In [2]:
df.head(100)

Unnamed: 0.1,Unnamed: 0,index,Department,Title,Text,Vernehmlassung_Day,Vernehmlassung_Month,Vernehmlassung_Year,Behoerde,SR_Links,SR_Numbers,Link_count,index.1,Months_until_decision,Decision_day,Decision_month,Decision_year,Months_until_accept,Accept_day,Accept_month,Accept_year
0,0,0,BK,Revision des Bundesgesetzes über die politisch...,,28,2,1993,Bundesrat,,,,0,,,,,,,,
1,1,1,EDA,Beitritt der Schweiz zum UNO-Übereinkommen übe...,,15,12,1992,Bundesrat,,,,1,,,,,,,,
2,2,2,EDI,Verordnung über den Wald (Waldverordnung),,16,3,1992,Bundesrat,,,,2,,,,,,,,
3,3,3,EDI,Beitritt der Schweiz zu drei internationalen B...,,15,6,1992,Bundesrat,,,,3,,,,,,,,
4,4,4,EDI,Bundesbeschluss über befristete Massnahmen geg...,,30,6,1992,Bundesrat,,,,4,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,95,13,EFD,Verzinsung von Verrechnungssteuerguthaben (Var...,Es stehen zwei Varianten von Gesetzesentwürfen...,15,6,1995,Bundesrat,,,,95,,,,,,,,
96,96,14,EFD,Bundesbeschluss über die Anordnung einer allge...,,30,6,1995,Bundesrat,,,,96,,,,,,,,
97,97,15,EFD,Verordnung über das öffentliche Beschaffungswesen,,18,9,1995,Bundesrat,,,,97,,,,,,,,
98,98,16,EFD,Finanzierung des öffentlichen Verkehrs,"Der Bundesrat sieht vor, die drei Finanzierung...",15,11,1995,Bundesrat,,,,98,,,,,,,,


In [3]:
df = df.dropna(
    subset=['SR_Links', 'Months_until_accept', 'Months_until_decision'])
no_text = df.drop(columns=['Title', 'Text', 'Unnamed: 0', 'index', 'index.1', 'SR_Links',
                  'SR_Numbers', 'Months_until_accept', 'Accept_day', 'Accept_month', 'Accept_year']).reset_index()
encoder = OneHotEncoder(sparse=False)
encoded = encoder.fit_transform(no_text[['Behoerde', 'Department']])
encoded = pd.DataFrame(encoded, columns=encoder.get_feature_names_out())
no_text = no_text.drop(columns=['Behoerde', 'Department', 'index'])
no_text = pd.concat([no_text, encoded], axis=1)
no_text.tail()

Unnamed: 0,Vernehmlassung_Day,Vernehmlassung_Month,Vernehmlassung_Year,Link_count,Months_until_decision,Decision_day,Decision_month,Decision_year,Behoerde_Behördenkommission,Behoerde_Bundesrat,Behoerde_Bundesversammlung,Behoerde_Departement oder Bundeskanzlei,Behoerde_Einheit der zentralen oder dezentralen Bundesverwaltung,Behoerde_Parlamentarische Kommissionen,Department_BK,Department_EDA,Department_EDI,Department_EFD,Department_EJPD,Department_EVD,Department_Parl.,Department_UVEK,Department_VBS,Department_WBF,Department_other
937,6,7,2015,2.0,15.066667,30.0,9.0,2016.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
938,14,8,2015,1.0,10.266667,17.0,6.0,2016.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
939,18,12,2015,1.0,6.066667,17.0,6.0,2016.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
940,15,3,2016,1.0,3.133333,17.0,6.0,2016.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
941,21,3,2016,1.0,12.033333,17.0,3.0,2017.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


## Overfitting test
Check if everything works by already giving the decision data, which we want to predict.

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
from tensorflow import keras
import tensorflow as tf
from sklearn.preprocessing import StandardScaler

targets = no_text['Months_until_decision']
features = no_text.drop(columns=['Months_until_decision'])
scaler = StandardScaler()
features = scaler.fit_transform(features)
features_training, features_test, targets_training, targets_test = train_test_split(
    features, targets, test_size=0.2, random_state=42)

parameters = {'kernel': ('linear', 'poly', 'rbf',
                         'sigmoid'), 'C': [0.1, 1, 10]}
clf = GridSearchCV(SVR(), parameters, n_jobs=-1, cv=5,
                   verbose=3, scoring='neg_mean_squared_error')
clf.fit(features_training, targets_training)
predicted_test = clf.predict(features_test)
print(mean_squared_error(targets_test, predicted_test))

Fitting 5 folds for each of 12 candidates, totalling 60 fits
0.0027676437022887104


Yes, the SVR learns how to subtract the two dates from each other.

In [5]:
tf.random.set_seed(12)
for learning_rate in [1, 0.1, 0.001, 0.0001]:
    model = keras.Sequential([keras.layers.Dense(10, activation='ReLU', input_shape=(
        24,)), keras.layers.Dropout(0.1), keras.layers.Dense(1, activation='ReLU')])
    model.build()
    model.compile(optimizer=tf.optimizers.Adam(
        learning_rate=learning_rate), loss='mean_squared_error')
    model.fit(features_training, targets_training, batch_size=1, epochs=10)
    predicted_test = model.predict(features_test)
    print(learning_rate)
    print(mean_squared_error(targets_test, predicted_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
1
1701.6642974720746
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
0.1
376.6424641286496
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
0.001
630.8834373801285
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
0.0001
1580.755114114337


For the feed-forward neural net The LR of 0.01 and 0.001 seem to work reasonably well. Both are slightly overfitting, but probably training for longer would give even better scores.

## Not overfitting
Excluding the exact decision date.

### SVM

In [6]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

targets = no_text['Months_until_decision']
features = no_text.drop(columns=[
                        'Months_until_decision', 'Decision_day', 'Decision_month', 'Decision_year'])
scaler = StandardScaler()
features = scaler.fit_transform(features)
features_training, features_test, targets_training, targets_test = train_test_split(
    features, targets, test_size=0.2, random_state=42)

parameters = {'kernel': ('linear', 'poly', 'rbf',
                         'sigmoid'), 'C': [0.1, 1, 10]}
clf = GridSearchCV(SVR(), parameters, n_jobs=-1, cv=5, verbose=3)
clf.fit(features_training, targets_training)
print('SVM')
print(clf.best_params_)
predicted_test = clf.predict(features_test)
print(mean_squared_error(predicted_test, targets_test))

Fitting 5 folds for each of 12 candidates, totalling 60 fits
SVM
{'C': 10, 'kernel': 'rbf'}
1350.3636136938092


### Decision tree

In [7]:
parameters = {'max_depth': (3, 6, 12, 25, None), 'min_samples_leaf': [1, 3, 7]}
clf = GridSearchCV(DecisionTreeRegressor(random_state=12),
                   parameters, n_jobs=-1, cv=5, verbose=3)
clf.fit(features_training, targets_training)
print('Tree')
print(clf.best_params_)
predicted_test = clf.predict(features_test)
print(mean_squared_error(predicted_test, targets_test))

Fitting 5 folds for each of 15 candidates, totalling 75 fits
Tree
{'max_depth': 3, 'min_samples_leaf': 7}
1184.0546846776547


### Nearest Neighbour

In [8]:
parameters = {'n_neighbors': (3, 5, 7, 11), 'weights': ['uniform', 'distance']}
clf = GridSearchCV(KNeighborsRegressor(), parameters,
                   n_jobs=-1, cv=5, verbose=4)
clf.fit(features_training, targets_training)
print('Nearest Neighbour')
print(clf.best_params_)
predicted_test = clf.predict(features_test)
print(mean_squared_error(predicted_test, targets_test))

Fitting 5 folds for each of 8 candidates, totalling 40 fits
Nearest Neighbour
{'n_neighbors': 11, 'weights': 'uniform'}
1108.483075196408


### Random Forest

In [9]:
parameters = {'n_estimators': (10, 50, 100), 'max_depth': (
    3, 6, 12, 25, None), 'min_samples_leaf': [1, 3, 7]}
clf = GridSearchCV(RandomForestRegressor(random_state=12),
                   parameters, n_jobs=-1, cv=5, verbose=4)
clf.fit(features_training, targets_training)
print('Random Forest')
print(clf.best_params_)
predicted_test = clf.predict(features_test)
print(mean_squared_error(predicted_test, targets_test))

Fitting 5 folds for each of 45 candidates, totalling 225 fits
Random Forest
{'max_depth': 3, 'min_samples_leaf': 3, 'n_estimators': 100}
1117.5740255716626


### Gradient Boosting

In [10]:
parameters = {'learning_rate': (0.001, 0.01, 0.1, 0.4), 'n_estimators': (
    10, 50, 100), 'max_depth': (3, 6, 12, 25, None), 'min_samples_leaf': [1, 3, 7]}
clf = GridSearchCV(GradientBoostingRegressor(random_state=12),
                   parameters, n_jobs=-1, cv=5, verbose=4)
clf.fit(features_training, targets_training)
print('Boosting')
print(clf.best_params_)
predicted_test = clf.predict(features_test)
print(mean_squared_error(predicted_test, targets_test))

Fitting 5 folds for each of 180 candidates, totalling 900 fits
Boosting
{'learning_rate': 0.1, 'max_depth': 3, 'min_samples_leaf': 1, 'n_estimators': 10}
1146.7757638073708


### Feed-forward neural network

In [11]:
for learning_rate in [1, 0.1, 0.001, 0.0001]:
    tf.random.set_seed(12)
    model = keras.Sequential([keras.layers.Dense(10, activation='ReLU', input_shape=(
        21,)), keras.layers.Dropout(0.1), keras.layers.Dense(1, activation='ReLU')])
    model.build()
    model.compile(optimizer=tf.optimizers.Adam(
        learning_rate=learning_rate), loss='mean_squared_error')
    model.fit(features_training, targets_training, batch_size=1, epochs=10)
    predicted_test = model.predict(features_test)
    print(learning_rate)
    print(mean_squared_error(targets_test, predicted_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
1
1701.6642974720746
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
0.1
1130.7311531690807
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
0.001
1057.6135826738655
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
0.0001
1506.9872370208307


This task is as expected much harder.

Many classifiers are very similar and have a rather bad score. The best scores are around 1100. Which means about three years.

This is pretty bad, but is also expected, as this is a very hard task with only limited training data (in comparison to many other ML tasks).

## Adding text information
Since this project only has very limited samples (and I expect those texts to be bad features), I did not fine-tune BERT and only used the embeddings of the pre-trained BERT. This also makes training more uniform.

### Download transformers for easier BERT

In [12]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.12.5-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 4.3 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 40.1 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.1.2-py3-none-any.whl (59 kB)
[K     |████████████████████████████████| 59 kB 6.8 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 39.0 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 48.0 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transfor

### Load libraries and BERT

In [13]:
from transformers import TFBertModel
from transformers import BertTokenizer
import numpy as np


model = TFBertModel.from_pretrained('bert-base-german-cased')

Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/508M [00:00<?, ?B/s]

Some layers from the model checkpoint at bert-base-german-cased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-german-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


### Prepare dataframe with previous, non-textual information

In [14]:
df = df.dropna(
    subset=['SR_Links', 'Months_until_accept', 'Months_until_decision'])
text = df.drop(columns=['Unnamed: 0', 'index', 'index.1', 'SR_Links', 'SR_Numbers',
               'Months_until_accept', 'Accept_day', 'Accept_month', 'Accept_year']).reset_index()
encoder = OneHotEncoder(sparse=False)
encoded = encoder.fit_transform(text[['Behoerde', 'Department']])
encoded = pd.DataFrame(encoded, columns=encoder.get_feature_names_out())

### Add Title embeddings

In [15]:
tokenizer = BertTokenizer.from_pretrained('bert-base-german-cased')
inputs = tokenizer(text['Title'].to_list(), return_tensors='tf', padding=True)
embedded_title = model(inputs)
embedded_title = pd.DataFrame(embedded_title['pooler_output'].numpy())

Downloading:   0%|          | 0.00/249k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/474k [00:00<?, ?B/s]

### Add Text embeddings

In [16]:
def no_nan(input):
    if type(input) == float:
        return ""
    return input


checked_text = [no_nan(input) for input in text['Text']]
tokenizer = BertTokenizer.from_pretrained('bert-base-german-cased')
concat_text = None
len_text = len(text)
for i in range(50):
    inputs = tokenizer(checked_text[int(
        i*len_text/50):int((i+1)*len_text/50)], return_tensors='tf', padding=True, truncation=True)
    embedded_text = model(inputs)
    embedded_text = embedded_text['pooler_output'].numpy()
    print(i)
    if concat_text is None:
        concat_text = embedded_text
    else:
        concat_text = np.concatenate([concat_text, embedded_text])
embedded_text = pd.DataFrame(concat_text)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49


In [17]:
text = text.drop(columns=['Behoerde', 'Department', 'index', 'Title', 'Text'])
text = pd.concat([text, encoded, embedded_title, embedded_text], axis=1)
text.tail()

Unnamed: 0,Vernehmlassung_Day,Vernehmlassung_Month,Vernehmlassung_Year,Link_count,Months_until_decision,Decision_day,Decision_month,Decision_year,Behoerde_Behördenkommission,Behoerde_Bundesrat,Behoerde_Bundesversammlung,Behoerde_Departement oder Bundeskanzlei,Behoerde_Einheit der zentralen oder dezentralen Bundesverwaltung,Behoerde_Parlamentarische Kommissionen,Department_BK,Department_EDA,Department_EDI,Department_EFD,Department_EJPD,Department_EVD,Department_Parl.,Department_UVEK,Department_VBS,Department_WBF,Department_other,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,...,728,729,730,731,732,733,734,735,736,737,738,739,740,741,742,743,744,745,746,747,748,749,750,751,752,753,754,755,756,757,758,759,760,761,762,763,764,765,766,767
937,6,7,2015,2.0,15.066667,30.0,9.0,2016.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.999462,-0.028841,-0.002988,0.47778,-0.181328,0.990687,0.045872,0.908294,0.097314,-0.031537,-0.0053,-0.024832,0.167311,0.999628,-0.91481,...,-0.641382,-0.805752,0.04355,-0.974765,-0.948802,-0.084835,0.015448,0.97911,-0.245245,-0.150366,0.037998,-0.086163,-0.048969,0.056265,0.050823,-0.051538,-0.131883,-0.97736,0.767668,0.106707,-0.895128,-0.91953,-0.123482,0.998901,0.059751,0.70667,-0.065601,-0.199007,0.983413,-0.056764,-0.99393,-0.184786,-0.992309,0.922609,0.066132,0.008587,0.050328,-0.059185,0.632312,-0.087605
938,14,8,2015,1.0,10.266667,17.0,6.0,2016.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.999418,-0.091151,0.132117,0.736032,-0.324399,0.970595,0.033339,0.911434,0.087837,-0.27164,0.126724,0.023224,0.669815,0.9999,-0.977855,...,0.762551,-0.957289,-0.324086,-0.586304,-0.845189,-0.059567,0.100485,0.994294,0.5018,-0.21506,-0.000606,-0.24379,0.182886,0.203146,-0.062922,-0.032283,0.08491,-0.975928,0.281003,-0.147259,-0.783449,-0.554282,-0.043908,0.981162,0.050912,0.850203,0.949988,-0.175343,0.983218,-0.012234,-0.624301,-0.146627,-0.967391,0.44949,0.053388,-0.038136,-0.042739,-0.102848,0.457173,-0.321369
939,18,12,2015,1.0,6.066667,17.0,6.0,2016.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.999866,-0.038268,0.260984,0.583915,-0.036757,0.988617,0.184825,-0.233967,-0.047974,-0.074589,0.000464,0.107325,-0.014181,0.999823,-0.988998,...,0.503953,-0.326982,-0.120823,-0.96231,-0.957036,-0.269614,-0.112452,0.709622,-0.146578,-0.118479,-0.074916,0.034735,-0.090729,0.124472,0.065864,-0.064125,-0.047836,-0.953403,0.196931,0.074513,-0.927573,-0.934958,0.013441,0.997444,0.16215,-0.075736,0.089827,-0.232584,0.961293,0.137314,-0.983843,0.082303,-0.972651,0.687559,0.144285,0.008939,-0.125636,-0.067978,0.026198,-0.290221
940,15,3,2016,1.0,3.133333,17.0,6.0,2016.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.999861,-0.094212,0.335018,0.895429,0.00521,0.937152,0.113863,0.219569,-0.112734,0.160694,-0.191104,0.00102,0.379901,0.999863,-0.965881,...,0.538928,0.129436,-0.185132,-0.990912,-0.991046,-0.104258,0.112069,0.78671,-0.504732,-0.227002,0.077774,-0.432937,0.117931,0.145919,-0.320744,-0.058922,-0.053402,-0.974284,0.190112,-0.04497,-0.948323,-0.99393,-0.113425,0.998874,0.246805,0.768723,-0.871369,-0.073727,0.965794,0.002006,-0.994511,-0.002302,-0.898509,-0.234354,-0.212925,-0.082415,0.006231,-0.010596,0.985365,-0.417561
941,21,3,2016,1.0,12.033333,17.0,3.0,2017.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.999642,-0.020097,-0.128804,0.607999,-0.063407,0.93745,0.121918,0.841808,0.156542,-0.205109,0.100531,-0.056608,0.707338,0.999882,-0.893409,...,-0.009983,-0.753983,-0.209791,-0.987087,-0.988997,-0.239222,0.069142,0.924157,-0.028016,-0.283644,-0.233953,-0.056391,0.176925,0.236797,-0.309774,0.021292,0.020069,-0.972721,-0.219711,-0.078375,-0.389197,-0.879693,-0.002308,0.997193,0.129594,0.593103,-0.251392,-0.054487,0.984396,-0.070566,-0.996397,-0.161145,-0.931884,0.853336,0.086695,-0.146199,0.040197,-0.025053,0.919121,-0.240841


### Make splits

In [18]:
targets = text['Months_until_decision']
features = text.drop(columns=['Months_until_decision',
                     'Decision_day', 'Decision_month', 'Decision_year'])
scaler = StandardScaler()
features = scaler.fit_transform(features)

features_training, features_test, targets_training, targets_test = train_test_split(
    features, targets, test_size=0.2, random_state=42)



### SVM

In [19]:
parameters = {'kernel': ('linear', 'poly', 'rbf',
                         'sigmoid'), 'C': [0.1, 1, 10]}
clf = GridSearchCV(SVR(), parameters, n_jobs=-1, cv=5, verbose=3)
clf.fit(features_training, targets_training)
print('SVM')
print(clf.best_params_)
predicted_test = clf.predict(features_test)
print(mean_squared_error(predicted_test, targets_test))

Fitting 5 folds for each of 12 candidates, totalling 60 fits
SVM
{'C': 0.1, 'kernel': 'linear'}
1146.1267993530737


### Tree

In [20]:
parameters = {'max_depth': (3, 6, 12, 25, None), 'min_samples_leaf': [1, 3, 7]}
clf = GridSearchCV(DecisionTreeRegressor(random_state=12),
                   parameters, n_jobs=-1, cv=5, verbose=3)
clf.fit(features_training, targets_training)
print('Tree')
print(clf.best_params_)
predicted_test = clf.predict(features_test)
print(mean_squared_error(predicted_test, targets_test))

Fitting 5 folds for each of 15 candidates, totalling 75 fits
Tree
{'max_depth': 3, 'min_samples_leaf': 1}
1469.9408970910997


### Feed-forward neural network

In [21]:
for learning_rate in [1, 0.1, 0.001, 0.0001]:
    tf.random.set_seed(12)
    model = keras.Sequential([keras.layers.Dense(10, activation='ReLU', input_shape=(
        1557,)), keras.layers.Dropout(0.1), keras.layers.Dense(1, activation='ReLU')])
    model.build()
    model.compile(optimizer=tf.optimizers.Adam(
        learning_rate=learning_rate), loss='mean_squared_error')
    model.fit(features_training, targets_training, batch_size=1, epochs=10)
    predicted_test = model.predict(features_test)
    print(learning_rate)
    print(mean_squared_error(targets_test, predicted_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
1
1701.6642974720746
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
0.1
1701.6642974720746
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
0.001
1148.4681955810413
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
0.0001
1219.7659823243757


The same obvservations as without the text embeddings apply. The scores are expectedly bad.

What may be concerning is that there is no gain in using the textual features and the resulting scores are rather worse. This may be because the scraped data had to do some assumptions, which may not hold. The other possibility is that the texts are not good features. In my opinion this is the case. There is no german BERT model that specialises in law/politics that I could easily find and the small differences between these words are important for this task. The other reason why this is hard is because the most important data points is the current political climate, which is very hard to put into features and is not listed our data set.

Generally, politics is sometimes very unpredictable, even for the best experts.