## SUNIL KAMKAR 

### create an algorithm, that takes html page as input and infers if the page contains the information about cancer tumorboard or not. What is a tumor board? Tumor Board is a consilium of doctors (usually from different disciplines) discussing cancer cases in their departments. The expected result is a CSV file for test data with columns [doc_id and prediction].



You have train.csv and test.csv files and folder with corresponding .html files.

#### Files:

- train.csv contains next columns: url, doc_id and label
- test.csv contains next columns: url and doc_id
- htmls contains files with names {doc_id}.html
- keyword2tumor_type.csv contains useful keywords for types of tumorboards
- Description of tumor board labels:

- 1 (no evidence): tumor boards are not mentioned on the page
- 2 (medium confidence): tumor boards are mentioned, but the page is not completely dedicated to tumor board description
- 3 (high confidence): page is completely dedicated to the description of tumor board types and dates

You are asked to prepare a model using htmls, referred to in train.csv, and make predictions for htmls from test.csv

In [52]:
# load all the required libraries
import pandas as pd
import numpy as np
import json
import csv
#import xmltodict
import yaml
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [91]:
#pip install plotly

Collecting plotly
  Downloading plotly-5.11.0-py2.py3-none-any.whl (15.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.3/15.3 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting tenacity>=6.2.0
  Downloading tenacity-8.1.0-py3-none-any.whl (23 kB)
Installing collected packages: tenacity, plotly
Successfully installed plotly-5.11.0 tenacity-8.1.0
Note: you may need to restart the kernel to use updated packages.


In [43]:
import pandas as pd

def load_data(filepath):
    extension = filepath.split(".")[-1]
    if extension == "csv":
        data = pd.read_csv(filepath)
    elif extension == "json":
        data = pd.read_json(filepath)
    elif extension=='xlsx':
        data=pd.read_excel(filepath)
        
    else:
        raise ValueError("Unsupported file format")
    
    print(data.head())


    return data



In [44]:
train_data = load_data("train.csv")

# Print the data


                                                 url  doc_id  label
0  http://elbe-elster-klinikum.de/fachbereiche/ch...       1      1
1  http://klinikum-bayreuth.de/einrichtungen/zent...       3      3
2  http://klinikum-braunschweig.de/info.php/?id_o...       4      1
3  http://klinikum-braunschweig.de/info.php/?id_o...       5      1
4  http://klinikum-braunschweig.de/zuweiser/tumor...       6      3


In [45]:
test_data = load_data("test.csv")

# Print the data


                                                 url  doc_id
0  http://chirurgie-goettingen.de/medizinische-ve...       0
1  http://evkb.de/kliniken-zentren/chirurgie/allg...       2
2  http://krebszentrum.kreiskliniken-reutlingen.d...       7
3  http://marienhospital-buer.de/mhb-av-chirurgie...      15
4  http://marienhospital-buer.de/mhb-av-chirurgie...      16


In [46]:
tumor_keywords=load_data('keyword2tumor_type.csv')


        keyword tumor_type
0  senologische      Brust
1  brustzentrum      Brust
2        breast      Brust
3        thorax      Brust
4     thorakale      Brust


In [47]:
# lets summarize the data 
import pandas as pd

def summarize_dataframe(df):
  # Get the shape of the DataFrame
  shape = df.shape

  # Print the shape
  print("Shape:", shape)

  # Iterate over the columns
  for column in df.columns:
    # Get the number of unique values
    n_unique = df[column].nunique()
    #print(f"{column}: {n_unique} unique values")

    # Get the value counts
    value_counts = df[column].value_counts()
    #print(value_counts)


In [48]:
print(summarize_dataframe(train_data))
print(summarize_dataframe(test_data))



Shape: (100, 3)
None
Shape: (48, 2)
None


In [50]:
summarize_dataframe(tumor_keywords)

Shape: (126, 2)


In [51]:
tumor_keywords['tumor_type'].value_counts()

Lunge                           10
Darm                            10
Gynäkologie                      8
Interdisziplinär                 7
Haut                             7
Hämatooncology                   7
Magen                            7
Brust                            6
Urologische                      6
Kopf-hals                        6
Sarkome                          6
Endokrine malignome              5
Pädiatrische                     4
Mamma carcinoma                  4
Pankreas                         3
Prostata                         3
Gallenblasen/gallengangkrebs     3
Neuroonkologie                   3
Leber                            2
Hoden, penis                     2
Knochentumoren                   2
Niere                            1
Mikroskopieren                   1
Stammzelltransplantation         1
Schwerpunkt                      1
Prätherapeutische                1
Oral                             1
Molekular                        1
Harnblase           

In [53]:
tumor_keywords['keyword'].value_counts()

senologische          1
systemerkrankungen    1
weichteil             1
weichgewebstumoren    1
sarkome               1
                     ..
skin                  1
haut                  1
melanoma              1
dermo                 1
chirurgisch           1
Name: keyword, Length: 126, dtype: int64

In [55]:
# the .size() method counts the number of rows for each group
train_data.groupby(by="label").size()

label
1    32
2    59
3     9
dtype: int64

In [61]:
def read_html(doc_id: int) -> str:
    """
    Reads the HTML file at the specified path.
    Since the language of the documents is German,
    we need to specify the 'latin1' encoding, rather
    than the more common 'utf-8'. For more info about
    the encoding
    """
    with open(file=f"htmls/{doc_id}.html",
              mode="r",
              encoding="latin1") as f:
        html = f.read()
    return html




In [62]:
# this will store the actual HTML text in the 'html' column
train_data["html"] = train_data["doc_id"].apply(read_html)

In [63]:
train_data

Unnamed: 0,url,doc_id,label,html
0,http://elbe-elster-klinikum.de/fachbereiche/ch...,1,1,<!DOCTYPE html>\n<!-- jsn_reta_pro 1.0.2 -->\n...
1,http://klinikum-bayreuth.de/einrichtungen/zent...,3,3,"<!DOCTYPE html>\n<html class=""no-js"" lang=""de""..."
2,http://klinikum-braunschweig.de/info.php/?id_o...,4,1,"<!doctype html>\n<html lang=""de"">\n<head>\n\t<..."
3,http://klinikum-braunschweig.de/info.php/?id_o...,5,1,"<!doctype html>\n<html lang=""de"">\n<head>\n\t<..."
4,http://klinikum-braunschweig.de/zuweiser/tumor...,6,3,"<!doctype html>\n<html lang=""de"">\n<head>\n\t<..."
...,...,...,...,...
95,http://www.unicross.uni-freiburg.de/thema/unifm/,140,1,"<!DOCTYPE html>\n<html lang=""de-DE""\nprefix=""o..."
96,http://www.uniklinik-duesseldorf.de/patienten-...,141,1,"<!DOCTYPE html>\n<html class=""no-js"" lang=""de""..."
97,http://www.vivantes.de/fuer-sie-vor-ort/klinik...,144,2,"<!DOCTYPE html>\n\n<html class=""no-js"" lang=""d..."
98,http://www.vivantes.de/fuer-sie-vor-ort/klinik...,145,2,"<!DOCTYPE html>\n\n<html class=""no-js"" lang=""d..."


As mentioned in the Tips section, we can use the BeautifulSoup package to parse the HTML content for the data that we need. To be able to use the package properly, and as we intend, in addition to installing BeautifulSoup with the pip install beautifulsoup4 command, you will need to install a proper HTML parser, lxml with the pip install lxml command. The lxml library is OS-dependent, so we recommend reading the Installation page to get the most accurate instructions.

In [66]:
#pip install xlml

In [70]:
import warnings
from bs4 import BeautifulSoup
warnings.filterwarnings(action='ignore')

def extract_html_text(html: str) -> str:
    
    bs=BeautifulSoup(markup=html, features='lxml')
    
    for script in bs(name=["script","style"]):
        script.decompose() # removes all script and style tags from html
    return bs.get_text(separator=" ")


In [71]:
train_data['html_text']=train_data['html'].apply(extract_html_text)
train_data.head(5)

Unnamed: 0,url,doc_id,label,html,html_text
0,http://elbe-elster-klinikum.de/fachbereiche/ch...,1,1,<!DOCTYPE html>\n<!-- jsn_reta_pro 1.0.2 -->\n...,\n \n \n \n \n \n Elbe-Elster Klinikum - Chiru...
1,http://klinikum-bayreuth.de/einrichtungen/zent...,3,3,"<!DOCTYPE html>\n<html class=""no-js"" lang=""de""...",\n \n \n \n \n \n \n Onkologisches Zentrum - K...
2,http://klinikum-braunschweig.de/info.php/?id_o...,4,1,"<!doctype html>\n<html lang=""de"">\n<head>\n\t<...",\n \n Zentrum - SozialpÃ¤diatrisches Zentrum -...
3,http://klinikum-braunschweig.de/info.php/?id_o...,5,1,"<!doctype html>\n<html lang=""de"">\n<head>\n\t<...",\n \n Leistung - Spezielle UnterstÃ¼tzung bei ...
4,http://klinikum-braunschweig.de/zuweiser/tumor...,6,3,"<!doctype html>\n<html lang=""de"">\n<head>\n\t<...",\n \n Zuweiser - Tumorkonferenzen - Tumorkonfe...


In [82]:
from gensim.parsing import preprocessing


def preprocess_html_text(html_text: str) -> str:
    """
    The preprocessing consists of the following six steps:

    1. Strips all non-alphanumerical characters.
    2. Strips all multiple whitespaces.
    3. Strips all punctuation.
    4. Strips all numerical characters.
    5. Converts to lowercase and then stems the text.
    6. Removes all stop-words.
    """
    preprocessed_text = preprocessing.strip_non_alphanum(s=html_text)
    preprocessed_text = preprocessing.strip_multiple_whitespaces(s=preprocessed_text)
    preprocessed_text = preprocessing.strip_punctuation(s=preprocessed_text)
    preprocessed_text = preprocessing.strip_numeric(s=preprocessed_text)

    preprocessed_text = preprocessing.stem_text(text=preprocessed_text)
    preprocessed_text = preprocessing.remove_stopwords(s=preprocessed_text)
    return preprocessed_text


train_data["preprocessed_html_text"] = train_data["html_text"].apply(preprocess_html_text)
train_data.head(5)

Unnamed: 0,url,doc_id,label,html,html_text,preprocessed_html_text
0,http://elbe-elster-klinikum.de/fachbereiche/ch...,1,1,<!DOCTYPE html>\n<!-- jsn_reta_pro 1.0.2 -->\n...,\n \n \n \n \n \n Elbe-Elster Klinikum - Chiru...,elb elster klinikum chirurgi finsterwald suche...
1,http://klinikum-bayreuth.de/einrichtungen/zent...,3,3,"<!DOCTYPE html>\n<html class=""no-js"" lang=""de""...",\n \n \n \n \n \n \n Onkologisches Zentrum - K...,onkologisch zentrum klinikum bayreuth aktuel ã...
2,http://klinikum-braunschweig.de/info.php/?id_o...,4,1,"<!doctype html>\n<html lang=""de"">\n<head>\n\t<...",\n \n Zentrum - SozialpÃ¤diatrisches Zentrum -...,zentrum sozialpã diatrisch zentrum stã dtisch ...
3,http://klinikum-braunschweig.de/info.php/?id_o...,5,1,"<!doctype html>\n<html lang=""de"">\n<head>\n\t<...",\n \n Leistung - Spezielle UnterstÃ¼tzung bei ...,leistung speziel unterstã¼tzung bei der anmeld...
4,http://klinikum-braunschweig.de/zuweiser/tumor...,6,3,"<!doctype html>\n<html lang=""de"">\n<head>\n\t<...",\n \n Zuweiser - Tumorkonferenzen - Tumorkonfe...,zuweis tumorkonferenzen tumorkonferenz gastroi...


In [83]:
# just trying one more method to solve the above issue
import pandas as pd
from bs4 import BeautifulSoup
import re

def extract_words(df, column):
  # Create a new DataFrame to store the results
  result_df = pd.DataFrame()

  # Iterate over the rows of the DataFrame
  for index, row in df.iterrows():
    # Get the text from the specified column
    text = row[column]

    # Parse the HTML using BeautifulSoup
    soup = BeautifulSoup(text, "html.parser")

    # Extract the text from the parsed HTML
    text = soup.get_text()

    # Remove any non-word characters from the text
    text = re.sub(r"\W", " ", text)

    # Add the cleaned text to the result DataFrame
    result_df = result_df.append({column: text}, ignore_index=True)

  # Return the result DataFrame
  return result_df




In [84]:
extract_words(train_data,'html_text')

Unnamed: 0,html_text
0,Elbe Elster Klinikum Chirurgie F...
1,Onkologisches Zentrum Klinikum...
2,Zentrum SozialpÃ diatrisches Zentrum S...
3,Leistung Spezielle UnterstÃ¼tzung bei de...
4,Zuweiser Tumorkonferenzen Tumorkonfere...
...,...
95,uniFM uniCROSS ...
96,Interdis...
97,FÃ¼r Ã rzte Vivantes...
98,Innere Medizin â ...


In [108]:
px.histogram(x=train_data["preprocessed_html_text"].apply(lambda text: set(text.split(" "))).apply(len),
             title="Unique Words Count")

There is one document with 6500-7000 unique words. All others consist of < 2000 unique words.

In [109]:
# modeling part
import random
import numpy as np
import tensorflow as tf

# it is always useful to set the random seeds
# wherever possible, for reproducibility of results
np.random.seed(42)
tf.random.set_seed(seed=42)

# DATA GENERATOR

A siamese network performs well for tasks with little training data

This is because the two subnetworks have shared weights
Thus, there are fewer parameters to learn during training
Specifically, siamese networks are useful when there are many classes with a small number of observations for each class

The first line of code in this script creates a new class called Pair, which inherits from the Sequence class in the tf.keras.utils module. This class will be used to generate data for a model that uses the Siamese network architecture.

The __init__ method is the class constructor, which is called when a new instance of the class is created. This method takes four arguments: a DataFrame from the pandas module, a Series from the pandas module, n_batch, and batch_size. These arguments are used to initialize the instance variables self.dataframe, self.labels, self.n_batch, and self.batch_size, respectively.  Finally, it creates a dict called self.anchor_groups that divides the data in self.dataframe into positive and negative groups for each class.

The __len__ method returns the number of batches, which is equal to the value of the self.n_batch instance variable.
The __len__ method returns the number of batches that the Pair instance will generate when __getitem__ is called.



The __getitem__ method is called when the Pair class is used in a for loop. This method generates one batch of data by creating a list of pairs of data points, along with the corresponding labels. The method then shuffles this list of pairs and converts it to a NumPy array, and then returns the data pairs and the targets as separate arrays.

The __getitem__ method is called when the Pair instance is indexed, and it returns a batch of data and targets. This method creates a list of pairs, where each pair consists of an anchor, a positive example (that belongs to the same class as the anchor), and a negative example (that belongs to a different class than the anchor). The pairs are then shuffled and converted to a NumPy array, and the data pairs and targets are extracted and returned.

he __getitem__ method is used in this code because it is called when the Pair instance is indexed (e.g. my_pair = Pair(...) and my_pair[0]). This allows users of the Pair class to access batches of data and targets simply by indexing the Pair instance, which is a convenient way to work with data in a sequence.



The get_support_set method is used to return a sample set of a certain size for each target class. This method is called by the fit_siamese function, which is defined outside of the Pair class.


When the method creates the pairs, it does so by generating a pair for each element in range(int(self.batch_size / 2)). This means that for each iteration of the loop, a pair of (anchor, positive example) and a pair of (anchor, negative example) are created, resulting in two pairs per iteration.

Therefore, dividing the batch_size by 2 ensures that the method creates the desired number of pairs. For example, if batch_size is 4, then int(self.batch_size / 2) is 2, so the method will create 2 pairs of (anchor, positive example) and 2 pairs of (anchor, negative example), resulting in a total of 4 pairs.


In the code you provided, the __getitem__ method generates pairs by selecting random examples from the data in self.dataframe.

To select a random example, the method uses the DataFrame.sample method, which returns a new DataFrame containing a random sample of the data. The sample method is called with the n=1 argument, which specifies that the method should return a DataFrame with a single row.

After calling sample, the method uses the DataFrame.iloc attribute to access the first (and only) row of the DataFrame, which is the random example that was selected. The iloc attribute is used here because it allows the method to access the row by its index (in this case, 0), which is a convenient way to access a single row of a DataFrame.

In summary, the line anchor = anchor_group.sample(n=1).iloc[0] selects a random example from anchor_group and stores it in the anchor variable. This is done for each of the anchor, positive, and negative examples in each pair

In [116]:
class Pair(tf.keras.utils.Sequence):
    def __init__(self, dataframe: pd.DataFrame, labels: pd.Series, n_batch: int, batch_size: int):
        """Initialization"""
        self.dataframe = dataframe
        self.labels = labels
        self.n_batch = n_batch
        self.batch_size = batch_size
        self.all_classes = set(self.labels)
        self.anchor_groups = {}
        for target_class in self.all_classes:
            self.anchor_groups[target_class] = {
                "positive": self.dataframe[self.labels == target_class],
                "negative": self.dataframe[self.labels != target_class]
            }

    def __len__(self):
        return self.n_batch

    def __getitem__(self, item):
        pairs = []

        for i in range(int(self.batch_size / 2)):
            anchor_class = random.randint(1, 3)
            anchor_group = self.anchor_groups[anchor_class]["positive"]
            not_anchor_group = self.anchor_groups[anchor_class]["negative"]

            anchor = anchor_group.sample(n=1).iloc[0]
            positive = anchor_group.sample(n=1).iloc[0]
            negative = not_anchor_group.sample(n=1).iloc[0]

            pairs.append([anchor, positive, 1])
            pairs.append([anchor, negative, 0])

        random.shuffle(x=pairs)
        pairs = np.array(pairs)

        data_pairs = pairs[:, :2]
        targets = pairs[:, 2]

        return data_pairs, tf.convert_to_tensor(targets, dtype=np.float32)

    def get_support_set(self, sample_size: int = 1):
        """Returns sample sets of certain size of each target class"""
        support_set = {}
        for target_class in self.all_classes:
            support_set[target_class] = self.anchor_groups[target_class]["positive"].sample(n=sample_size)
        return support_set

The **get_support_set** method takes one parameter, sample_size, which specifies the size of the sample set to be returned for each target class. This parameter has a default value of 1, so if it is not provided, the sample sets will each have size 1.

First, the method creates an empty dictionary called support_set, which will be used to store the sample sets for each target class. Then, it iterates over all target classes using the all_classes attribute, which is a set of all unique target classes in the labels series.

For each target class, the method gets the positive sample group for that class from the anchor_groups attribute, which is a dictionary that maps target classes to their positive and negative sample groups. The positive group is the group of samples that belong to the target class, and the negative group is the group of samples that do not belong to the target class.

Next, the method samples a set of size sample_size from the positive group, and adds the sample set to the support_set dictionary using the target class as the key.

Finally, the method returns the support_set dictionary, which contains the sample sets for all target classes.






# model
Here, we define our model, as a siamese network. The model is a sequence of layers, starting with a TextVectorization layer. This layer accepts natural language (text) as input, and maps it to an integer sequence. At initialization time, we should provide a vocabulary of words for it to be able to map the words at prediction time.

Following the text vectorization layer, we implement three Dense layers, with two Dropout layers in between. Lastly, we apply a L2 normalization layer to penalize large weights.

In our implementation of a siamese network, we override the __call__ method of the tf.keras.Model class. This is needed because of the nature of the model.

Siamese networks take as input triplets: anchor (baseline) input, a sample from the same class as the anchor - positive, and a sample from a different class than the anchor - negative. It then does two passes the anchor twice through the network: once in combination with the positive sample, and the second time with the negative sample. Lastly, it compares the difference in outputs from the two passes. We expect the error/loss of the model to be low for the "positive pass" and higher for the "negative pass" since we want samples from the same class to be as similar to each other as possible, and as different from other classes as possible.



In [119]:
class SiameseNetwork(tf.keras.Model):
    def __init__(self, corpora: pd.Series):
        super(SiameseNetwork, self).__init__()
        self.vectorizer_layer: tf.keras.layers.TextVectorization = tf.keras.layers.TextVectorization(
            max_tokens=2000,  # empirically chosen as best, higher number overfits (see the unique words count plot)
            output_mode="int",
            output_sequence_length=512
        )
        self.vectorizer_layer.adapt(corpora.values) #This line calls the adapt method of the vectorizer_layer attribute, which adapts the layer to the provided texts
        self.encoder = tf.keras.Sequential(layers=[
            self.vectorizer_layer,
            tf.keras.layers.Dense(units=256, activation=tf.keras.activations.relu),
            tf.keras.layers.Dropout(rate=0.3),
            tf.keras.layers.Dense(units=128, activation=tf.keras.activations.relu),
            tf.keras.layers.Dropout(rate=0.3),
            tf.keras.layers.Dense(units=64, activation=tf.keras.activations.relu),
            tf.keras.layers.Lambda(function=lambda x: tf.math.l2_normalize(x, axis=1))
        ])
        self.encoding_distance = tf.keras.layers.Dot(axes=1)

    def __call__(self, inputs, *args, **kwargs):
        anchors, supports = inputs[:, 0], inputs[:, 1]
        anchors_encoded = self.encoder(anchors)
        supports_encoded = self.encoder(supports)
        return self.encoding_distance((anchors_encoded, supports_encoded))

    def predict_with_support_set(self, entry, support_set: dict):
        """
        Custom method that wraps around the __call__ method.
        It is used to pass the entry (input text) multiple times
        through the model to average out the losses and provide more
        stable estimate.
        """
        scores = {}
        for instance_class, texts in support_set.items():
            class_scores = ([self(np.array([entry, text]).reshape((-1, 2))) for text in texts])
            scores[instance_class] = tf.math.reduce_mean(class_scores)
        return max(scores, key=scores.get)

The code defines a class called SiameseNetwork, which is a type of neural network that uses two identical sub-networks to process two input vectors and compare them. The class extends tf.keras.Model, which is a high-level API for building and training models in TensorFlow.

The __init__ method of the SiameseNetwork class takes a corpora argument, which is expected to be a Pandas series containing a collection of texts. The method initializes the parent class with super(), and then creates a vectorizer_layer attribute, which is an instance of the tf.keras.layers.TextVectorization class. This layer is used to convert the input texts into numerical vectors.

The __call__ method of the SiameseNetwork class is used to define the forward pass of the network, which takes two input vectors (anchors and supports), encodes them using the encoder attribute (which is a tf.keras.Sequential model), and then compares the encoded vectors using the encoding_distance attribute (which is a tf.keras.layers.Dot layer with axes=1).

The predict_with_support_set method is a custom method that wraps around the __call__ method. It is used to pass the input text multiple times through the model to average out the outputs and provide more stable estimates. It takes an entry (input text) and a support_set, which is a dictionary mapping classes to lists of texts belonging to those classes. It returns the class with the highest average score.

In the predict_with_support_set method of the SiameseNetwork class, the entry (input text) and the text from the support_set are passed to the __call__ method of the SiameseNetwork instance as a NumPy array. This array has shape (batch_size, 2), where batch_size is the number of pairs of inputs. The __call__ method expects the input array to have shape (batch_size, 2, sequence_length), where sequence_length is the length of the input vectors. Therefore, the predict_with_support_set method calls the np.array.reshape method on the input array to add an additional dimension of size sequence_length before passing it to the __call__ method.




In [144]:

model = SiameseNetwork(corpora=train_data["preprocessed_html_text"])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics='binary_accuracy')



2022-12-14 23:54:39.739210: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


In [122]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(train_data["preprocessed_html_text"], train_data["label"],
                                                      test_size=0.2,
                                                      random_state=42, stratify=train_data["label"])
#This is important because it prevents the case where the split is done only on a single class, or the splits have unrepresentative class distribution.

In [123]:
# training params
BATCH_SIZE = 64
N_BATCH = 100
# we instantiate training and validation data / pair generators
TRAIN_PAIR_GENERATOR = Pair(dataframe=X_train, labels=y_train, n_batch=N_BATCH, batch_size=BATCH_SIZE)
VALID_PAIR_GENERATOR = Pair(dataframe=X_valid, labels=y_valid, n_batch=N_BATCH, batch_size=BATCH_SIZE)

In [124]:
early_stopping_callback = tf.keras.callbacks.EarlyStopping(monitor="val_loss", patience=3)

In [129]:
history = model.fit(
    x=TRAIN_PAIR_GENERATOR,
    validation_data=VALID_PAIR_GENERATOR,
    epochs=10,
    callbacks=[early_stopping_callback],
    verbose=1
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10


In [130]:
import plotly.graph_objects as go
figure = go.Figure()

figure.add_scatter(y=history.history["binary_accuracy"], name="Training Accuracy")
figure.add_scatter(y=history.history["val_binary_accuracy"], name="Validation Accuracy")

figure.update_layout(dict1={
    "title": "Model Accuracy During Training",
    "xaxis_title": "Epoch",
    "yaxis_title": "Accuracy"
}, overwrite=True)

figure.show()

In [131]:
figure = go.Figure()

figure.add_scatter(y=history.history["loss"], name="Training Loss")
figure.add_scatter(y=history.history["val_loss"], name="Validation Loss")

figure.update_layout(dict1={
    "title": "Model Loss During Training",
    "xaxis_title": "Epoch",
    "yaxis_title": "Loss"
}, overwrite=True)

figure.show()

In [132]:
y_pred = X_valid.apply(lambda text: model.predict_with_support_set(
    entry=text,
    support_set=TRAIN_PAIR_GENERATOR.get_support_set(7)
))

In [133]:
# build a classification report
from sklearn.metrics import classification_report

report = classification_report(y_true=y_valid, y_pred=y_pred, zero_division=0)
print(report)

              precision    recall  f1-score   support

           1       0.56      0.83      0.67         6
           2       0.91      0.83      0.87        12
           3       0.00      0.00      0.00         2

    accuracy                           0.75        20
   macro avg       0.49      0.56      0.51        20
weighted avg       0.71      0.75      0.72        20



In [134]:
# this stores the actual HTML text in the 'html' column
test_data["html"] = test_data["doc_id"].apply(read_html)
# extracts text elements from the HTML
test_data["html_text"] = test_data["html"].apply(extract_html_text)
# cleans the extracted text
test_data["preprocessed_html_text"] = test_data["html_text"].apply(preprocess_html_text)
test_data.sample(n=5, random_state=42)

Unnamed: 0,url,doc_id,html,html_text,preprocessed_html_text
27,http://www.josephstift-dresden.de/pressemittei...,71,"<?xml version=""1.0"" encoding=""utf-8""?>\n<rss v...",\n \n Krankenhaus St. Joseph-Stift Dresden (PM...,krankenhau st joseph stift dresden pm http www...
40,http://www.pius-hospital.de/kliniken/gynaekolo...,123,"<!DOCTYPE html>\n<html lang=""de"" dir=""ltr"" pre...",\n \n \n \n \n \n \n \n \n \n \n Patienteninfo...,patienteninformationen klinik fã¼r frauenheilk...
26,http://www.interdisziplinaere-endoskopie.mri.t...,70,"<!DOCTYPE html>\n<html lang=""de"">\n\t<!--[if I...",\n \n \n \n \n \n \n Herzlich Willkommen â...,herzlich willkommen â interdisziplinã endoskop...
43,http://www.uk-augsburg.de/krebsbehandlung/diag...,134,"<!DOCTYPE html> \n<html lang=""de""> \n\t<head> ...",\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n Z...,zweitmeinung â warum ein weiter einschã tzung ...
24,http://www.hjk-muenster.de/unsere-kompetenzen/...,68,"<!DOCTYPE html><html lang=""de"" class=""no-js""><...",Gelenkersatz Skip to main content hjk Die Ei...,gelenkersatz skip main content hjk die einrich...


In [135]:
test_data["predictions"] = test_data["preprocessed_html_text"].apply(lambda text: model.predict_with_support_set(
    entry=text,
    support_set=TRAIN_PAIR_GENERATOR.get_support_set(sample_size=7)
))
test_data.sample(n=5, random_state=42)

Unnamed: 0,url,doc_id,html,html_text,preprocessed_html_text,predictions
27,http://www.josephstift-dresden.de/pressemittei...,71,"<?xml version=""1.0"" encoding=""utf-8""?>\n<rss v...",\n \n Krankenhaus St. Joseph-Stift Dresden (PM...,krankenhau st joseph stift dresden pm http www...,2
40,http://www.pius-hospital.de/kliniken/gynaekolo...,123,"<!DOCTYPE html>\n<html lang=""de"" dir=""ltr"" pre...",\n \n \n \n \n \n \n \n \n \n \n Patienteninfo...,patienteninformationen klinik fã¼r frauenheilk...,1
26,http://www.interdisziplinaere-endoskopie.mri.t...,70,"<!DOCTYPE html>\n<html lang=""de"">\n\t<!--[if I...",\n \n \n \n \n \n \n Herzlich Willkommen â...,herzlich willkommen â interdisziplinã endoskop...,1
43,http://www.uk-augsburg.de/krebsbehandlung/diag...,134,"<!DOCTYPE html> \n<html lang=""de""> \n\t<head> ...",\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n Z...,zweitmeinung â warum ein weiter einschã tzung ...,2
24,http://www.hjk-muenster.de/unsere-kompetenzen/...,68,"<!DOCTYPE html><html lang=""de"" class=""no-js""><...",Gelenkersatz Skip to main content hjk Die Ei...,gelenkersatz skip main content hjk die einrich...,2


In [137]:
test_data["predictions"].value_counts()

2    26
1    22
Name: predictions, dtype: int64

In [138]:
test_data[["doc_id", "predictions"]]

Unnamed: 0,doc_id,predictions
0,0,2
1,2,2
2,7,2
3,15,2
4,16,2
5,24,2
6,31,1
7,32,1
8,36,1
9,38,2


It is very prone to overfitting, though this is almost certainly because of the small data set.
We have zero precision and recall on the label = 3 which is concerning and should be addressed some