# Applying the TF-IDF transformation

Before we apply the TF-IDF transformation, it is obligatory to put aside some
test data for evaluating our model later. Otherwise, a future Machine Learning model would have access to statistics of the
entire dataset and may deduce statistics of the test dataset afterwards.
However, the entire purpose of the train-test-split is to evaluate the model on
data it has not seen before.

In [1]:
import pandas as pd

df = pd.read_json("../data/processed/data.json")
df = df.loc[df["Procedures_Length"] > 0, [
    "Label", 
    "Procedures", 
    "Description", 
    "Procedures_Length", 
    "Description_Length",
    "Procedures_Description_Ratio"
]]

## Making a train-test-split

With `sklearn`, splitting a `DataFrame` reduces to calling the `train_test_split`
function from the `model_selection` module. The `test_size` argument determines
the relative size of the test set.

In [2]:
from sklearn.model_selection import train_test_split

X, y = df.drop(columns=["Label"]), df["Label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

Note that we split up our target column `Label` from the rest so that it will
not be included in the following transformations.

## Fitting `TfidfVectorizer`s

Since we have two text columns (`Procedures` and `Description`), it is best to
fit two `TfidfVectorizer`s so that all information contained in those two
separately will be preserved.
The rest of the features should be _scaled_ as certain models encounter
numerical problems when two features are on very different scales (that is to
say one feature usually is very large, e.g. $\gg 10^6$, while another only attains
values between 0 and 1). To do all of this in one go, `sklearn` provides us with
a [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) that takes a list of tuples consisting of a column name
and a transformer that should transform the corresponding column. Additionally,
the `ColumnTransformer`'s `remainder` keyword argument may be another
transformer that will be applied to the remaining columns. Here's how to use it:


In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler


columnwise_tfidf = ColumnTransformer(
    [
        (
            "procedures", 
            TfidfVectorizer(), 
            "Procedures"
        ),
        (
            "desc", 
            TfidfVectorizer(), 
            "Description"
        )
    ],
    remainder=StandardScaler(),
    n_jobs=-1,
)

First, the first item in the tuple is a name for the transformation for later
reference. Second, the [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html?highlight=tfidfvectorizer#sklearn.feature_extraction.text.TfidfVectorizer) with standard arguments constructs the
TF-IDF vectors in almost the same way that I explained it in the Blog Post accompanying this part of the project. The only
difference is that the document frequency of each word is increased by one to
prevent zero divisions. Third and last, the [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) scales the
remaining features such that they have zero mean and unit standard deviation.

Applying this `ColumnTransformer` to our train set follows the usual `sklearn`
API. Each `Transformer` has `fit` and `transform` methods. Here, the first is
used /solely on the train set/ to fit the `Transformer`. Afterwards, the second
may be used to transform both the train and test set.

In [4]:
columnwise_tfidf.fit(X_train)
X_train_transformed = columnwise_tfidf.transform(X_train)

Conveniently, most transformers have a `fit_transform` method that combines
these two steps into one:

In [5]:
X_train_transformed = columnwise_tfidf.fit_transform(X_train)

## Extracting keywords

Let us use the fitted transformers to extract keywords from articles. First, we will extract the vocabulary as determined by the `TfidfVectorizer`s. To distinguish between the words from the Procedures and the Description, we will prepend each of them with a prefix.

In [6]:
def vocabulary():
    return (
        [f"proc__{name}" for name in columnwise_tfidf.named_transformers_["procedures"].get_feature_names()]
        + [f"desc__{name}" for name in columnwise_tfidf.named_transformers_["desc"].get_feature_names()]
    )

Note that the names we have provided for the `TfidfVectorizer`s earlier now come into play.

Second, let's write a function accepting an article and returning a `DataFrame` containing the words with the highest frequencies.

In [7]:
def extract_keywords(article, topn=10):
    article_transformed = columnwise_tfidf.transform(article).toarray()[0]
    frequencies = list(zip(vocabulary(), article_transformed))
    frequencies.sort(key=lambda x: -x[1])
    return pd.DataFrame(frequencies[:topn])

Finally, let's extract keywords from one of the most iconic SCP articles: The one for [SCP-682](http://www.scp-wiki.net/scp-682). This is one of the best examples of Keter class SCPs. 

In [8]:
scp_682 = df.loc[df["Description"].str.startswith("SCP-682")].drop(columns=["Label"])
extract_keywords(scp_682)

Unnamed: 0,0,1
0,proc__682,0.767357
1,desc__kia,0.738121
2,desc__682,0.523255
3,desc__agent,0.171312
4,desc__personnel,0.156161
5,proc__speak,0.153737
6,proc__acid,0.144138
7,proc__to,0.133515
8,desc__pvt,0.110179
9,proc__scp,0.107281


This does not look too promising. First, maybe numbers should be ignored. Then, there are words "to", "of" appearing in almost every article in english. "speak" might also not be telling much. This will only get worse if we look at the top 30 keywords.

In [9]:
extract_keywords(scp_682, topn=30)

Unnamed: 0,0,1
0,proc__682,0.767357
1,desc__kia,0.738121
2,desc__682,0.523255
3,desc__agent,0.171312
4,desc__personnel,0.156161
5,proc__speak,0.153737
6,proc__acid,0.144138
7,proc__to,0.133515
8,desc__pvt,0.110179
9,proc__scp,0.107281


## Fine-tuning the `TfidfVectorizer`

Fortunately, `TfidfVectorizer` has a lot of options to fine-tune its behavior. First and maybe most importantly, we can enforce that certain words should be ignored via the `stop_words` keyword argument. It either expects the string "english" and then uses a list constructed by the `sklearn` developers (with its own set of disadvantages) or it expects a list of strings containing the words that shall be ignored. Second, we can specify a regex pattern via the `token_pattern` keyword argument. This pattern will be used when parsing the articles to build up the vocabulary. The standard pattern includes single words containing letters and numbers; we will modify it to only parse for words containing letters.

In [10]:
columnwise_tfidf = ColumnTransformer(
    [
        (
            "procedures", 
            TfidfVectorizer(
                stop_words="english", 
                strip_accents='unicode',
                token_pattern='(?u)\\b[a-zA-Z][a-zA-Z]+\\b',
            ), 
            "Procedures"
        ),
        (
            "desc", 
            TfidfVectorizer(
                stop_words="english", 
                strip_accents='unicode',
                token_pattern='(?u)\\b[a-zA-Z][a-zA-Z]+\\b'
            ), 
            "Description"
        )
    ],
    remainder=StandardScaler()
)

columnwise_tfidf.fit(X_train)

ColumnTransformer(n_jobs=None,
                  remainder=StandardScaler(copy=True, with_mean=True,
                                           with_std=True),
                  sparse_threshold=0.3, transformer_weights=None,
                  transformers=[('procedures',
                                 TfidfVectorizer(analyzer='word', binary=False,
                                                 decode_error='strict',
                                                 dtype=<class 'numpy.float64'>,
                                                 encoding='utf-8',
                                                 input='content',
                                                 lowercase=True, max_df=1.0,
                                                 max_features=None, min_df=1...
                                                 dtype=<class 'numpy.float64'>,
                                                 encoding='utf-8',
                                                 input='co

In [11]:
extract_keywords(scp_682, topn=30)

Unnamed: 0,0,1
0,desc__kia,0.890278
1,proc__speak,0.272335
2,proc__acid,0.255331
3,desc__agent,0.206627
4,proc__scp,0.190041
5,desc__personnel,0.188352
6,proc__attempts,0.174127
7,proc__reacted,0.169915
8,proc__incapacitation,0.161413
9,proc__fear,0.155381


This looks much better. A few remarks:

- I had to google for the two abbreviations "kia" and "pvt". The first is the abbreviation for "killed in action" while the second stands for the military rank "Private".
- On second thought, "speak" *may* contain the information that the SCP object is able to speak and, thusly, might hint at it being sapient. As sapient SCPs are probably more likely to be of class euclid or keter, this could be valuable information for a Machine Learning model.
- One could start building a custom list of stop words more suitable for parsing SCP articles. In the list above, the words "best" and "called" as well as "scp" could be ignored. I will postpone this to the next part of this series of posts. Because some models give some insight in their learning process, we can use them to see if their decisions are based on filler words.