# 📘 Explainable Stylo Workshop Notebook

Welcome to the Colab notebook for the **Explainable Stylo Workshop**.

This notebook helps you:
- Load the [`cl_explainable_stylo`](https://github.com/remolek/Wshop-ExplainableStylo) package from GitHub
- Prepare to work with textual data stored in folders like `example1/`, `example2/`
- Understand how to explore and use the package interactively


In [3]:
# 🔧 Setup: Clone the repository and install the package
import os

# Clone the repo only if it's not already cloned
if not os.path.exists("Wshop-ExplainableStylo"):
    !git clone https://github.com/remolek/Wshop-ExplainableStylo.git
%cd Wshop-ExplainableStylo

Cloning into 'Wshop-ExplainableStylo'...
remote: Enumerating objects: 104, done.[K
remote: Counting objects: 100% (104/104), done.[K
remote: Compressing objects: 100% (93/93), done.[K
remote: Total 104 (delta 34), reused 40 (delta 5), pack-reused 0 (from 0)[K
Receiving objects: 100% (104/104), 775.69 KiB | 18.92 MiB/s, done.
Resolving deltas: 100% (34/34), done.
/content/Wshop-ExplainableStylo/Wshop-ExplainableStylo


## 📦 Import the Package

Now that the package is downloaded, you can import its modules and use them in the notebook.
The `base` module below contains the main class which is a wrapper for most of the other functions.


In [4]:
# Import a module from cl_explainable_stylo
from cl_explainable_stylo import base

# Help or test it
# help(example_loader)

In [5]:
help(base)

Help on module cl_explainable_stylo.base in cl_explainable_stylo:

NAME
    cl_explainable_stylo.base - This module defines a base class 'explain_style' providing a pipeline for preprocessing texts (currently by Spacy), extracting their features, classifying them with LGBM, explaining the classifiers with SHAP and visualising the results.

DESCRIPTION
    To initialize the class you need a .json file with metadata of the form, minimally:
    {"experiment_name":"...",
    "labels": ["filename", "class"],
    "files":
            {"filename": ["path_to_file1.txt","path_to_file2.txt"],
            "class": ["file1_class", "file2_class"]}
     }
    
    Author: Jeremi K. Ochab
    Date: August 14, 2023

CLASSES
    builtins.object
        explain_style
    
    class explain_style(builtins.object)
     |  explain_style(metadata_json, manual=True)
     |  
     |  Methods defined here:
     |  
     |  __init__(self, metadata_json, manual=True)
     |      Initialize self.  See help(type(s

## 📁 Example1: two translations

Let's start with a simple example of comparing two translations of Joseph Conrad's *Heart of Darkness* into Polish, by Aniela Zagórska (1930) and Jacek Dukaj (2017).

In [6]:
import os
from glob import glob

# 🔍 Find all folders starting with "example"
# example_dirs = sorted([d for d in os.listdir() if d.startswith("example") and os.path.isdir(d)])
# print(f"Detected folders: {example_dirs}")

# Find and preview all .txt files inside each folder
folder = 'example1'
txt_files = sorted(glob(os.path.join(folder, "*/*.txt")))
print(f"\n📁 {folder} contains {len(txt_files)} .txt files:")
for path in txt_files:
    print(f"  📄 {os.path.basename(path)}")
    with open(path, "r", encoding="utf-8") as f:
        preview = f.read(200).strip().replace("\n", " ")
    print(f"    Preview:\n    {preview[:150]}{'...' if len(preview) > 150 else ''}")


📁 example1 contains 2 .txt files:
  📄 Dukaj.txt
    Preview:
    SERCE CIEMNOŚCI Jacek Dukaj ISBN 978-83-08-06606-5   Wszelka sztuka przemawia przede wszystkim do zmysłów. I także artysta słowa pisanego musi przemaw...
  📄 Zagórska.txt
    Preview:
    Tytuł oryginału: Heart of Darkness Przełożyła: Aniela Zagórska ISBN: 978-83-7779-308-4   I  Jacht krążowniczy „Nellie” obrócił się na kotwicy bez najl...


### 📄 Text loading and processing

Before we have prepared a small `.json` file, containing the basic metadata for our first experiment (at this point it might seem a bit redundant).

In [7]:
import json
from IPython.display import JSON, display

# Load the JSON
with open('example1/init_metadata.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

# Display an interactive JSON tree (in Jupyter lab only)
display(JSON(data))


<IPython.core.display.JSON object>

With these metadata we will initialize the experiment. All the text, results, etc. will be stored in one variable `exp`.

In [8]:
exp = base.explain_style('example1/init_metadata.json',manual = True)

Initialisation metadata loaded from example1/init_metadata.json.
Available text labels ['filename', 'class', 'author'].
Predefined text classes ['dukaj', 'zagórska'].
You are in manual mode. The next steps would be to run in sequence '.texts_load()', '.texts_preprocess()', '.texts_subsample()', '.extract_features()', '.classify()', '.explain()', and then any of the '.plot_...()' methods. If you would like to change the parameters provided in metadata or set by default  ('preproc_scheme', 'subsample_scheme', 'feature_scheme', 'cv_scheme', 'classifier_scheme'), please, load them via '.load_parameter_name()' to avoid incosistent file naming and config saving.


Let's follow the instructions and load the text files by calling the method `.texts_load()` on our all-embracing `exp`. After that they will be stored in `exp.texts` (to be exact, these are going to be tuples of text and its metadata).

In [9]:
exp.texts_load()

List of labels used: '['filename', 'class', 'author']'.


0it [00:00, ?it/s]

If you have not run spaCy in this notebook before, it will ask your permission to install a language model. It may take a minute to download and install the model. Afterwards, the text processing will commence.

[⏱️ Session lifespan in GoogleColab might be from idle ~90 min. to ~12 hrs code running. Every time after the session is over and you start the environment again, all the packages and models need to be loaded again too].

In [10]:
exp.texts_preprocess()

Default folder 'explain_example1/subsamples_none_800' created.
Labels ['filename', 'class', 'author'] will be now accessible in documents via 'document._.label'.
Checking if the model is available...
The pl_core_news_lg is available from Spacy models.
Spacy model not installed.
 Would you like to install it now: [y/n]y
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('pl_core_news_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Number of documents provided: 2.


0it [00:00, ?it/s]

Documents saved as explain_example1/subsamples_none_800/docs.spacy.
They can be reloaded with '.docs_load(filename)'.
Preprocessing parameters saved as 'explain_example1/subsamples_none_800/config_preproc_scheme.json'. It can be reloaded with '.load_params_preproc(filename)'.


If you are running the code locally, you will notice some new

*   folders, e.g., `explain_example1/` - referring to the given experiment
*   subfolders, e.g., `subsamples_none_800` - referring to some preprocessing parameters related to segmenting the texts into smaller chunks
*   config files, e.g., `config_*.json` - containing Python dictionaries listing all the settings and parameters (currently the default ones)
*   data files, e.g., `docs.spacy` - texts already processed by spaCy

They are supposed to store for your later reference the processing steps you have taken and allow you reloading intermediate steps of the analysis.

Now, you have your `exp.docs` ready in memory. But let's assume that you are working on your own laptop and want to reload them tomorrow...
```
exp.docs_load(filename = 'docs.spacy')
# This looks for 'docs.spacy' in your experiment's directory 'explain_example1/subsamples_none_800/'
# This is a default name, so you do not need to indicate the filename:
# exp.docs_load()
```

In [11]:
print(f"You've got {len(exp.docs)} texts ready.\n"
f"The first author is {exp.docs[0]._.author}, and the second {exp.docs[1]._.author}.")

You've got 2 texts ready.
The first author is Dukaj, and the second Zagórska.


This is rather few texts per author. The classifier won't have much to go on with later.

So let's 🪓🪓🪓 these beautiful pieces of literature into smaller chunks to get more samples.

Currently implemented choices of `sample_type` are:
- `'none'` - no chunking
- `'t'` - segments defined by their number of tokens
- `'s'` - segments defined by their number of sentences (as defined by spaCy)
- `'ts'` - segments defined by their number of tokens, but make sure they don't cut sentences in half.

In [12]:
exp.subsample_scheme = {'sample_type': 'ts', 'sample_length': 200}
exp.texts_subsample()
print(f"Samples 1-150 are {exp.docs_subsampled[0]._.author}'s, and 151-351 are {exp.docs_subsampled[170]._.author}'s.")

'subsample_scheme' changed to '{'sample_type': 'ts', 'sample_length': 200}'.


  0%|          | 0/2 [00:00<?, ?it/s]

Number of text samples produced: 351.
Samples 1-150 are Dukaj's, and 151-351 are Zagórska's.


### 🔬 Feature extraction

This is how our text samples look now:

In [13]:
from IPython.display import display, HTML
sample = exp.docs_subsampled[153]
display(HTML(f"<pre style='white-space:pre-wrap'>{sample}</pre>"))

But spaCy gives you plenty of annotation, e.g.:

In [14]:
print(f"Lemmas: {[token.lemma_ for token in sample[-20:]]}")
print(f"Categories of named entities: {[token.ent_type_ for token in sample[-20:]]}")
print(f"Part-of-speech tags: {[token.pos_ for token in sample[-20:]]}")
print(f"Dependency tags: {[token.dep_ for token in sample[-20:]]}")

Lemmas: ['Francis', 'Drake', '’', 'a', 'do', 'sir', 'John', 'Franklin', '–', 'rycerz', 'utytułowany', 'lub', 'nie', ',', 'wielki', ',', 'błędny', 'rycerz', 'morze', '.']
Categories of named entities: ['persName', 'persName', '', '', '', '', 'persName', 'persName', '', '', '', '', '', '', '', '', '', '', '', '']
Part-of-speech tags: ['PROPN', 'PROPN', 'PROPN', 'PROPN', 'ADP', 'NOUN', 'PROPN', 'PROPN', 'PUNCT', 'NOUN', 'ADJ', 'CCONJ', 'PART', 'PUNCT', 'ADJ', 'PUNCT', 'ADJ', 'NOUN', 'NOUN', 'PUNCT']
Dependency tags: ['appos', 'flat', 'punct', 'fixed', 'case', 'nmod', 'appos', 'flat', 'punct', 'appos', 'amod', 'cc', 'conj', 'punct', 'amod', 'punct', 'conj', 'conj', 'nmod', 'punct']


You can utilise many of these as features for the classifier, and later as features that will explain differences between text styles.

A longer list is given below:

In [15]:
from cl_explainable_stylo import feature_extraction
help(feature_extraction.count_features)

Help on function count_features in module cl_explainable_stylo.feature_extraction:

count_features(docs, feature_scheme={'features': [13, 23, 30, 52], 'max_features': 1000, 'n_grams_word': (1, 3), 'n_grams_pos': (1, 3), 'n_grams_dep': (1, 3), 'n_grams_morph': (1, 1), 'min_cull_word': 0.0, 'max_cull_word': 1.0, 'min_cull_d2': 0.0, 'max_cull_d2': 1.0, 'remove_duplicates': True}, verbose=0, tqdm_propagate=False)
    Count the features in the given documents.
    
    Parameters
    ----------
    docs : spacy.tokens.doc.Doc or spacy.tokens.span.Span or list
        A single `spacy.tokens.doc.Doc` object or `spacy.tokens.span.Span` or a list of such objects representing the documents.
    features : int or list
        A single integer or a list of integers representing the features to count.
    max_features : int, optional
        The maximum number of features to consider (default is 1000).
    n_grams_word : tuple, optional
        The n-gram range for word features (default is (1, 3))

You choose from the above the ones that you consider potentially useful.

The default settings are defined in `exp.default_feature_scheme`,
and the currently used settings are stored in `exp.feature_scheme`, which you can modify.

🥼 Currently, the pipeline is implemented to work with _spaCy_.  If have your own annotations, you have to extract them as features on your own. But you can still use them in later steps.

In [16]:
exp.default_feature_scheme

{'features': [13, 23, 32, 52, 61],
 'max_features': 1000,
 'n_grams_word': (1, 3),
 'n_grams_pos': (1, 3),
 'n_grams_dep': (1, 3),
 'n_grams_morph': (1, 1),
 'min_cull_word': 0.0,
 'max_cull_word': 1.0,
 'min_cull_d2': 0.0,
 'max_cull_d2': 1.0,
 'remove_duplicates': False}

In [17]:
exp.feature_scheme['max_features']=500
exp.feature_scheme

{'features': [13, 23, 32, 52, 61],
 'max_features': 500,
 'n_grams_word': (1, 3),
 'n_grams_pos': (1, 3),
 'n_grams_dep': (1, 3),
 'n_grams_morph': (1, 1),
 'min_cull_word': 0.0,
 'max_cull_word': 1.0,
 'min_cull_d2': 0.0,
 'max_cull_d2': 1.0,
 'remove_duplicates': False}

Now, let the machine do some heavy lifting for us:

In [18]:
exp.extract_features(save_to_file=True)

Number of documents provided: 351.
Features to be extracted: [13, 23, 32, 52, 61].


  0%|          | 0/5 [00:00<?, ?it/s]

-- Extracting non-NER lemmas (replacing named entities with their entity type).
-- Extracting dependency-based non-NER lemma bigrams (including punctuation, excluding numerals, replacing named entities with their entity type).
-- Extracting all parts of speech (no punctuation).
-- Extracting morphology annotations with punctuation (replacing named entities with their entity type).
-- Extracting all named entities.
Default folder 'explain_example1/subsamples_ts_200' created.
Feature extraction parameters saved as 'explain_example1/subsamples_ts_200/config_feature_scheme.json'. It can be reloaded with '.load_params_feature(filename)'.
'.feature_dataframe' saved as 'explain_example1/subsamples_ts_200/features.csv'. It can be reloaded with '.features_load(filename)'.


The table (samples x features) is going to be saved as a `.csv` file, which can later be reloaded with
```
exp.features_load(filename = 'features.csv')
```
Our all-encompassing variable `exp` now contains also that table `exp.feature_dataframe` encoded as a _Pandas_ `DataFrame`.

⚠️ It can grow rather **BIG**, ~100 MB and more, depending on the size of your corpus and the number of features.

⚠️ The folder for storage is named after the subsampling settings: `subsamples_ts_200`. If you make multiple experiments with different subsampling settings, they are going to be stored separately. If you make experiments with different `.feature_scheme` settings, they might overwrite or produce `features_1.csv`, `features_2.csv`, etc. (with corresponding `config_*_1.json`, `config_*_2.json`, ..., files).


In [19]:
exp.feature_dataframe

Unnamed: 0,aby_13,ach_13,agent_13,albo_13,ale_13,ale nie_13,ani_13,aż_13,bardzo_13,bez_13,...,wschodnioindyjskiej_61,wtajemniczy_61,wygraża_61,wylornetkowałem_61,zamorskiego_61,Ósma_61,ósmej_61,Łódź_61,Śródziemnego_61,Śródziemnym_61
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
346,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
347,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
348,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
349,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


🥼 If have your own features and they are in the same table format (especially the number of text samples), you can still load them with
```
exp.features_load(filename = 'features.csv')
```
Alternatively, if you have them stored in a variable as _Pandas_ `DataFrame`,
you can simply put it into the `exp` (without unnecessary computing `exp.extract_features()`):
```
exp.feature_dataframe = your_df
```

### 🏋️‍♀️ Training a classifier

Everything's ready to feed the classifier, train it to distinguish between the authors and then explain how it is doing that.

Like before, the default settings are defined in `exp.default_cv_scheme` and `exp.default_classifier_scheme`,
and the currently used settings are stored in `exp.cv_scheme` and `exp.classifier_scheme`, which you can modify.

⚠️ Not all the settings are used for a given `cv_method`.

In [20]:
exp.cv_scheme

{'cv_method': 'StratifiedKFold',
 'n_repeats': 10,
 'n_splits': 10,
 'shuffle': True,
 'n_groups': 2,
 'p': 2,
 'test_fold': [1, 0, 1, 1, 0],
 'test_size': 0.2,
 'train_size': 0.2,
 'random_state': None,
 'val_fraction': 0.25,
 'scoring': {'acc': {'func': <function sklearn.metrics._classification.accuracy_score(y_true, y_pred, *, normalize=True, sample_weight=None)>},
  'f1': {'func': <function sklearn.metrics._classification.fbeta_score(y_true, y_pred, *, beta, labels=None, pos_label=1, average='binary', sample_weight=None, zero_division='warn')>,
   'params': {'beta': 1, 'average': 'macro'}}}}

In [21]:
exp.cv_scheme['val_fraction'] = 0. # There's a bug that I need to fix. That's the current workaround.
exp.cv_scheme['n_repeats'] = 1

In [22]:
exp.classify(save_to_file=True)

Proceeding with 2 classes.
Proceeding with StratifiedKFold cross-validation.


  0%|          | 0/1 [00:00<?, ?it/s]

'val_fraction' setting results in 0.0 validation samples. No nested validation will be performed.
Train:315, Val:0, Test:36




'val_fraction' setting results in 0.0 validation samples. No nested validation will be performed.
Train:316, Val:0, Test:35




'val_fraction' setting results in 0.0 validation samples. No nested validation will be performed.
Train:316, Val:0, Test:35




'val_fraction' setting results in 0.0 validation samples. No nested validation will be performed.
Train:316, Val:0, Test:35




'val_fraction' setting results in 0.0 validation samples. No nested validation will be performed.
Train:316, Val:0, Test:35




'val_fraction' setting results in 0.0 validation samples. No nested validation will be performed.
Train:316, Val:0, Test:35




'val_fraction' setting results in 0.0 validation samples. No nested validation will be performed.
Train:316, Val:0, Test:35




'val_fraction' setting results in 0.0 validation samples. No nested validation will be performed.
Train:316, Val:0, Test:35




'val_fraction' setting results in 0.0 validation samples. No nested validation will be performed.
Train:316, Val:0, Test:35




'val_fraction' setting results in 0.0 validation samples. No nested validation will be performed.
Train:316, Val:0, Test:35




  0%|          | 0/10 [00:00<?, ?it/s]

Accuracy [0-1, higher better]: 	0.97 	 Baseline value:  0.57
F1 [0-1, higher better]: 		0.97 	 Baseline value:  0.36
Scores saved as explain_example1/subsamples_ts_200/scores_class-class.pkl.
Classifiers saved as explain_example1/subsamples_ts_200/classifiers_class-class.pkl.
Classifiers saved as booster_*.txt.
Classifier parameters saved as 'explain_example1/subsamples_ts_200/config_classifier_scheme.json'. It can be reloaded with '.load_params_classifier(filename)'.
Cross-validation parameters saved as 'explain_example1/subsamples_ts_200/config_cv_scheme.pkl'. It can be reloaded with '.load_params_cv(filename)'.


Let's check if the classifier is actually able to correctly attribute the samples to the authors:

In [23]:
exp.scores.print_scores()

Accuracy [0-1, higher better]: 	0.97 	 Baseline value:  0.57
F1 [0-1, higher better]: 		0.97 	 Baseline value:  0.36


In [None]:
exp.scores.predictions_per_cv

Both the classifier itself
```
exp.classifiers_load(postfix='_class-class')
```
and the scores
```
exp.scores_load(postfix='_class-class')
```
are saved and can be reloaded for further inspection or for inference using the trained model.

### Explanations

In [25]:
exp.explain(save_to_file=True)

  0%|          | 0/10 [00:00<?, ?it/s]

IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed

As before, you can save and load the explanations by
```
exp.explanations_load(postfix='_class-class')
```

⚠️ Note that this file can get big, too. Consequently, writing to it and reading it may be slow at the moment.

In [35]:
exp.explanations.shap_cv

{0: {0: .values =
  array([-0.59710091,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ])
  
  .base_values =
  array([1.60916807, 1.60916807, 1.60916807, 1.60916807, 1.60916807,
         1.60916807, 1.60916807, 1.60916807, 1.60916807, 1.60916807,
         1.60916807, 1.60916807, 1.60916807, 1.60916807, 1.60916807,
         1.60916807, 1.60916807, 1.60916807, 1.60916807, 1.60916807,
         1.60916807, 1.60916807, 1.60916807, 1.60916807, 1.60916807,
         1.60916807, 1.60916807, 1.60916807, 1.60916807, 1.60916807,
         1.60916807, 1.60916807, 1.60916807, 1.60916807, 1.60916807,
         1.60916807])
  
  .data =
  array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)},
 1: {0: .values =
  array([-0.60525228,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ])
  
  .base_values =
  array([1.64194388, 1.64194388, 1.64194388, 1.64194388, 1.64194388,
         1.64194388, 1.64194388, 1.64194388, 1.64194388, 1.64194388,
        

## 🧪 Your Playground

You can now run your own analyses using functions from `cl_explainable_stylo`, loading texts and experimenting with stylistic explainability.

In [None]:
# Write your own code here!
# For example, analyze loaded text, vectorize it, or explain features
# from cl_explainable_stylo import analyze_text

# results = analyze_text(text)
# print(results)