# Demo notebook for interactive plotting and data analysis
SSC, September 2022

This demonstrates usage of the interactive plotting methods.

In [None]:
# Please ignore this cell: extra install steps that are only executed when running the notebook on Google Colab
# flake8-noqa-cell
import os
if 'google.colab' in str(get_ipython()) and not os.path.isdir('Test_Data'):
    # we're running on colab and we haven't already downloaded the test data
    # first install pinned version of setuptools (latest version doesn't seem to work with this package on colab)
    %pip install setuptools==61 -qqq
    # install the moralization package
    %pip install git+https://github.com/ssciwr/moralization.git -qqq
      # download test data sets
    !wget https://github.com/ssciwr/moralization/archive/refs/heads/test_data.zip -q
    !mkdir -p data && unzip -qq test_data.zip && mv -f moralization-test_data/*_Data ./data/. && rm -rf moralization-test_data test_data.zip
    !spacy download de_core_news_sm

In [None]:
from moralization import DataManager

# Import the data using the DataManager

If you need more information about raised warnings run: <br>
```import logging ``` <br>
```logging.getLogger().setLevel(logging.DEBUG)```

In [None]:
# analyse small dataset
# data_manager = DataManager("/content/data/Test_Data/XMI_11")

# if you have data in a language different than German, you 
# can pass the selected language model for the corpus language 
# using the language_model keyword argument
# for a selection of the models, see https://spacy.io/usage/models
data_manager = DataManager("/content/data/Test_Data/XMI_11", language_model="en_core_web_sm")

# analyse full dataset
# data_manager = DataManager("/content/data/All_Data/XMI_11") 


## Validate the quality of the data

The integrity of the data is checked using the `check_data_integrity` method. This method will return `True` only when all categories passed the minimum requirements. Otherwise it will return `False`.

To check the integrity of the data four categories are evaluated: `frequency`, `length`, `span_distinctiveness`, `boundary_distinctiveness`. It is based on the spaCy [span analyzer](https://github.com/ljvmiranda921/spacy-span-analyzer).

- `Frequency` is the total number of spans for a span type in the dataset’s training corpus. Recommended minimum value: 50
- `Relative frequency` is the percentage a certain category occupies. Recommended minimum value: 0.2
- `Span distinctiveness` is a measure of how distinctive the text that comprises spans is compared to
the overall text of the corpus. Recommended minimum value: 1
- `Boundary distinctiveness` is a measure of how
distinctive the starts and ends of spans are. Recommended minimum value: 1


See https://www.romanklinger.de/publications/PapayKlingerPado2020.pdf page 3 for more information.

In [None]:
result = data_manager.check_data_integrity()



In [None]:
print("Data passed the test?:", result)

This will tell you if some categories are exceptionally rare and therefore not reliable in both statistics and training.

# Analyse the data per paragraph (instance)
Analysis of how often an annotation occurs per text source is carried out using `occurrence_analysis`. 

This function has three different modes:

- `table`: Show which categories are present in which paragraph, sorted by filenames.
- `corr`: Show the correlation of the occurrence of different categories within the same paragraph. This is based on the [pandas `corr` function](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html) and uses the Pearson correlation coefficient.
- `heatmap`: A heatmap visualization of the correlation matrix `corr`.



In [None]:
occurence_table = data_manager.occurrence_analysis(_type="table")
occurence_table.head(3)

To find examples of spans where specific categories are present you can use this code.
Just change the filter condition to whatever you need.

In [None]:
filter_conditions = [
    ("KAT1-Moralisierendes Segment", "Moralisierung explizit"),
    ("KAT2-Moralwerte", "Care"),
]

filtered_df = occurence_table.copy()
for first_level, second_level in filter_conditions:
    filtered_df = filtered_df.loc[filtered_df[(first_level, second_level)] == 1]
filtered_df

It can be provided as an occurence correlation:

In [None]:
correlation_df = data_manager.occurrence_analysis(_type="corr")
correlation_df.head(5)

With the `heatmap` argument this function can be used to get a quick overview of the correlation matrix.

For a more detailed look at the correlation heatmap please use `interactive_correlation_analysis` function that is explained at the end of the notebook.

In [None]:
data_manager.occurrence_analysis(_type="heatmap")

The dataframes can also be exported as csv to perform further sorting.

In [None]:
# for the general table
df = data_manager.occurrence_analysis()
df.to_csv("./table_occurrence.csv")

In [None]:
# for the general table
df = data_manager.occurrence_analysis(_type="corr")
df.to_csv("./table_correlation.csv")

If you do not want the full table but filter it for specific data files, you can do so by providing a `file_filter` keyword:

In [None]:
data_manager.occurrence_analysis(file_filter="test_data-trimmed_version_of-Gerichtsurteile-neg-AW-neu-optimiert-BB")

In [None]:
data_manager.occurrence_analysis(_type="heatmap", file_filter="test_data-trimmed_version_of-Gerichtsurteile-neg-AW-neu-optimiert-BB")

Likewise if you do not want the full correlation plot but filter it for specific categories, you can do so by providing a `cat_filter` keyword (only works for `type=heatmap`):

In [None]:
data_manager.occurrence_analysis(_type="heatmap", cat_filter="KAT1-Moralisierendes Segment")

# Analyze the dataset as a whole
You can also analyse the data using the spacy [span analyzer](https://github.com/ljvmiranda921/spacy-span-analyzer). The modes can be selected as above: 
- `frequency` stands for the total frequency in the complete dataset;
- `length` for the geometric mean of the spans' lengths in tokens in the complete dataset;
- `span_distinctiveness` for distinctiveness of the span compared to the corpus. It measures how distinct the text comprising the spans is compared to the rest of the corpus. It is defined as the KL divergence D(P_span || P), where P is the unigram word distribution of the corpus, and P_span as the unigram distribution of tokens within the span. High values indicate that different words are used inside spans compared to the rest of the text, whereas low values indicate that the word distribution is similar inside and outside of spans. This property is positively correlated with model performance. Spans with high distinctiveness should be able to rely more heavily on local features, as each token carries information about span membership. Low span distrinctivess then calls for sequence information.
- `boundary_distinctiveness` for distinctiveness of the boundaries compared to the corpus. Measures how distinctive the starts and ends of spans are. It is formalized as the KL-divergence D(P_bounds || P) where P is the unigram word distribution of the corpus, and P_bounds as the unigram distribution of the boundary tokens. This property is positively correlated with model performance. High values mean that the start and end points of spans are easy to spot, while low values indicate smooth transitions.
- `all` will return all of the above as a dictionary.


All of these metrics also also used in the `check_data_integrity` function at the top.

Show how often a given label is present in different categories.


In [None]:
data_manager.return_analyzer_result(result_type="frequency")

Show how long the spans are for different labels. 


In [None]:
data_manager.return_analyzer_result(result_type="length")

Show the span distinctiveness for different labels. 


In [None]:
data_manager.return_analyzer_result(result_type="span_distinctiveness")

Show the boundary distinctiveness for different labels. 


In [None]:
data_manager.return_analyzer_result(result_type="boundary_distinctiveness")

In [None]:
data_manager.return_analyzer_result(result_type="all")

Again, any of these can be exported as csv.

In [None]:
df = data_manager.return_analyzer_result(result_type="boundary_distinctiveness")
df.to_csv("./boundary_distinctiveness.csv")

# Interactive data analysis

Here is an example of the interactive data analysis tools we provide.

Please note, that it can take a couple seconds on google colab to go from `loading` to the interface.<br>
Once Dash shows that it is running on a port you can click the link to open the applet in a new tab.

`interactive_data_analysis` can be used to quickly get an overview over the `frequency`, `length`, `span_distinctiveness`, `boundary_distinctiveness` for the different categories. These results can also be numerically viewed with the `return_analyzer_result` function. See above.

- Frequency is the number of spans for a span type in the dataset’s training corpus.
- Span length is the geometric mean of spans’ lengths, in tokens.
- Span distinctiveness is a measure of how distinctive the text that comprises spans is compared to
the overall text of the corpus
- Boundary distinctiveness is a measure of how
distinctive the starts and ends of spans are.

See https://www.romanklinger.de/publications/PapayKlingerPado2020.pdf page 3 for more information.

In [None]:
data_manager.interactive_data_analysis(port=8058)

Here one can visualize the heatmap of the different class correlations in a more simplistic overview. This includes precise filtering of which classes to show.

This map also allows for zooming into specific regions.

In the top right one can also export the picture.

In [None]:
data_manager.interactive_correlation_analysis(port=8059)

The next function might struggle on large datasets. </br>
This function will show you the selected dataset with annotations. </br>
Select `sc` to see all annotations.

In [None]:
data_manager.interactive_data_visualization(port = 8065)