# Demo notebook for interactive plotting and data analysis
SSC, September 2022

This demonstrates usage of the interactive plotting methods.

In [None]:
# Please ignore this cell: extra install steps that are only executed when running the notebook on Google Colab
# flake8-noqa-cell
import os
if 'google.colab' in str(get_ipython()) and not os.path.isdir('Test_Data'):
    # we're running on colab and we haven't already downloaded the test data
    # first install pinned version of setuptools (latest version doesn't seem to work with this package on colab)
    %pip install setuptools==61 -qqq
    # install the moralization package
    %pip install git+https://github.com/ssciwr/moralization.git -qqq
      # download test data sets
    !wget https://github.com/ssciwr/moralization/archive/refs/heads/test_data.zip -q
    !mkdir -p data && unzip -qq test_data.zip && mv -f moralization-test_data/*_Data ./data/. && rm -rf moralization-test_data test_data.zip
    !spacy download de_core_news_sm

In [None]:
from moralization import DataManager

# Import the data using the DataManager

If you need more information about raised warnings run: <br>
```import logging ``` <br>
```logging.getLogger().setLevel(logging.DEBUG)```

In [None]:
# analyse small dataset
data_manager = DataManager("/content/data/Test_Data/XMI_11")
# analyse full dataset
# data_manager = DataManager("/content/data/All_Data/XMI_11") 

The integrity of the data is checked using the `check_data_integrity` method:

To check the integrety of the data four categories are evaluated: `frequency`, `length`, `span_distinctiveness`, `boundary_distinctiveness` categories.

- Frequency is the total number of spans for a span type in the dataset’s training corpus. Recommended minimum Value: 50
- Relativ frequency is the percentage a certain category occupies. Recommended minimum Value = 0.2
- Span distinctiveness is a measure of how distinctive the text that comprises spans is compared to
the overall text of the corpus. Recommended minimum Value = 1
- Boundary distinctiveness is a measure of how
distinctive the starts and ends of spans are. Recommended minimum Value = 1


See https://www.romanklinger.de/publications/PapayKlingerPado2020.pdf page 3 for more information.

In [None]:
data_manager.check_data_integrity()

This will tell you if some categories are exceptionally rare and therefore not reliable in both statistics and training.

# Data analysis
Analysis of how often an annotation occurs per text source is carried out using `occurence_analysis`. 

This function has three different modes:

- `table`: Show which categories are present in which paragraph, sorted by filenames.
- `corr`: Show the correlation of the occurrence of different categories within the same paragraph.
- `heatmap`: A heatmap visualization of the correlation matrix.



In [None]:
occurence_table = data_manager.occurence_analysis(_type="table")
occurence_table.head(3)

To find examples of spans where specific categories are present you can use this code.
Just change the filter condition to whatever you need.

In [None]:
filter_conditions = [
    ("KAT1-Moralisierendes Segment", "Moralisierung explizit"),
    ("KAT2-Moralwerte", "Care"),
]

filtered_df = occurence_table.copy()
for first_level, second_level in filter_conditions:
    filtered_df = filtered_df.loc[filtered_df[(first_level, second_level)] == 1]
filtered_df

It can be provided as an occurence correlation:

In [None]:
correlation_df = data_manager.occurence_analysis(_type="corr")
correlation_df.head(5)

or as a heatmap:

In [None]:
data_manager.occurence_analysis(_type="heatmap")

The dataframes can also be exported as csv to perform further sorting.

In [None]:
# for the general table
df = data_manager.occurence_analysis()
df.to_csv("./table_occurence.csv")

In [None]:
# for the general table
df = data_manager.occurence_analysis(_type="corr")
df.to_csv("./table_correlation.csv")

If you do not want the full table but filter it for specific data files, you can do so by providing a `file_filter` keyword:

In [None]:
data_manager.occurence_analysis(file_filter="test_data-trimmed_version_of-Gerichtsurteile-neg-AW-neu-optimiert-BB")

In [None]:
data_manager.occurence_analysis(_type="heatmap", file_filter="test_data-trimmed_version_of-Gerichtsurteile-neg-AW-neu-optimiert-BB")

Likewise if you do not want the full correlation plot but filter it for specific categories, you can do so by providing a `cat_filter` keyword (only works for `type=heatmap`):

In [None]:
data_manager.occurence_analysis(_type="heatmap", cat_filter="KAT1-Moralisierendes Segment")

# Check the data
You can also analyse the data uasing the spacy [span analyzer](https://github.com/ljvmiranda921/spacy-span-analyzer).

More documentation here - Gwydion.

In [None]:
data_manager.return_analyzer_result(result_type="frequency")

In [None]:
data_manager.return_analyzer_result(result_type="length")

In [None]:
data_manager.return_analyzer_result(result_type="span_distinctiveness")

In [None]:
data_manager.return_analyzer_result(result_type="boundary_distinctiveness")

In [None]:
data_manager.return_analyzer_result(result_type="all")

Again, any of these can be exported as csv.

In [None]:
df = data_manager.return_analyzer_result(result_type="boundary_distinctiveness")
df.to_csv("./boundary_distinctiveness.csv")

# Interactive data analysis

Here is an example of the interactive data analysis tools we provide.

please note, that it can take a couple seconds on google colab to go from `loading` to the interface.<br>
Once Dash shows that it is running on a port you can click the link to open the applet in a new tab.

`interactive_data_analysis` can be used to quickly get an overview over the `frequency`, `length`, `span_distinctiveness`, `boundary_distinctiveness` for the different categories.

- Frequency is the number of spans for a span type in the dataset’s training corpus.
- Span length is the geometric mean of spans’ lengths, in tokens.
- Span distinctiveness is a measure of how distinctive the text that comprises spans is compared to
the overall text of the corpus
- Boundary distinctiveness is a measure of how
distinctive the starts and ends of spans are.

See https://www.romanklinger.de/publications/PapayKlingerPado2020.pdf page 3 for more information.

In [None]:
data_manager.interactive_data_analysis()

A quick way to compare the 

In [None]:
data_manager.interactive_correlation_analysis()

The next function might struggle on large datasets.

In [None]:
data_manager.interactive_data_visualization()