# Freeform explorer and column explorer

A function `freeform()` to list the freeforms (open-ended story tags) associated with a character pair doing slash.

Also a more general function `mentions()` to search a column, and the report the frequences in another column of those matching rows.

## Load dataframe from .CSV into `df`

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as plt

# https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html
#
# Categories are ordered by descending frequency in dataset so that
# setting unsorted=True makes graphs come out correctely where
# the category is the primary category of the population.
#
# To get the frequency of, say, 'language', run
# src/explore/table/histogram-table.py -i data/database/20220612.yaml language

categories_type = pd.api.types.CategoricalDtype(
    categories=[
        'M/M',
        'Gen',
        'F/M',
        'F/F',
        'Multi',
        'No category',
        'Other'
    ],
    ordered=True)

# Ordered by frequency in dataset
warnings_type = pd.api.types.CategoricalDtype(
    categories=[
        'No Archive Warnings Apply',
        'Choose Not To Use Archive Warnings',
        'Graphic Depictions Of Violence',
        'Major Character Death',
        'Rape/Non-Con',
        'Underage',
    ],
    ordered=True)

# Ordered by frequency in dataset
rating_type = pd.api.types.CategoricalDtype(
    categories=[
        'General Audiences',
        'Teen And Up Audiences',
        'Explicit',
        'Mature',
        'Not Rated',
    ],
    ordered=True)

language_type = pd.api.types.CategoricalDtype(
    categories=[
        'en',
        'ru',
        'de',
        'zh-Hans',
        'it',
        'pt-br',
        'ko',
        'fr',
        'es',
        'cy',
        'pl',
        'cs',
        'ja',
        'he',
        'tlh-Latn',
        'nl'
    ],
    ordered=True)

dtypes = { 'id': 'int64',
           'author': 'string',
           'chapter': 'Int64',
           'chapters': 'Int64',
           'comments': 'Int64',
           'complete': 'bool',
           'filename': 'string',
           'hits': 'Int64',
           'kudos': 'Int64',
           'language': 'category',
           'summary': 'string',
           'title': 'string',
           'userid': 'Int64',
           'words': 'Int64',
           'rating': rating_type,
           'language': language_type }

# Load data from CSV into Pandas dataframe
# See https://pbpython.com/pandas_dtypes.html
df = pd.read_csv('../../../data/database/20220612.csv', dtype=dtypes)
df.set_index('id', inplace=True)

# Convert some strings to lists
def strtolist(s):
    if pd.isna(s):
        return list()
    else:
        return eval(s)

df['categories'] = df['categories'].apply(strtolist)
df['characters'] = df['characters'].apply(strtolist)
df['charactersclean'] = df['charactersclean'].apply(strtolist)
df['fandoms'] = df['fandoms'].apply(strtolist)
df['freeforms'] = df['freeforms'].apply(strtolist)
df['relationships'] = df['relationships'].apply(strtolist)
df['relationshipspair'] = df['relationshipspair'].apply(strtolist)
df['relationshipspairslash'] = df['relationshipspairslash'].apply(strtolist)
df['relationshipspairamp'] = df['relationshipspairamp'].apply(strtolist)
df['relationshipspax'] = df['relationshipspax'].apply(strtolist)
df['relationshipspaxslash'] = df['relationshipspaxslash'].apply(strtolist)
df['relationshipspaxamp'] = df['relationshipspaxamp'].apply(strtolist)
df['warnings'] = df['warnings'].apply(strtolist)

# Convert to pandas datetime
# Only publications after 2010
df['publicationdate'] = pd.to_datetime(df['publicationdate'])
dawn = pd.Timestamp('2010-01-01')
df = df[df['publicationdate'] >= dawn]

# Only English
df = df[df['language'] == 'en']

# Complete works
df = df[df['complete'] == True]


## Define the data explorer for freeforms

This code is easily enough understood if you have src/misc/pandas-cheetsheet.md handy, as it's mainly phrases from that.

In [None]:
class Explorer:
    def __init__(self, df):
        # Narrow the dataframe to just the columns of interest
        self.df = df[['relationshipspairslash',
                      'freeforms',
                      'categories',
                      'charactersclean',
                      'fandoms',
                      'relationshipspair',
                      'relationshipspairamp',
                      'relationshipspax',
                      'relationshipspaxamp',
                      'relationshipspaxslash']]
        
    def freeforms(self, pair, top=0):
        # Narrow the dataframe to `pair_df` which contains just the rows which contain the pairing
        pair_df = pd.DataFrame()
        for index, relationshipspairslash in zip(self.df.index, self.df['relationshipspairslash']):
            if pair in relationshipspairslash:
                # Match found, copy the whole row to the bottom of the new dataframe
                row = self.df[self.df.index == index].copy()
                pair_df = pd.concat([pair_df, row])
            
        # Explode out the freeforms list into separate rows in a dataframe `explode_df`
        explode_df = pd.DataFrame()
        for index, freeforms in zip(pair_df.index, pair_df['freeforms']):
            for i in freeforms:
                row = pair_df[pair_df.index == index].copy()
                row['freeforms'] = i
                explode_df = pd.concat([explode_df, row], ignore_index=True)

        # Create a dataframe of (freeforms, freq)
        freq_df = explode_df['freeforms'].value_counts()
        # Narrow the returned dataframe to the N most popular if that was requested
        if top > 0:
            freq_df = freq_df.head(top)
        
        return freq_df

    
    # A more general version of freeforms() above.
    def mentions(self, phrase, search='relationshipspairslash', collect='relationshipspairslash'):
        # Search for `phrase` in dataset column `search`
        match_df = pd.DataFrame()
        for index, value in zip(self.df.index, self.df[search]):
            if phrase in value:
                row = self.df[self.df.index == index].copy()
                match_df = pd.concat([match_df, row])
        # Explode column `collect`
        explode_df = pd.DataFrame()
        for index, values in zip(match_df.index, match_df[collect]):
            for i in values:
                row = match_df[match_df.index == index].copy()
                row[collect] = i
                explode_df = pd.concat([explode_df, row], ignore_index=True)
        # Count frequencies
        freq_df = explode_df[collect].value_counts()
        return freq_df

## Using the explorer

### Freeform explorer
 
Run and save to variable `f` the results of exploring the freeform text 'Elim Garak/Julian Bashir'

```
ex = Explorer(df)
f = ex.freeforms('Elim Garak/Julian Bashir')
```
 
Print that exploration
 
```
f
```
 
Save that exploration to a .CSV file. CSV files don't have any internal documentation, so take care with the file name.

```
f.to_csv('freeform-explorer-garak-bashir.csv')
```

You don't need to restart Jupyter to explore another pairing, just run the `freeform()` function again. We might only want the top 10 most popular freeforms:

```
f = ex.freeforms('Elim Garak/Julian Bashir', top=10)
```

or to explore another pairing entirely:

```
f = ex.freeforms('Jadzia Dax/Kira Nerys')
```
 
Similarly you don't need to always save the results to `f`. Any variable name is fine.


### List field explorer

For fields which are lists the explorer can `search` one column for the presence of the value. Those matching
rows then have the `collect` column summed for each of the list values in that column.

The is a generalised version of `freeforms()`. The same work as `ex.freeforms('Jadzia Dax/Kira Nerys')` is
done with `ex.mentions('Jadzia Dax/Kira Nerys', search='relationshipspairslash', collect='freeforms')`.

```
m = ex.mentions('Elim Garak/Julian Bashir', search='relationshipspairslash', collect='charactersclean')
m
```

The column names which can be given into `search=` and `collect=` are:

* `'relationshipspairslash'`
* `'freeforms'`
* `'categories'`
* `'charactersclean'`
* `'fandoms'`
* `'relationshipspair'`
* `'relationshipspairamp'`
* `'relationshipspax'`
* `'relationshipspaxamp'`
* `'relationshipspaxslash'`

There is a bug at the moment. If there is no match then the program crashes. I'll fix that.


### Try one

Give it a try here. Change the line with `ex.freeforms()` or `ex.mentions()`.  Remove the leading `#` (the comment character). Then press \[Run\] to re-run the changed Jupyter cell.

In [None]:
pd.options.display.max_rows = 999
ex = Explorer(df)

In [None]:
# f = ex.freeforms('Odo/Quark', top=20)
# f
# f.to_csv('freeforms-filename.csv')

# m = ex.mentions('Elim Garak/Julian Bashir', search='relationshipspairslash', collect='charactersclean')
# m
# m.to_csv('mentions-filename.csv')