# Frequency of variables

Calculate the frequency of selected variables, write the frequency table to a CSV, and print the frequency table.

The code in here starts with base.ipynb are the rest is cut-and-paste from src/misc/pandas-cheatsheet.md.

## Load the data into dataframe `df`

This is from base.ipynb

Do some basic narrowing which applies to all frequency tables.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as plt

import ds9
df = ds9.df()

## charactersclean

We'll show the creation of the freqency table for `charactersclean` in detail.

### Narrow the dataframe to just the column `charactersclean`

In [2]:
charactersclean_df = df[['charactersclean']]

### Explode the `charactersclean` list into rows

Straight from the cheat sheet.

In [3]:
new_df = pd.DataFrame()
for index, charactersclean in zip(charactersclean_df.index, charactersclean_df['charactersclean']):
    for i in charactersclean:
        row = charactersclean_df[charactersclean_df.index == index].copy()
        row['charactersclean'] = i
        new_df = pd.concat([new_df, row], ignore_index=True)
charactersclean_df = new_df

### Calculate the frequency of the rows' values

From the cheat sheet, creates a new dataframe with frequency counts.

Then save those into a CSV.

In [4]:
freq = charactersclean_df['charactersclean'].value_counts()
freq.to_csv('frequency-charactersclean.csv', header=None)
pd.options.display.max_rows = 999
print(freq)

Julian Bashir                         4938
Elim Garak                            4521
Kira Nerys                            1945
non-cast                              1922
Jadzia Dax                            1631
Odo                                   1380
Benjamin Sisko                        1190
Quark                                 1122
Miles O'Brien                         1051
Ezri Dax                               577
Skrain Dukat                           544
Worf                                   438
Jake Sisko                             384
Weyoun                                 369
Keiko O'Brien                          365
Corat Damar                            323
Tora Ziyal                             289
Nog                                    283
Kelas Parmak                           224
Rom                                    203
Enabran Tain                           186
Leeta                                  169
Molly O'Brien                          147
Kasidy Yate

## relationshipspairslash

The processing of the other varaibles follows this same model.

In [5]:
relationshipspairslash_df = df[['relationshipspairslash']]

new_df = pd.DataFrame()
for index, relationshipspairslash in zip(relationshipspairslash_df.index, relationshipspairslash_df['relationshipspairslash']):
    for i in relationshipspairslash:
        row = relationshipspairslash_df[relationshipspairslash_df.index == index].copy()
        row['relationshipspairslash'] = i
        new_df = pd.concat([new_df, row], ignore_index=True)
relationshipspairslash_df = new_df

freq = relationshipspairslash_df['relationshipspairslash'].value_counts()
freq.to_csv('frequency-relationshipspairslash.csv', header=None)
pd.options.display.max_rows = 999
print(freq)

Elim Garak/Julian Bashir                                                                                                                                              3812
Jadzia Dax/Kira Nerys                                                                                                                                                  394
Odo/Quark                                                                                                                                                              372
non-cast/non-cast                                                                                                                                                      304
Kira Nerys/Odo                                                                                                                                                         244
Jadzia Dax/Worf                                                                                                                                  

## relationshipspairamp

In [6]:
relationshipspairamp_df = df[['relationshipspairamp']]

new_df = pd.DataFrame()
for index, relationshipspairamp in zip(relationshipspairamp_df.index, relationshipspairamp_df['relationshipspairamp']):
    for i in relationshipspairamp:
        row = relationshipspairamp_df[relationshipspairamp_df.index == index].copy()
        row['relationshipspairamp'] = i
        new_df = pd.concat([new_df, row], ignore_index=True)
relationshipspairamp_df = new_df

freq = relationshipspairamp_df['relationshipspairamp'].value_counts()
freq.to_csv('frequency-relationshipspairamp.csv', header=None)
pd.options.display.max_rows = 999
print(freq)

Elim Garak & Julian Bashir                                  583
Julian Bashir & Miles O'Brien                               140
Jadzia Dax & Julian Bashir                                  105
non-cast & non-cast                                          74
Jadzia Dax & Kira Nerys                                      69
Odo & Quark                                                  62
Julian Bashir & Kira Nerys                                   62
Julian Bashir & non-cast                                     48
Elim Garak & Odo                                             41
Elim Garak & Enabran Tain                                    37
Elim Garak & non-cast                                        36
Kira Nerys & Odo                                             36
Ezri Dax & Julian Bashir                                     35
Jake Sisko & Nog                                             34
Elim Garak & Tora Ziyal                                      34
Benjamin Sisko & Jadzia Dax             

## freeforms

In [7]:
freeforms_df = df[['freeforms']]

new_df = pd.DataFrame()
for index, freeforms in zip(freeforms_df.index, freeforms_df['freeforms']):
    for i in freeforms:
        row = freeforms_df[freeforms_df.index == index].copy()
        row['freeforms'] = i
        new_df = pd.concat([new_df, row], ignore_index=True)
freeforms_df = new_df

freq = freeforms_df['freeforms'].value_counts()
freq.to_csv('frequency-freeforms.csv', header=None)
pd.options.display.max_rows = 20000
# Just print those seen twice or more, see the .CSV for the full list
print(freq[freq > 1])

Fluff                                                                                794
Established Relationship                                                             613
Angst                                                                                610
Post-Canon                                                                           398
Hurt/Comfort                                                                         378
Post-Canon Cardassia                                                                 335
Humor                                                                                305
Friendship                                                                           291
Alternate Universe - Canon Divergence                                                264
First Kiss                                                                           255
Oral Sex                                                                             245
Friends to Lovers    

## relationshipspairslashgender

In [8]:
gender_df = df[['relationshipspairslashgender']]
explode_df = ds9.explode(gender_df, 'relationshipspairslashgender')
freq = explode_df['relationshipspairslashgender'].value_counts()
freq.to_csv('frequency-relationshipspairslashgender.csv', header=None)
freq

Male/Male                                                                                     5174
Female/Male                                                                                   1208
Male/unknown                                                                                   690
Female/Female                                                                                  665
unknown/unknown                                                                                324
Female/unknown                                                                                 165
Male/Male/Male                                                                                  74
Female/Male/Male                                                                                53
Male/Male/unknown                                                                               43
Female/Male/unknown                                                                             37
Female/Fem

## relationshipspairslashspecies

In [9]:
species_df = df[['relationshipspairslashspecies']]
explode_df = ds9.explode(species_df, 'relationshipspairslashspecies')
freq = explode_df['relationshipspairslashspecies'].value_counts()
freq.to_csv('frequency-relationshipspairslashspecies.csv', header=None)
freq

Cardassian/Human                                                                                                                3883
Bajoran/Trill                                                                                                                    488
Changeling/Ferangi                                                                                                               372
Human/unknown                                                                                                                    335
unknown/unknown                                                                                                                  328
Cardassian/Cardassian                                                                                                            259
Bajoran/Changeling                                                                                                               246
Cardassian/unknown                                                   