In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import MultiLabelBinarizer

%matplotlib inline

plt.style.use('bmh')
pd.options.display.float_format = '{:,.0f}'.format

## Let's get setup

In [2]:
df = pd.read_csv('../data/fanfics_metadata.csv')

In [None]:
df.info()

## Clean data

In [None]:
cols_with_missing = [col for col in df.columns if df[col].isnull().any()]
cols_with_missing

Some columns are blank, but that says something and is not considered missing data. For example NaN in bookmarks means no one has bookmarked that work. Let's add in zeros to be explicit about this. For categorical data, we can deal with NaN when we do one-hot encoding.

In [3]:
df['category'].fillna('No Category Specified', inplace=True)
df['relationship'].fillna('No Relationship Specified', inplace=True)
df['character'].fillna('No Character Specified', inplace=True)
df['additional tags'].fillna('No Additional Tags Specified', inplace=True)
df['words'].fillna(0, inplace=True)
df['comments'].fillna(0, inplace=True)
df['kudos'].fillna(0, inplace=True)
df['bookmarks'].fillna(0, inplace=True)
df['hits'].fillna(0, inplace=True)

We have multiple columns of categories, several of which consists of list of variables ('category', 'fandom', 'relationship', 'additional tags'). The 'category' column is actually fairly short. It can be M/M, F/F, F/M, Gen, Multi, Other, or No Category Specified in some combination. The three other columns are progressively more complicated. 

In [None]:
df['category'].unique()

## Working with Categories

The following are all types of categories that can be one-hot encoded as needed: 'rating', 'language', 'status', 'category', 'fandom', 'relationship', 'character', 'additional tags'. Use col_list to specifiy the columns you want in the new data frame and whether you want to drop the original columns.

In [12]:
def column_MLB(col_name, df):
    # Takes a column name and does one-shot encoding on it, indpendent of whether 
    # each row of column is a category or a list of categorical data
    # Also 
    
    mlb = MultiLabelBinarizer()
    #create boolean mask matched non NaNs values
    mask = df[col_name].notnull()
    #filter by boolean indexing
    arr = mlb.fit_transform(df.loc[mask, col_name].dropna().str.strip('[]').str.split(', '))
    #create DataFrame and add missing (NaN)s index values
    return (pd.DataFrame(arr, index=df.index[mask], columns=mlb.classes_)
               .reindex(df.index, fill_value=0))

In [13]:
def one_hot_encoding(df, columns, drop):
    df_one_hot_encoded = df
    for column in columns:
        result = column_MLB(column, df)
        df_one_hot_encoded = pd.concat([df_one_hot_encoded, result], axis=1, sort=False)
        if drop == True:
            df_one_hot_encoded.drop([column],axis=1, inplace=True)
          
    return df_one_hot_encoded
    

In [14]:
#col_list = ['rating', 'language', 'status', 'category', 'fandom', 'relationship', 'character', 'additional tags']
col_list = ['category']
new_df = one_hot_encoding(df, col_list, drop=True)
new_df

Unnamed: 0,work_id,title,rating,fandom,relationship,character,additional tags,language,published,status,...,kudos,bookmarks,hits,F/F,F/M,Gen,M/M,Multi,No Category Specified,Other
0,3104510,Second Chances,Teen And Up Audiences,"Star Wars - All Media Types, Star Wars Prequel...","Obi-Wan Kenobi/Anakin Skywalker, Obi-Wan Kenob...","Leia Organa, Luke Skywalker, Anakin Skywalker,...","Age Regression/De-Aging, Soul Bond, The Force,...",English,2015-01-05,Completed,...,1917,446,74424,1,0,0,1,0,0,0
1,6423526,hurricane on the edge of oblivion (with nowher...,Mature,Star Wars: The Wrath of Darth Maul - Ryder Win...,"Obi-Wan Kenobi & Xanatos, Qui-Gon Jinn & Feemo...","Obi-Wan Kenobi, Xanatos (Star Wars), Qui-Gon J...","minor OC's - Freeform, at least I'm pretty sur...",English,2016-04-01,Updated,...,1815,380,28728,0,0,1,0,0,0,0
2,9552773,time to change the road you're on,General Audiences,"Star Wars - All Media Types, Star Wars: The Cl...","Anakin Skywalker & Ahsoka Tano, Ahsoka Tano an...","Ahsoka Tano, Anakin Skywalker, Kanan Jarrus, E...","AU, Time Travel Fix-It, possibly more of a tim...",English,2017-02-02,Completed,...,1446,340,25348,0,0,1,0,0,0,0
3,5162474,Twin Sunrise,General Audiences,"Star Wars Original Trilogy, Star Wars: Rebels,...","Luke Skywalker & Darth Vader, Darth Vader & Ap...","Anakin Skywalker | Darth Vader, Luke Skywalker...","Grey Jedi, Alternate Universe, Sith Shenanigan...",English,2015-11-07,Completed,...,1418,382,52527,0,0,1,0,0,0,0
4,4417469,On the Edge of the Devil's Backbone,Teen And Up Audiences,"Star Wars: Rebels, Star Wars - All Media Types",Kanan Jarrus/Hera Syndulla,"Hera Syndulla, Kanan Jarrus, Sabine Wren, Gara...","Alternate Universe, Canon-Typical Violence",English,2015-07-25,Completed,...,1395,255,39386,0,1,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4315,17114495,Ostateczne Poswiecenie. Nieznana historia ukry...,General Audiences,"Star Wars - All Media Types, Rogue One: A Star...",No Relationship Specified,"Wilhuff Tarkin, Ahsoka Tano, Wulff Yularen",No Additional Tags Specified,Polski,2018-12-23,Completed,...,0,0,41,0,0,1,0,0,0,0
4316,16719585,The Life of Scoundrels,General Audiences,"Star Wars Sequel Trilogy, Star Wars Rebels",No Relationship Specified,"Han Solo, Chewbacca (Star Wars), Boba Fett, Ho...","disintegrations WILL become part of your rep, ...",English,2018-11-23,Completed,...,0,0,61,0,0,1,0,0,0,0
4317,20451119,In Night,Mature,"Star Wars: Rebels, Star Wars:KANAN",Janus Kasmir/Caleb Dume,"Caleb Dume, Janus Kasmir",No Additional Tags Specified,Ri Ben Yu,2019-08-30,Completed,...,0,0,15,0,0,0,1,0,0,0
4318,17136161,"""I am my prayer to you""",General Audiences,Star Wars: Rebels,"Garazeb ""Zeb"" Orrelios/Original Character(s)","Garazeb ""Zeb"" Orrelios, Original Characters","Romance, Religion, Lasan, Lasat, Prayer, Pilgr...",English,2018-12-24,Completed,...,0,0,46,0,1,0,0,0,0,0


Some general notes. One hot encoding all the categories will result in over 15k features, and a very sparse matrix. 

AO3 actually allows users to tag things as they like but a human will assign synonyms as they see fit. Its an unusual and powerful hyrbrid. Using the synonyms instead of user specified tag might result in a reduction of columns by perhaps a third? I'm not really sure without further scraping and possibly a database. For the near term, working to explore the data further might be most helpful.

## What is the shape of the data

In [None]:
result.info()