In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

First I want to inspect the metadata files and make the information more useful. For example, the author column in the Project Gutenberg data.

In [2]:
pg_metadata = pd.read_csv('../data/gut_books/austen_metadata.csv')

In [3]:
pg_metadata.head()

Unnamed: 0,Text#,Type,Issued,Title,Language,Authors,Subjects,LoCC,Bookshelves
0,105,Text,1994-02-01,Persuasion,en,"Austen, Jane, 1775-1817",England -- Social life and customs -- 19th cen...,PR,
1,121,Text,1994-04-01,Northanger Abbey,en,"Austen, Jane, 1775-1817",England -- Social life and customs -- 19th cen...,PR,Gothic Fiction
2,141,Text,1994-06-01,Mansfield Park,en,"Austen, Jane, 1775-1817",England -- Fiction; Young women -- Fiction; Lo...,PR,
3,158,Text,1994-08-01,Emma,en,"Austen, Jane, 1775-1817",Humorous stories; England -- Fiction; Young wo...,PR,
4,21839,Text,2007-06-15,Sense and Sensibility,en,"Austen, Jane, 1775-1817; Dobson, Austin, 1840-...",England -- Social life and customs -- 19th cen...,PR,


Keep columns: Text#, Title, Authors

In [4]:
pg_metadata = pg_metadata[['Text#', 'Title', 'Authors']]

In [5]:
# clean up Authors
pg_metadata['Authors'] = pg_metadata['Authors'].str.split(',').str[:2].str.join(',')

In [6]:
pg_metadata

Unnamed: 0,Text#,Title,Authors
0,105,Persuasion,"Austen, Jane"
1,121,Northanger Abbey,"Austen, Jane"
2,141,Mansfield Park,"Austen, Jane"
3,158,Emma,"Austen, Jane"
4,21839,Sense and Sensibility,"Austen, Jane"
5,42671,Pride and Prejudice,"Austen, Jane"


In [7]:
ao3_metadata = pd.read_csv('../data/fanfic_texts/links_05182024053844.csv')

In [8]:
ao3_metadata.head()

Unnamed: 0,title,author,summary,fandoms,warnings,characters,relationships,tags,words,rating,chapters,categories,complete,link
0,We Have Been Trying To Reach You About Your Li...,Katri,\n<p>Just a bit of silliness based on the prom...,"['Pride and Prejudice - Jane Austen', 'Pride a...",['No Archive Warnings Apply'],['Mr. Bennet (Pride and Prejudice)'],[],[],11116,General Audiences,2/2,F/M,True,https://archiveofourown.org/works/55697983
1,The Younger Son,Sonetka,"\n<p>For the JAFF Trope Inversion Prompt: ""The...","['AUSTEN Jane - Works', 'Sense and Sensibility...",['Creator Chose Not To Use Archive Warnings'],"['Edward Ferrars', 'Elinor Dashwood', 'Fanny D...",['Elinor Dashwood/Edward Ferrars'],[],7164,General Audiences,1/1,"F/M, Gen",True,https://archiveofourown.org/works/55785940
2,Golden,Courtney621,\n<p>Mrs. Bennet’s foolish boast about her eld...,"['Pride and Prejudice - Jane Austen', 'AUSTEN ...",['No Archive Warnings Apply'],"['Jane Bennet', 'Charles Bingley', 'Caroline B...",['Jane Bennet/Charles Bingley'],"['Alternate Universe - Fairy Tale', 'Alternate...",12066,General Audiences,3/3,F/M,True,https://archiveofourown.org/works/55435072
3,Pride and Prejudice,wildwomendontgettheblues,"\n<p>""It is a truth universally acknowledged, ...",['Pride and Prejudice - Jane Austen'],['No Archive Warnings Apply'],"['Fitzwilliam Darcy', 'Elizabeth Bennet', 'Wil...","['Elizabeth Bennet/Fitzwilliam Darcy', 'Jane B...","['Enemies to Lovers', 'Falling In Love', 'Comi...",122143,Not Rated,2/2,"F/M, Gen",True,https://archiveofourown.org/works/55736635
4,The Settlement of Lady Elliot's Piano,Gwynterys,\n<p>—Admiral Croft executes an outflanking ma...,['Persuasion - Jane Austen'],['No Archive Warnings Apply'],"['Anne Elliot', 'Frederick Wentworth', 'Admira...",['Anne Elliot & Frederick Wentworth'],"['Marriage', 'Family Drama', 'Character Study'...",5941,General Audiences,2/2,"F/M, Gen",True,https://archiveofourown.org/works/55189105


In [9]:
# create Text# column by extracting the work_id from link column
ao3_metadata['Text#'] = ao3_metadata['link'].str.split(pat = '/').str[-1]

In [None]:
ao3_metadata.head()

Keep columns: Text#, title, author, words

In [10]:
ao3_metadata = ao3_metadata[['Text#', 'title', 'author', 'words']]

In [11]:
ao3_metadata.head()

Unnamed: 0,Text#,title,author,words
0,55697983,We Have Been Trying To Reach You About Your Li...,Katri,11116
1,55785940,The Younger Son,Sonetka,7164
2,55435072,Golden,Courtney621,12066
3,55736635,Pride and Prejudice,wildwomendontgettheblues,122143
4,55189105,The Settlement of Lady Elliot's Piano,Gwynterys,5941


The number of words in the AO3 works is given in the metadata, but I'll need to create that information for the PG books. However, since that might change after data exploration and further cleaning, I'll write the code to generate the information here but use it later.

In [25]:
word_num = 0

for id in pg_metadata['Text#']:
    with open(f'../data/gut_books/{id}.txt', 'r') as file:
        print(id)
        data = file.read()
        lines = data.split()
        print(lines[100:110])
        word_num = len(lines)

    print(word_num)

105
['admiration', 'and', 'respect,', 'by', 'contemplating', 'the', 'limited', 'remnant', 'of', 'the']
83335
121
['immediate', 'publication.', 'It', 'was', 'disposed', 'of', 'to', 'a', 'bookseller,', 'it']
77223
141
['XLVII', 'CHAPTER', 'XLVIII', 'CHAPTER', 'I', 'About', 'thirty', 'years', 'ago', 'Miss']
159630
158
['IX.', 'CHAPTER', 'X.', 'CHAPTER', 'XI.', 'CHAPTER', 'XII.', 'CHAPTER', 'XIII.', 'CHAPTER']
157558
21839
['XI', 'CHAPTER', 'XII', 'CHAPTER', 'XIII', 'CHAPTER', 'XIV', 'CHAPTER', 'XV', 'CHAPTER']
121870
42671
['of', '"Sense', 'and', 'Sensibility."', 'VOL.', 'I.', 'London:', 'Printed', 'for', 'T.']
121980


Initially the word count for the Austen books was quite low due to the newline characters essentially joining words. I went back to the scraping notebook and amended the code to remove those characters which fixed the problem.