# Scraping

This notebook uses routines from [```scrape.py```](../util/scrape.py) to scrape fanfiction from [Archive of our Own](http://archiveofourown.org) (AO3). We've chosen to focus on <i>Avatar: The Last Airbender</i> (which we'll call Avatar) fanfiction for this project.

In [1]:
import sys
sys.path.append("..")

from util.scrape import get_search_url, get_works_info, scrape_fic_list
import numpy as np

%load_ext autoreload
%autoreload 2

We define our search query: completed Avatar fanfiction ("fics") with at least 400 kudos. We restrict ourselves to English-language works.

In [2]:
avatar_search = get_search_url("Avatar: The Last Airbender", min_kudos=400, complete=True, english_only=True)

In [3]:
avatar_search

'https://archiveofourown.org/works/search?utf8=✓&work_search[complete]=T&work_search[crossover]=F&work_search[fandom_names]=Avatar: The Last Airbender&work_search[single_chapter]=&work_search[language_id]=en&work_search[kudos_count]=>400&work_search[sort_column]=created_at&work_search[sort_direction]=desc'

Let's scrape the search page to collect metadata for fics we would like to include in our corpus. Here we restrict our collection to works that are not in a series (some authors post dozens of chapters as individual fics, which biases our data towards those few authors) and are longer than 1000 words.

In [59]:
works_info = get_works_info(avatar_search, 2000, word_range=(1000,0), exclude_series=True)

2358 potential matches found
getting page 1, 2000 works left
getting page 6, 1957 works left
getting page 11, 1904 works left
getting page 16, 1864 works left
getting page 21, 1828 works left
getting page 26, 1782 works left
getting page 31, 1730 works left
getting page 36, 1682 works left
getting page 41, 1630 works left
getting page 46, 1580 works left
getting page 51, 1531 works left
getting page 56, 1487 works left
getting page 61, 1451 works left
getting page 66, 1412 works left
getting page 71, 1385 works left
getting page 76, 1343 works left
getting page 81, 1301 works left
getting page 86, 1260 works left
getting page 91, 1200 works left
getting page 96, 1156 works left
getting page 101, 1114 works left
getting page 106, 1062 works left
getting page 111, 1012 works left
getting page 116, 967 works left
ran out of fics with 937 left


The above was run 11/15/20 at 7:42 AM, but wasn't saved and a rerun would be different. Let's see an example of the format.

In [5]:
example = get_works_info(avatar_search, 20, exclude_series=True)

2370 potential matches found
getting page 1, 20 works left
finished


In [8]:
example[0]

{'work_id': '27388627',
 'rating': 'general',
 'lang': 'English',
 'words': 1476,
 'chapters': 1,
 'date': '04 Nov 2020',
 'series': {},
 'author': 'lesmiserablol',
 'all_authors': ['lesmiserablol']}

We pass an array of similar metadata to ```scrape_fic_list```, which will use ```work_id``` to find the fanfiction's URL and scrape its text.

In [60]:
fic_df = scrape_fic_list(works_info, print_every=30)

beginning scrape...
scraping fic 1/1063
scraping fic 31/1063
scraping fic 61/1063
scraping fic 91/1063
scraping fic 121/1063
scraping fic 151/1063
scraping fic 181/1063
scraping fic 211/1063
scraping fic 241/1063
scraping fic 271/1063
scraping fic 301/1063
scraping fic 331/1063
scraping fic 361/1063
scraping fic 391/1063
scraping fic 421/1063
scraping fic 451/1063
scraping fic 481/1063
scraping fic 511/1063
scraping fic 541/1063
scraping fic 571/1063
scraping fic 601/1063
scraping fic 631/1063
scraping fic 661/1063
scraping fic 691/1063
scraping fic 721/1063
scraping fic 751/1063
scraping fic 781/1063
scraping fic 811/1063
scraping fic 841/1063
scraping fic 871/1063
scraping fic 901/1063
scraping fic 931/1063
scraping fic 961/1063
scraping fic 991/1063
scraping fic 1021/1063
scraping fic 1051/1063


Here's what the first few entries look like.

In [13]:
fic_df.head(3)

Unnamed: 0,work_id,rating,lang,words,chapters,date,series,author,all_authors,title,text,relationships,chars,tags
0,27388627,general,English,1476,1,04 Nov 2020,{},lesmiserablol,[lesmiserablol],the water's rough (but this love is ours),\nZuko is not surprised in the slightest when ...,[Sokka/Zuko (Avatar)],"[Sokka (Avatar), Zuko (Avatar)]","[Hurt/Comfort, Established Relationship, Chron..."
1,27288721,general,English,2540,1,30 Oct 2020,{},Haicrescendo,[Haicrescendo],Amateur Theatrics,Toph doesn’t give two shits about books or scr...,[Toph Beifong & Zuko],"[Toph Beifong, Zuko (Avatar), Aang (Avatar)]","[no ships here we die like men, fuck around an..."
2,27241933,teen,English,7232,1,28 Oct 2020,{},nights,[nights],Love Amongst the Algorithm,"Zuko’s thumb swipes the screen, again, again, ...",[Sokka/Zuko (Avatar)],"[Zuko (Avatar), Sokka (Avatar), Ty Lee (Avatar...","[Alternate Universe - Modern Setting, Alternat..."


Save our dataframe to a file.

In [67]:
#fic_df.to_pickle("../data/avatar_fics_scraped.pickle")

Potential future work: don't put too many works from a single author in the corpus (especially if they're very short).