# Fetch guternberg books from a category

## Step 1, get book ids

- go to http://m.gutenberg.org/ebooks/search.mobile/?query=bsxHorror&sort_order=downloads

- scroll to the bottom and click "show more" a few times
- enter the javascript below in the browsers js console
- it should have copied the ids to your clipboard, you can paste it into "ids" below


```js
// to get all book ids shown on page, paste this javascript into js console in browser when on the page above
a_elems = document.getElementsByClassName("table link")
hrefs = Array.from(a_elems)
  .map(e=>e.href) // get link
  .filter(e=>e) // remove empty links
ids = hrefs.map(e=>/(\d+)\.mobile/.exec(e)) // regular expression match
  .filter(e=>e) // remove ones not found
  .map(e=>e[1]) // get just id
copy(ids) // copy to clipboard
```

In [1]:
import os
os.sys.path.append('..')

In [2]:
import requests
import os
import re
import bs4
import time
import json


dest_dir = '../data/input/poetry_gutenberg'
if not os.path.isdir(dest_dir):
    os.makedirs(dest_dir)
    
raw_dir = os.path.join(dest_dir, 'raw')
if not os.path.isdir(raw_dir):
    os.makedirs(raw_dir)

In [3]:
from tqdm import tqdm as tqdm

In [4]:
from dataset import parse_gutenberg

In [5]:
# urls to download text inputs
ids = [
  "16328",
  "1322",
  "20",
  "9622",
  "228",
  "1321",
  "16452",
  "1012",
  "981",
  "1567",
  "23684",
  "847",
  "14568",
  "14020",
  "4800",
  "19",
  "2490",
  "24269",
  "3333",
  "1365",
  "19221",
  "9700",
  "20158",
  "2039",
  "8388",
  "18500",
  "23475",
  "30276",
  "8209",
  "24280",
  "8801",
  "21811",
  "262",
  "28665",
  "20732",
  "25340",
  "12389",
  "27577",
  "20431",
  "16786",
  "841",
  "8820",
  "28666",
  "261",
  "28621",
  "1020",
  "43224",
  "13830",
  "12925",
  "13310",
  "8672",
  "27308",
  "53375",
  "27199",
  "22531",
  "12031",
  "15553",
  "12759",
  "12924",
  "17119",
  "579",
  "12664",
  "13037",
  "23979",
  "26376",
  "15390",
  "13900",
  "2670",
  "18726",
  "21029",
  "20174",
  "12032",
  "19784",
  "22001",
  "13223",
  "9606",
  "27663",
  "7325",
  "12717",
  "22833",
  "18871",
  "53385",
  "17948",
  "12718",
  "26288",
  "18673",
  "7110",
  "17347",
  "1229",
  "13224",
  "12413",
  "18524",
  "9889",
  "53378",
  "18287"
]


In [6]:
# download/cache raw files
for bid in ids:
    
    # first download index
    index_url = "http://www.gutenberg.org/files/{bid:}".format(bid=bid)
    r = requests.get(index_url)
    r.raise_for_status()
    soup = bs4.BeautifulSoup(r.content, "html5lib")
    hrefs = [e.attrs['href'] for e in soup.findAll('a')]
    links = [h for h in hrefs if h.endswith('.txt')]
    time.sleep(0.1) # avoid ddos/ban
    
    # download text
    for link in links:
        txt_url = index_url + '/' + link
        outfile = os.path.join(dest_dir, 'raw', link)
        if not os.path.isfile(outfile):
            r = requests.get(txt_url)
            r.raise_for_status()
            open(outfile, 'w').write(r.text)
            
            time.sleep(0.1) # avoid ddos/ban

In [7]:
# download/cache raw files
for bid in tqdm(ids, mininterval=60):
    
    # first download index
    index_url = "http://www.gutenberg.org/files/{bid:}".format(bid=bid)
    r = requests.get(index_url)
    r.raise_for_status()
    soup = bs4.BeautifulSoup(r.content, "html5lib")
    hrefs = [e.attrs['href'] for e in soup.findAll('a')]
    links = [h for h in hrefs if h.endswith('.txt')]
    time.sleep(0.1) # avoid ddos/ban
    
    # download text
    for link in links:
        txt_url = index_url + '/' + link
        outfile = os.path.join(dest_dir, 'raw', link)
        if not os.path.isfile(outfile):
            r = requests.get(txt_url)
            r.raise_for_status()
            open(outfile, 'w').write(r.text)
            
            time.sleep(0.1) # avoid ddos/ban

100%|██████████| 95/95 [00:34<00:00,  2.72it/s]


# 2. turn into cleaned(ish) txt

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
# convert raw text into one long csv

max_len = 400
num_sent = 6
data=[]
for infile in os.listdir(raw_dir):
    path = os.path.join(raw_dir, infile)
    info = parse_gutenberg(open(path).read())
    if info['language']=='English':
        print(info['title'])
        data.append(info['content'])

The Odyssey of Homer
The Aeneid
Endymion, A Poetic Romance
Poems by Jean Ingelow, In Two Volumes, Volume II.
Pastoral Poems by Nicholas Breton,, Selected Poetry by George Wither, and, Pastoral Poetry by William Browne (of Tavistock)
Rose and Roof-Tree, Poems
The World's Best Poetry, Volume 3, Sorrow and Consolation
Endymion, A Poetic Romance
Japanese Prints
Amores, Poems
The Complete Poetical Works of James Russell Lowell
The Iliad of Homer, Translated into English Blank Verse
Beowulf
The New Morning, Poems
Songs and Other Verse
Love-Songs of Childhood
Astrophel and Other Poems, Taken from The Collected Poetical Works of Algernon Charles, Swinburne, Vol. VI
    Problem: No title found

    Problem: No '*** START' seen

    Problem: No '*** END' seen

The Carmina of Caius Valerius Catullus
Poetry: A Magazine of Verse, Volume I, October-March, 1912-13
Beowulf
Hymen
Death Be Not Proud
The Works of Horace
The Complete Works of Robert Burns: Containing his Poems, Songs, and Correspondence.,

In [10]:
x_train, x_test = train_test_split(data)
x_val, x_test = train_test_split(x_test)

In [11]:
open(os.path.join(dest_dir, "train.txt"), "w").write("\n\n".join(x_train))
open(os.path.join(dest_dir, "val.txt"), "w").write("\n\n".join(x_val))
open(os.path.join(dest_dir, "test.txt"), "w").write("\n\n".join(x_test))

1074206