# Fetch guternberg books from a category

## Step 1, get book ids

- go to http://m.gutenberg.org/ebooks/search.mobile/?query=bsxHorror&sort_order=downloads

- scroll to the bottom and click "show more" a few times
- enter the javascript below in the browsers js console
- it should have copied the ids to your clipboard, you can paste it into "ids" below


```js
// to get all book ids shown on page, paste this javascript into js console in browser when on the page above
a_elems = document.getElementsByClassName("table link")
hrefs = Array.from(a_elems)
  .map(e=>e.href) // get link
  .filter(e=>e) // remove empty links
ids = hrefs.map(e=>/(\d+)\.mobile/.exec(e)) // regular expression match
  .filter(e=>e) // remove ones not found
  .map(e=>e[1]) // get just id
copy(ids) // copy to clipboard
```

In [1]:
import os
os.sys.path.append('..')

In [2]:
import requests
import os
import re
import bs4
import time
import json


dest_dir = '../data/input/scifi_gutenberg'
if not os.path.isdir(dest_dir):
    os.makedirs(dest_dir)
    
raw_dir = os.path.join(dest_dir, 'raw')
if not os.path.isdir(raw_dir):
    os.makedirs(raw_dir)

In [3]:
from tqdm import tqdm as tqdm

In [4]:
from dataset import parse_gutenberg

In [5]:
# urls to download text inputs
ids = [
  "36",
  "36",
  "84",
  "43",
  "35",
  "36",
  "1250",
  "42",
  "21279",
  "164",
  "159",
  "18857",
  "62",
  "41445",
  "5230",
  "32032",
  "31516",
  "1268",
  "42324",
  "10662",
  "31547",
  "10002",
  "41562",
  "1164",
  "64",
  "624",
  "51783",
  "18247",
  "28554",
  "2488",
  "83",
  "30240",
  "30123",
  "1951",
  "3526",
  "68",
  "32706",
  "32154",
  "51171",
  "28346",
  "126",
  "51461",
  "72",
  "52167",
  "49525",
  "32522",
  "32530",
  "29132",
  "1153",
  "20728",
  "29614",
  "50921",
  "28767",
  "52326",
  "30255",
  "1013",
  "3748",
  "11229",
  "54873",
  "775",
  "40284",
  "31979",
  "51233",
  "40992",
  "28698",
  "551",
  "1329",
  "19445",
  "21970",
  "1249",
  "16921",
  "19362",
  "20869",
  "32832",
  "18137",
  "20727",
  "8993",
  "40964",
  "14021",
  "21873",
  "8086",
  "6542",
  "27188",
  "123",
  "19651",
  "33854",
  "23731",
  "30971",
  "29019",
  "4791",
  "25550",
  "149",
  "18458",
  "17401",
  "21051",
  "605",
  "19145",
  "29948",
  "12163",
  "11696",
  "780",
  "32633",
  "18668",
  "29405",
  "20788",
  "36258",
  "37775",
  "22357",
  "20796",
  "40953",
  "22893",
  "24104",
  "21489",
  "19726",
  "23790",
  "16457",
  "18224",
  "552",
  "51184",
  "20038",
  "52228",
  "49462",
  "41064",
  "18846",
  "44278",
  "22541",
  "19090",
  "96",
  "18807",
  "36358",
  "10542",
  "51845",
  "19474",
  "553",
  "765",
  "50571",
  "22754",
  "20919",
  "29579",
  "51782",
  "51650",
  "20898",
  "27462",
  "20988",
  "32436",
  "29135",
  "33516",
  "18342",
  "31619",
  "604",
  "22958",
  "28215",
  "14287",
  "20707",
  "42914",
  "20121",
  "32563",
  "26521",
  "20857",
  "51804",
  "29142",
  "41637",
  "13944",
  "5097",
  "21092",
  "29720",
  "22549",
  "19141",
  "22629",
  "10966",
  "718",
  "19478",
  "29206",
  "545",
  "22218",
  "24035",
  "22073",
  "32664",
  "50138",
  "23210",
  "1353",
  "32360",
  "29204",
  "50774",
  "30964",
  "19029",
  "1607",
  "51866",
  "29662",
  "25438",
  "51273",
  "41981",
  "18151",
  "29283",
  "35103",
  "715",
  "32498",
  "51046",
  "50022",
  "14888",
  "18831",
  "27013",
  "27013",
  "17028",
  "20782",
  "30408",
  "20000",
  "41905",
  "32026",
  "6709",
  "49165",
  "51809",
  "3809",
  "28832",
  "19471",
  "30283",
  "30458",
  "18105",
  "22216",
  "28650",
  "20154",
  "37653",
  "51082",
  "18800",
  "28063",
  "7303",
  "32447",
  "48850",
  "29310",
  "50566",
  "32256",
  "3479",
  "23146",
  "20659",
  "285",
  "30960",
  "18584",
  "22966",
  "3808",
  "30796",
  "51267",
  "13704",
  "33644",
  "51712",
  "50783",
  "29471",
  "53042",
  "49651",
  "22767",
  "24277",
  "32654",
  "51101",
  "31262",
  "43235",
  "50441",
  "21094",
  "30334",
  "51101",
  "32108",
  "53132",
  "50928",
  "24750",
  "24517",
  "30828",
  "31501",
  "30014",
  "42901",
  "22559",
  "19513",
  "38674",
  "29559",
  "51037",
  "27444",
  "6538",
  "50682",
  "20726",
  "19526",
  "50682",
  "24302",
  "22332",
  "50999",
  "10349",
  "51152",
  "18719",
  "52574",
  "51814",
  "50876",
  "19067",
  "3797",
  "29133",
  "23169",
  "12901",
  "51258",
  "50863",
  "51361",
  "27053",
  "22540",
  "4548",
  "9055",
  "25086",
  "29750",
  "41714",
  "41714",
  "22544",
  "51519",
  "32004",
  "21647",
  "18855",
  "31778",
  "18949",
  "53456",
  "51854",
  "42989",
  "24395",
  "20212",
  "51168",
  "51726",
  "51509",
  "30019",
  "23028",
  "18492",
  "30427",
  "18602",
  "26066",
  "51774",
  "50585",
  "29503",
  "50819",
  "26191",
  "51148",
  "32040",
  "51801",
  "30002",
  "23085",
  "29140",
  "32579",
  "949",
  "26955",
  "23164",
  "17026",
  "31663",
  "29322",
  "21510",
  "51397",
  "25713",
  "29410",
  "19027",
  "51781",
  "22538",
  "52501",
  "24246",
  "50844",
  "21627",
  "35425",
  "18520",
  "20649",
  "1368",
  "30199",
  "30679",
  "18814",
  "35204",
  "51768",
  "13716",
  "26741",
  "49897",
  "50406",
  "51833",
  "41049",
  "18641",
  "32237",
  "31664",
  "50893",
  "17958",
  "51129",
  "51549",
  "9862",
  "29303",
  "29190",
  "41586",
  "30044",
  "29446",
  "29303",
  "32078",
  "51115",
  "51310",
  "17355",
  "18460",
  "29475",
  "31286",
  "799",
  "51668",
  "30493",
  "29492",
  "29448",
  "29202",
  "29353",
  "51150",
  "25078",
  "29455",
  "32041",
  "31324",
  "51330",
  "51855",
  "31324",
  "29445",
  "50063",
  "52844",
  "51330",
  "51074",
  "31976",
  "32346",
  "50103",
  "51603",
  "51274",
  "24196",
  "18109",
  "22579",
  "35759",
  "32067",
  "29908",
  "29548",
  "32651",
  "31587",
  "50980",
  "6620",
  "19076",
  "33660",
  "23232",
  "55801",
  "23845",
  "33660",
  "21897",
  "50948",
  "19111",
  "30140",
  "25439",
  "20739",
  "25067",
  "23960",
  "2509",
  "30234",
  "50904",
  "29458",
  "29876",
  "24247",
  "29298",
  "42987",
  "51122",
  "23104",
  "30399",
  "51687",
  "50935",
  "4920",
  "29619",
  "29389",
  "32850",
  "50935",
  "51379",
  "24152",
  "28912",
  "32208",
  "18172",
  "24370",
  "51247",
  "32819",
  "29504",
  "18346",
  "33642",
  "22227",
  "18861",
  "51589",
  "29177",
  "30062",
  "26292",
  "30311",
  "18632",
  "50622",
  "18861",
  "19258",
  "51852",
  "18632",
  "46111",
  "30311",
  "51589",
  "30639",
  "30062",
  "28893",
  "28550",
  "19338",
  "17870",
  "26292",
  "53102",
  "53045",
  "51651",
  "51596",
  "51844",
  "27089",
  "17027",
  "30764",
  "18753",
  "14301",
  "32587",
  "51651",
  "51596",
  "51433",
  "29966",
  "27143",
  "32272",
  "20553",
  "40993",
  "31207",
  "29525",
  "29487",
  "32317",
  "23591",
  "32953",
  "50924",
  "19476",
  "50668",
  "32207",
  "19066",
  "50713",
  "18361",
  "30476",
  "19709",
  "51759",
  "51545",
  "18361",
  "27631",
  "51433",
  "30034",
  "23657",
  "29976",
  "27392",
  "20859",
  "29416",
  "33934",
  "24121",
  "21782",
  "51842",
  "23535",
  "42111",
  "30770",
  "26867",
  "26782",
  "31583",
  "29060",
  "51353",
  "5155",
  "28437",
  "28062",
  "51363",
  "28062",
  "51256",
  "50998",
  "24382",
  "49754",
  "32828",
  "23612",
  "51530",
  "19102",
  "29139",
  "43046",
  "30901",
  "25644",
  "30728",
  "51751",
  "51740",
  "51321",
  "43835",
  "10008",
  "26206",
  "30715",
  "24149",
  "23868",
  "18261",
  "25234",
  "22338",
  "23868",
  "56062",
  "30715",
  "32544",
  "24558",
  "31648",
  "27968",
  "51662",
  "30303",
  "32410",
  "28438",
  "28840",
  "29421",
  "29309",
  "29618",
  "29643",
  "26957",
  "23764",
  "30267",
  "42816",
  "30348",
  "49809",
  "30468",
  "40968",
  "24064",
  "51295",
  "28809",
  "32665",
  "23599",
  "17030",
  "3831",
  "34469",
  "22560",
  "51395",
  "23882",
  "33850",
  "30742",
  "18786",
  "30474",
  "32696",
  "14152",
  "51240",
  "30307",
  "51436",
  "29159",
  "17029",
  "22890",
  "17138",
  "23799",
  "50802",
  "23162",
  "20147",
  "24198",
  "11583",
  "23636",
  "52776",
  "22866",
  "50753",
  "26906",
  "50847",
  "26966",
  "29940",
  "22467",
  "11626",
  "27595",
  "51241",
  "27730",
  "7463",
  "20519",
  "49531",
  "24928",
  "51681",
  "29205",
  "32077",
  "29466",
  "22967",
  "21988",
  "32133",
  "29599",
  "51112",
  "23651",
  "51824",
  "29742",
  "32592",
  "29897",
  "51531",
  "33386",
  "51823",
  "51268",
  "51758",
  "51799",
  "40954",
  "9081",
  "51663",
  "51534",
  "50872",
  "30371",
  "51810",
  "30398",
  "51518",
  "23337",
  "32697",
  "28628",
  "31922",
  "28924",
  "51408",
  "23767",
  "31981",
  "29408",
  "16721",
  "4340",
  "51699",
  "29620",
  "50835",
  "22132",
  "51713",
  "32324",
  "52009",
  "27464",
  "31892",
  "24392",
  "32068",
  "30251",
  "31611",
  "22527",
  "32906",
  "28341",
  "50826",
  "28643",
  "50332",
  "22466",
  "41062",
  "24145",
  "52845",
  "29299",
  "26246",
  "51741",
  "23102",
  "50988",
  "27309",
  "51669",
  "29170",
  "18916",
  "29418",
  "32029",
  "29735",
  "40737",
  "29994",
  "19086",
  "30035",
  "33842",
  "51075",
  "51009",
  "51407",
  "51396",
  "32126",
  "22226",
  "31897",
  "28453",
  "48880",
  "30767",
  "39572",
  "27393",
  "51027",
  "27609",
  "32359",
  "11205",
  "36867",
  "29038",
  "10165",
  "22869",
  "41941",
  "24118",
  "32351",
  "24180",
  "32764",
  "29321",
  "51449",
  "29326",
  "51420",
  "29632",
  "51137",
  "29822",
  "31767",
  "51498",
  "30322",
  "49762",
  "30338",
  "32683",
  "23561",
  "49693",
  "32181",
  "51574",
  "28244",
  "7052",
  "23153",
  "50890",
  "24977",
  "32124",
  "28031",
  "41027",
  "24101",
  "31215",
  "25862",
  "32486",
  "29965",
  "32079",
  "29990",
  "53059",
  "29293",
  "51478",
  "23942",
  "29623",
  "29271",
  "23929",
  "29680",
  "26967",
  "29931",
  "22176",
  "51288",
  "23148",
  "50827",
  "30339",
  "32321",
  "27633",
  "51102",
  "28118",
  "43264",
  "22462",
  "35426",
  "28705",
  "51483",
  "32805",
  "50848",
  "19515",
  "51362",
  "24119",
  "51656",
  "23884",
  "31948",
  "30214",
  "32780",
  "29046",
  "32427",
  "26795",
  "51688",
  "29625",
  "49656",
  "29698",
  "51132",
  "29771",
  "51832",
  "23920",
  "32761",
  "23198",
  "51072",
  "41084",
  "51153",
  "32011",
  "51482",
  "30029",
  "44404",
  "29963",
  "31062",
  "23147",
  "51609",
  "23688",
  "50889",
  "30583",
  "51167",
  "29146",
  "37991",
  "30242",
  "40961",
  "27248",
  "51475",
  "29240",
  "32347",
  "29401",
  "32903",
  "32833",
  "31364",
  "23443",
  "26761",
  "22154",
  "26843",
  "22585",
  "19370",
  "11556",
  "30381",
  "32820",
  "19000",
  "50818",
  "24005",
  "32837",
  "24054",
  "32905",
  "28883",
  "51170",
  "25567",
  "32637",
  "29954",
  "51091",
  "8757",
  "32591",
  "29118",
  "32134",
  "2026",
  "51008",
  "26109",
  "51125",
  "29196",
  "32150",
  "30497",
  "51126",
  "29328",
  "31686",
  "26536",
  "32836",
  "24275",
  "51570",
  "24278",
  "30885",
  "16827",
  "51255",
  "27110",
  "50290",
  "19964",
  "50936",
  "18257",
  "30832",
  "51185",
  "4968",
  "51028",
  "53015",
  "51622",
  "51081",
  "43041",
  "30382",
  "51867",
  "23160",
  "33839",
  "25166",
  "51597",
  "29053",
  "35401",
  "26917",
  "35879",
  "29149",
  "32676",
  "23185",
  "31644",
  "22897",
  "48089",
  "24122",
  "51344",
  "23762",
  "50884",
  "20838",
  "52933",
  "26290",
  "51296",
  "29384",
  "51445",
  "22590",
  "32597",
  "24274",
  "23099",
  "51773",
  "4717",
  "30884",
  "30337",
  "51834",
  "23197",
  "51805",
  "24567",
  "51605",
  "28030",
  "51546",
  "19738",
  "31585",
  "27492",
  "51727",
  "23831",
  "50959",
  "11393",
  "31651",
  "29437",
  "34420",
  "23568",
  "32212",
  "29308",
  "51623",
  "24187",
  "28111",
  "29457",
  "23669",
  "29975",
  "26956",
  "27019",
  "31356",
  "22997",
  "27588",
  "6717",
  "31573",
  "23791",
  "51297",
  "20920",
  "49838",
  "31343",
  "51331",
  "20551",
  "38287",
  "24864",
  "32562",
  "28583",
  "40970",
  "26882",
  "31929",
  "23872",
  "49779",
  "20802",
  "31665",
  "23588"
]


In [8]:
# # download/cache raw files
# for bid in ids:
    
#     # first download index
#     index_url = "http://www.gutenberg.org/files/{bid:}".format(bid=bid)
#     r = requests.get(index_url)
#     r.raise_for_status()
#     soup = bs4.BeautifulSoup(r.content, "html5lib")
#     hrefs = [e.attrs['href'] for e in soup.findAll('a')]
#     links = [h for h in hrefs if h.endswith('.txt')]
#     time.sleep(0.1) # avoid ddos/ban
    
#     # download text
#     for link in links:
#         txt_url = index_url + '/' + link
#         outfile = os.path.join(dest_dir, 'raw', link)
#         if not os.path.isfile(outfile):
#             r = requests.get(txt_url)
#             r.raise_for_status()
#             open(outfile, 'w').write(r.text)
            
#             time.sleep(0.1) # avoid ddos/ban

In [9]:
# download/cache raw files
for bid in tqdm(ids, mininterval=60):
    
    # first download index
    index_url = "http://www.gutenberg.org/files/{bid:}".format(bid=bid)
    r = requests.get(index_url)
    r.raise_for_status()
    soup = bs4.BeautifulSoup(r.content, "html5lib")
    hrefs = [e.attrs['href'] for e in soup.findAll('a')]
    links = [h for h in hrefs if h.endswith('.txt')]
    time.sleep(0.1) # avoid ddos/ban
    
    # download text
    for link in links:
        txt_url = index_url + '/' + link
        outfile = os.path.join(dest_dir, 'raw', link)
        if not os.path.isfile(outfile):
            r = requests.get(txt_url)
            r.raise_for_status()
            open(outfile, 'w').write(r.text)
            
            time.sleep(0.1) # avoid ddos/ban

100%|██████████| 1002/1002 [10:39<00:00,  1.57it/s]


# 2. turn into cleaned(ish) txt

In [10]:
from sklearn.model_selection import train_test_split

In [11]:
# convert raw text into one long csv

max_len = 400
num_sent = 6
data=[]
for infile in os.listdir(raw_dir):
    path = os.path.join(raw_dir, infile)
    info = parse_gutenberg(open(path).read())
    if info['language']=='English':
        print(info['title'])
        data.append(info['content'])

In the Orbit of Saturn
In the Year 2889
Caxton's Book: A Collection of Essays, Poems, Tales, and Sketches.
Polaris of the Snows
The Music Master of Babylon
The Silver Menace
Subspace Survivors
They Twinkled Like Jewels
Danger in Deep Space
The Mad Planet
The Dictator
The Moon Pool
Wall of Crystal, Eye of Night
When the Mountain Shook
Pagan Passions
B-12's Moon Glow
Eight Keys to Eden
Asteroid of Fear
The Great Drought
Bodyguard
The Place Where Chicago Was
The Inheritors
No Moving Parts
Lighter Than You Think
Desire No More
The Day Time Stopped Moving
The Ultimate Experiment
Today is Forever
Slave Planet
The Fifth-Dimension Tube
Return to Pleasure Island
The Vortex Blaster
Arm of the Law
Alien Offer
Adrift in the Unknown, or, Queer Adventures in a Queer Realm, Author: William Wallace Cook, Release Date: December 10, 2013 [EBook #44404], Language: English, Character set encoding: US-ASCII
Unborn Tomorrow
Priestess of the Flame
Swenson, Dispatcher
The Door Through Space
Pick a Crime
Metam

The People of the Crater
Disowned
Triplanetary
The Weirdest World
The Affair of the Brains
Anchorite
Urania
The Secret Martians
The Fire and the Sword
Proof of the Pudding
The Scarlet Plague
The Great Nebraska Sea
That Sweet Little Old Lady
Fee of the Frontier
The Next Time We Die
The Floating Island of Madness
This World Must Die!
A Journey to the Centre of the Earth
Urania
Looking Backward, 2000-1887
Summit
Robur the Conqueror
To Save Earth
The Sweeper of Loray
Bridge Crossing
Off on a Comet
The Golden Amazons of Venus
The Luckiest Man in Denv
Unborn Tomorrow
The Big Fix
Rescue Squad
Naudsonce
Urania
The End of Time
When the Sleeper Wakes
Say "Hello" for Me
The Coming of the Ice
Big Pill
Security Risk
All Around the Moon
The Ego Machine
It's a Small Solar System
Tulan
Tony and the Beetles
Planet of the Damned
Soldier Boy
The Phantom Airman, Author: Rowland Walker, Release Date: July 20, 2013 [EBook #43264], Language: English, Character set encoding: ISO-8859-1
Hawk Carse
The Last Man

The Lost Continent
The Ambulance Made Two Trips
The Worlds of If
Micro-Man
No Shield from the Dead
Scrimshaw
Anchorite
Vanishing Point
The People that Time Forgot
The Invisible Man
Check and Checkmate
Operation Distress
Oneness
When the Sleeper Wakes
Robots of the World! Arise!
Salvage in Space
Project Mastodon
Gambler's World
Freedom
Highways in Hiding
The Chessmen of Mars
From An Unseen Censor
An Incident on Route 12
The Six Fingers of Time
The Chapter Ends
The Seed of the Toc-Toc Birds
Prospector's Special
The Big Tomorrow
Meeting of the Minds
The Cosmic Express
Skin Game
Rip Foster Rides the Gray Planet
Equation of Doom
Dead Giveaway
The Hoofer
Stand by for Mars!
A World is Born
The Stuff
Extracts from the Galactick Almanack
The Mind Master
Code Three
Dead World
Space Prison
The Glory of Ippling
The Door Through Space
Space Tug
The Fourth Invasion
I'll Kill You Tomorrow
Star Mother
The Golden Amazons of Venus
Potential Enemy
The Lani People
Frankenstein, or Modern Prometheus
The Al

In [12]:
x_train, x_test = train_test_split(data)
x_val, x_test = train_test_split(x_test)

In [13]:
open(os.path.join(dest_dir, "train.txt"), "w").write("\n\n".join(x_train))
open(os.path.join(dest_dir, "val.txt"), "w").write("\n\n".join(x_val))
open(os.path.join(dest_dir, "test.txt"), "w").write("\n\n".join(x_test))

10146460