# Top Gutenberg Ebooks (yesterday's ranking) download

## Tirthajyoti Sarkar, Sunnyvale, CA, 2017

### What is Project Gutenberg? - 
Project Gutenberg is a volunteer effort to digitize and archive cultural works, to "encourage the creation and distribution of eBooks". It was founded in 1971 by American writer Michael S. Hart and is the **oldest digital library.** This longest-established ebook project releases books that entered the public domain, and can be freely read or downloaded in various electronic formats.

* **This starter code scrapes the url of the Project Gutenberg's Top 100 ebooks (yesterday's ranking) for identifying the ebook links. **
* **It uses BeautifulSoup4 for parsing the HTML and regular expression code for identifying the Top 100 ebook file numbers.**
* **It includes a function to take an usser input on how many books to download and then crawls the server to download them in a dictionary object.**
* **Finally, it also includes a function to save the downloaded Ebooks as text files in a local directory**

In [1]:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
import re

#### Ignore SSL certificate errors

In [2]:
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

#### Read the HTML from the URL and pass on to BeautifulSoup

In [3]:
# Read the HTML from the URL and pass on to BeautifulSoup
top100url = 'https://www.gutenberg.org/browse/scores/top'
url = top100url
print(f"Opening the file connection to {url}")
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
print("Connection established and HTML parsed...")

Opening the file connection to https://www.gutenberg.org/browse/scores/top
Connection established and HTML parsed...


#### Find all the _'href'_ tags and store them in the list of links

In [4]:
# Empty list to hold all the http links in the HTML page
lst_links=[]

In [5]:
# Find all the href tags and store them in the list of links
for link in soup.find_all('a'):
    #print(link.get('href'))
    lst_links.append(link.get('href'))

#### Use regular expression to find the numeric digits in these links. These are the file number for the Top 100 books.

In [6]:
# Use regular expression to find the numeric digits in these links. These are the file number for the Top 100 books. 
# Initialize empty list to hold the file numbers
booknum=[]

In [7]:
# Number 19 to 119 in the original list of links have the Top 100 books' number.
for i in range(19,119):
	link=lst_links[i]
	link=link.strip()
	# Regular expression to find the numeric digits in the link (href) string
	n=re.findall('[0-9]+',link)
	if len(n)==1:
		# Append the filenumber casted as integer
		booknum.append(int(n[0]))

print ("\nThe file numbers for the top 100 ebooks on Gutenberg are shown below\n"+"-"*70)
print(booknum)


The file numbers for the top 100 ebooks on Gutenberg are shown below
----------------------------------------------------------------------
[1342, 84, 1080, 46, 219, 2542, 98, 345, 2701, 844, 11, 5200, 43, 16328, 76, 74, 1952, 6130, 2591, 1661, 41, 174, 23, 1260, 1497, 408, 3207, 1400, 30254, 58271, 1232, 25344, 58269, 158, 44881, 1322, 205, 2554, 1184, 2600, 120, 16, 58276, 5740, 34901, 28054, 829, 33, 2814, 4300, 100, 55, 160, 1404, 786, 58267, 3600, 19942, 8800, 514, 244, 2500, 2852, 135, 768, 58263, 1251, 3825, 779, 58262, 203, 730, 20203, 35, 1250, 45, 161, 30360, 7370, 58274, 209, 27827, 58256, 33283, 4363, 375, 996, 58270, 521, 58268, 36, 815, 1934, 3296, 58279, 105, 2148, 932, 1064, 13415]


#### Search in the extracted text (using Regular Expression) from the soup object to find the names of top 100 Ebooks (Yesterday's rank)

In [8]:
start_idx=soup.text.splitlines().index('Top 100 EBooks yesterday')
lst_titles_temp=[] # Empty list of Ebook names
for i in range(100):
    lst_titles_temp.append(soup.text.splitlines()[start_idx+2+i])

In [9]:
# Use regular expression to extract only text from the name strings and append to an empty list
lst_titles=[]
for i in range(100):
    id1,id2=re.match('^[a-zA-Z ]*',lst_titles_temp[i]).span()
    lst_titles.append(lst_titles_temp[i][id1:id2])
for l in lst_titles:
    print(l)

Pride and Prejudice by Jane Austen 
Frankenstein
A Modest Proposal by Jonathan Swift 
A Christmas Carol in Prose
Heart of Darkness by Joseph Conrad 
Et dukkehjem
A Tale of Two Cities by Charles Dickens 
Dracula by Bram Stoker 
Moby Dick
The Importance of Being Earnest
Alice
Metamorphosis by Franz Kafka 
The Strange Case of Dr
Beowulf
Adventures of Huckleberry Finn by Mark Twain 
The Adventures of Tom Sawyer by Mark Twain 
The Yellow Wallpaper by Charlotte Perkins Gilman 
The Iliad by Homer 
Grimms
The Adventures of Sherlock Holmes by Arthur Conan Doyle 
The Legend of Sleepy Hollow by Washington Irving 
The Picture of Dorian Gray by Oscar Wilde 
Narrative of the Life of Frederick Douglass
Jane Eyre
The Republic by Plato 
The Souls of Black Folk by W
Leviathan by Thomas Hobbes 
Great Expectations by Charles Dickens 
The Romance of Lust
The Tower of London by William Benham 
Il Principe
The Scarlet Letter by Nathaniel Hawthorne 

Emma by Jane Austen 
Confessions of a Thug by Meadows Taylo

### Define a function that takes an user input of how many top books to download and crawls the server to download

In [26]:
def download_top_books(num_download=10,verbosity=0):
    """
    Function: Download top N books from Gutenberg.org where N is specified by user
    Verbosity: If verbosity is turned on (set to 1) then prints the downloading status for every book
    Returns: Returns a dictionary where keys are the names of the books and values are the raw text.
    Exception Handling: If a book is not found on the server (due to broken link or whatever reason), inserts "NOT FOUND" as the text.
    """
    topEBooks = {}
    
    if num_download<=0:
        print("I guess no download is necessary")
        return topEBooks
    
    if num_download>100:
        print("You asked for more than 100 downloads.\nUnfortunately, Gutenberg ranks only top 100 books.\nProceeding to download top 100 books.")
        num_download=100
    
    # Base URL for files repository
    baseurl= 'http://www.gutenberg.org/files/'
    
    if verbosity==1:
        count_done=0
        for i in range(num_download):
            print ("Working on book:", lst_titles[i])
            
            # Create the proper download link (url) from the book id
            # You have to examine the Gutenberg.org file structure carefully to come up with the proper url
            bookid=booknum[i]
            bookurl= baseurl+str(bookid)+'/'+str(bookid)+'-0.txt'
            # Create a file handler object
            try:
                fhand = urllib.request.urlopen(bookurl)
                txt_dump = ''
                # Iterate over the lines in the file handler object and dump the data into the text string
                for line in fhand:
                    # Use decode method to convert the UTF-8 to Unicode string
                    txt_dump+=line.decode()
                # Add downloaded text to the dictionary with keys matching the list of book titles.
                # This puts the raw text as the value of the key of the dictionary bearing the name of the Ebook 
                topEBooks[lst_titles[i]]=txt_dump
                count_done+=1
                print (f"Finished downloading {round(100*count_done/num_download,2)}%")
            except urllib.error.URLError as e:
                topEBooks[lst_titles[i]]="NOT FOUND"
                count_done+=1
                print(f"**ERROR: {lst_titles[i]} {e.reason}**")
    else:
        count_done=0
        from tqdm import tqdm, tqdm_notebook
        for i in tqdm(range(num_download),desc='Download % completed',dynamic_ncols=True):
            # Create the proper download link (url) from the book id
            # You have to examine the Gutenberg.org file structure carefully to come up with the proper url
            bookid=booknum[i]
            bookurl= baseurl+str(bookid)+'/'+str(bookid)+'-0.txt'
            # Create a file handler object
            try:
                fhand = urllib.request.urlopen(bookurl)
                txt_dump = ''
                # Iterate over the lines in the file handler object and dump the data into the text string
                for line in fhand:
                    # Use decode method to convert the UTF-8 to Unicode string
                    txt_dump+=line.decode()
                # Add downloaded text to the dictionary with keys matching the list of book titles.
                # This puts the raw text as the value of the key of the dictionary bearing the name of the Ebook 
                topEBooks[lst_titles[i]]=txt_dump
                count_done+=1
            except urllib.error.URLError as e:
                topEBooks[lst_titles[i]]="NOT FOUND"
                count_done+=1
                print(f"**ERROR: {lst_titles[i]} {e.reason}**")
        
    print ("-"*40+"\nFinished downloading all books!\n"+"-"*40)
       
    return (topEBooks)

#### Test the function with verbosity=0 (default)

In [23]:
dict_books=download_top_books(1)

Download % completed: 100%|██████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.83s/it]


----------------------------------------
Finished downloading all books!
----------------------------------------


#### Test the function with verbosity=1

In [272]:
dict_books=download_top_books(105,1)

You asked for more than 100 downloads. 
Unfortunately, Gutenberg ranks only top 100 books.
Proceeding to download 100 books.


#### Show the final dictionary and an example of the downloaded text

In [14]:
print(dict_books[lst_titles[0]][:1500])

﻿The Project Gutenberg EBook of Pride and Prejudice, by Jane Austen

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: Pride and Prejudice

Author: Jane Austen

Posting Date: August 26, 2008 [EBook #1342]
Release Date: June, 1998
Last Updated: March 10, 2018

Language: English

Character set encoding: UTF-8

*** START OF THIS PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***




Produced by Anonymous Volunteers





PRIDE AND PREJUDICE

By Jane Austen



Chapter 1


It is a truth universally acknowledged, that a single man in possession
of a good fortune, must be in want of a wife.

However little known the feelings or views of such a man may be on his
first entering a neighbourhood, this truth is so well fixed in the minds
of the surrounding families, that he is considered the right

### Write a function to download and save the downloaded texts

In [27]:
def save_text_files(num_download=10,verbosity=1):
    """
    Downloads top N books from Gutenberg.org where N is specified by user.
    If verbosity is turned on (set to 1) then prints the downloading status for every book.
    Asks user for a location on computer where to save the downloaded Ebooks and process accordingly.
    Returns status message indicating how many ebooks could be successfully downloaded and saved
    """
    
    import os
    
    # Download the Ebooks and save in a dictionary object (in-memory)
    dict_books=download_top_books(num_download=num_download,verbosity=verbosity)
    
    if dict_books=={}:
        return None
    
    # Ask use for a save location (directory path)
    savelocation=input("Please enter a folder location to save the Ebooks in:")
    
    count_successful_download=0
    
    # Create a default folder/directory in the current working directory if the input is blank
    if (len(savelocation)<1):
        savelocation=os.getcwd()+'\\'+'Ebooks'+'\\'
        # Creates new directory if the directory does not exist. Otherwise, just use the existing path.
        if not os.path.isdir(savelocation):
            os.mkdir(savelocation)
    else:
        if savelocation[-1]=='\\':
            os.mkdir(savelocation)
        else:
            os.mkdir(savelocation+'\\')
    #print("Saving files at:",savelocation)
    for k,v in dict_books.items():
        if (v!="NOT FOUND"):
            filename=savelocation+str(k)+'.txt'
            file=open(filename,'wb')
            file.write(v.encode("UTF-8",'ignore'))
            file.close()
            count_successful_download+=1
    
    # Status message
    print (f"{count_successful_download} book(s) was/were successfully downloaded and saved to the location {savelocation}")
    if (num_download!=count_successful_download):
        print(f"{num_download-count_successful_download} books were not found on the server!")

In [28]:
save_text_files(100,verbosity=1)

Working on book: Pride and Prejudice by Jane Austen 
Finished downloading 1.0%
Working on book: Heart of Darkness by Joseph Conrad 
Finished downloading 2.0%
Working on book: Dracula by Bram Stoker 
**ERROR: Dracula by Bram Stoker  Not Found**
Working on book: A Tale of Two Cities by Charles Dickens 
Finished downloading 4.0%
Working on book: The Art of Being Happy by Joseph Droz 
Finished downloading 5.0%
Working on book: Moby Dick
Finished downloading 6.0%
Working on book: Alice
Finished downloading 7.0%
Working on book: Metamorphosis by Franz Kafka 
**ERROR: Metamorphosis by Franz Kafka  Not Found**
Working on book: Frankenstein
Finished downloading 9.0%
Working on book: The Romance of Lust
Finished downloading 10.0%
Working on book: The Importance of Being Earnest
**ERROR: The Importance of Being Earnest Not Found**
Working on book: The Adventures of Tom Sawyer by Mark Twain 
Finished downloading 12.0%
Working on book: The Adventures of Sherlock Holmes by Arthur Conan Doyle 
**ERRO