## Top Gutenberg Ebooks (yesterday's ranking) download
* **This starter code scrapes the url of the Project Gutenberg's Top 100 ebooks (yesterday's ranking) for identifying the ebook links. **
* **It uses BeautifulSoup4 for parsing the HTML and regular expression code for identifying the Top 100 ebook file numbers.**
* **It includes a function to take an usser input on how many books to download and then crawls the server to download them in a dictionary object.**
* **Finally, it also includes a function to save the downloaded Ebooks in as text files in a local directory**

In [59]:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
import re

#### Ignore SSL certificate errors

In [2]:
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

#### Read the HTML from the URL and pass on to BeautifulSoup

In [11]:
# Read the HTML from the URL and pass on to BeautifulSoup
top100url = 'https://www.gutenberg.org/browse/scores/top'
url = top100url
print(f"Opening the file connection to {url}")
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
print("Connection established and HTML parsed...")

Opening the file connection to https://www.gutenberg.org/browse/scores/top
Connection established and HTML parsed...


#### Find all the _'href'_ tags and store them in the list of links

In [53]:
# Empty list to hold all the http links in the HTML page
lst_links=[]

In [54]:
# Find all the href tags and store them in the list of links
for link in soup.find_all('a'):
    #print(link.get('href'))
    lst_links.append(link.get('href'))

#### Use regular expression to find the numeric digits in these links. These are the file number for the Top 100 books.

In [14]:
# Use regular expression to find the numeric digits in these links. These are the file number for the Top 100 books. 
# Initialize empty list to hold the file numbers
booknum=[]

In [15]:
# Number 19 to 119 in the original list of links have the Top 100 books' number.
for i in range(19,119):
	link=lst_links[i]
	link=link.strip()
	# Regular expression to find the numeric digits in the link (href) string
	n=re.findall('[0-9]+',link)
	if len(n)==1:
		# Append the filenumber casted as integer
		booknum.append(int(n[0]))

print ("\nThe file numbers for the top 100 ebooks on Gutenberg are shown below\n"+"-"*70)
print(booknum)


The file numbers for the top 100 ebooks on Gutenberg are shown below
----------------------------------------------------------------------
[1342, 219, 56633, 11, 84, 98, 844, 1661, 2701, 76, 56630, 345, 5200, 2591, 30254, 74, 27827, 2542, 1232, 2600, 4300, 158, 1400, 56632, 6130, 1184, 1952, 174, 43, 5740, 1260, 16, 56623, 135, 768, 3207, 829, 1322, 161, 16328, 28054, 30360, 408, 1497, 4363, 244, 2274, 56625, 2680, 2554, 3600, 100, 147, 120, 23, 2500, 56626, 863, 1080, 20203, 10, 2852, 56612, 2814, 1399, 28520, 46, 205, 19942, 1404, 1934, 3090, 56635, 20228, 3296, 45, 5739, 35, 55, 56634, 56628, 1112, 25305, 36, 730, 33283, 203, 514, 521, 851, 34901, 308, 1998, 56620, 42, 209, 2097, 236, 1228, 20]


#### Search in the extracted text (using Regular Expression) from thr soup object to find the names of top 100 Ebooks (Yesterday's rank)

In [83]:
start_idx=soup.text.splitlines().index('Top 100 EBooks yesterday')
lst_titles_temp=[] # Empty list of Ebook names
for i in range(100):
    lst_titles_temp.append(soup.text.splitlines()[start_idx+2+i])

In [87]:
# Use regular expression to extract only text from the name strings and append to an empty list
lst_titles=[]
for i in range(100):
    id1,id2=re.match('^[a-zA-Z ]*',lst_titles_temp[i]).span()
    lst_titles.append(lst_titles_temp[i][id1:id2])
for l in lst_titles:
    print(l)

Pride and Prejudice by Jane Austen 
Heart of Darkness by Joseph Conrad 
The City That Was by Stephen Smith 
Alice
Frankenstein
A Tale of Two Cities by Charles Dickens 
The Importance of Being Earnest
The Adventures of Sherlock Holmes by Arthur Conan Doyle 
Moby Dick
Adventures of Huckleberry Finn by Mark Twain 
Rome by W
Dracula by Bram Stoker 
Metamorphosis by Franz Kafka 
Grimms
The Romance of Lust
The Adventures of Tom Sawyer by Mark Twain 
The Kama Sutra of Vatsyayana by Vatsyayana 
Et dukkehjem
Il Principe
War and Peace by graf Leo Tolstoy 
Ulysses by James Joyce 
Emma by Jane Austen 
Great Expectations by Charles Dickens 
Two Little Women and Treasure House by Carolyn Wells 
The Iliad by Homer 
The Count of Monte Cristo
The Yellow Wallpaper by Charlotte Perkins Gilman 
The Picture of Dorian Gray by Oscar Wilde 
The Strange Case of Dr
Tractatus Logico
Jane Eyre
Peter Pan by J
Quaint Korea by Louise Jordan Miln 
Les Mis
Wuthering Heights by Emily Bront
Leviathan by Thomas Hobbes 
G

### Define a function that takes an user input of how many top books to download and crawls the server to download

In [273]:
def download_top_books(num_download=10,verbosity=0):
    """
    Function: Download top N books from Gutenberg.org where N is specified by user
    Verbosity: If verbosity is turned on (set to 1) then prints the downloading status for every book
    Returns: Returns a dictionary where keys are the names of the books and values are the raw text.
    Exception Handling: If a book is not found on the server (due to broken link or whatever reason), inserts "NOT FOUND" as the text.
    """
    topEBooks = {}
    
    if num_download<=0:
        print("I guess no download is necessary")
        return topEBooks
    
    if num_download>100:
        print("You asked for more than 100 downloads.\nUnfortunately, Gutenberg ranks only top 100 books.\nProceeding to download top 100 books.")
        num_download=100
    
    # Base URL for files repository
    baseurl= 'http://www.gutenberg.org/files/'
    
    if verbosity==1:
        count_done=0
        for i in range(num_download):
            print ("Working on book:", lst_titles[i])
            
            # Create the proper download link (url) from the book id
            # You have to examine the Gutenberg.org file structure carefully to come up with the proper url
            bookid=booknum[i]
            bookurl= baseurl+str(bookid)+'/'+str(bookid)+'-0.txt'
            # Create a file handler object
            try:
                fhand = urllib.request.urlopen(bookurl)
                txt_dump = ''
                # Iterate over the lines in the file handler object and dump the data into the text string
                for line in fhand:
                    # Use decode method to convert the UTF-8 to Unicode string
                    txt_dump+=line.decode()
                # Add downloaded text to the dictionary with keys matching the list of book titles.
                # This puts the raw text as the value of the key of the dictionary bearing the name of the Ebook 
                topEBooks[lst_titles[i]]=txt_dump
                count_done+=1
                print (f"Finished downloading {round(100*count_done/num_download,2)}%")
            except urllib.error.URLError as e:
                topEBooks[lst_titles[i]]="NOT FOUND"
                count_done+=1
                print(f"ERROR: {lst_titles[i]} {e.reason}")
    else:
        count_done=0
        from tqdm import tqdm, tqdm_notebook
        for i in tqdm(range(num_download),desc='Download % completed',dynamic_ncols=True):
            # Create the proper download link (url) from the book id
            # You have to examine the Gutenberg.org file structure carefully to come up with the proper url
            bookid=booknum[i]
            bookurl= baseurl+str(bookid)+'/'+str(bookid)+'-0.txt'
            # Create a file handler object
            try:
                fhand = urllib.request.urlopen(bookurl)
                txt_dump = ''
                # Iterate over the lines in the file handler object and dump the data into the text string
                for line in fhand:
                    # Use decode method to convert the UTF-8 to Unicode string
                    txt_dump+=line.decode()
                # Add downloaded text to the dictionary with keys matching the list of book titles.
                # This puts the raw text as the value of the key of the dictionary bearing the name of the Ebook 
                topEBooks[lst_titles[i]]=txt_dump
                count_done+=1
            except urllib.error.URLError as e:
                topEBooks[lst_titles[i]]="NOT FOUND"
                count_done+=1
                print(f"ERROR: {lst_titles[i]} {e.reason}")
        
    print ("-"*40+"\nFinished downloading all books!\n"+"-"*40)
       
    return (topEBooks)

#### Test the function with verbosity=0 (default)

In [269]:
dict_books=download_top_books(1)

Download % completed: 100%|██████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.37s/it]


----------------------------------------
Finished downloading all books!
----------------------------------------


#### Test the function with verbosity=1

In [272]:
dict_books=download_top_books(105,1)

You asked for more than 100 downloads. 
Unfortunately, Gutenberg ranks only top 100 books.
Proceeding to download 100 books.


#### Show the final dictionary and an example of the downloaded text

In [205]:
print(dict_books[lst_titles[1]][:1500])

﻿The Project Gutenberg EBook of Heart of Darkness, by Joseph Conrad

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: Heart of Darkness

Author: Joseph Conrad

Release Date: February 1995 [EBook #219]
Last Updated: September 7, 2016

Language: English

Character set encoding: UTF-8

*** START OF THIS PROJECT GUTENBERG EBOOK HEART OF DARKNESS ***




Produced by Judith Boss and David Widger





HEART OF DARKNESS

By Joseph Conrad




I


The Nellie, a cruising yawl, swung to her anchor without a flutter of
the sails, and was at rest. The flood had made, the wind was nearly
calm, and being bound down the river, the only thing for it was to come
to and wait for the turn of the tide.

The sea-reach of the Thames stretched before us like the beginning of
an interminable waterway. In t

### Write a function to download and save the downloaded texts

In [290]:
def save_text_files(num_download=10,verbosity=1):
    """
    Downloads top N books from Gutenberg.org where N is specified by user.
    If verbosity is turned on (set to 1) then prints the downloading status for every book.
    Asks user for a location on computer where to save the downloaded Ebooks and process accordingly.
    Returns status message indicating how many ebooks could be successfully downloaded and saved
    """
    
    import os
    
    # Download the Ebooks and save in a dictionary object (in-memory)
    dict_books=download_top_books(num_download=num_download,verbosity=verbosity)
    
    if dict_books=={}:
        return None
    
    # Ask use for a save location (directory path)
    savelocation=input("Please enter a folder location to save the Ebooks in:")
    
    count_successful_download=0
    
    # Create a default folder/directory in the current working directory if the input is blank
    if (len(savelocation)<1):
        savelocation=os.getcwd()+'\\'+'Ebooks'+'\\'
        # Creates new directory if the directory does not exist. Otherwise, just use the existing path.
        if not os.path.isdir(savelocation):
            os.mkdir(savelocation)
    else:
        if savelocation[-1]=='\\':
            os.mkdir(savelocation)
        else:
            os.mkdir(savelocation+'\\')
    #print("Saving files at:",savelocation)
    for k,v in dict_books.items():
        if (v!="NOT FOUND"):
            filename=savelocation+str(k)+'.txt'
            file=open(filename,'wb')
            file.write(v.encode("UTF-8",'ignore'))
            file.close()
            count_successful_download+=1
    
    # Status message
    print (f"{count_successful_download} book(s) was/were successfully downloaded and saved to the location {savelocation}")
    if (num_download!=count_successful_download):
        print(f"{num_download-count_successful_download} books were not found on the server!")

In [289]:
save_text_files(3,1)

Working on book: Pride and Prejudice by Jane Austen 
Finished downloading 33.33%
Working on book: Heart of Darkness by Joseph Conrad 
Finished downloading 66.67%
Working on book: The City That Was by Stephen Smith 
Finished downloading 100.0%
----------------------------------------
Finished downloading all books!
----------------------------------------
Please enter a folder location to save the Ebooks in:
3 book(s) was/were successfully downloaded and saved to the location C:\Users\Tirtha\Documents\Personal\Data Science related\Gits and Projects\PythonScripts\WebDataAnalytics\Ebooks\
