# 01 Text selection and download

In this notebook, we will select the books we want to analyse and download them.
There are a number of ways we can get a catalog of the books in the Gutenberg project. The cleanest way is to use the SQL database that can be created with the gutenbergpy package. It is created by runnning gutenbergpy.gutenbergcache.GutenbergCache.create(). We did not include this line as it is a long process, instead, we included a zipped file of the database, which we now unzip:

In [1]:
import os
import zipfile
with zipfile.ZipFile('gutenbergindex.db.zip', 'r') as zip:
    zip.extractall('')

We get all French language books from the catalog and put their gutenberg number, author and title in a dataframe.

In [2]:
import sqlite3
import pandas as pd
con = sqlite3.connect('gutenbergindex.db')
cur = con.cursor()

cur.execute("""SELECT titles.name, authors.name, books.gutenbergbookid
FROM books
JOIN titles ON titles.bookid = books.id
JOIN book_authors ON books.id = book_authors.bookid
JOIN authors ON book_authors.authorid = authors.id
WHERE books.languageid = 5""") ##French books have a languageid of 5

gut_df=pd.DataFrame(cur.fetchall(), columns=['title', 'author', 'number']).drop_duplicates(subset='number', keep='first').set_index("number") # we only keep the first alias of the writers

con.close()
gut_df

Unnamed: 0_level_0,title,author
number,Unnamed: 1_level_1,Unnamed: 2_level_1
29114,La cour et la ville de Madrid vers la fin du X...,"d'Aulnoy, Marie-Catherine"
797,L'Abbesse De Castro,Marie-Henri Beyle
32138,Vers Ispahan,"Viaud, Julien"
24861,L'affaire Sougraine,"Lemay, Léon Pamphile"
37678,Louis XI et Les États Pontificaux de France au...,"Rey, Raymond"
...,...,...
12120,Nouveaux contes bleus,"Laboulaye, Édouard-René Lefebvre de"
43782,Gringalette,"Grassal, Georges-Joseph"
17660,L'archipel en feu,"Verne, Júlio"
53722,Viviane,"Tennyson, Alfred Lord"


Which authors are the most represented?

In [3]:
gut_df['author'].value_counts()[:30]

Verschillende                             261
Dupin, Amandine-Aurore-Lucile              73
Dumas, Alexander                           55
Verne, Júlio                               44
Shakspeare, William                        42
Anonyme                                    38
De Maupassant, Guy                         37
Michelet, J.                               32
Thiers, Marie-Joseph-Louis-Adolphe         30
Zola, Émile Édouard Charles Antoine        29
Lamartine, Alphonse Marie Louis de         28
Hugo, Victor Marie, comte                  28
Féval, Paul Henri Corentin                 26
Gourmont, Remi de                          25
Viaud, Julien                              23
Lebert, Marie                              22
Boylesve, Rene                             22
Sue, Eugene                                20
Goncourt, Edmond Louis Antoine Huot de     20
Gaboriau, Émile                            19
Thibault, Jacques-Anatole-François         19
Arouet, François Marie            

Let's choose ten well known and well represented French authors and put the book data in a dataframe. We use their name in the form that they are most well known.

In [4]:
author_list_raw=["Dupin, Amandine-Aurore-Lucile", "Dumas, Alexander", "Verne, Júlio", "De Maupassant, Guy", "Hugo, Victor Marie, comte", "Lamartine, Alphonse Marie Louis de",
             "Thibault, Jacques-Anatole-François", "Proust, Marcel", "Zola, Émile Édouard Charles Antoine",  "Flaubert, Gustave"] 
sel_gut_df1=gut_df[gut_df['author'].isin(author_list_raw)]
sel_gut_df=sel_gut_df1.copy()

author_list=["George Sand", "Alexandre Dumas", "Jules Verne", "Guy de Maupassant", "Victor Hugo", "Alphonse de Lamartine",
             "Anatole France", "Marcel Proust", "Émile Zola",  "Gustave Flaubert"] 
for i, author in enumerate(author_list_raw):
    sel_gut_df.loc[sel_gut_df1['author'] == author, 'author'] = author_list[i]

sel_gut_df

Unnamed: 0_level_0,title,author
number,Unnamed: 1_level_1,Unnamed: 2_level_1
34204,La petite Fadette,George Sand
5095,20000 Lieues sous les mers — Part 1,Jules Verne
24850,Lourdes,Émile Zola
41054,Cours familier de Littérature - Volume 07,Alphonse de Lamartine
63794,La comédie de celui qui épousa une femme muette,Anatole France
...,...,...
17693,"La San-Felice, Tome 01",Alexandre Dumas
7772,Les Quarante-Cinq — Tome 3,Alexandre Dumas
38674,"De la terre à la lune, trajet direct en 97 heu...",Jules Verne
17660,L'archipel en feu,Jules Verne


Let's create folders for our files.

In [5]:

if not os.path.exists("rawtext"):
    os.mkdir("rawtext")
for a in sel_gut_df["author"]:
    if not os.path.exists(os.path.join("rawtext", a)):
        os.mkdir(os.path.join("rawtext", a))

And download the files: (this could probably be done more efficiently with the built-in download function of gutenbergpy)

In [6]:
i=0
import requests
for a in author_list:
    for n in sel_gut_df[sel_gut_df['author']==a].index:
        i+=1
        filename=os.path.join("rawtext", a, str(n))
        if not os.path.isfile(filename):
            url1 = 'https://www.gutenberg.org/files/'+str(n)+"/"+str(n)+"-0.txt" # we prefer these as these are utf-8
            url2= 'https://www.gutenberg.org/files/'+str(n)+"/"+str(n)+"-8.txt" #these are in latin-1
            url3= 'https://www.gutenberg.org/files/'+str(n)+"/"+str(n)+".txt" #if all else fails we take the unknown (latin-1 or ascii) encoding and try to figure out what encoding it is
            response1 = requests.head(url1)
            response2 = requests.head(url2)
            response3 = requests.head(url3)
            if response1.status_code == 200:
                with open(filename, 'wb') as f:
                    text=requests.get(url1)
                    f.write(text.content)
            elif response2.status_code == 200:
                with open(filename, 'wb') as f:
                    text=requests.get(url2)
                    f.write(text.content.decode('latin1').encode('utf-8'))
            elif response3.status_code == 200:
                with open(filename, 'wb') as f:
                    text=requests.get(url3)
                    try:
                        encoding = re.search(r"Character set encoding: (.*?)\n", text.text).group(1).strip()
                        if encoding=='ISO-Latin-1':
                            encoding='latin1' 
                        f.write(text.content.decode(encoding).encode('utf-8'))
                    except:
                        f.write(text.content)
            else:
                print("no such file:", filename)



no such file: rawtext/Jules Verne/20973


This is because this file is an audio file and was not downloaded. Let's update our df accordingly.

In [7]:
sel_gut_df.drop(20973, inplace=True)


Let's try to open the files:

In [8]:
for i, row in sel_gut_df.iterrows():
    filename=os.path.join("rawtext", row["author"], str(i))
    with open(filename, 'r') as f:
        try:
            text=f.read()
        except:
            print(filename)

rawtext/Alexandre Dumas/1910


This seems to be a wrongly encoded file. Fortunately, we have the another version:

In [9]:
sel_gut_df[sel_gut_df['title'].str.contains('ulipe')]

Unnamed: 0_level_0,title,author
number,Unnamed: 1_level_1,Unnamed: 2_level_1
1910,La Tulipe Noire,Alexandre Dumas
26504,La tulipe noire,Alexandre Dumas


We delete the wrongly encoded version and save our catalog dataframe.

In [10]:
sel_gut_df.drop(1910, inplace=True)
sel_gut_df.to_csv('sel_gut_df.csv')