# Synthetic Cover Page Generation

This notebook is the first step in our pipeline. The aim was to use thousands of cover pages with a variety of fonts and font sizes and train a neural network to automatically recognize/extract characters from a new cover page. Due to the unfeasibility of physically taking that many photos of books, we decided to synthetically generate these pages ourselves.

A brief outline of the process implemented in this notebook:

1. Generate a list of possible urls hosted by the University of Rome's Italian Library website.
2. Scrape the online library for all valid urls and get the page content.
3. Use Beautiful Soup to extract and save content from the 'Author' and 'Title' tags as their respective lists.
4. Save a list of common fonts (available for non-commercial use from Google Fonts library).
5. Use a combination of (m font_styles, n font_sizes) for each (title, author) pair to generate mn cover pages in .pdf format.

In [1]:
from bs4 import BeautifulSoup

In [2]:
import requests

## 1. Generate a list of possible urls hosted by the University of Rome's Italian Library website.

In [4]:
main5 = 'http://www.bibliotecaitaliana.it/indice/visualizza_scheda/bibit00000'
main4 = 'http://www.bibliotecaitaliana.it/indice/visualizza_scheda/bibit0000'
main3 = 'http://www.bibliotecaitaliana.it/indice/visualizza_scheda/bibit000'
main2 = 'http://www.bibliotecaitaliana.it/indice/visualizza_scheda/bibit00'

In [5]:
list1 = list(range(0,2000))

In [6]:
len(str(list1[100]))

3

In [7]:
links = []

In [8]:
for i in list1:
    if (len(str(i))) == 1:
        links.append(str(main5+str(i)))
    elif (len(str(i))) == 2:
        links.append(str(main4+str(i)))
    elif (len(str(i))) == 3:
        links.append(str(main3+str(i)))
    elif (len(str(i))) == 4:
        links.append(str(main2+str(i)))

In [9]:
all_requests = []

In [10]:
from tqdm import tqdm

## 2. Scrape the online library for all valid urls and get the page content.

In [11]:
for link in tqdm(links):
    all_requests.append(requests.get(link))

100%|██████████| 2000/2000 [34:34<00:00,  1.01it/s]


In [12]:
all_requests2 = all_requests

In [13]:
len(all_requests2)

2000

In [14]:
valid_pages = []

In [15]:
for r in all_requests2:
    if r.status_code == 200:
        valid_pages.append(r)

## 3. Use Beautiful Soup to extract and save content from the 'Author' and 'Title' tags as their respective lists.

In [16]:
soups = []

In [17]:
for page in valid_pages:
    soup = BeautifulSoup(page.content,'html.parser')
    soups.append(soup)

In [18]:
authors = []
titles = []

In [19]:
for soup in soups:
    text_only = [text for text in soup.stripped_strings]
    try:
        author_index = text_only.index('Autore:')
    except:
        pass
    authors.append(text_only[author_index+1])
    try:
        title_index = text_only.index('Titolo:')
    except:
        pass
    titles.append(text_only[title_index+1])

In [20]:
len(authors)

1629

In [21]:
len(titles)

1629

In [23]:
import pickle

In [24]:
with open('authors.pkl', 'wb') as f:
    pickle.dump(authors, f)

In [25]:
with open('titles.pkl', 'wb') as f:
    pickle.dump(titles, f)

In [41]:
from fpdf import FPDF
from tqdm import tqdm
import os
import random

In [42]:
import warnings
warnings.filterwarnings('ignore')

## 4. Save a list of common fonts (available for non-commercial use from Google Fonts library).

In [43]:
fonts = []

In [44]:
for root, dirs, files in os.walk("./fonts_dir"):  
    for filename in files:
        fonts.append(filename[:-4])

In [45]:
print(len(fonts))

482


In [46]:
fname1 = './fonts/'
fname2 = '.ttf'

## 5. Use a combination of (m font_styles, n font_sizes) for each (title, author) pair to generate mn cover pages in .pdf format.

In [47]:
def create_individual_cover(font, size, title, author):
    f_path = fname1+font+fname2
    t_height = size/3
    output_file = './raw_pages/'+str(title.replace(" ","_"))+'_'+str(author.replace(" ","_"))+'_'+str(font.replace(" ","_"))+"_"+str(size)+'.pdf'
    
    this_pdf = FPDF()
    try:
        this_pdf.add_font(family=font,style='',fname=f_path,uni=True)
        this_pdf.add_page()
        this_pdf.set_font(font,'',size)
        this_pdf.write(h=t_height,txt=title)
        this_pdf.multi_cell(w=10,h=t_height,txt="\n\n")
        this_pdf.write(h=t_height,txt=author)
        this_pdf.output(output_file, 'F')
        this_pdf.close()
    except:
        pass

In [48]:
font_sizes_range = list(range(33,96))

In [51]:
def create_cover_combos(book_title, book_author):
    for f in random.choices(fonts,k=17):
        sizes = random.choices(font_sizes_range,k=8)
        for size in sizes:
            create_individual_cover(f,size,book_title,book_author)

In [52]:
for (author,title) in tqdm(set(zip(authors,titles))):
    create_cover_combos(title, author)

100%|██████████| 1628/1628 [26:39<00:00,  1.09it/s]


As we progressed, the direction of our project changed and we used a OpenCV pipeline instead of training a neural network. The major reason being that in order to perform supervised learning on this dataset, we needed a training dataset that was pre-labelled which in this scenario meant having the bounding box coordinates for each character in each cover image. Labelling an image for object detection is easier in cases where the object is singular (for e.g. detect a cat, face, animal etc.) But labelling an image with an average of 40 characters proved unfeasible due to time constraints.

But towards the end of this project, we realized that this pipeline would be extremely useful in building on the work completed by us so far. This will be explained in more detail in later notebooks and in the readme.