# #readMoreCanlit | Notebook 1: Data acquisition

<center><img src='../img/readMoreCanlit.png'></center>

<a name="contents"></a>
## Contents

* <a href="#overview">Overview</a><br>
* <a href="#imports">Imports</a><br>
* <a href="#data-sources">Data sources</a><br>
* <a href="#get-international-book-metadata">Get international book metadata</a><br>
* <a href="#get-canadian-book-metadata">Get Canadian book metadata</a><br>
* <a href="#get-canadian-book-cover-art">Get Canadian book cover art</a><br>

<a name="overview"></a>
## Overview

To populate the corpus and app, three sets of data needed to be gathered:

> 1. information on international books (title, author, description)
> 2. information on Canadian books (title, author, description)
> 3. book cover art for Canadian books

For the international books, a broad search was undertaken to find lists of ISBNs (the international standard book number used in publishing to distinguish books/editions from one another) online. Sources were found at openlibrary.org and data.world that included over 2.7 million ISBNs (the lists contain the ISBN and no further information). 

To gather the necessary metadata, the ISBNdb.com API was employed to query a database of 12 million books. Progress was slow, however. First, there was a limitation of 15,000 queries per day. Second, the ISBNdb.com database is incomplete; many ISBNs had no entry in the database and of those that were present, many of them lacked descriptions (the primary piece of metadata that is required for the recommender system to work effectively. On average, about 5 percent of the returned ISBN information was usable. Over 100,000 ISBN queries were executed to produce a dataset of 6,000 international titles.

Canadian titles were more easily sourced via the website 49thshelf.com. Through webscraping, a set of URLs was requested in order to produce a set of title, author and description information for 8,500 Canadian fiction titles (it was not possible to differentiate the international titles by genre). The website provided book cover images for all of the titles as well.


<div style="text-align: right">(<a href="#contents">home</a>) </div>

<a name="imports"></a>
## Imports

In [46]:
# pandas and numpy
import pandas as pd
import numpy as np

# other imports
from bs4 import BeautifulSoup
import json
import requests
import time
import urllib.request
from datetime import datetime

pd.options.display.max_seq_items = 2000
pd.options.display.max_rows = 4000

<div style="text-align: right">(<a href="#contents">home</a>) </div>

<a name="get-international-book-metadata"></a>
## Get international book metadata

In this section of the notebook, I use the ISBNdb.com API to query their database with a long list of ISBNs. The code in this section was rerun many times over a period of days. To run this code, you will need to uncomment it, acquire your own API code and membership to ISBNdb.com. The assembled data is in the repo in the /data folder.

In [27]:
# Read in the ISBN list

isbn = pd.read_csv('../data/data_acquisition/international_for_download.csv')

# My for loop below requires the ISBNs to be interpreted as strings so they can be interpolated into URLs
isbn = isbn.applymap(str)

# Confirm the change
isbn.dtypes

isbn        object
title       object
authors     object
overview    object
dtype: object

In [28]:
# Reduce to a subset matching the ISBNdb.com daily limit

isbn = isbn[0:15000]
isbn

Unnamed: 0,isbn,title,authors,overview
0,9781780104089,,,
1,9781780104065,,,
2,9781780104058,,,
3,9781780104041,,,
4,9781780104034,,,
5,9781780104027,,,
6,9781780104003,,,
7,9781780103990,,,
8,9781780103983,,,
9,9781780103976,,,


In [29]:
# Iterate through dataframe containing the list of ISBNs, 
# constructing URLs to pass to requests
# along with the ISBNdb authorization key
# return the necessary content in JSON format
# and write it back into the dataframe

# Note, this process was repeated many times
# it is commented out so I don't incur costs if all cells are run

# for j in range(len(isbn)):

#     header = {'Authorization': 'YOUR API KEY HERE'}
#     base_url = ('https://api2.isbndb.com/book/')
#     response = requests.get(base_url + isbn['isbn'][j], headers=header)
#     payload = response.json()
      
#     try:
#         isbn['title'][j] = payload['book']['title']
    
#     except:
#         isbn['title'][j] = np.nan
    
#     try:
#         isbn['authors'][j] = payload['book']['authors']
    
#     except:
#         isbn['authors'][j] = np.nan
              
#     try:
#         isbn['overview'][j] = payload['book']['overview']
    
#     except:
#         isbn['overview'][j] = np.nan
  
#     print('Info downloaded for book ' + str(j + 1) + ' of ' +  str(len(isbn)) + ' books.')
              
#     time.sleep(1)
    

Info downloaded for book 1 of 3000 books.
Info downloaded for book 2 of 3000 books.
Info downloaded for book 3 of 3000 books.
Info downloaded for book 4 of 3000 books.
Info downloaded for book 5 of 3000 books.
Info downloaded for book 6 of 3000 books.
Info downloaded for book 7 of 3000 books.
Info downloaded for book 8 of 3000 books.
Info downloaded for book 9 of 3000 books.
Info downloaded for book 10 of 3000 books.
Info downloaded for book 11 of 3000 books.
Info downloaded for book 12 of 3000 books.
Info downloaded for book 13 of 3000 books.
Info downloaded for book 14 of 3000 books.
Info downloaded for book 15 of 3000 books.
Info downloaded for book 16 of 3000 books.
Info downloaded for book 17 of 3000 books.
Info downloaded for book 18 of 3000 books.
Info downloaded for book 19 of 3000 books.
Info downloaded for book 20 of 3000 books.
Info downloaded for book 21 of 3000 books.
Info downloaded for book 22 of 3000 books.
Info downloaded for book 23 of 3000 books.
Info downloaded for 

Info downloaded for book 190 of 3000 books.
Info downloaded for book 191 of 3000 books.
Info downloaded for book 192 of 3000 books.
Info downloaded for book 193 of 3000 books.
Info downloaded for book 194 of 3000 books.
Info downloaded for book 195 of 3000 books.
Info downloaded for book 196 of 3000 books.
Info downloaded for book 197 of 3000 books.
Info downloaded for book 198 of 3000 books.
Info downloaded for book 199 of 3000 books.
Info downloaded for book 200 of 3000 books.
Info downloaded for book 201 of 3000 books.
Info downloaded for book 202 of 3000 books.
Info downloaded for book 203 of 3000 books.
Info downloaded for book 204 of 3000 books.
Info downloaded for book 205 of 3000 books.
Info downloaded for book 206 of 3000 books.
Info downloaded for book 207 of 3000 books.
Info downloaded for book 208 of 3000 books.
Info downloaded for book 209 of 3000 books.
Info downloaded for book 210 of 3000 books.
Info downloaded for book 211 of 3000 books.
Info downloaded for book 212 of 

Info downloaded for book 377 of 3000 books.
Info downloaded for book 378 of 3000 books.
Info downloaded for book 379 of 3000 books.
Info downloaded for book 380 of 3000 books.
Info downloaded for book 381 of 3000 books.
Info downloaded for book 382 of 3000 books.
Info downloaded for book 383 of 3000 books.
Info downloaded for book 384 of 3000 books.
Info downloaded for book 385 of 3000 books.
Info downloaded for book 386 of 3000 books.
Info downloaded for book 387 of 3000 books.
Info downloaded for book 388 of 3000 books.
Info downloaded for book 389 of 3000 books.
Info downloaded for book 390 of 3000 books.
Info downloaded for book 391 of 3000 books.
Info downloaded for book 392 of 3000 books.
Info downloaded for book 393 of 3000 books.
Info downloaded for book 394 of 3000 books.
Info downloaded for book 395 of 3000 books.
Info downloaded for book 396 of 3000 books.
Info downloaded for book 397 of 3000 books.
Info downloaded for book 398 of 3000 books.
Info downloaded for book 399 of 

Info downloaded for book 564 of 3000 books.
Info downloaded for book 565 of 3000 books.
Info downloaded for book 566 of 3000 books.
Info downloaded for book 567 of 3000 books.
Info downloaded for book 568 of 3000 books.
Info downloaded for book 569 of 3000 books.
Info downloaded for book 570 of 3000 books.
Info downloaded for book 571 of 3000 books.
Info downloaded for book 572 of 3000 books.
Info downloaded for book 573 of 3000 books.
Info downloaded for book 574 of 3000 books.
Info downloaded for book 575 of 3000 books.
Info downloaded for book 576 of 3000 books.
Info downloaded for book 577 of 3000 books.
Info downloaded for book 578 of 3000 books.
Info downloaded for book 579 of 3000 books.
Info downloaded for book 580 of 3000 books.
Info downloaded for book 581 of 3000 books.
Info downloaded for book 582 of 3000 books.
Info downloaded for book 583 of 3000 books.
Info downloaded for book 584 of 3000 books.
Info downloaded for book 585 of 3000 books.
Info downloaded for book 586 of 

Info downloaded for book 751 of 3000 books.
Info downloaded for book 752 of 3000 books.
Info downloaded for book 753 of 3000 books.
Info downloaded for book 754 of 3000 books.
Info downloaded for book 755 of 3000 books.
Info downloaded for book 756 of 3000 books.
Info downloaded for book 757 of 3000 books.
Info downloaded for book 758 of 3000 books.
Info downloaded for book 759 of 3000 books.
Info downloaded for book 760 of 3000 books.
Info downloaded for book 761 of 3000 books.
Info downloaded for book 762 of 3000 books.
Info downloaded for book 763 of 3000 books.
Info downloaded for book 764 of 3000 books.
Info downloaded for book 765 of 3000 books.
Info downloaded for book 766 of 3000 books.
Info downloaded for book 767 of 3000 books.
Info downloaded for book 768 of 3000 books.
Info downloaded for book 769 of 3000 books.
Info downloaded for book 770 of 3000 books.
Info downloaded for book 771 of 3000 books.
Info downloaded for book 772 of 3000 books.
Info downloaded for book 773 of 

Info downloaded for book 938 of 3000 books.
Info downloaded for book 939 of 3000 books.
Info downloaded for book 940 of 3000 books.
Info downloaded for book 941 of 3000 books.
Info downloaded for book 942 of 3000 books.
Info downloaded for book 943 of 3000 books.
Info downloaded for book 944 of 3000 books.
Info downloaded for book 945 of 3000 books.
Info downloaded for book 946 of 3000 books.
Info downloaded for book 947 of 3000 books.
Info downloaded for book 948 of 3000 books.
Info downloaded for book 949 of 3000 books.
Info downloaded for book 950 of 3000 books.
Info downloaded for book 951 of 3000 books.
Info downloaded for book 952 of 3000 books.
Info downloaded for book 953 of 3000 books.
Info downloaded for book 954 of 3000 books.
Info downloaded for book 955 of 3000 books.
Info downloaded for book 956 of 3000 books.
Info downloaded for book 957 of 3000 books.
Info downloaded for book 958 of 3000 books.
Info downloaded for book 959 of 3000 books.
Info downloaded for book 960 of 

Info downloaded for book 1122 of 3000 books.
Info downloaded for book 1123 of 3000 books.
Info downloaded for book 1124 of 3000 books.
Info downloaded for book 1125 of 3000 books.
Info downloaded for book 1126 of 3000 books.
Info downloaded for book 1127 of 3000 books.
Info downloaded for book 1128 of 3000 books.
Info downloaded for book 1129 of 3000 books.
Info downloaded for book 1130 of 3000 books.
Info downloaded for book 1131 of 3000 books.
Info downloaded for book 1132 of 3000 books.
Info downloaded for book 1133 of 3000 books.
Info downloaded for book 1134 of 3000 books.
Info downloaded for book 1135 of 3000 books.
Info downloaded for book 1136 of 3000 books.
Info downloaded for book 1137 of 3000 books.
Info downloaded for book 1138 of 3000 books.
Info downloaded for book 1139 of 3000 books.
Info downloaded for book 1140 of 3000 books.
Info downloaded for book 1141 of 3000 books.
Info downloaded for book 1142 of 3000 books.
Info downloaded for book 1143 of 3000 books.
Info downl

Info downloaded for book 1305 of 3000 books.
Info downloaded for book 1306 of 3000 books.
Info downloaded for book 1307 of 3000 books.
Info downloaded for book 1308 of 3000 books.
Info downloaded for book 1309 of 3000 books.
Info downloaded for book 1310 of 3000 books.
Info downloaded for book 1311 of 3000 books.
Info downloaded for book 1312 of 3000 books.
Info downloaded for book 1313 of 3000 books.
Info downloaded for book 1314 of 3000 books.
Info downloaded for book 1315 of 3000 books.
Info downloaded for book 1316 of 3000 books.
Info downloaded for book 1317 of 3000 books.
Info downloaded for book 1318 of 3000 books.
Info downloaded for book 1319 of 3000 books.
Info downloaded for book 1320 of 3000 books.
Info downloaded for book 1321 of 3000 books.
Info downloaded for book 1322 of 3000 books.
Info downloaded for book 1323 of 3000 books.
Info downloaded for book 1324 of 3000 books.
Info downloaded for book 1325 of 3000 books.
Info downloaded for book 1326 of 3000 books.
Info downl

Info downloaded for book 1488 of 3000 books.
Info downloaded for book 1489 of 3000 books.
Info downloaded for book 1490 of 3000 books.
Info downloaded for book 1491 of 3000 books.
Info downloaded for book 1492 of 3000 books.
Info downloaded for book 1493 of 3000 books.
Info downloaded for book 1494 of 3000 books.
Info downloaded for book 1495 of 3000 books.
Info downloaded for book 1496 of 3000 books.
Info downloaded for book 1497 of 3000 books.
Info downloaded for book 1498 of 3000 books.
Info downloaded for book 1499 of 3000 books.
Info downloaded for book 1500 of 3000 books.
Info downloaded for book 1501 of 3000 books.
Info downloaded for book 1502 of 3000 books.
Info downloaded for book 1503 of 3000 books.
Info downloaded for book 1504 of 3000 books.
Info downloaded for book 1505 of 3000 books.
Info downloaded for book 1506 of 3000 books.
Info downloaded for book 1507 of 3000 books.
Info downloaded for book 1508 of 3000 books.
Info downloaded for book 1509 of 3000 books.
Info downl

Info downloaded for book 1671 of 3000 books.
Info downloaded for book 1672 of 3000 books.
Info downloaded for book 1673 of 3000 books.
Info downloaded for book 1674 of 3000 books.
Info downloaded for book 1675 of 3000 books.
Info downloaded for book 1676 of 3000 books.
Info downloaded for book 1677 of 3000 books.
Info downloaded for book 1678 of 3000 books.
Info downloaded for book 1679 of 3000 books.
Info downloaded for book 1680 of 3000 books.
Info downloaded for book 1681 of 3000 books.
Info downloaded for book 1682 of 3000 books.
Info downloaded for book 1683 of 3000 books.
Info downloaded for book 1684 of 3000 books.
Info downloaded for book 1685 of 3000 books.
Info downloaded for book 1686 of 3000 books.
Info downloaded for book 1687 of 3000 books.
Info downloaded for book 1688 of 3000 books.
Info downloaded for book 1689 of 3000 books.
Info downloaded for book 1690 of 3000 books.
Info downloaded for book 1691 of 3000 books.
Info downloaded for book 1692 of 3000 books.
Info downl

Info downloaded for book 1854 of 3000 books.
Info downloaded for book 1855 of 3000 books.
Info downloaded for book 1856 of 3000 books.
Info downloaded for book 1857 of 3000 books.
Info downloaded for book 1858 of 3000 books.
Info downloaded for book 1859 of 3000 books.
Info downloaded for book 1860 of 3000 books.
Info downloaded for book 1861 of 3000 books.
Info downloaded for book 1862 of 3000 books.
Info downloaded for book 1863 of 3000 books.
Info downloaded for book 1864 of 3000 books.
Info downloaded for book 1865 of 3000 books.
Info downloaded for book 1866 of 3000 books.
Info downloaded for book 1867 of 3000 books.
Info downloaded for book 1868 of 3000 books.
Info downloaded for book 1869 of 3000 books.
Info downloaded for book 1870 of 3000 books.
Info downloaded for book 1871 of 3000 books.
Info downloaded for book 1872 of 3000 books.
Info downloaded for book 1873 of 3000 books.
Info downloaded for book 1874 of 3000 books.
Info downloaded for book 1875 of 3000 books.
Info downl

Info downloaded for book 2037 of 3000 books.
Info downloaded for book 2038 of 3000 books.
Info downloaded for book 2039 of 3000 books.
Info downloaded for book 2040 of 3000 books.
Info downloaded for book 2041 of 3000 books.
Info downloaded for book 2042 of 3000 books.
Info downloaded for book 2043 of 3000 books.
Info downloaded for book 2044 of 3000 books.
Info downloaded for book 2045 of 3000 books.
Info downloaded for book 2046 of 3000 books.
Info downloaded for book 2047 of 3000 books.
Info downloaded for book 2048 of 3000 books.
Info downloaded for book 2049 of 3000 books.
Info downloaded for book 2050 of 3000 books.
Info downloaded for book 2051 of 3000 books.
Info downloaded for book 2052 of 3000 books.
Info downloaded for book 2053 of 3000 books.
Info downloaded for book 2054 of 3000 books.
Info downloaded for book 2055 of 3000 books.
Info downloaded for book 2056 of 3000 books.
Info downloaded for book 2057 of 3000 books.
Info downloaded for book 2058 of 3000 books.
Info downl

Info downloaded for book 2220 of 3000 books.
Info downloaded for book 2221 of 3000 books.
Info downloaded for book 2222 of 3000 books.
Info downloaded for book 2223 of 3000 books.
Info downloaded for book 2224 of 3000 books.
Info downloaded for book 2225 of 3000 books.
Info downloaded for book 2226 of 3000 books.
Info downloaded for book 2227 of 3000 books.
Info downloaded for book 2228 of 3000 books.
Info downloaded for book 2229 of 3000 books.
Info downloaded for book 2230 of 3000 books.
Info downloaded for book 2231 of 3000 books.
Info downloaded for book 2232 of 3000 books.
Info downloaded for book 2233 of 3000 books.
Info downloaded for book 2234 of 3000 books.
Info downloaded for book 2235 of 3000 books.
Info downloaded for book 2236 of 3000 books.
Info downloaded for book 2237 of 3000 books.
Info downloaded for book 2238 of 3000 books.
Info downloaded for book 2239 of 3000 books.
Info downloaded for book 2240 of 3000 books.
Info downloaded for book 2241 of 3000 books.
Info downl

Info downloaded for book 2403 of 3000 books.
Info downloaded for book 2404 of 3000 books.
Info downloaded for book 2405 of 3000 books.
Info downloaded for book 2406 of 3000 books.
Info downloaded for book 2407 of 3000 books.
Info downloaded for book 2408 of 3000 books.
Info downloaded for book 2409 of 3000 books.
Info downloaded for book 2410 of 3000 books.
Info downloaded for book 2411 of 3000 books.
Info downloaded for book 2412 of 3000 books.
Info downloaded for book 2413 of 3000 books.
Info downloaded for book 2414 of 3000 books.
Info downloaded for book 2415 of 3000 books.
Info downloaded for book 2416 of 3000 books.
Info downloaded for book 2417 of 3000 books.
Info downloaded for book 2418 of 3000 books.
Info downloaded for book 2419 of 3000 books.
Info downloaded for book 2420 of 3000 books.
Info downloaded for book 2421 of 3000 books.
Info downloaded for book 2422 of 3000 books.
Info downloaded for book 2423 of 3000 books.
Info downloaded for book 2424 of 3000 books.
Info downl

Info downloaded for book 2586 of 3000 books.
Info downloaded for book 2587 of 3000 books.
Info downloaded for book 2588 of 3000 books.
Info downloaded for book 2589 of 3000 books.
Info downloaded for book 2590 of 3000 books.
Info downloaded for book 2591 of 3000 books.
Info downloaded for book 2592 of 3000 books.
Info downloaded for book 2593 of 3000 books.
Info downloaded for book 2594 of 3000 books.
Info downloaded for book 2595 of 3000 books.
Info downloaded for book 2596 of 3000 books.
Info downloaded for book 2597 of 3000 books.
Info downloaded for book 2598 of 3000 books.
Info downloaded for book 2599 of 3000 books.
Info downloaded for book 2600 of 3000 books.
Info downloaded for book 2601 of 3000 books.
Info downloaded for book 2602 of 3000 books.
Info downloaded for book 2603 of 3000 books.
Info downloaded for book 2604 of 3000 books.
Info downloaded for book 2605 of 3000 books.
Info downloaded for book 2606 of 3000 books.
Info downloaded for book 2607 of 3000 books.
Info downl

Info downloaded for book 2769 of 3000 books.
Info downloaded for book 2770 of 3000 books.
Info downloaded for book 2771 of 3000 books.
Info downloaded for book 2772 of 3000 books.
Info downloaded for book 2773 of 3000 books.
Info downloaded for book 2774 of 3000 books.
Info downloaded for book 2775 of 3000 books.
Info downloaded for book 2776 of 3000 books.
Info downloaded for book 2777 of 3000 books.
Info downloaded for book 2778 of 3000 books.
Info downloaded for book 2779 of 3000 books.
Info downloaded for book 2780 of 3000 books.
Info downloaded for book 2781 of 3000 books.
Info downloaded for book 2782 of 3000 books.
Info downloaded for book 2783 of 3000 books.
Info downloaded for book 2784 of 3000 books.
Info downloaded for book 2785 of 3000 books.
Info downloaded for book 2786 of 3000 books.
Info downloaded for book 2787 of 3000 books.
Info downloaded for book 2788 of 3000 books.
Info downloaded for book 2789 of 3000 books.
Info downloaded for book 2790 of 3000 books.
Info downl

Info downloaded for book 2952 of 3000 books.
Info downloaded for book 2953 of 3000 books.
Info downloaded for book 2954 of 3000 books.
Info downloaded for book 2955 of 3000 books.
Info downloaded for book 2956 of 3000 books.
Info downloaded for book 2957 of 3000 books.
Info downloaded for book 2958 of 3000 books.
Info downloaded for book 2959 of 3000 books.
Info downloaded for book 2960 of 3000 books.
Info downloaded for book 2961 of 3000 books.
Info downloaded for book 2962 of 3000 books.
Info downloaded for book 2963 of 3000 books.
Info downloaded for book 2964 of 3000 books.
Info downloaded for book 2965 of 3000 books.
Info downloaded for book 2966 of 3000 books.
Info downloaded for book 2967 of 3000 books.
Info downloaded for book 2968 of 3000 books.
Info downloaded for book 2969 of 3000 books.
Info downloaded for book 2970 of 3000 books.
Info downloaded for book 2971 of 3000 books.
Info downloaded for book 2972 of 3000 books.
Info downloaded for book 2973 of 3000 books.
Info downl

In [30]:
# Drop the ISBN column now that it is no longer needed
# and save the current set of international book metadata out to csv
# with the file named for the current date and time.

now = datetime.now()
dt = now.strftime("%d-%m-%Y_%H-%M-%S")


isbn.drop('isbn', axis=1, inplace=True)
isbn.to_csv('../data/saved/isbn' + dt +'.csv', index = False)

# Note that this process was repeated many times to assemble the 
# international portion of the app's dataframe

<div style="text-align: right">(<a href="#contents">home</a>) </div>

<a name="get-canadian-book-metadata"></a>
## Get Canadian book metadata

In this section of the notebook, I connect to the website 49thshelf.com in order to access descriptions of Canadian books used by the model. Uncomment the relevant code to run the download process. This code only needs to be run once, but it will take about 24 hours to complete. The data is in the repo in the /data folder.

In [5]:
# Read in the list of Canadian ISBNs
canadian_download = pd.read_csv('../data/data_acquisition/canadian_for_download.csv')

# My for loop below requires the ISBNs to be interpreted as strings so they can be interpolated into URLs

canadian_download = canadian_download.applymap(str)
canadian_download = canadian_download[7001:]
canadian_download.reset_index(drop=True, inplace=True)
# Confirm the change
canadian_download

Unnamed: 0,title_url,title,authors,image,description
0,https://49thshelf.com/Books/T/The-Rib-From-Whi...,The Rib From Which I Remake the World,Ed Kurtz,https://images.49thshelf.com/var/ezflow_site/s...,
1,https://49thshelf.com/Books/T/The-Rise-Fall-of...,The Rise & Fall of Great Powers,Tom Rachman,https://images.49thshelf.com/var/ezflow_site/s...,
2,https://49thshelf.com/Books/T/The-Rise-Fall-of...,The Rise & Fall of Great Powers,Tom Rachman,https://images.49thshelf.com/var/ezflow_site/s...,
3,https://49thshelf.com/Books/T/The-Rise-of-the-...,The Rise of the Iron Moon,Stephen Hunt,https://images.49thshelf.com/var/ezflow_site/s...,
4,https://49thshelf.com/Books/T/The-Rise-of-the-...,The Rise of the Iron Moon,Stephen Hunt,https://images.49thshelf.com/var/ezflow_site/s...,
5,https://49thshelf.com/Books/T/The-Rising-Tide,The Rising Tide,Mark Frutkin,https://images.49thshelf.com/var/ezflow_site/s...,
6,https://49thshelf.com/Books/T/The-Rivals-of-Ve...,The Rivals of Versailles,Sally Christie,https://images.49thshelf.com/var/ezflow_site/s...,
7,https://49thshelf.com/Books/T/The-River4,The River,Cheryl Kaye Tardif,https://images.49thshelf.com/var/ezflow_site/s...,
8,https://49thshelf.com/Books/T/The-River-Burns3,The River Burns,Trevor Ferguson,https://images.49thshelf.com/var/ezflow_site/s...,
9,https://49thshelf.com/Books/T/The-River-Killers,The River Killers,Bruce Burrows,https://images.49thshelf.com/var/ezflow_site/s...,


In [6]:
# Iterate through the Canadian book metadata dataframe (populated from a csv);
# Grab the URL where the book description lives and use beautifulsoup
# to grab the relevant content and write it back to the dataframe

for c in (range(len(canadian_download))):
    response = requests.get(canadian_download['title_url'][c])
    soup = BeautifulSoup(response.text, 'html.parser')
    
    try:
        result = soup.find("div", "description", "pleat")
        raw_description = result.text
        canadian_download['description'][c] = raw_description.lstrip('\nDescription\n\n').rstrip('\n')
        print('Processed book ' + str(c))
    except:
        canadian_download['description'][c] = np.nan
        print('Failed to process book ' + str(c))
        
    time.sleep(1)
    
canadian_download.to_csv('../data/processed/canadian_pre5.csv', index = False)

Failed to process book 0
Processed book 1
Processed book 2
Processed book 3
Processed book 4
Processed book 5
Processed book 6
Processed book 7
Processed book 8
Processed book 9
Processed book 10
Processed book 11
Processed book 12
Processed book 13
Processed book 14
Processed book 15
Processed book 16
Processed book 17
Processed book 18
Failed to process book 19
Processed book 20
Processed book 21
Processed book 22
Processed book 23
Processed book 24
Processed book 25
Processed book 26
Processed book 27
Processed book 28
Processed book 29
Processed book 30
Processed book 31
Processed book 32
Processed book 33
Processed book 34
Processed book 35
Processed book 36
Processed book 37
Processed book 38
Processed book 39
Processed book 40
Processed book 41
Processed book 42
Processed book 43
Processed book 44
Processed book 45
Processed book 46
Processed book 47
Processed book 48
Processed book 49
Processed book 50
Processed book 51
Processed book 52
Processed book 53
Processed book 54
Proc

Failed to process book 432
Processed book 433
Processed book 434
Processed book 435
Processed book 436
Processed book 437
Processed book 438
Processed book 439
Processed book 440
Processed book 441
Processed book 442
Processed book 443
Processed book 444
Processed book 445
Failed to process book 446
Processed book 447
Processed book 448
Processed book 449
Processed book 450
Processed book 451
Processed book 452
Processed book 453
Processed book 454
Failed to process book 455
Failed to process book 456
Processed book 457
Processed book 458
Processed book 459
Processed book 460
Processed book 461
Failed to process book 462
Processed book 463
Processed book 464
Processed book 465
Processed book 466
Processed book 467
Processed book 468
Processed book 469
Processed book 470
Processed book 471
Processed book 472
Processed book 473
Processed book 474
Processed book 475
Processed book 476
Processed book 477
Failed to process book 478
Processed book 479
Processed book 480
Processed book 481
Pr

Processed book 859
Failed to process book 860
Processed book 861
Processed book 862
Processed book 863
Processed book 864
Processed book 865
Processed book 866
Processed book 867
Processed book 868
Processed book 869
Processed book 870
Processed book 871
Processed book 872
Processed book 873
Processed book 874
Processed book 875
Processed book 876
Processed book 877
Processed book 878
Processed book 879
Processed book 880
Processed book 881
Processed book 882
Processed book 883
Processed book 884
Processed book 885
Processed book 886
Processed book 887
Processed book 888
Processed book 889
Processed book 890
Processed book 891
Processed book 892
Processed book 893
Processed book 894
Processed book 895
Processed book 896
Processed book 897
Processed book 898
Processed book 899
Processed book 900
Processed book 901
Processed book 902
Processed book 903
Processed book 904
Processed book 905
Processed book 906
Processed book 907
Processed book 908
Processed book 909
Processed book 910
Proc

Processed book 1271
Processed book 1272
Processed book 1273
Processed book 1274
Processed book 1275
Processed book 1276
Processed book 1277
Processed book 1278
Processed book 1279
Processed book 1280
Processed book 1281
Processed book 1282
Processed book 1283
Processed book 1284
Processed book 1285
Processed book 1286
Processed book 1287
Failed to process book 1288
Processed book 1289
Failed to process book 1290
Processed book 1291
Processed book 1292
Processed book 1293
Processed book 1294
Processed book 1295
Processed book 1296
Processed book 1297
Processed book 1298
Processed book 1299
Processed book 1300
Processed book 1301
Processed book 1302
Processed book 1303
Processed book 1304
Processed book 1305
Processed book 1306
Processed book 1307
Processed book 1308
Processed book 1309
Processed book 1310
Processed book 1311
Processed book 1312
Processed book 1313
Failed to process book 1314
Processed book 1315
Processed book 1316
Processed book 1317
Processed book 1318
Processed book 1

Failed to process book 1647
Processed book 1648
Failed to process book 1649
Processed book 1650
Failed to process book 1651
Processed book 1652
Processed book 1653
Processed book 1654
Processed book 1655
Processed book 1656
Processed book 1657
Processed book 1658
Processed book 1659
Processed book 1660
Processed book 1661
Processed book 1662
Processed book 1663
Failed to process book 1664
Failed to process book 1665
Processed book 1666
Processed book 1667
Processed book 1668
Processed book 1669
Processed book 1670
Processed book 1671
Processed book 1672
Processed book 1673
Processed book 1674
Failed to process book 1675
Processed book 1676
Processed book 1677
Processed book 1678
Processed book 1679
Processed book 1680
Processed book 1681
Processed book 1682
Processed book 1683
Failed to process book 1684
Processed book 1685
Failed to process book 1686
Processed book 1687
Processed book 1688
Processed book 1689
Processed book 1690
Failed to process book 1691
Processed book 1692
Processe

In [75]:
canadian_images

Unnamed: 0,id,title,authors,image,description,image_name
0,0,88,Michael Fletcher,https://images.49thshelf.com/var/ezflow_site/s...,The dream of Artificial Intelligence is dead a...,88.jpg
1,1,419,Will Ferguson,https://images.49thshelf.com/var/ezflow_site/s...,From internationally bestselling travel writer...,419.jpg
2,2,1978,Daniel Jones,https://images.49thshelf.com/var/ezflow_site/s...,"In this violent, raw, and often beautiful nove...",1978.jpg
3,3,1979,Ray Robertson,https://images.49thshelf.com/var/ezflow_site/s...,It’s 1979 and Tom Buzby is thirteen years old ...,1979.jpg
4,4,2113,Kevin J. Anderson,https://images.49thshelf.com/var/ezflow_site/s...,18 exhilarating journeys into Rush-inspired wo...,2113.jpg
...,...,...,...,...,...,...
6768,6768,Zero Day,Ezekiel Boone,https://images.49thshelf.com/var/ezflow_site/s...,"The wildly entertaining, deeply satisfying fin...",zero_day.jpg
6769,6769,Zip's File,Shannon Maguire,https://images.49thshelf.com/var/ezflow_site/s...,Zip's File: A Romance of Silence explores the ...,zip's_file.jpg
6770,6770,Zolitude,Paige Cooper,https://images.49thshelf.com/var/ezflow_site/s...,WINNER OF THE 2018 QUEBEC WRITERS' FEDERATION ...,zolitude.jpg
6771,6771,Zoo and Crowbar,David Zieroth,https://images.49thshelf.com/var/ezflow_site/s...,The Wind has mysteriously caused the death of ...,zoo_and_crowbar.jpg


In [None]:
# Drop the ISBN column now that it is no longer needed
# and save the current set of Canadian book metadata out to csv
# with the file named for the current date and time.

canadian_download.drop('isbn', axis=1, inplace=True)
canadian_download.shape

international_download.drop('isbn', axis=1, inplace=True)
international_download.shape

In [None]:
# Remove duplicate entries from the dataframe
canadian_download = canadian_download.drop_duplicates(subset='title', keep="first")
canadian_download.to_csv('../data/processed/canadian_pre.csv', index = False)

<div style="text-align: right">(<a href="#contents">home</a>) </div>

<a name="get-canadian-book-cover-art"></a>
## Get Canadian book cover art

In this section, I use the data assembled in the previous section, which included URLs for cover images of the books. Running this process would involve uncommenting the code, and waiting for the downloads, which take about 24 hours, to complete. 

In [69]:
# Read in the list of Canadian book metadata
# that contains URLs for book-cover imagery
canadian_images = pd.read_csv('../data/processed/canadian_pre1.csv')

# My for loop below requires the ISBNs to be interpreted as strings so they can be interpolated into URLs
canadian_images = canadian_images.applymap(str)
# Confirm the change
canadian_images.dtypes

id             object
title          object
authors        object
image          object
description    object
dtype: object

In [70]:
canadian_images.dropna(how='all', inplace=True)

In [71]:
canadian_images

Unnamed: 0,id,title,authors,image,description
0,0,88,Michael Fletcher,https://images.49thshelf.com/var/ezflow_site/s...,The dream of Artificial Intelligence is dead a...
1,1,419,Will Ferguson,https://images.49thshelf.com/var/ezflow_site/s...,From internationally bestselling travel writer...
2,2,1978,Daniel Jones,https://images.49thshelf.com/var/ezflow_site/s...,"In this violent, raw, and often beautiful nove..."
3,3,1979,Ray Robertson,https://images.49thshelf.com/var/ezflow_site/s...,It’s 1979 and Tom Buzby is thirteen years old ...
4,4,2113,Kevin J. Anderson,https://images.49thshelf.com/var/ezflow_site/s...,18 exhilarating journeys into Rush-inspired wo...
...,...,...,...,...,...
6768,6768,Zero Day,Ezekiel Boone,https://images.49thshelf.com/var/ezflow_site/s...,"The wildly entertaining, deeply satisfying fin..."
6769,6769,Zip's File,Shannon Maguire,https://images.49thshelf.com/var/ezflow_site/s...,Zip's File: A Romance of Silence explores the ...
6770,6770,Zolitude,Paige Cooper,https://images.49thshelf.com/var/ezflow_site/s...,WINNER OF THE 2018 QUEBEC WRITERS' FEDERATION ...
6771,6771,Zoo and Crowbar,David Zieroth,https://images.49thshelf.com/var/ezflow_site/s...,The Wind has mysteriously caused the death of ...


In [72]:
canadian_images['image_name'] = ''

for t in range(len(canadian_images)):
    try:
        canadian_images['image_name'][t] = canadian_images['title'][t].replace(' ', '_').lower() + '.jpg'
    except: 
        canadian_images['image_name'][t] = canadian_images['title'][t] + '.jpg'

In [73]:
canadian_images

Unnamed: 0,id,title,authors,image,description,image_name
0,0,88,Michael Fletcher,https://images.49thshelf.com/var/ezflow_site/s...,The dream of Artificial Intelligence is dead a...,88.jpg
1,1,419,Will Ferguson,https://images.49thshelf.com/var/ezflow_site/s...,From internationally bestselling travel writer...,419.jpg
2,2,1978,Daniel Jones,https://images.49thshelf.com/var/ezflow_site/s...,"In this violent, raw, and often beautiful nove...",1978.jpg
3,3,1979,Ray Robertson,https://images.49thshelf.com/var/ezflow_site/s...,It’s 1979 and Tom Buzby is thirteen years old ...,1979.jpg
4,4,2113,Kevin J. Anderson,https://images.49thshelf.com/var/ezflow_site/s...,18 exhilarating journeys into Rush-inspired wo...,2113.jpg
...,...,...,...,...,...,...
6768,6768,Zero Day,Ezekiel Boone,https://images.49thshelf.com/var/ezflow_site/s...,"The wildly entertaining, deeply satisfying fin...",zero_day.jpg
6769,6769,Zip's File,Shannon Maguire,https://images.49thshelf.com/var/ezflow_site/s...,Zip's File: A Romance of Silence explores the ...,zip's_file.jpg
6770,6770,Zolitude,Paige Cooper,https://images.49thshelf.com/var/ezflow_site/s...,WINNER OF THE 2018 QUEBEC WRITERS' FEDERATION ...,zolitude.jpg
6771,6771,Zoo and Crowbar,David Zieroth,https://images.49thshelf.com/var/ezflow_site/s...,The Wind has mysteriously caused the death of ...,zoo_and_crowbar.jpg


In [60]:
urllib.request.urlretrieve(canadian_images['image'][0], '../img/books/' + canadian_images['image_name'][0])

('../img/books/88.jpg', <http.client.HTTPMessage at 0x2f251f56a88>)

In [74]:
for i in range(len(canadian_images)):

    try:
        urllib.request.urlretrieve(canadian_images['image'][i], '../img/books/' + canadian_images['image_name'][i])
        print('Just captured image number ' + str(i))
    except:
        print('Failed to capture image number ' + str(i))
        
    time.sleep(1)


Just captured image number 0
Just captured image number 1
Just captured image number 2
Just captured image number 3
Just captured image number 4
Failed to capture image number 5
Just captured image number 6
Just captured image number 7
Just captured image number 8
Failed to capture image number 9
Just captured image number 10
Just captured image number 11
Just captured image number 12
Just captured image number 13
Just captured image number 14
Just captured image number 15
Just captured image number 16
Just captured image number 17
Failed to capture image number 18
Just captured image number 19
Failed to capture image number 20
Just captured image number 21
Just captured image number 22
Just captured image number 23
Just captured image number 24
Failed to capture image number 25
Just captured image number 26
Just captured image number 27
Failed to capture image number 28
Failed to capture image number 29
Just captured image number 30
Just captured image number 31
Just captured image nu

Just captured image number 261
Failed to capture image number 262
Just captured image number 263
Just captured image number 264
Just captured image number 265
Just captured image number 266
Just captured image number 267
Failed to capture image number 268
Just captured image number 269
Just captured image number 270
Failed to capture image number 271
Just captured image number 272
Just captured image number 273
Just captured image number 274
Just captured image number 275
Just captured image number 276
Just captured image number 277
Failed to capture image number 278
Just captured image number 279
Just captured image number 280
Failed to capture image number 281
Just captured image number 282
Just captured image number 283
Just captured image number 284
Just captured image number 285
Just captured image number 286
Just captured image number 287
Just captured image number 288
Just captured image number 289
Just captured image number 290
Just captured image number 291
Just captured image

Just captured image number 517
Just captured image number 518
Just captured image number 519
Just captured image number 520
Just captured image number 521
Just captured image number 522
Just captured image number 523
Just captured image number 524
Just captured image number 525
Just captured image number 526
Failed to capture image number 527
Just captured image number 528
Just captured image number 529
Just captured image number 530
Failed to capture image number 531
Just captured image number 532
Just captured image number 533
Failed to capture image number 534
Just captured image number 535
Just captured image number 536
Just captured image number 537
Failed to capture image number 538
Just captured image number 539
Just captured image number 540
Just captured image number 541
Just captured image number 542
Just captured image number 543
Just captured image number 544
Just captured image number 545
Just captured image number 546
Failed to capture image number 547
Failed to capture i

Just captured image number 775
Just captured image number 776
Just captured image number 777
Just captured image number 778
Just captured image number 779
Just captured image number 780
Just captured image number 781
Failed to capture image number 782
Just captured image number 783
Failed to capture image number 784
Failed to capture image number 785
Just captured image number 786
Just captured image number 787
Failed to capture image number 788
Failed to capture image number 789
Just captured image number 790
Just captured image number 791
Failed to capture image number 792
Just captured image number 793
Just captured image number 794
Failed to capture image number 795
Just captured image number 796
Just captured image number 797
Just captured image number 798
Failed to capture image number 799
Just captured image number 800
Just captured image number 801
Just captured image number 802
Failed to capture image number 803
Just captured image number 804
Just captured image number 805
Jus

Just captured image number 1031
Just captured image number 1032
Just captured image number 1033
Just captured image number 1034
Just captured image number 1035
Just captured image number 1036
Failed to capture image number 1037
Just captured image number 1038
Just captured image number 1039
Just captured image number 1040
Failed to capture image number 1041
Just captured image number 1042
Just captured image number 1043
Just captured image number 1044
Just captured image number 1045
Just captured image number 1046
Just captured image number 1047
Just captured image number 1048
Just captured image number 1049
Just captured image number 1050
Just captured image number 1051
Just captured image number 1052
Just captured image number 1053
Just captured image number 1054
Just captured image number 1055
Just captured image number 1056
Just captured image number 1057
Just captured image number 1058
Just captured image number 1059
Just captured image number 1060
Just captured image number 1061


Just captured image number 1281
Just captured image number 1282
Just captured image number 1283
Just captured image number 1284
Just captured image number 1285
Just captured image number 1286
Just captured image number 1287
Just captured image number 1288
Failed to capture image number 1289
Just captured image number 1290
Just captured image number 1291
Just captured image number 1292
Just captured image number 1293
Just captured image number 1294
Failed to capture image number 1295
Just captured image number 1296
Failed to capture image number 1297
Just captured image number 1298
Failed to capture image number 1299
Just captured image number 1300
Failed to capture image number 1301
Just captured image number 1302
Failed to capture image number 1303
Just captured image number 1304
Just captured image number 1305
Just captured image number 1306
Just captured image number 1307
Just captured image number 1308
Just captured image number 1309
Just captured image number 1310
Just captured im

Just captured image number 1531
Just captured image number 1532
Just captured image number 1533
Just captured image number 1534
Just captured image number 1535
Just captured image number 1536
Just captured image number 1537
Just captured image number 1538
Just captured image number 1539
Just captured image number 1540
Just captured image number 1541
Just captured image number 1542
Just captured image number 1543
Just captured image number 1544
Failed to capture image number 1545
Just captured image number 1546
Just captured image number 1547
Just captured image number 1548
Just captured image number 1549
Just captured image number 1550
Just captured image number 1551
Just captured image number 1552
Just captured image number 1553
Just captured image number 1554
Just captured image number 1555
Failed to capture image number 1556
Failed to capture image number 1557
Just captured image number 1558
Just captured image number 1559
Just captured image number 1560
Just captured image number 1

Just captured image number 1782
Failed to capture image number 1783
Just captured image number 1784
Failed to capture image number 1785
Failed to capture image number 1786
Failed to capture image number 1787
Failed to capture image number 1788
Just captured image number 1789
Just captured image number 1790
Just captured image number 1791
Just captured image number 1792
Failed to capture image number 1793
Failed to capture image number 1794
Just captured image number 1795
Failed to capture image number 1796
Just captured image number 1797
Just captured image number 1798
Just captured image number 1799
Just captured image number 1800
Just captured image number 1801
Just captured image number 1802
Just captured image number 1803
Just captured image number 1804
Just captured image number 1805
Just captured image number 1806
Just captured image number 1807
Failed to capture image number 1808
Just captured image number 1809
Just captured image number 1810
Failed to capture image number 1811


Failed to capture image number 2031
Failed to capture image number 2032
Failed to capture image number 2033
Just captured image number 2034
Just captured image number 2035
Just captured image number 2036
Failed to capture image number 2037
Just captured image number 2038
Just captured image number 2039
Just captured image number 2040
Just captured image number 2041
Just captured image number 2042
Just captured image number 2043
Failed to capture image number 2044
Just captured image number 2045
Just captured image number 2046
Failed to capture image number 2047
Failed to capture image number 2048
Just captured image number 2049
Just captured image number 2050
Failed to capture image number 2051
Just captured image number 2052
Just captured image number 2053
Just captured image number 2054
Failed to capture image number 2055
Just captured image number 2056
Just captured image number 2057
Just captured image number 2058
Just captured image number 2059
Failed to capture image number 2060


Failed to capture image number 2281
Just captured image number 2282
Just captured image number 2283
Failed to capture image number 2284
Failed to capture image number 2285
Just captured image number 2286
Just captured image number 2287
Just captured image number 2288
Failed to capture image number 2289
Just captured image number 2290
Just captured image number 2291
Just captured image number 2292
Just captured image number 2293
Just captured image number 2294
Just captured image number 2295
Just captured image number 2296
Just captured image number 2297
Just captured image number 2298
Failed to capture image number 2299
Just captured image number 2300
Just captured image number 2301
Just captured image number 2302
Just captured image number 2303
Just captured image number 2304
Failed to capture image number 2305
Just captured image number 2306
Just captured image number 2307
Just captured image number 2308
Failed to capture image number 2309
Just captured image number 2310
Failed to ca

Just captured image number 2529
Just captured image number 2530
Just captured image number 2531
Just captured image number 2532
Just captured image number 2533
Failed to capture image number 2534
Just captured image number 2535
Failed to capture image number 2536
Just captured image number 2537
Just captured image number 2538
Just captured image number 2539
Failed to capture image number 2540
Just captured image number 2541
Just captured image number 2542
Failed to capture image number 2543
Just captured image number 2544
Just captured image number 2545
Just captured image number 2546
Failed to capture image number 2547
Just captured image number 2548
Failed to capture image number 2549
Just captured image number 2550
Just captured image number 2551
Just captured image number 2552
Just captured image number 2553
Just captured image number 2554
Just captured image number 2555
Just captured image number 2556
Just captured image number 2557
Just captured image number 2558
Just captured im

Just captured image number 2778
Failed to capture image number 2779
Just captured image number 2780
Just captured image number 2781
Just captured image number 2782
Just captured image number 2783
Just captured image number 2784
Just captured image number 2785
Just captured image number 2786
Just captured image number 2787
Just captured image number 2788
Just captured image number 2789
Just captured image number 2790
Just captured image number 2791
Just captured image number 2792
Just captured image number 2793
Failed to capture image number 2794
Just captured image number 2795
Just captured image number 2796
Failed to capture image number 2797
Just captured image number 2798
Just captured image number 2799
Just captured image number 2800
Failed to capture image number 2801
Just captured image number 2802
Just captured image number 2803
Failed to capture image number 2804
Just captured image number 2805
Just captured image number 2806
Just captured image number 2807
Just captured image 

Just captured image number 3026
Just captured image number 3027
Just captured image number 3028
Just captured image number 3029
Just captured image number 3030
Just captured image number 3031
Just captured image number 3032
Just captured image number 3033
Just captured image number 3034
Failed to capture image number 3035
Failed to capture image number 3036
Just captured image number 3037
Failed to capture image number 3038
Just captured image number 3039
Just captured image number 3040
Failed to capture image number 3041
Failed to capture image number 3042
Just captured image number 3043
Just captured image number 3044
Failed to capture image number 3045
Just captured image number 3046
Just captured image number 3047
Just captured image number 3048
Just captured image number 3049
Just captured image number 3050
Just captured image number 3051
Failed to capture image number 3052
Just captured image number 3053
Just captured image number 3054
Just captured image number 3055
Just capture

Just captured image number 3276
Just captured image number 3277
Just captured image number 3278
Just captured image number 3279
Just captured image number 3280
Just captured image number 3281
Failed to capture image number 3282
Just captured image number 3283
Just captured image number 3284
Just captured image number 3285
Just captured image number 3286
Just captured image number 3287
Just captured image number 3288
Failed to capture image number 3289
Failed to capture image number 3290
Failed to capture image number 3291
Failed to capture image number 3292
Failed to capture image number 3293
Just captured image number 3294
Just captured image number 3295
Just captured image number 3296
Just captured image number 3297
Failed to capture image number 3298
Failed to capture image number 3299
Just captured image number 3300
Just captured image number 3301
Just captured image number 3302
Failed to capture image number 3303
Just captured image number 3304
Just captured image number 3305
Fail

Just captured image number 3526
Just captured image number 3527
Failed to capture image number 3528
Failed to capture image number 3529
Failed to capture image number 3530
Failed to capture image number 3531
Failed to capture image number 3532
Failed to capture image number 3533
Failed to capture image number 3534
Just captured image number 3535
Failed to capture image number 3536
Failed to capture image number 3537
Just captured image number 3538
Just captured image number 3539
Failed to capture image number 3540
Just captured image number 3541
Just captured image number 3542
Just captured image number 3543
Just captured image number 3544
Just captured image number 3545
Just captured image number 3546
Just captured image number 3547
Just captured image number 3548
Just captured image number 3549
Failed to capture image number 3550
Just captured image number 3551
Just captured image number 3552
Just captured image number 3553
Just captured image number 3554
Just captured image number 3

Failed to capture image number 3776
Just captured image number 3777
Just captured image number 3778
Just captured image number 3779
Just captured image number 3780
Just captured image number 3781
Just captured image number 3782
Just captured image number 3783
Just captured image number 3784
Just captured image number 3785
Just captured image number 3786
Just captured image number 3787
Just captured image number 3788
Just captured image number 3789
Just captured image number 3790
Just captured image number 3791
Failed to capture image number 3792
Just captured image number 3793
Just captured image number 3794
Just captured image number 3795
Just captured image number 3796
Just captured image number 3797
Failed to capture image number 3798
Just captured image number 3799
Just captured image number 3800
Just captured image number 3801
Failed to capture image number 3802
Just captured image number 3803
Just captured image number 3804
Just captured image number 3805
Just captured image numb

Just captured image number 4025
Just captured image number 4026
Just captured image number 4027
Just captured image number 4028
Just captured image number 4029
Just captured image number 4030
Just captured image number 4031
Just captured image number 4032
Failed to capture image number 4033
Failed to capture image number 4034
Just captured image number 4035
Just captured image number 4036
Failed to capture image number 4037
Just captured image number 4038
Just captured image number 4039
Just captured image number 4040
Just captured image number 4041
Failed to capture image number 4042
Just captured image number 4043
Just captured image number 4044
Just captured image number 4045
Just captured image number 4046
Failed to capture image number 4047
Just captured image number 4048
Just captured image number 4049
Just captured image number 4050
Just captured image number 4051
Failed to capture image number 4052
Just captured image number 4053
Just captured image number 4054
Just captured im

Just captured image number 4273
Just captured image number 4274
Just captured image number 4275
Just captured image number 4276
Just captured image number 4277
Failed to capture image number 4278
Just captured image number 4279
Just captured image number 4280
Just captured image number 4281
Just captured image number 4282
Just captured image number 4283
Just captured image number 4284
Just captured image number 4285
Just captured image number 4286
Just captured image number 4287
Just captured image number 4288
Just captured image number 4289
Just captured image number 4290
Just captured image number 4291
Failed to capture image number 4292
Failed to capture image number 4293
Just captured image number 4294
Failed to capture image number 4295
Failed to capture image number 4296
Just captured image number 4297
Just captured image number 4298
Failed to capture image number 4299
Just captured image number 4300
Failed to capture image number 4301
Just captured image number 4302
Failed to ca

Just captured image number 4522
Failed to capture image number 4523
Failed to capture image number 4524
Failed to capture image number 4525
Just captured image number 4526
Just captured image number 4527
Just captured image number 4528
Failed to capture image number 4529
Just captured image number 4530
Failed to capture image number 4531
Just captured image number 4532
Failed to capture image number 4533
Failed to capture image number 4534
Just captured image number 4535
Failed to capture image number 4536
Just captured image number 4537
Failed to capture image number 4538
Just captured image number 4539
Failed to capture image number 4540
Failed to capture image number 4541
Failed to capture image number 4542
Just captured image number 4543
Failed to capture image number 4544
Failed to capture image number 4545
Just captured image number 4546
Just captured image number 4547
Failed to capture image number 4548
Just captured image number 4549
Just captured image number 4550
Just capture

Just captured image number 4769
Failed to capture image number 4770
Just captured image number 4771
Just captured image number 4772
Just captured image number 4773
Just captured image number 4774
Failed to capture image number 4775
Just captured image number 4776
Failed to capture image number 4777
Failed to capture image number 4778
Just captured image number 4779
Failed to capture image number 4780
Just captured image number 4781
Just captured image number 4782
Just captured image number 4783
Just captured image number 4784
Just captured image number 4785
Just captured image number 4786
Failed to capture image number 4787
Just captured image number 4788
Failed to capture image number 4789
Just captured image number 4790
Just captured image number 4791
Just captured image number 4792
Just captured image number 4793
Just captured image number 4794
Just captured image number 4795
Just captured image number 4796
Just captured image number 4797
Failed to capture image number 4798
Just cap

Just captured image number 5018
Just captured image number 5019
Just captured image number 5020
Just captured image number 5021
Just captured image number 5022
Failed to capture image number 5023
Just captured image number 5024
Just captured image number 5025
Just captured image number 5026
Just captured image number 5027
Just captured image number 5028
Just captured image number 5029
Just captured image number 5030
Just captured image number 5031
Just captured image number 5032
Just captured image number 5033
Failed to capture image number 5034
Just captured image number 5035
Failed to capture image number 5036
Failed to capture image number 5037
Just captured image number 5038
Failed to capture image number 5039
Just captured image number 5040
Just captured image number 5041
Just captured image number 5042
Just captured image number 5043
Just captured image number 5044
Just captured image number 5045
Failed to capture image number 5046
Just captured image number 5047
Just captured im

Failed to capture image number 5266
Just captured image number 5267
Just captured image number 5268
Just captured image number 5269
Failed to capture image number 5270
Failed to capture image number 5271
Just captured image number 5272
Just captured image number 5273
Failed to capture image number 5274
Just captured image number 5275
Just captured image number 5276
Just captured image number 5277
Just captured image number 5278
Failed to capture image number 5279
Just captured image number 5280
Failed to capture image number 5281
Just captured image number 5282
Just captured image number 5283
Failed to capture image number 5284
Just captured image number 5285
Failed to capture image number 5286
Just captured image number 5287
Just captured image number 5288
Just captured image number 5289
Just captured image number 5290
Just captured image number 5291
Just captured image number 5292
Just captured image number 5293
Failed to capture image number 5294
Just captured image number 5295
Just

Just captured image number 5513
Just captured image number 5514
Just captured image number 5515
Just captured image number 5516
Just captured image number 5517
Just captured image number 5518
Just captured image number 5519
Just captured image number 5520
Just captured image number 5521
Just captured image number 5522
Just captured image number 5523
Just captured image number 5524
Just captured image number 5525
Failed to capture image number 5526
Just captured image number 5527
Just captured image number 5528
Just captured image number 5529
Just captured image number 5530
Failed to capture image number 5531
Just captured image number 5532
Failed to capture image number 5533
Just captured image number 5534
Just captured image number 5535
Just captured image number 5536
Failed to capture image number 5537
Just captured image number 5538
Failed to capture image number 5539
Just captured image number 5540
Just captured image number 5541
Failed to capture image number 5542
Just captured im

Just captured image number 5761
Just captured image number 5762
Just captured image number 5763
Failed to capture image number 5764
Failed to capture image number 5765
Just captured image number 5766
Just captured image number 5767
Failed to capture image number 5768
Just captured image number 5769
Just captured image number 5770
Just captured image number 5771
Just captured image number 5772
Just captured image number 5773
Just captured image number 5774
Failed to capture image number 5775
Failed to capture image number 5776
Failed to capture image number 5777
Just captured image number 5778
Just captured image number 5779
Failed to capture image number 5780
Failed to capture image number 5781
Just captured image number 5782
Just captured image number 5783
Failed to capture image number 5784
Just captured image number 5785
Just captured image number 5786
Just captured image number 5787
Just captured image number 5788
Failed to capture image number 5789
Just captured image number 5790


Just captured image number 6009
Just captured image number 6010
Just captured image number 6011
Failed to capture image number 6012
Just captured image number 6013
Just captured image number 6014
Failed to capture image number 6015
Just captured image number 6016
Just captured image number 6017
Just captured image number 6018
Just captured image number 6019
Just captured image number 6020
Failed to capture image number 6021
Just captured image number 6022
Just captured image number 6023
Just captured image number 6024
Just captured image number 6025
Just captured image number 6026
Failed to capture image number 6027
Failed to capture image number 6028
Just captured image number 6029
Just captured image number 6030
Failed to capture image number 6031
Just captured image number 6032
Failed to capture image number 6033
Just captured image number 6034
Just captured image number 6035
Just captured image number 6036
Just captured image number 6037
Failed to capture image number 6038
Just cap

Failed to capture image number 6258
Just captured image number 6259
Just captured image number 6260
Just captured image number 6261
Just captured image number 6262
Just captured image number 6263
Failed to capture image number 6264
Just captured image number 6265
Just captured image number 6266
Failed to capture image number 6267
Just captured image number 6268
Just captured image number 6269
Just captured image number 6270
Just captured image number 6271
Just captured image number 6272
Just captured image number 6273
Just captured image number 6274
Just captured image number 6275
Just captured image number 6276
Failed to capture image number 6277
Just captured image number 6278
Just captured image number 6279
Just captured image number 6280
Just captured image number 6281
Failed to capture image number 6282
Failed to capture image number 6283
Just captured image number 6284
Just captured image number 6285
Just captured image number 6286
Failed to capture image number 6287
Just capture

Just captured image number 6507
Just captured image number 6508
Just captured image number 6509
Just captured image number 6510
Just captured image number 6511
Just captured image number 6512
Just captured image number 6513
Just captured image number 6514
Just captured image number 6515
Just captured image number 6516
Just captured image number 6517
Just captured image number 6518
Just captured image number 6519
Just captured image number 6520
Just captured image number 6521
Just captured image number 6522
Just captured image number 6523
Just captured image number 6524
Just captured image number 6525
Just captured image number 6526
Just captured image number 6527
Just captured image number 6528
Just captured image number 6529
Just captured image number 6530
Failed to capture image number 6531
Just captured image number 6532
Just captured image number 6533
Just captured image number 6534
Just captured image number 6535
Just captured image number 6536
Failed to capture image number 6537


Failed to capture image number 6757
Just captured image number 6758
Just captured image number 6759
Just captured image number 6760
Just captured image number 6761
Just captured image number 6762
Just captured image number 6763
Just captured image number 6764
Failed to capture image number 6765
Just captured image number 6766
Just captured image number 6767
Just captured image number 6768
Just captured image number 6769
Just captured image number 6770
Just captured image number 6771
Failed to capture image number 6772


In [77]:
# # Drop the column with the image URLs as it's no longer needed
# canadian_images.drop(columns='image', inplace=True)

# And resave to the file
canadian_images.to_csv('../data/processed/canadian_may19.csv', index = False)

Through running the processes in this notebook as many times as needed, I was about to build a database of 6,500+ Canadian titles and 10,000+ international titles. Next, they will go through some feature engineering and preprocessing to become ready to be run through the TFIDF Vectorizer (which will change the words into numerical values that a model can understand and compare with one another. 

<div style="text-align: right">(<a href="#contents">home</a>) </div>