# Final Evaluation - Part 1

First, create a dataset of more than 500 cases using any of the APIs listed [here](https://github.com/public-apis/public-apis) or [here](https://github.com/realpython/list-of-python-api-wrappers) or by scraping the website of your choice. You will submit your notebook and the csv data file.

We used the public API of the "Penguin Publishing" (http://www.penguinrandomhouse.biz/webservices/rest/) to create a dataset of 501 books which included the word "shakespeare" in their title. More specifically, we created a function which, after specifying the search word and the number of books required, could extract the specified number of books with that search word in their title.

In [1]:
import requests
import pandas as pd

def get_books(search_term, count):
    
    df = []
    
    # The API only allows to get 100 books at a time, so we have to loop over the given books count, 100 books at a time
    for i in range(0, count, 100):
        
        # Specify at which book index to start
        start_index = i 
        
        # Specify how many books to get this time - either 100 if we will have another loop over the count 
        # or the remaining book count if we finish looping with this loop
        if count - 100 - i < 0:
            max_count = count - i
        else:
            max_count = 100
        
        # Make call to the API and get the information in JSON format
        base_url   = 'https://reststop.randomhouse.com/resources/titles/'
        parameters = {'start': start_index,
                      'max':max_count,
                      'expandLevel': 1, # Get all possible information
                      'search': search_term}
        headers = {'Accept': 'application/json',
                   'Content-Type': 'application/json'}
        r = requests.get(base_url, parameters, headers = headers).json()
        
        # If the API call output is not empty (this could happen if the specified parameter 'count' is larger 
        # than the number of available books with this search term), append the output to the output dataframe.
        # Also, specify two cases of how to append the output to the output dataframe - one case when the number 
        # of books in the API output is 1 and another case when the number of books in the API output is larger than 1.
        if 'title' in r:
            if max_count == 1:
                df.append(pd.DataFrame([r['title']]))
            else:
                df.append(pd.DataFrame(r['title']))
    
    return df



In [2]:
# Extract 501 books with the word 'shakespeare' in their title. Note that the output will contain either 
# the specified number of books or the maximum number of books available with the search term in their title.

books = get_books('shakespeare', 501)
df = pd.concat(books, ignore_index = True)

In [3]:
# Length of the output dataframe
len(df)

501

In [4]:
df

Unnamed: 0,@uri,author,authors,authorbio,authorweb,awards,characters,contributorfirst1,contributorfirst2,contributorlast1,...,rgabout,rgauthbio,rgcopy,rgdiscussion,bonusfeature,authqanda,authordesktop,authordesktoplinktext,authordesktoplinkurl,sgmtDesc
0,https://reststop.randomhouse.com/resources/tit...,"RUDNICKI, STEFAN","{'authorId': {'@contributortype': 'E V', '$': ...",<b>Stefan Rudnicki</b>&#160;is an avid audiobo...,Collected and Introduced by Stefan Rudnicki,,,Stefan,Stefan,Rudnicki,...,,,,,,,,,,
1,https://reststop.randomhouse.com/resources/tit...,"EPSTEIN, NORRIE","{'authorId': {'@contributortype': 'A', '$': '8...",<b>Norrie Epstein</b> has lectured extensively...,Norrie Epstein,,,Norrie,,Epstein,...,,,,,,,,,,
2,https://reststop.randomhouse.com/resources/tit...,"CRYSTAL, DAVID","{'authorId': [{'@contributortype': '1', '$': '...",David Crystal is one of the most authoritative...,David Crystal,,,David,Ben,Crystal,...,,,,,,,,,,
3,https://reststop.randomhouse.com/resources/tit...,"CLARK, SANDRA","{'authorId': {'@contributortype': 'E', '$': '2...",The author of many books on Shakespeare and En...,Sandra Clark,,,Sandra,,Clark,...,,,,,,,,,,
4,https://reststop.randomhouse.com/resources/tit...,"BRADLEY, A. C.","{'authorId': [{'@contributortype': 'U', '$': '...",<b>A. C. Bradley</b> was born in Cheltenham in...,A. C. Bradley,,,A. C.,John,Bradley,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
496,https://reststop.randomhouse.com/resources/tit...,"DOESCHER, IAN","{'authorId': [{'@contributortype': 'A', '$': '...",<b>Ian Doescher&#160;</b>is the <i>New York Ti...,Ian Doescher,,,Ian,Kent,Doescher,...,,,,,,,,,,
497,https://reststop.randomhouse.com/resources/tit...,"DOESCHER, IAN","{'authorId': [{'@contributortype': 'A', '$': '...",<b>Ian Doescher&#160;</b>is the <i>New York Ti...,Ian Doescher,,,Ian,Kent,Doescher,...,,,,,,,,,,
498,https://reststop.randomhouse.com/resources/tit...,"DOESCHER, IAN","{'authorId': {'@contributortype': 'A', '$': '1...",<b>Ian Doescher</b> is the <i>New York Times</...,Ian Doescher,,,Ian,,Doescher,...,,,,,,,,,,
499,https://reststop.randomhouse.com/resources/tit...,"DOESCHER, IAN","{'authorId': {'@contributortype': 'A', '$': '1...",<b>Ian Doescher</b> is the <i>New York Times</...,Ian Doescher,,,Ian,,Doescher,...,,,,,,,,,,


In [5]:
# Dataframe's column names
df.columns

Index(['@uri', 'author', 'authors', 'authorbio', 'authorweb', 'awards',
       'characters', 'contributorfirst1', 'contributorfirst2',
       'contributorlast1', 'contributorlast2', 'division', 'flapcopy',
       'formatcode', 'formatname', 'imprint', 'isbn', 'isbn10',
       'isbn10hyphenated', 'isbn13hyphenated', 'keyword', 'onsaledate',
       'pages', 'pricecanada', 'priceusa', 'relatedisbns', 'salestatus',
       'subjectcategory1', 'subjectcategory2', 'subjectcategory3',
       'subjectcategorydescription1', 'subjectcategorydescription2',
       'subjectcategorydescription3', 'subtitle', 'tgpdf', 'themes',
       'titleauthisbn', 'titleshort', 'titlesubtitleauthisbn', 'titleweb',
       'updatedOn', 'webdomains', 'links', 'workid', 'contributorfirst3',
       'contributorlast3', 'jacketquotes', 'excerpt', 'tableofcontents',
       'agerange', 'agerangecode', 'acmartflap', 'subformat', 'rgabout',
       'rgauthbio', 'rgcopy', 'rgdiscussion', 'bonusfeature', 'authqanda',
       'au

We can see that the output dataframe is not very clean, for example, some columns contain dictionaries as their elements and some text is in HTML format, making it hard to read. Thus we will clean the dataset a little bit.

In [5]:
# Create a function which converts HTML text to normal readable text, and use it on non-readable columns

import html2text
import math
def convert_from_html_to_text(html_text):
    if html_text==html_text:
        html_text = html_text.replace("<b>", "")
        html_text = html_text.replace("</b>", "")
        html_text = html_text.replace("<i>", "'")  # <i> and </i> are used for text in Italics, and in this case identify 
        html_text = html_text.replace("</i>", "'") # book titles and such, thus we put those words in quotation marks instead
        html_text = html2text.html2text(html_text)
    return html_text

df['authorbio_text'] = df['authorbio'].apply(convert_from_html_to_text)
df['flapcopy_text'] = df['flapcopy'].apply(convert_from_html_to_text)
df['jacketquotes_text'] = df['jacketquotes'].apply(convert_from_html_to_text)
df['excerpt_text'] = df['excerpt'].apply(convert_from_html_to_text)
df['tableofcontents_text'] = df['tableofcontents'].apply(convert_from_html_to_text)
df['acmartflap_text'] = df['acmartflap'].apply(convert_from_html_to_text)

In [6]:
# Example of how HTML text looks like after being converted to normal text

print(df['authorbio'][0])
print('\n')
print(df['authorbio_text'][0])

<b>Stefan Rudnicki</b>&#160;is an avid audiobook narrator, receiving numerous Earphones Awards from&#160;<i>AudioFile</i>&#160;magazine. He is also a Grammy-winning audiobook producer.


Stefan Rudnicki is an avid audiobook narrator, receiving numerous Earphones
Awards from 'AudioFile' magazine. He is also a Grammy-winning audiobook
producer.




In [7]:
# Extract link URLs and link descriptions from the 'links' column

def extract_linktext(links):
    if links is not None:
        linktext=[]
        if isinstance(links['link'], dict):
            linktext = links['link']['linktext']
        else:
            for i in range(len(links['link'])):
                linktext.append(links['link'][i]['linktext'])
        return linktext
    
def extract_linkurl(links):  
    if links is not None:
        url=[]
        if isinstance(links['link'], dict):
            url = links['link']['url']
        else:
            for i in range(len(links['link'])):
                url.append(links['link'][i]['url'])
        return url

df['links_url'] = df['links'].apply(extract_linkurl)
df['links_description'] = df['links'].apply(extract_linktext)

In [8]:
# Extract award name, year, ISBN and award level from the column 'awards'

def extract_award_name(awards):
    if awards is not None:
        award_name=[]
        if isinstance(awards['award'], dict):
            award_name = awards['award']['awarddesc']
        else:
            for i in range(len(awards['award'])):
                award_name.append(awards['award'][i]['awarddesc'])
        return award_name

def extract_award_year(awards):
    if awards is not None:
        award_year=[]
        if isinstance(awards['award'], dict):
            award_year = awards['award']['awardyear']
        else:
            for i in range(len(awards['award'])):
                award_year.append(awards['award'][i]['awardyear'])
        return award_year
    
def extract_award_isbn(awards):
    if awards is not None:
        award_isbn=[]
        if isinstance(awards['award'], dict):
            award_isbn = awards['award']['awardisbn']
        else:
            for i in range(len(awards['award'])):
                award_isbn.append(awards['award'][i]['awardisbn'])
        return award_isbn

def extract_award_level(awards):
    if awards is not None:
        award_level=[]
        if isinstance(awards['award'], dict):
            award_level = awards['award']['awardlevel']
        else:
            for i in range(len(awards['award'])):
                award_level.append(awards['award'][i]['awardlevel'])
        return award_level

    
df['award_name'] = df['awards'].apply(extract_award_name)
df['award_year'] = df['awards'].apply(extract_award_year)
df['award_isbn'] = df['awards'].apply(extract_award_isbn)
df['award_level'] = df['awards'].apply(extract_award_level)

In [9]:
# Put all relevant columns to a new dataframe 

df_clean = df[['workid', 'authorweb', 'titleweb', 'subtitle', 'pages', 'formatcode', 'formatname', 'division', 'imprint', 
    'salestatus', 'onsaledate', 'updatedOn', 'pricecanada', 'priceusa', 'isbn', 'isbn10', 'isbn10hyphenated', 
    'isbn13hyphenated', 'themes', 'agerange', 'subjectcategory1', 'subjectcategorydescription1', 
    'subjectcategory2', 'subjectcategorydescription2', 'subjectcategory3', 'subjectcategorydescription3',
    'authorbio_text', 'tableofcontents_text', 'excerpt_text', 'flapcopy_text', 'acmartflap_text', 'jacketquotes_text',
    'links_url', 'links_description', 'award_name', 'award_year', 'award_isbn', 'award_level'
    ]]

In [11]:
df_clean.sample(10)

Unnamed: 0,workid,authorweb,titleweb,subtitle,pages,formatcode,formatname,division,imprint,salestatus,...,excerpt_text,flapcopy_text,acmartflap_text,jacketquotes_text,links_url,links_description,award_name,award_year,award_isbn,award_level
429,204501,William Shakespeare,The Taming of the Shrew,,224,EL,eBook,Random House Group,Modern Library,EL,...,"Introduction \n \nTHE ""TAMING"" AND THE ""SHRE...","A robust and bawdy battle of the sexes, this e...",,,,,,,,
361,324152,William Shakespeare,The Merry Wives of Windsor,,160,EL,eBook,Penguin Adult HC/TR,Penguin Classics,EL,...,,The acclaimed Pelican Shakespeare series edite...,,,,,,,,
174,299479,William Shakespeare,Pericles/Cymbeline/The Two Noble Kinsmen,,736,MM,Paperback,Berkley / NAL,Signet,IP,...,,The plays collected here follow: the journeys ...,,,,,,,,
319,213819,William Shakespeare,The Comedy of Errors,,176,TR,Trade Paperback,Random House Group,Modern Library,IP,...,Act 1 Scene 1 running scene 1 \n \nEnter Duk...,"""I see two husbands, or mine eyes deceive me.""...",,Praise for 'William Shakespeare: Complete Work...,,,,,,
69,317729,"E. Foley, B. Coates",Shakespeare Basics for Grown-Ups,Everything You Need to Know About the Bard,336,TR,Trade Paperback,Penguin Adult HC/TR,Plume,IP,...,INTRODUCTION\n\nWilliam Shakespeare is without...,"An essential guide to Shakespeare, from the in...",,"""An obvious candidate to take to a desert isla...",,,,,,
393,555885,William Shakespeare,Antony and Cleopatra,,192,EL,eBook,Penguin Adult HC/TR,Penguin Classics,EL,...,,The acclaimed Pelican Shakespeare series edite...,,"""Gorgeous new Shakespeare paperbacks."" \n--Ma...",,,,,,
413,205696,William Shakespeare,Richard III,,272,EL,eBook,Random House Group,Modern Library,EL,...,"Chapter 1 \n \nlist of parts \n \nRICHARD,...",An exciting new edition of the complete works ...,,,,,,,,
180,303494,Lilian Jackson Braun,The Cat Who Knew Shakespeare,,256,MM,Paperback,Berkley / NAL,Berkley,IP,...,,In this mystery in the' 'bestselling Cat Who s...,,Praise for Lilian Jackson Braun and the Cat Wh...,,,,,,
104,9278,John Barton Foreword by Trevor Nunn,Playing Shakespeare,An Actor's Guide,288,EL,eBook,Knopf,Anchor,EL,...,part one \n \nObjective Things \n \nchapte...,Playing Shakespeare' 'is the premier guide to ...,,"""One of the sanest, wisest, and most practical...",,,,,,
244,610951,Emma Smith,This Is Shakespeare,,0,DN,Unabridged Audiobook Download,Audio,Random House Audio,EL,...,INTRODUCTION \n \nWhy should you read a book...,An electrifying new study that investigates th...,,"'Advance Praise from the U.K.:' \n \n""I admi...",https://soundcloud.com/penguin-audio/this-is-s...,Link to Audio Clip,,,,


In [23]:
# Save the dataframe to a CSV file

df_clean.to_csv('penguin_books_scraped.csv', encoding='utf8', index=False)

In [26]:
# Check if everything is fine

df_clean_read = pd.read_csv('penguin_books_scraped.csv', encoding='utf8')
df_clean_read.sample(10)

Unnamed: 0,workid,authorweb,titleweb,subtitle,pages,formatcode,formatname,division,imprint,salestatus,...,excerpt_text,flapcopy_text,acmartflap_text,jacketquotes_text,links_url,links_description,award_name,award_year,award_isbn,award_level
394,555886,William Shakespeare,Measure for Measure,,160,EL,eBook,Penguin Adult HC/TR,Penguin Classics,EL,...,,The acclaimed Pelican Shakespeare series edite...,,"""Gorgeous new Shakespeare paperbacks."" \n--Ma...",,,,,,
491,597934,Ian Doescher,William Shakespeare's Get Thee Back to the Fut...,,176,EL,eBook,Quirk Books,Quirk Books,EL,...,,Celebrate 'Back to the Future' with this illus...,,"""A weird and wonderful Shakespearean play.""--'...",,,,,,
58,555886,William Shakespeare,Measure for Measure,,160,TR,Trade Paperback,Penguin Adult HC/TR,Penguin Classics,IP,...,,The acclaimed Pelican Shakespeare series edite...,,"""Gorgeous new Shakespeare paperbacks."" \n--Ma...",,,,,,
170,294240,William Shakespeare,Titus Andronicus and Timon of Athens,,464,MM,Paperback,Berkley / NAL,Signet,IP,...,,As part of the Signet Classics Shakespeare Ser...,,,,,,,,
155,326574,William Shakespeare,"Henry IV, Part I",,336,MM,Paperback,Berkley / NAL,Signet,IP,...,,"This edition of Shakespeare's Henry IV, Part 1...",,,,,,,,
87,164696,William Shakespeare,King John and Henry VIII,,592,EL,eBook,Bantam Dell,Bantam Classics,EL,...,'Introduction \n' \n \nThe Life and Death o...,These two history plays--one written in the ea...,,,,,,,,
479,622223,Elizabeth J. Duncan,Ill Met By Murder,A Shakespeare in the Catskills Mystery,288,EL,eBook,Crooked Lane Books,Crooked Lane Books,EL,...,,It's the most important night of the year for ...,,"Praise for 'Ill Met by Murder': \n \n""Duncan...",,,,,,
209,164679,William Shakespeare Edited by David Bevingto...,Hamlet,,384,MM,Paperback,Bantam Dell,Bantam Classics,IP,...,"Dramatis Personae\n\n*\n\nghost of Hamlet, the...","One of the greatest plays of all time, the com...","One of the greatest plays of all time, the com...",Praise for 'William Shakespeare: Complete Work...,http://www.randomhouse.com/bantamdell/shakespe...,Click here for the official Bantam Dell Shakes...,,,,
297,205693,William Shakespeare,King Lear,,272,TR,Trade Paperback,Random House Group,Modern Library,IP,...,'Chapter One \n \n \nAct 1 Scene 1 running ...,King Lear is Shakespeare's bleakest and profou...,King Lear is Shakespeare's bleakest and profou...,,,,,,,
260,163697,Francine Segan,Shakespeare's Kitchen,Renaissance Recipes for the Contemporary Cook:...,288,EL,eBook,Random House Group,Random House,EL,...,"Some pigeons, Davy, a couple of \nshort-legge...","""Shakespeare's Kitchen not only reveals, somet...",,"' \n'""Shakespeare's Kitchen treats four-hundr...",,,,,,
