# Books scraping
<img src = 'goodreads.png'>

### Pick a website and describe your objective

* Browse through different websites and pick on to scrape. Check the 'Project Ideas' section for inspiration
* Summarize your project idea and outline your strategy in a jupyter notebook

#### Project Outline :

* We're going to scrape https://www.goodreads.com/genres
* For each category, we'll get the new releases
* For each every release we will grab it's book information
 ```
 Author, Book name, Summary, Rating, Published year
 
 ```

In [1]:
# let's write a function to download the page
import requests
from bs4 import BeautifulSoup

def get_genres_type():
    # write the code for getting different genres
    genres_url = 'https://www.goodreads.com/genres'
    response = requests.get(genres_url)
    if response.status_code != 200:
        raise Exception ('Failed to Load page {}'.format(genres_url))
    
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

In [2]:
doc = get_genres_type()

In [3]:
# let's create a helper functions to parse information from the page
def get_genres_titles(doc):
    selection_class = 'gr-hyperlink'
    genres = doc.find_all('a',{'class': selection_class})
    genres_title = []
    for tag in genres:
        genres_title.append(tag.text)

    genres_title.remove('')
    
    return list(set(genres_title))

In [4]:
# get_genres_titles can be used to get list of titles

In [5]:
titles = get_genres_titles(doc)

In [6]:
len(titles)

42

In [7]:
titles[:5]

['Mystery', 'Fiction', 'Chick Lit', 'Religion', 'Biography']

In [8]:
# create urls for new released for each category
def get_genres_new_release_url(doc):
    new_release_url = []
    base_url = 'https://www.goodreads.com/genres/'
    
    selection_class = 'gr-hyperlink'
    genres = doc.find_all('a',{'class': selection_class})
    

    for tag in genres:
        new_release_url.append(base_url + 'new_releases/' + tag.text)
        
    new_release_url.remove('https://www.goodreads.com/genres/new_releases/')

    return list(set(new_release_url))

In [9]:
# save all urls inside url variable
url = get_genres_new_release_url(doc)

In [34]:
# First 10 urls
url[:10]

['https://www.goodreads.com/genres/new_releases/Fantasy',
 'https://www.goodreads.com/genres/new_releases/More Genres',
 'https://www.goodreads.com/genres/new_releases/Sports',
 'https://www.goodreads.com/genres/new_releases/Suspense',
 'https://www.goodreads.com/genres/new_releases/History',
 'https://www.goodreads.com/genres/new_releases/Philosophy',
 'https://www.goodreads.com/genres/new_releases/Horror',
 'https://www.goodreads.com/genres/new_releases/Travel',
 'https://www.goodreads.com/genres/new_releases/Classics',
 'https://www.goodreads.com/genres/new_releases/Fiction']

In [11]:
# Parse through the new_released_url 
def get_genres_page(new_release_url):
    # Download the page
    response = requests.get(new_release_url)
    # check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(new_release_url))
    
    #parse using beautiful soup
    genre_doc = BeautifulSoup(response.text, 'html.parser')
    return genre_doc

In [12]:
# parse and save information inside genre_doc
genre_doc = get_genres_page('https://www.goodreads.com/genres/new_releases/Sports')

In [13]:
# find all books from all divs associated with that genre
def get_each_book_link(genre_doc):
    # use genre_doc variable inside which all information is saved
    divs = genre_doc.find_all('div',{'class': 'coverWrapper'})
    book_url = []
    for i in range(len(divs)):
        a_tags = divs[i].find_all('a')
        book_url.append('https://www.goodreads.com' + a_tags[0]['href'])
    return book_url

In [14]:
# save all book link inside book url
book_url = get_each_book_link(genre_doc)

In [15]:
book_url

['https://www.goodreads.com/book/show/122495892-meet-your-match',
 'https://www.goodreads.com/book/show/177138498-on-the-shore',
 'https://www.goodreads.com/book/show/60619760-plays-well-with-others',
 'https://www.goodreads.com/book/show/61851486-mine-to-take',
 'https://www.goodreads.com/book/show/62927890-the-all-american',
 'https://www.goodreads.com/book/show/63017290-the-fire-the-water-and-maudie-mcginn',
 'https://www.goodreads.com/book/show/63249764-lexington',
 'https://www.goodreads.com/book/show/62067439-forever-goals',
 'https://www.goodreads.com/book/show/123249526-game-changer',
 'https://www.goodreads.com/book/show/157354980-power-play',
 'https://www.goodreads.com/book/show/173468176-embracing-fate',
 'https://www.goodreads.com/book/show/160646938-25-blue-lock-25',
 'https://www.goodreads.com/book/show/175067322-onside-play',
 'https://www.goodreads.com/book/show/108518990-reckless',
 'https://www.goodreads.com/book/show/174713026-behind-the-net',
 'https://www.goodread

In [21]:
book_url[0]

'https://www.goodreads.com/book/show/122495892-meet-your-match'

In [22]:
# parse through book url
def get_book_page(book_url):
    # Download the page
    response = requests.get(book_url)
    # check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(book_url))
    
    #parse using beautiful soup
    book_doc = BeautifulSoup(response.text, 'html.parser')
    return book_doc

In [39]:
# parse and save the information of the given book inside book_doc
book_doc = get_book_page('https://www.goodreads.com/book/show/122495892-meet-your-match')

In [45]:
# Get all information of the book from book doc in which all information is been saved.
def get_book_info(book_doc):
    big_div = book_doc.find_all('div',{'class': 'BookPage__mainContent'})
    a_tags = big_div[0].find_all('a')
    div_tags = big_div[0].find_all('div')
    span_tags = big_div[0].find_all('span')
    h1_tags = big_div[0].find_all('h1')
    p_tags = big_div[0].find_all('p')
    published_year = p_tags[1].text
    book_title = h1_tags[0].text
    authorName = span_tags[1].text
    Rating = div_tags[8].text
    Summary = span_tags[20].text
    return authorName, book_title, Summary, Rating, published_year

In [46]:
# meet-your-match book information
get_book_info(book_doc)

('Kandi Steiner',
 'Meet Your Match',
 'One Month with Vince Tanev: Tampa’s Hotshot RookieTwenty-four-seven access on and off the ice.The headline says it all, and my bosses are over the moon when the opportunity of a lifetime lands in my lap. Of course, they aren’t aware that I’ve already met Vince Cool at an all-star gala — and that we were at each other’s throats the entire time.It doesn’t matter that he’s the kind of hot that shows God has favorites — messy brown hair, heated hazel eyes, the smirk of a rockstar, and a scar over his eyebrow that makes every woman particularly feral.He’s a rich, cocky playboy — a brand I’m all too familiar with, and one I’m determined to never be around again.But after my coverage of the gala stirs up buzz, the team’s General Manager and my CEO strike a deal. To help fill the arena at home games, I’ll get up close and personal with Tampa’s new shiny toy. Whether he’s at practice, playing in a game, partying, or drinking coffee half-naked in his condo

In [31]:
# book --> fate-of-a-royal, genre --> sports
book_doc2 = get_book_page('https://www.goodreads.com/book/show/185358293-fate-of-a-royal')

get_book_info(book_doc2)

('Meagan Brandy, Amo Jones',
 'Fate of a Royal',
 '',
 '4.17',
 'First published July 7, 2023')

In [27]:
# book --> long-shot, genre --> sports
book_doc2 = get_book_page('https://www.goodreads.com/book/show/172002322-long-shot')

get_book_info(book_doc2)

('M.J. Fields',
 'Long Shot: Lincoln U- Ice Hockey',
 "Ellie Before my junior year at Lincoln University, I promised myself three things. Just three. When Professor Taylor posted our chem midterm p artners and I got stuck with the captain of Lincoln’s Ice Hockey team , I was sure I’d be able to abide by rules 2 and 3. After all, I was immune to Leo’s whole ‘hottest guy on campus’ thing. I grew up with him— my brother’s best friend . That I’ve been obsessed with since I was eight. F' me. Leo Senior year. Captain of the hockey team , already engaged to the NHL with the perfect union months away, my whole life mapped out ahead of me. All my dreams have come true. Except for the one with Eleanor Rhodes. That one’s reoccurring, starring her in nothing but my number and that perfect little blush that shows up every time she looks at me. No biggie. I’ve worked that one out. In the shower. Numerous times. Never had a reason to get any closer than her Instagram until Professor Taylor assigned u

In [30]:
# book --> reckless, genre --> sports
book_doc2 = get_book_page('https://www.goodreads.com/book/show/108518990-reckless')

get_book_info(book_doc2)

('Elsie Silver',
 'Reckless',
 'Theo Silva. Rowdy bull rider. Notorious ladies’ man. Scorching hot trouble wrapped up in a drool-worthy package.And he’s looking at me like I might be his next meal.But I’m almost free of my toxic marriage and have sworn off men entirely. So all I see when I look back is temptation served up with a heaping side of heartbreak.The man is hard to trust—and even harder to resist.Make that impossible. Because Theo is persistent. And no matter how hard I try to freeze him out, he melts my icy exterior and pulls apart all my defenses.Over a drink in a small town bar, I blurt out my deepest, darkest secrets. Then I spend the singular hottest night of my life with him.He worships my body. He makes me blush. I come alive beneath his hands.Then I tell him to forget it ever happened. I want simple, and with him it all feels complicated.It was supposed to be a one-time thing.A secret.But that little plus sign is going to make this secret impossible to keep.',
 '4.45'

In [33]:
# select the genre :- history parse through it
genre_doc = get_genres_page('https://www.goodreads.com/genres/new_releases/History')
# get all book links associated with history :- new released 
book_url = get_each_book_link(genre_doc)
# Select first book link from the list parse through it
book_doc = get_book_page(book_url[0])
# get the book information
get_book_info(book_doc)


('B. Dylan Hollis',
 'Baking Yesteryear: The Best Recipes from the 1900s to the 1980s',
 'GenresCookbooksNonfictionCookingFoodHistoryHumorFoodie',
 '4.75',
 'First published July 25, 2023')

In [36]:
# select the genre :- history parse through it
genre_doc = get_genres_page('https://www.goodreads.com/genres/new_releases/Horror')
# get all book links associated with history :- new released 
book_url = get_each_book_link(genre_doc)
# Select first book link from the list parse through it
book_doc = get_book_page(book_url[0])
# get the book information
get_book_info(book_doc)

('Karen M. McManus',
 'One of Us Is Back',
 "From international bestseller, Karen McManus, comes the explosive third and final thrilling instalment in the acclaimed One of Us... series. Ever since Simon died in detention, life hasn't been easy for the Bayview Crew. First the Bayview Four had to prove they weren't killers. Then a new generation had to outwit a vengeful copycat. Now, it's beginning again. At first the mysterious billboard seems like a bad joke: Time for a new game, Bayview. But when a member of the crew disappears, it's clear this 'game' just got serious - and no one understands the rules.Everyone's a target. And now that someone unexpected has returned to Bayview, things are starting to get deadly. Simon was right about secrets - they all come out in the end.  The thing is, Simon was right about secrets-they all come out, eventually. And Bayview has a lot it's still hiding.",
 '4.16',
 'First published July 25, 2023')

### Let's See first 10 books from horror genre


In [51]:
# select the genre :- history parse through it
genre_doc = get_genres_page('https://www.goodreads.com/genres/new_releases/Horror')
# get all book links associated with history :- new released 
book_url = get_each_book_link(genre_doc)

# print first 10 books links
print(book_url[:10])

['https://www.goodreads.com/book/show/57932307-one-of-us-is-back', 'https://www.goodreads.com/book/show/61796642-the-militia-house', 'https://www.goodreads.com/book/show/63066025-doomsday-match', 'https://www.goodreads.com/book/show/61884783-boys-in-the-valley', 'https://www.goodreads.com/book/show/62858184-burn-the-negative', 'https://www.goodreads.com/book/show/62919378-at-the-end-of-every-day', 'https://www.goodreads.com/book/show/62049709-the-possibilities', 'https://www.goodreads.com/book/show/61317666-a-guide-to-the-dark', 'https://www.goodreads.com/book/show/62919225-infested', 'https://www.goodreads.com/book/show/62997448-magdalena']


In [52]:
# Select book link from the list parse through it :-'a-guide-to-the-dark'
book_doc = get_book_page('https://www.goodreads.com/book/show/61317666-a-guide-to-the-dark')
# get the book information
get_book_info(book_doc)

('Meriam Metoui',
 'A Guide to the Dark',
 "You can check out of Room 9, but you can never leave.The Haunting of Hill House meets Nina LaCour in this paranormal mystery YA about the ghosts we carry with us.Something is building, simmering just out of reach.The room is watching. But Mira and Layla don't know this yet. When the two best friends are stranded on their spring break college tour road trip, they find themselves at the Wildwood Motel, located in the middle of nowhere, Indiana. Mira can't shake the feeling that there is something wrong and rotten about their room. Inside, she's haunted by nightmares of her dead brother. When she wakes up, he's still there.Layla doesn't see him. Or notice anything suspicious about Room 9. The place may be a little run down, but it has a certain charm she can’t wait to capture on camera. If Layla is being honest, she’s too preoccupied with confusing feelings for Mira to see much else. But when they learn eight people died in that same room, they 

## References and Future Work

Summary of what we did

- First we scrape all genre from the website - https://www.goodreads.com/
- Then we add these genres to new_released link - https://www.goodreads.com/new_releases
- parse through a particular genre
- took all book links associated to that genre
- parse through a particular book
- took all informaton like:

    ```
    AuthorName, Book Title, Summary, Rating, Published year
    ```
    


References to links you found useful

- Jovian from where I learned basics of web scraping:- https://jovian.com/aakashns/python-web-scraping-project-guide


 
Ideas for future work

- Automation for each category
- saving the all books info from a particular category into CSV format