## Popular Fiction In The New Release

Objective:
   
   Find the top fictional books for each month from the newly released section, based on the average rating given by readers. At the end of each month, i want to read newly released fictional book which has highest avg rating. I want to automate the process of searching the books with highest avg rating in goodreads and then get the links to order them from bookdepository in my mail.

Websites uesd: www.goodreads.com, www.bookdepository.com

How to solve the problem using python
1. Get the data (booktitle,author,average rating,etc)from www.goodreads.com website and save it in a csv file to make data analysis. This is done using HTTP requests using python's request library and pulling data out of html files with BeautifulSoup library.


2. Clean the data extracted from goodreads website. split the data into seperate columns if requires, check for null and NaN values.


3. Analyse the data and identify the Top n books based on average rating. For the top n books show the no of ratings and reviews in visual representation with the help of plotting library plotly.


4. Generate an eMail with will inculde corresponding bookdepository links (www.bookdepository.com) for each top n book for ordering


Individual Project by Elango Sindhu Priya


In [None]:
#import the libraries

In [26]:
import requests
from bs4 import BeautifulSoup
import datetime as dt
from tqdm import tqdm as pb
import pandas as pd

import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

import smtplib
import os
from email.mime.text import MIMEText
from email.mime.image import MIMEImage
from email.mime.multipart import MIMEMultipart

In [27]:
init_notebook_mode(connected=True)

## 1. Data Extraction

In [28]:
# func - send the HTTP request and use BeautifulSoup to pull out the data from response
def get_soup(url, headers=None):
    resp = requests.get(url, headers=headers)
    data_html = resp.text
    soup = BeautifulSoup(data_html)
    return soup

In [4]:
# get the required data and save it into csv file
url="https://www.goodreads.com/genres/new_releases/fiction"

last_month = (pd.Period(dt.datetime.now(), 'M') - 1).strftime('%b')

rows= []
soup = get_soup(url)   
for tag_a in pb(soup.find_all("a")):
    book_link = str(tag_a.get("href"))
    if book_link.startswith("/book/show"):
        fullurl = "https://www.goodreads.com" + book_link
        soup = get_soup(fullurl)
        BookTitle = (soup.find("h1", {"id": "bookTitle"})).text.strip()
        AuthorName = (soup.find("span", {"itemprop": "name"})).text
        Avg_Rating = (soup.find("span", {"itemprop": "ratingValue"})).text
        Rating_Count = (soup.find("meta", {"itemprop": "ratingCount"})).get("content")
        Review_count = (soup.find("meta", {"itemprop": "reviewCount"})).get("content")
        desc = (soup.find("div", {"id": "details"}))
        if desc and len(desc.find_all("div", {"class": "row"})) == 2:
            desc_str= desc.find_all("div", {"class": "row"})[1].text.strip()
        row = (BookTitle,AuthorName,Avg_Rating,Rating_Count,Review_count,desc_str)
        rows.append(row)
df = pd.DataFrame(rows)

columns = ['BookTitle','AuthorName','Avg_Rating','No_of_Rating','No_of_Review','Description']
df.to_csv(f'GoodReads_NewRelease_Info_raw_{last_month}.csv', header=columns)

100%|██████████| 321/321 [04:22<00:00,  1.22it/s]


## 2. Data Cleansing

In [29]:
#open the dataset
File_name = f"GoodReads_NewRelease_Info_raw_{last_month}.csv"
df = pd.read_csv(File_name)

In [30]:
df
#df.dtypes

Unnamed: 0.1,Unnamed: 0,BookTitle,AuthorName,Avg_Rating,No_of_Rating,No_of_Review,Description
0,0,Imaginary Friend,Stephen Chbosky,3.57,3330,1088,Published\n October 1st 2019\n ...
1,1,The Fountains of Silence,Ruta Sepetys,4.34,3239,914,Published\n October 1st 2019\n ...
2,2,"Olive, Again",Elizabeth Strout,4.37,3895,915,Published\n October 15th 2019\n ...
3,3,Find Me,André Aciman,3.49,1956,452,Published\n October 29th 2019\n ...
4,4,The Library of the Unwritten,A.J. Hackwith,3.91,806,288,Published\n October 1st 2019\n ...
...,...,...,...,...,...,...,...
95,95,Things We Say in the Dark,Kirsty Logan,4.26,122,33,Published\n October 3rd 2019\n ...
96,96,The Weight of a Moment,Michael Bowe,4.60,52,33,Published\n October 1st 2019\n ...
97,97,Bury the Lede,Gaby Dunn,3.47,265,118,Published\n October 8th 2019\n ...
98,98,The Testaments,Margaret Atwood,4.25,58250,7970,Published\n September 10th 2019\n ...


In [31]:
#to remove unnamed column
df.columns = ["tmp_col"] + list(df.columns[1:])
df.drop(columns=["tmp_col"], inplace=True)

In [32]:
df['Description'].values[0]

'Published\n        October 1st 2019\n         by Grand Central Publishing'

In [33]:
#split the Description column into Publication_Month,Publication_Year, Publication_Company using string split()
desc_split = df['Description'].str.split("\n",expand = True) #set expand = True to split into two columns
df["Publication_Date"]= desc_split[1].str.strip()
df["Publication_Company"]= desc_split[2].str.strip()

pcomp_split = df["Publication_Company"].str.split("by",n=1,expand = True) #n=1 max seperation in a string
df["Publication_Company"] = pcomp_split[1].str.strip()

date_split = df["Publication_Date"].str.split(" ",expand = True) 
df['Publication_Month'] =  date_split[0].str.strip()
df['Publication_Year'] =  date_split[2].str.strip()

df.drop(columns =["Description"], inplace = True) # drop the column
df.drop(columns =["Publication_Date"], inplace = True) # drop the column

In [34]:
df[['Publication_Month','Publication_Year','Publication_Company']].values[0]

array(['October', '2019', 'Grand Central Publishing'], dtype=object)

In [38]:
df.Publication_Month.unique()

array(['October'], dtype=object)

In [39]:
df.Publication_Year.unique()

array(['2019'], dtype=object)

In [37]:
#Filter the dataset - to include only the data from October 2019
#df.dtypes
df = df[(df.Publication_Month == "October")&(df.Publication_Year == "2019")]

In [40]:
#check for null/Nan values in dataframe
df.isna().sum() #df.isnull().sum() #df.isna().sum()

BookTitle              0
AuthorName             0
Avg_Rating             0
No_of_Rating           0
No_of_Review           0
Publication_Company    0
Publication_Month      0
Publication_Year       0
dtype: int64

In [41]:
df

Unnamed: 0,BookTitle,AuthorName,Avg_Rating,No_of_Rating,No_of_Review,Publication_Company,Publication_Month,Publication_Year
0,Imaginary Friend,Stephen Chbosky,3.57,3330,1088,Grand Central Publishing,October,2019
1,The Fountains of Silence,Ruta Sepetys,4.34,3239,914,Philomel Books,October,2019
2,"Olive, Again",Elizabeth Strout,4.37,3895,915,Random House,October,2019
3,Find Me,André Aciman,3.49,1956,452,"Farrar, Straus and Giroux",October,2019
4,The Library of the Unwritten,A.J. Hackwith,3.91,806,288,Ace Books,October,2019
...,...,...,...,...,...,...,...,...
93,Deeplight,Frances Hardinge,4.37,132,63,Macmillan Children's Books,October,2019
94,Watershed,Mark Barr,4.31,32,9,Hub City Press,October,2019
95,Things We Say in the Dark,Kirsty Logan,4.26,122,33,Harvill Secker,October,2019
96,The Weight of a Moment,Michael Bowe,4.60,52,33,Villa Campanile Press,October,2019


In [42]:
#save the cleansed dataset into a csv file
df.to_csv(f"GoodReads_NewRelease_Info_clean_{last_month}.csv")

## 3. Analyse the Dataset

In [43]:
# to find top books based on the Avg_Rating
Avg_Rating_df = df.sort_values('Avg_Rating',ascending=False).copy()

In [44]:
#display the top 10 books based on Avg_Rating
Avg_Rating_df[['BookTitle','Avg_Rating']].head(10)

Unnamed: 0,BookTitle,Avg_Rating
96,The Weight of a Moment,4.6
79,How Fires End,4.59
43,Shattered Bonds,4.56
67,Amber Hollow,4.55
60,Holding On To Nothing,4.52
89,The Tornado,4.49
61,The Painted Castle,4.47
90,Here Until August: Stories,4.44
34,Tristan Strong Punches a Hole in the Sky,4.42
75,Salvation Lost,4.4


In [45]:
#plot the grapg to show the no of ratings and review for each\ top n books having highest Avg_Rating
df1 = (df.nlargest(10,"Avg_Rating"))
objs = [
    go.Bar(x=df1.BookTitle, y=df1.No_of_Rating, name="No of Rating"),
    go.Bar(x=df1.BookTitle, y=df1.No_of_Review, name="No of Review"),
]
iplot(objs)

## 4. Generate an email

In [46]:
#func to send mail  
#include book details(name & author) and links in the email body
def send_mail(data,No_of_links = 0):
    
    #smtp connetion to send mail
    smtp_ssl_host = 'smtp.gmail.com'  # smtp.mail.yahoo.com
    smtp_ssl_port = 465
    username = 'pthnproject@gmail.com'
    password = 'XXXXXXXXXXXXXXXXX'
    sender = 'pthnproject@gmail.com'
    targets = 'abc@gmail.com'
    
    msg = MIMEMultipart()
    
    msg['Subject'] = f"Happy Reading {last_month}"
    msg['From'] = sender
    msg['To'] = ', '.join(targets)
    
    #intro
    intro = f"""<pre>
        Book Depository Links to October Releases 
        Top {No_of_links} Books with Highest Average Rating in the fictional section 
    
        Clink on the Links to order</pre>"""
    txt = MIMEText(intro,'html')
    msg.attach(txt)
    
    #body
    item_no =1
    for item in range(0,No_of_links):
        book_details = ((data['BookTitle'].values[item]).strip()) +" by "+ ((data['AuthorName'].values[item]).strip())
        book_link = book_details.replace(" ", "+")
        url = f"https://www.bookdepository.com/search?searchTerm={book_link}"
        Link = url
        
        email_body= f"""<pre> 
        {item_no}.{book_details}
        {Link}</pre>"""
        item_no +=1
        
        txt = MIMEText(email_body,'html')
        msg.attach(txt)
    
    #image
    filepath = "Reading quotes.jpeg"
    with open(filepath, 'rb') as f:
        img = MIMEImage(f.read())

    img.add_header('Content-Disposition',
               'attachment',
               filename=os.path.basename(filepath))
    msg.attach(img)

    server = smtplib.SMTP_SSL(smtp_ssl_host, smtp_ssl_port)
    server.login(username, password)
    server.sendmail(sender, targets, msg.as_string())
    server.quit()

In [47]:
#call the func to get the order links for each top n books in the email
send_mail(Avg_Rating_df,No_of_links = 5)