## Popular Fiction In The New Release

Objective:
   
   Find the top fictional books for each month from the newly released section, based on the average rating given by readers. At the end of each month, i want to read newly released fictional book which has highest avg rating. I want to automate the process of searching the books with highest avg rating in goodreads and then get the links to order them from bookdepository in my mail.

Websites uesd: www.goodreads.com, www.bookdepository.com

How to solve the problem using python
1. Get the data (booktitle,author,average rating,etc)from www.goodreads.com website and save it in a csv file to make data analysis. This is done using HTTP requests using python's request library and pulling data out of html files with BeautifulSoup library.


2. Clean the data extracted from goodreads website. split the data into seperate columns if requires, check for null and NaN values.


3. Analyse the data and identify the Top n books based on average rating. For the top n books show the no of ratings and reviews in visual representation with the help of plotting library plotly.


4. Generate an eMail with will inculde corresponding bookdepository links (www.bookdepository.com) for each top n book for ordering


Individual Project by Elango Sindhu Priya


In [None]:
#import the libraries

In [293]:
import requests
from bs4 import BeautifulSoup
import datetime as dt
from tqdm import tqdm as pb
import pandas as pd
from IPython.display import display
import seaborn as sns 

import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

import smtplib
import os
from email.mime.text import MIMEText
from email.mime.image import MIMEImage
from email.mime.multipart import MIMEMultipart

In [2]:
init_notebook_mode(connected=True)

In [None]:
#to display all the rows
pd.set_option('display.max_rows', df.shape[0]+1)
#pd.options.display.max_columns = None

## Define Functions

In [424]:
last_month = (pd.Period(dt.datetime.now(), 'M') - 1).strftime('%B')
url="https://www.goodreads.com/genres/new_releases/fiction"

# func - send the HTTP request and use BeautifulSoup to pull out the data from response
def get_soup(url, headers=None):
    resp = requests.get(url, headers=headers)
    data_html = resp.text
    soup = BeautifulSoup(data_html)
    return soup

#to get the last month's year
def year():
    last_year = (pd.Period(dt.datetime.now(),'Y') - 1).strftime('%Y')
    current_year = pd.Period(dt.datetime.now(),'Y').strftime('%Y')
    if last_month=='December':
        return last_year
    else:
        return current_year
    
#data extraction through request library
def data_extraction():
    #url="https://www.goodreads.com/genres/new_releases/fiction"
    rows= []
    soup = get_soup(url)   
    for tag_a in pb(soup.find_all("a")):
        book_link = str(tag_a.get("href"))
        if book_link.startswith("/book/show"):
            fullurl = "https://www.goodreads.com" + book_link
            soup = get_soup(fullurl)
            BookTitle = (soup.find("h1", {"id": "bookTitle"})).text.strip()
            AuthorName = (soup.find("span", {"itemprop": "name"})).text
            Avg_Rating = (soup.find("span", {"itemprop": "ratingValue"})).text
            Rating_Count = (soup.find("meta", {"itemprop": "ratingCount"})).get("content")
            Review_count = (soup.find("meta", {"itemprop": "reviewCount"})).get("content")
            desc = (soup.find("div", {"id": "details"}))
            if desc and len(desc.find_all("div", {"class": "row"})) == 2:
                desc_str= desc.find_all("div", {"class": "row"})[1].text.strip()
                Page_Count= desc.find_all("div", {"class": "row"})[0].text.strip()
            row = (BookTitle,AuthorName,Avg_Rating,Rating_Count,Review_count,desc_str,Page_Count)
            rows.append(row)
    df = pd.DataFrame(rows)
    columns = ['BookTitle','AuthorName','Avg_Rating','No_of_Rating','No_of_Review','Description','Page_Count']
    df.to_csv(f'GoodReads_NewRelease_Info_raw_{last_month}.csv',index = False, header=columns)
    print("-----Data extraction completed-----")
    print("Extracted data is stored in csv file")
    
def data_cleaning():
    #open the dataset
    File_name = f"GoodReads_NewRelease_Info_raw_{last_month}.csv"
    df = pd.read_csv(File_name)
    #split the Description column into Publication_Month,Publication_Year, Publication_Company using string split()
    desc_split = df['Description'].str.split("\n",expand = True) #set expand = True to split into two columns
    df["Publication_Date"]= desc_split[1].str.strip()
    df["Publication_Company"]= desc_split[2].str.strip()

    pcomp_split = df["Publication_Company"].str.split("by",n=1,expand = True) #n=1 max seperation in a string
    df["Publication_Company"] = pcomp_split[1].str.strip()

    date_split = df["Publication_Date"].str.split(" ",expand = True) 
    df['Publication_Month'] =  date_split[0].str.strip()
    df['Publication_Year'] =  date_split[2].str.strip()

    df.drop(columns =["Description"], inplace = True) # drop the column
    df.drop(columns =["Publication_Date"], inplace = True) # drop the column
    #Filter the dataset - to include only the data from October 2019
    df = df[(df.Publication_Month == last_month)&(df.Publication_Year == year())]
    
    df['Page_NO'] = df['Page_Count'].str.extract('(\d+)', expand=True) #to get the page no \d -no +-multiple times
    df["Page_NO"].fillna(0, inplace = True) 
    df["Page_NO"] = df["Page_NO"].astype(int) 
    desc_split = df['Page_Count'].str.split(",",expand = True) #set expand = True to split into  columns
    df["Type"]= desc_split[0].str.strip()
    df["Type"].replace(to_replace= '([0-9]+ )\w+',value='No format',regex=True,inplace=True)
    df.drop(columns =["Page_Count"], inplace = True) # drop the column

    # to ensure Avg_rating does not have NaN values
    bool_series = pd.notnull(df["Avg_Rating"])
    
    #check for null/Nan values in dataframe
    nan_values = df.isna().sum()
    print("sum of NaN values in each column:\n",nan_values)
    
    print("\nTypes of each column:\n",df.dtypes) # print the column dtype
    
    #save the cleansed dataset into a csv file
    df.to_csv(f"GoodReads_NewRelease_Info_clean_{last_month}.csv",index = False)
    print("\n-----Data cleaning completed-----")
    print("New csv file with cleansed data is created")

def color_negative_red(value):
    if value > 0 and value < 300:
        color = 'green'
    elif value ==0:
        color = 'red'
    else:
        color = 'black'
    return 'color: %s' % color

def color_negative_green(value):
    if value == 'Kindle Edition':
        color = 'blue'
    elif value == 'ebook':
        color = 'darkorange'
    else:
        color = 'black'
    return 'color: %s' % color

def clean_data():
    File_name = f"GoodReads_NewRelease_Info_clean_{last_month}.csv"
    df = pd.read_csv(File_name)
    cm = sns.light_palette("green", as_cmap=True)
    print("\n\n*Page_NO-green - quick read books with less than 300 pages")
    print("*Page_NO-yellow - book with highest no of pages")
    display(df.style
    .applymap(color_negative_red, subset=['Page_NO'])
    .applymap(color_negative_green, subset=['Type'])
    .background_gradient(cmap=cm,subset=['Avg_Rating','No_of_Rating','No_of_Review'])
    .highlight_null(null_color='red')
    .highlight_max(subset=['Page_NO']))
    
def data_analyse():
    File_name = f"GoodReads_NewRelease_Info_clean_{last_month}.csv"
    df = pd.read_csv(File_name)
    # to find top books based on the Avg_Rating
    Avg_Rating_df = df.sort_values('Avg_Rating',ascending=False).copy()
    #display the top 10 books based on Avg_Rating
    Avg_Rating_df[['BookTitle','Avg_Rating']].head(10)
    #plot the grapg to show the no of ratings and review for each\ top n books having highest Avg_Rating
    df1 = (df.nlargest(10,"Avg_Rating"))
    objs = [
        go.Bar(x=df1.BookTitle, y=df1.No_of_Rating, name="No of Rating",marker=dict(color='rgb(49,130,189)'),text=df1['Avg_Rating']),
        go.Bar(x=df1.BookTitle, y=df1.No_of_Review, name="No of Review",text=df1['Avg_Rating']),
    ]
    layout = go.Layout(
        title='Top 10 Books With Highest Average Rating'
    )
    fig = go.Figure(data=objs, layout=layout)
    iplot(fig)
    #catef=gorize & group_by the books based on Avg_Rating collumn
    #plot to show the percantage of books in each category
    def func(x):
        if x < 3.5:
            return "3.5 and less"
        elif x < 4.5:
            return "4.5-3.5"
        else:
            return "4.5-5.0"
    df["Rating_range"] = df.Avg_Rating.apply(func)
    res = df.groupby("Rating_range")["Avg_Rating"].count()
    labels = ['3.5 and less','4.5-3.5','4.5-5.0']
    values = [res[0],res[1],res[2]]
    layout = go.Layout(title='Percantage of Books under Each Rating Category')
    fig = go.Figure(data=[go.Pie(labels=labels, values=values)],layout=layout)
    #fig.show()
    iplot(fig)
    df2 = df[(df['Page_NO'] > 0) & (df['Page_NO'] <=300)]
    objs = [
            go.Bar(x=df2.BookTitle, y=df2.Page_NO, name="Page count",text=df2['Avg_Rating']),
        ]
    layout = go.Layout(
            title='Quick Read Books with less than 300 Pages'
        )
    fig = go.Figure(data=objs, layout=layout)
    iplot(fig)
    print("-----Data analyse completed-----")
    send_mail(Avg_Rating_df,get_no_of_links())
    print("email send")
#func to send mail  
#include book details(name & author) and links in the email body
def send_mail(data,No_of_links):
    
    #smtp connetion to send mail
    smtp_ssl_host = 'smtp.gmail.com'  # smtp.mail.yahoo.com
    smtp_ssl_port = 465
    username = 'pthnproject@gmail.com'
    password = 'sejvvtapcahsjyqa'
    sender = 'pthnproject@gmail.com'
    targets = 'sindhupriya.e@gmail.com'
    
    msg = MIMEMultipart()
    
    msg['Subject'] = f"Happy Reading {last_month}"
    msg['From'] = sender
    msg['To'] = ', '.join(targets)
    
    #intro
    intro = f"""<pre>
        Book Depository Links to October Releases 
        Top {No_of_links} Books with Highest Average Rating in the fictional section 
    
        Clink on the Links to order</pre>"""
    txt = MIMEText(intro,'html')
    msg.attach(txt)
    
    #body
    item_no =1
    for item in range(0,No_of_links):
        book_details = ((data['BookTitle'].values[item]).strip()) +" by "+ ((data['AuthorName'].values[item]).strip())
        book_link = book_details.replace(" ", "+")
        url = f"https://www.bookdepository.com/search?searchTerm={book_link}"
        Link = url
        
        email_body= f"""<pre> 
        {item_no}.{book_details}
        {Link}</pre>"""
        item_no +=1
        
        txt = MIMEText(email_body,'html')
        msg.attach(txt)
        
    print("sending email......")
    
    #image
    filepath = "Reading quotes.jpeg"
    with open(filepath, 'rb') as f:
        img = MIMEImage(f.read())

    img.add_header('Content-Disposition',
               'attachment',
               filename=os.path.basename(filepath))
    msg.attach(img)

    server = smtplib.SMTP_SSL(smtp_ssl_host, smtp_ssl_port)
    server.login(username, password)
    server.sendmail(sender, targets, msg.as_string())
    server.quit()
    
#to get the no of links 
def get_no_of_links():
    while True:
        amount = input("Enter the No of Links to be send in mail: ")
        try:
            val = int(amount)
            if val > 0:
                break
            else:
                print("No of Links can't be negative or Zero, try again")
        except ValueError:
            print("No of Links must be a number, try again")
    return val

def main():
    #data_extraction() #calls get_soup()
    data_cleaning()
    clean_data()
    data_analyse() #calls send_mail() & get_no_of_links()

In [425]:
main()

sum of NaN values in each column:
 BookTitle              0
AuthorName             0
Avg_Rating             0
No_of_Rating           0
No_of_Review           0
Publication_Company    2
Publication_Month      0
Publication_Year       0
Page_NO                0
Type                   0
dtype: int64

Types of each column:
 BookTitle               object
AuthorName              object
Avg_Rating             float64
No_of_Rating             int64
No_of_Review             int64
Publication_Company     object
Publication_Month       object
Publication_Year        object
Page_NO                  int64
Type                    object
dtype: object

-----Data cleaning completed-----
New csv file with cleansed data is created


*Page_NO-green - quick read books with less than 300 pages
*Page_NO-yellow - book with highest no of pages


Unnamed: 0,BookTitle,AuthorName,Avg_Rating,No_of_Rating,No_of_Review,Publication_Company,Publication_Month,Publication_Year,Page_NO,Type
0,The Starless Sea,Erin Morgenstern,4.12,12069,3433,Doubleday Books,November,2019,498,Hardcover
1,Blood Heir,Amélie Wen Zhao,3.96,1146,528,Delacorte Press,November,2019,464,Hardcover
2,The Deep,Rivers Solomon,3.86,1608,503,Gallery / Saga Press,November,2019,176,Hardcover
3,The Confession Club,Elizabeth Berg,3.8,1563,485,Random House,November,2019,304,Kindle Edition
4,Ali Cross,James Patterson,4.03,763,42,jimmy patterson,November,2019,320,Hardcover
5,Fate of the Fallen,Kel Kade,3.89,845,191,Tor Books,November,2019,400,ebook
6,The Revisioners,Margaret Wilkerson Sexton,3.9,434,87,Counterpoint,November,2019,288,Hardcover
7,Twenty-One Truths About Love,Matthew Dicks,3.72,673,426,St. Martin's Press,November,2019,352,Hardcover
8,Queen of the Conquered,Kacen Callender,3.74,146,60,Orbit,November,2019,400,Paperback
9,"The Other Windsor Girl: A Novel of Princess Margaret, Royal Rebel",Georgie Blalock,3.66,397,82,William Morrow Paperbacks,November,2019,377,Paperback


-----Data analyse completed-----
Enter the No of Links to be send in mail: 2
sending email......
email send
