# Analyze SQL

The corona virus whose presence shocked the whole world has changed everyone's daily routine. Now, city residents no longer spend their free time outside the home, such as going to cafes and malls. They are more often at home, spending time reading books. This also encourages startup companies to develop new applications for book lovers.

You have been given a database of one of the competing companies in this industry. The database contains data about books, publishers, authors, as well as customer ratings and reviews of related books. This information will be used in making a price quote for a new product.

**Data Description**

books (Contains data about books):
   - `book_id` — book ID
   - `author_id` — author ID
   - `title` — book title
   - `num_pages` — number of pages
   - `publication_date` — publication date
   - `publisher_id` — publisher ID

authors (Contains data about the author):
   - `author_id` — author ID
   - `author` — author name

publishers (Contains data about publishers):
   - `publisher_id` — publisher ID
   - `publisher` — publisher name

ratings (Contains data about user reviews):
   - `rating_id` — rating ID
   - `book_id` — book ID
   - `username` — the username that rated the book
   - `ratings` — rating from the users

reviews (Contains data about customer reviews):
   - `review_id` — Review ID
   - `book_id` — book ID
   - `username` — the username who reviewed the book
   - `text` — review text
   
<b>Objective</b>:

- Analyze this dataset to obtain information that will be used in making a price quote for a new product.

<b>Stages:</b><a id='back'></a>

1. [Data Overview](#Start)
2. [*Exploratory Data Analysis*](#EDA)
    - [Number of books released after January 1, 2000](#1)
    - [Number of user reviews and average rating for each book](#2)
    - [Publisher who has published the most number of books](#3)
    - [Author with highest average book rating](#4)
    - [Average number of review texts among users](#5)
3. [General Conclusion](#Conclusion)

Note:

need to install using the code below, the PostgreSQL driver used by SQLAlchemy to connect to the PostgreSQL database:

`pip install psycopg2-binaryz`

In [1]:
# import required libraries
import pandas as pd
from sqlalchemy import create_engine

In [2]:
#Accounts
db_config = {'user': 'practicum_student',                         # nama pengguna
             'pwd': 's65BlTKV3faNIGhmvJVzOqhs',                   # kata sandi
             'host': 'rc1b-wcoijxj3yxfsf3fs.mdb.yandexcloud.net',
             'port': 6432,                                        # port koneksi
             'db': 'data-analyst-final-project-db'}               # nama database


# Parameters
connection_string = 'postgresql://{}:{}@{}:{}/{}'.format(db_config['user'],
                                                         db_config['pwd'],
                                                         db_config['host'],
                                                         db_config['port'],
                                                         db_config['db'])

# Connect account with database with connection_string parameter
engine = create_engine(connection_string, connect_args={'sslmode':'require'})

In [3]:
# Functions to perform queries with SQL
'''
Definition:
-----------
    # Functions to perform queries with SQL
-----------
    queries:
        the desired query using SQL language
'''
def query_sql (query):
    return pd.io.sql.read_sql(query, con = engine)

## Data Overview<a id='Start'></a>

In [4]:
# View the books dataset and display the results
query_sql(
    '''
    SELECT
        *
    FROM
        books;
    '''
).head()

Unnamed: 0,book_id,author_id,title,num_pages,publication_date,publisher_id
0,1,546,'Salem's Lot,594,2005-11-01,93
1,2,465,1 000 Places to See Before You Die,992,2003-05-22,336
2,3,407,13 Little Blue Envelopes (Little Blue Envelope...,322,2010-12-21,135
3,4,82,1491: New Revelations of the Americas Before C...,541,2006-10-10,309
4,5,125,1776,386,2006-07-04,268


In [5]:
# View the reviews dataset and display the results
query_sql(
    '''
    SELECT
        *
    FROM
        reviews;
    '''
).head()

Unnamed: 0,review_id,book_id,username,text
0,1,1,brandtandrea,Mention society tell send professor analysis. ...
1,2,1,ryanfranco,Foot glass pretty audience hit themselves. Amo...
2,3,2,lorichen,Listen treat keep worry. Miss husband tax but ...
3,4,3,johnsonamanda,Finally month interesting blue could nature cu...
4,5,3,scotttamara,Nation purpose heavy give wait song will. List...


In [6]:
# View the ratings dataset and display the results
query_sql(
    '''
    SELECT
        *
    FROM
        ratings;
    '''
).head()

Unnamed: 0,rating_id,book_id,username,rating
0,1,1,ryanfranco,4
1,2,1,grantpatricia,2
2,3,1,brandtandrea,5
3,4,2,lorichen,3
4,5,2,mariokeller,2


In [7]:
# View the publishers dataset and display the results
query_sql(
    '''
    SELECT 
        *
    FROM
        publishers;
    '''
).head()

Unnamed: 0,publisher_id,publisher
0,1,Ace
1,2,Ace Book
2,3,Ace Books
3,4,Ace Hardcover
4,5,Addison Wesley Publishing Company


In [8]:
# View the authors dataset and display the results
query_sql(
    '''
    SELECT 
        *
    FROM
        authors;
    '''
).head()

Unnamed: 0,author_id,author
0,1,A.S. Byatt
1,2,Aesop/Laura Harris/Laura Gibbs
2,3,Agatha Christie
3,4,Alan Brennert
4,5,Alan Moore/David Lloyd


## *Exploratory Data Analysis*<a id='EDA'></a>

### Number of books released after January 1, 2000<a id='1'></a>

In [9]:
# Counts the number of books released after the specified date
query_sql(
    '''
    SELECT 
        COUNT(publication_date) as books_cnt
    FROM
        books
    WHERE
        publication_date > '2020-01-01';
    '''
).head()

Unnamed: 0,books_cnt
0,1


In [10]:
# Examine the book
query_sql(
    '''
    -- CTE(Common Table Expression)
    WITH table_2 AS(
        SELECT
            *
        FROM
            authors
    )
    
    -- Merge tables
    SELECT 
        books.author_id,
        publication_date,
        author,
        title,
        num_pages
    FROM
        books LEFT JOIN table_2
        ON table_2.author_id = books.author_id
    WHERE
        publication_date > '2020-01-01';
    '''
).head()

Unnamed: 0,author_id,publication_date,author,title,num_pages
0,377,2020-03-31,Lynsay Sands,A Quick Bite (Argeneau #1),360


There is only 1 book released after January 1, 2000, namely:
- a book by author Lynsay Sands entitled 'A Quick Bite (Argeneau #1)' which has 360 pages.

### Number of user reviews and average rating for each book<a id='2'></a>

In [11]:
# Calculates the number of reviews and the average rating of each book
query_sql(
    '''
    -- CTE
    WITH table_reviews AS(
        SELECT
            book_id,
            review_id
        FROM
            reviews
    )
    
    -- Merge tables
    SELECT
        ratings.book_id,
        ROUND(AVG(ratings.rating)::numeric, 2) AS rating_avg, -- Rata-rata rating setiap buku
        COUNT(table_reviews.review_id) as review_total -- Total review setiap buku
    FROM
        ratings LEFT JOIN table_reviews
        ON ratings.book_id = table_reviews.book_id
    GROUP BY
        ratings.book_id
    ORDER BY
        review_total DESC;
    '''
).head()

Unnamed: 0,book_id,rating_avg,review_total
0,948,3.66,1120
1,750,4.13,528
2,673,3.83,516
3,302,4.41,492
4,299,4.29,480


If sorted by the number of reviews given by users on each book:
1. `books_id` 948 received the most reviews with a total of 1120 and the average review was 3.66,
2. Followed by `books_id` 750 which received 528 reviews and an average review of 4.13,
2. Ranked next, there are `books_id` 673 which received 516 reviews and an average review of 3.83

### Publisher who has published the most number of books<a id='3'></a>

The filter that will be applied is the number of pages more than 50 so that it can exclude things like brochures and publications.

In [12]:
# Calculates the total number of books released by the publisher
query_sql(
    '''
    -- CTE
    WITH table_publisher AS(
        SELECT
            *
        FROM
            publishers
    )
    
    -- Merge tables
    SELECT
        books.publisher_id,
        COUNT(books.book_id) AS books_cnt, -- Total buku yang dirilis
        table_publisher.publisher
    FROM
        books LEFT JOIN table_publisher
        ON books.publisher_id = table_publisher.publisher_id
    WHERE
        books.num_pages > 50
    GROUP BY
        books.publisher_id,
        table_publisher.publisher
    ORDER BY
        books_cnt DESC;
    '''
).head()

Unnamed: 0,publisher_id,books_cnt,publisher
0,212,42,Penguin Books
1,309,31,Vintage
2,116,25,Grand Central Publishing
3,217,24,Penguin Classics
4,33,19,Ballantine Books


Publisher:
1. Penguin Books is the most (42) in terms of book publishing,
2. Followed by Vintage, Grand Central Publishing, Penguin Classics, and Ballantine Books respectively.

### Author with highest average book rating<a id='4'></a>

The filter that will be applied is books that are rated more than 50 times.

In [13]:
# Calculates the rating given by the user to each book
query_sql(
    '''
    -- CTE
    WITH table_authors AS(
        SELECT
            *
        FROM
            authors
    ), table_books AS(
        SELECT
            book_id,
            author_id
        FROM
            books
    )
    
    -- Merge tables
    SELECT
        ratings.book_id,
        COUNT(ratings.username) AS rating_cnt,
        table_authors.author_id,
        table_authors.author
    FROM
        ratings
        LEFT JOIN table_books
        ON ratings.book_id = table_books.book_id
        LEFT JOIN table_authors
        ON table_books.author_id = table_authors.author_id
    GROUP BY
        ratings.book_id,
        table_authors.author_id,
        table_authors.author
    HAVING
        COUNT(ratings.username)>50
    ORDER BY
        rating_cnt DESC;
    '''
).head()

Unnamed: 0,book_id,rating_cnt,author_id,author
0,948,160,554,Stephenie Meyer
1,750,88,240,J.R.R. Tolkien
2,673,86,235,J.D. Salinger
3,75,84,106,Dan Brown
4,302,82,236,J.K. Rowling/Mary GrandPré


1. Stephenie Meyer is the author with the most rated books (160)
2. Followed by J.R.R. Tolkien, J.D. Salinger, Dan Brown, and J.K. Rowling/Mary GrandPré respectively.

### Average number of review texts among users<a id='5'></a>

The filter that will be applied is books that are rated more than 50 times.

In [14]:
# Average user review
query_sql(
    '''
    -- CTE
    WITH table_authors AS (
        SELECT
            *
        FROM
            authors
    ), table_books AS (
        SELECT 
            book_id,
            author_id,
            title
        FROM
            books
    ), table_reviews AS (
        SELECT
            book_id,
            text
        FROM
            reviews
    ), table_ratings AS (
        SELECT
            book_id,
            COUNT(username) AS rating_cnt FROM ratings
        GROUP BY
            book_id
        HAVING
            COUNT(username) > 50 -- Filter rating yang diterima lebih dari 50 kali
    )

    -- Merge tables
    SELECT 
        table_ratings.book_id,
        table_ratings.rating_cnt,
        table_books.title,
        AVG(LENGTH(table_reviews.text)) AS avg_text_len
    FROM
        table_ratings
        JOIN table_books ON table_ratings.book_id = table_books.book_id
        JOIN table_authors ON table_books.author_id = table_authors.author_id
        JOIN table_reviews ON table_ratings.book_id = table_reviews.book_id
    GROUP BY 
        table_ratings.book_id,
        table_ratings.rating_cnt,
        table_books.title
    ORDER BY 
        table_ratings.rating_cnt DESC;

    '''
).head()

Unnamed: 0,book_id,rating_cnt,title,avg_text_len
0,948,160,Twilight (Twilight #1),89.571429
1,750,88,The Hobbit or There and Back Again,86.833333
2,673,86,The Catcher in the Rye,103.166667
3,75,84,Angels & Demons (Robert Langdon #1),112.6
4,302,82,Harry Potter and the Prisoner of Azkaban (Harr...,82.0


1. Books with the title "Twilight (Twilight #1)" received an average review text of 89.57,
2. Followed by "The Hobbit or There and Back Again" (86.83), "The Catcher in the Rye" (103.17), "Angels & Demons (Robert Langdon #1)" (112.6), Harry Potter and the Prisoner of Azkaban " (82)

## General Conclusion<a id='Conclusion'></a>

The general conclusions from the results of this dataset analysis are:
1. This dataset consists of information about books, authors, publishers, ratings, and user reviews.
2. There is only one book released after January 1, 2000.
3. Books with ID 948 are the most reviewed books by users with an average rating of 3.66.
4. Penguin Books is the publisher that publishes the most books.
5. Stephenie Meyer is the author with the highest average book rating.
6. Some books have a relatively high average review text among users rating more than 50 books.

Based on the results of this analysis, several recommendations that can be given in making a price quote for a new product are:
1. Take books that have been widely reviewed and have a good average rating to be used as a reference in offering new products.
2. Contacting relevant publishers to explore cooperation opportunities in publishing new products.
3. Look at the works of authors with the highest average book rating and consider inviting cooperation in publishing new products.
4. Pay attention to user reviews and high average review text as a consideration in offering new products.