<a href="https://colab.research.google.com/github/sarikasea/SQL_Mastery/blob/main/SQL_Data_Exploration_Lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ungraded Lab: Data Exploration Lab

## üìã Overview
Welcome to the Data Exploration lab! In this hands-on session, you'll investigate BookCycle's data to help the management team gain valuable insights about their inventory and sales. You'll learn how to sort query results and use basic aggregate functions to summarize data, skills that are crucial for data analysis in real business settings.

## üéØ Learning Outcomes
By the end of this lab, you will be able to:
<ul>
    <li>Sort query results using ORDER BY</li>
    <li>Use basic aggregate functions (COUNT, SUM, AVG) to summarize data</li>
    <li>Apply filters with WHERE clauses in combination with aggregations</li>
    <li>Interpret summarized data to derive business insights</li>
</ul>

## üìö Dataset Information
We'll be working with the 'books' table from the BookCycle database. This table contains information about the books in inventory, including details like title, author, genre, condition, pricing, and location.

## üñ•Ô∏è Activities

### Activity 1: Connecting to the Database and Basic Sorting

As a data analyst at BookCycle, your first task is to organize the book inventory data to help the management team quickly access information about their stock.

<b>Step 1</b>: Import the necessary libraries and connect to the database:


In [11]:
import sqlite3
import pandas as pd

# Setting up the database. DO NOT edit the code given below
def setup_database():
    conn = sqlite3.connect('bookcycle.db')
    cursor = conn.cursor()

    # Create books table
    cursor.execute('''
        CREATE TABLE IF NOT EXISTS books (
            book_id INTEGER PRIMARY KEY,
            title TEXT,
            author TEXT,
            genre TEXT,
            condition TEXT,
            purchase_price REAL,
            list_price REAL,
            stock INTEGER,
            location TEXT
        )
    ''')

    # Sample data
    sample_data = [
        ('The Great Gatsby', 'F. Scott Fitzgerald', 'Classic Fiction', 'Like New', 8.50, 15.00, 5, 'Shelf A1'),
        ('1984', 'George Orwell', 'Classic Fiction', 'Good', 7.00, 12.00, 3, 'Shelf A2'),
        ('To Kill a Mockingbird', 'Harper Lee', 'Classic Fiction', 'Very Good', 9.00, 16.00, 7, 'Shelf A1'),
        ('Pride and Prejudice', 'Jane Austen', 'Classic Fiction', 'Acceptable', 6.00, 10.00, 2, 'Shelf A3'),
        ('The Hitchhiker\'s Guide to the Galaxy', 'Douglas Adams', 'Science Fiction', 'Like New', 7.50, 13.00, 8, 'Shelf B1'),
        ('Dune', 'Frank Herbert', 'Science Fiction', 'Good', 9.50, 17.00, 4, 'Shelf B2'),
        ('Foundation', 'Isaac Asimov', 'Science Fiction', 'Very Good', 8.00, 14.00, 6, 'Shelf B1'),
        ('Murder on the Orient Express', 'Agatha Christie', 'Mystery', 'Like New', 6.50, 11.00, 10, 'Shelf C1'),
        ('The Da Vinci Code', 'Dan Brown', 'Mystery', 'Good', 5.50, 9.00, 9, 'Shelf C2'),
        ('Gone Girl', 'Gillian Flynn', 'Mystery', 'Very Good', 7.00, 12.50, 5, 'Shelf C1'),
        ('Educated', 'Tara Westover', 'Biography', 'Like New', 10.00, 18.00, 3, 'Shelf D1'),
        ('Becoming', 'Michelle Obama', 'Biography', 'Good', 9.00, 16.00, 4, 'Shelf D2'),
        ('Sapiens: A Brief History of Humankind', 'Yuval Noah Harari', 'History', 'Very Good', 11.00, 20.00, 6, 'Shelf E1'),
        ('Cosmos', 'Carl Sagan', 'Science', 'Like New', 8.50, 15.00, 5, 'Shelf F1'),
        ('The Immortal Life of Henrietta Lacks', 'Rebecca Skloot', 'Science', 'Good', 7.00, 12.00, 3, 'Shelf F2'),
        ('The Night Circus', 'Erin Morgenstern', 'Fantasy', 'Very Good', 8.00, 14.00, 7, 'Shelf G1'),
        ('A Game of Thrones', 'George R.R. Martin', 'Fantasy', 'Good', 9.00, 16.00, 5, 'Shelf G2'),
        ('Harry Potter and the Sorcerer\'s Stone', 'J.K. Rowling', 'Fantasy', 'Like New', 7.00, 12.00, 10, 'Shelf G3'),
        ('The Hunger Games', 'Suzanne Collins', 'Young Adult', 'Very Good', 6.00, 10.00, 8, 'Shelf H1'),
        ('Divergent', 'Veronica Roth', 'Young Adult', 'Good', 5.00, 9.00, 6, 'Shelf H2')
    ]

    # Check if table is empty before inserting
    cursor.execute("SELECT COUNT(*) FROM books")
    if cursor.fetchone()[0] == 0:
        cursor.executemany('''
            INSERT INTO books (title, author, genre, condition, purchase_price, list_price, stock, location)
            VALUES (?, ?, ?, ?, ?, ?, ?, ?)
        ''', sample_data)
        conn.commit()
        print("Database setup complete with sample data.")
    else:
        print("Database already contains data, skipping insertion.")


    conn.close()

setup_database()

Database already contains data, skipping insertion.


In [12]:
# Connect to the SQLite database
conn = sqlite3.connect('bookcycle.db')

# Test the connection by querying the first 5 rows of the books table
query = """
SELECT *
FROM books
LIMIT 5;
"""

df = pd.read_sql_query(query, conn)
display(df)

Unnamed: 0,book_id,title,author,genre,condition,purchase_price,list_price,stock,location
0,1,The Great Gatsby,F. Scott Fitzgerald,Classic Fiction,Like New,8.5,15.0,5,Shelf A1
1,2,1984,George Orwell,Classic Fiction,Good,7.0,12.0,3,Shelf A2
2,3,To Kill a Mockingbird,Harper Lee,Classic Fiction,Very Good,9.0,16.0,7,Shelf A1
3,4,Pride and Prejudice,Jane Austen,Classic Fiction,Acceptable,6.0,10.0,2,Shelf A3
4,5,The Hitchhiker's Guide to the Galaxy,Douglas Adams,Science Fiction,Like New,7.5,13.0,8,Shelf B1


In [13]:
df.shape

(5, 9)

<b>Step 2:</b> Let's sort the books by their list price in descending order

In [14]:
query = """
SELECT title, author, list_price
FROM books
ORDER BY list_price DESC
LIMIT 10;
"""

df = pd.read_sql_query(query, conn)
display(df)

Unnamed: 0,title,author,list_price
0,Sapiens: A Brief History of Humankind,Yuval Noah Harari,20.0
1,Educated,Tara Westover,18.0
2,Dune,Frank Herbert,17.0
3,To Kill a Mockingbird,Harper Lee,16.0
4,Becoming,Michelle Obama,16.0
5,A Game of Thrones,George R.R. Martin,16.0
6,The Great Gatsby,F. Scott Fitzgerald,15.0
7,Cosmos,Carl Sagan,15.0
8,Foundation,Isaac Asimov,14.0
9,The Night Circus,Erin Morgenstern,14.0


 <b>üí° Tip:</b> The ORDER BY clause is used to sort the results. DESC specifies descending order.

<b>Step 3: Try it yourself:</b> Sort the books by title in alphabetical order and display the first 15 results.

In [15]:
query = """
SELECT *
FROM books
ORDER BY title DESC
LIMIT 15;
"""

df = pd.read_sql_query(query, conn)
display(df)

Unnamed: 0,book_id,title,author,genre,condition,purchase_price,list_price,stock,location
0,3,To Kill a Mockingbird,Harper Lee,Classic Fiction,Very Good,9.0,16.0,7,Shelf A1
1,16,The Night Circus,Erin Morgenstern,Fantasy,Very Good,8.0,14.0,7,Shelf G1
2,15,The Immortal Life of Henrietta Lacks,Rebecca Skloot,Science,Good,7.0,12.0,3,Shelf F2
3,19,The Hunger Games,Suzanne Collins,Young Adult,Very Good,6.0,10.0,8,Shelf H1
4,5,The Hitchhiker's Guide to the Galaxy,Douglas Adams,Science Fiction,Like New,7.5,13.0,8,Shelf B1
5,1,The Great Gatsby,F. Scott Fitzgerald,Classic Fiction,Like New,8.5,15.0,5,Shelf A1
6,9,The Da Vinci Code,Dan Brown,Mystery,Good,5.5,9.0,9,Shelf C2
7,13,Sapiens: A Brief History of Humankind,Yuval Noah Harari,History,Very Good,11.0,20.0,6,Shelf E1
8,4,Pride and Prejudice,Jane Austen,Classic Fiction,Acceptable,6.0,10.0,2,Shelf A3
9,8,Murder on the Orient Express,Agatha Christie,Mystery,Like New,6.5,11.0,10,Shelf C1


In [16]:
df.shape

(15, 9)

#### ‚öôÔ∏è Test Your Work:
<ul>
    <li>Did your query execute without errors?</li>
    <li>Are the books sorted alphabetically by title?</li>
    <li>Did you see 15 results?</li>

</ul>

### Activity 2: Using Aggregate Functions

The management team wants to understand the overall state of their inventory. You'll use aggregate functions to provide summary statistics.

<b>Step 1:</b> Let's start by counting the total number of books in the inventory:

In [17]:
query = """
SELECT COUNT(*) as total_books
FROM books;
"""

df = pd.read_sql_query(query, conn)
display(df)

Unnamed: 0,total_books
0,20


<b>Step 2:</b> Let's calculate the average purchase price and list price of the books:

In [18]:
query = """
SELECT
    AVG(purchase_price) as avg_purchase_price,
    AVG(list_price) as avg_list_price
FROM books;
"""

df = pd.read_sql_query(query, conn)
display(df)

Unnamed: 0,avg_purchase_price,avg_list_price
0,7.75,13.575


:<b>Step 3: Try it yourself:</b> Calculate the total inventory value (sum of all list prices) and the number of unique authors in the database.


In [19]:
query = """
SELECT
    SUM(list_price * stock) as total_inventory_value,
    COUNT(DISTINCT author) as unique_authors
FROM books;
"""

df = pd.read_sql_query(query, conn)
display(df)

Unnamed: 0,total_inventory_value,unique_authors
0,1533.5,20


#### ‚öôÔ∏è Test Your Work:
<ul>
    <li>Did your query execute without errors?</li>
    <li>Do you see two columns: total_inventory_value and unique_authors?</li>
    <li>Are the values reasonable given what you know about the bookstore?</li>

</ul>

### Activity 3: Combining Aggregations with Filters

The management wants to analyze the inventory of specific genres and conditions to make informed decisions about future purchases.

<b>Step 1:</b> Let's find the average list price for books in 'Classic Fiction' genre:

In [21]:
query = """
SELECT AVG(list_price) as avg_price_classic_fiction
FROM books
WHERE genre = 'Classic Fiction';
"""

df = pd.read_sql_query(query, conn)
display(df)

Unnamed: 0,avg_price_classic_fiction
0,13.25


<b>Step 2:</b> Let's count the number of books in 'Very Good' condition for each genre:

In [22]:
query = """
SELECT genre, COUNT(*) as count_very_good
FROM books
WHERE condition = 'Very Good'
GROUP BY genre
ORDER BY count_very_good DESC;
"""

df = pd.read_sql_query(query, conn)
display(df)

Unnamed: 0,genre,count_very_good
0,Young Adult,1
1,Science Fiction,1
2,Mystery,1
3,History,1
4,Fantasy,1
5,Classic Fiction,1


<b>Step 3: Try it yourself:</b>  Find the total value (sum of list prices) of books in 'Like New' condition for each location, sorted by total value in descending order.

In [25]:
query = """
select location, sum(list_price) as sum_list_prices
from books
where condition = 'Like New'
group by location
order by sum_list_prices desc;
"""

df = pd.read_sql_query(query, conn)
display(df)

Unnamed: 0,location,sum_list_prices
0,Shelf D1,18.0
1,Shelf F1,15.0
2,Shelf A1,15.0
3,Shelf B1,13.0
4,Shelf G3,12.0
5,Shelf C1,11.0


#### Close the Connection
It's good practice to close the database connection when you're done

In [26]:
# Close the database connection
conn.close()

#### ‚öôÔ∏è Test Your Work:

- Did your query execute without errors?</li>
- Do you see results for different locations?</li>
- Are the results sorted with the highest total value first?</li>

## ‚úÖ Success Checklist
- You can sort query results using ORDER BY
- You can use COUNT, SUM, and AVG functions to summarize data
- You can combine WHERE clauses with aggregations
- You can interpret the results to derive business insights

## üîç Common Issues & Solutions
- Problem: Syntax error in SQL query
    - Solution: Double-check your SQL syntax, especially commas between selected columns and semicolons at the end of queries
- Problem: Unexpected results from aggregations
    - Solution: Verify that you're grouping correctly and using the right aggregate function for your needs

## ‚û°Ô∏è Summary

Congratulations on completing the Data Exploration lab ‚Äì you've now mastered essential SQL skills for sorting and aggregating data, enabling you to extract valuable insights from complex datasets and make data-driven decisions in real-world business scenarios.

### üîë Key Points
- ORDER BY is used to sort results in ascending (default) or descending (DESC) order
- Aggregate functions like COUNT, SUM, and AVG summarize data
- WHERE clauses can be used with aggregations to filter data before summarizing
- GROUP BY is used with aggregations to summarize data for each group
