# Project 1, Part 1, Sales Related Queries

University of California, Berkeley

Master of Information and Data Science (MIDS) program

w205 - Fundamentals of Data Engineering

Student: Terra Jiang

Year: 2025

Semester: Summer

Section: 4


# Included Modules and Packages

Code cell containing your includes for modules and packages

In [1]:
import math
import numpy as np
import pandas as pd

import psycopg2

# Supporting code

Code cells containing any supporting code, such as connecting to the database, any functions, etc.  Remember you can use any code from the labs.

In [2]:
#
# function to run a select query and return rows in a pandas dataframe
# pandas puts all numeric values from postgres to float
# if it will fit in an integer, change it to integer
#

def my_select_query_pandas(query, rollback_before_flag, rollback_after_flag):
    "function to run a select query and return rows in a pandas dataframe"
    
    if rollback_before_flag:
        connection.rollback()
    
    df = pd.read_sql_query(query, connection)
    
    if rollback_after_flag:
        connection.rollback()
    
    # fix the float columns that really should be integers
    
    for column in df:
    
        if df[column].dtype == "float64":

            fraction_flag = False

            for value in df[column].values:
                
                if not np.isnan(value):
                    if value - math.floor(value) != 0:
                        fraction_flag = True

            if not fraction_flag:
                df[column] = df[column].astype('Int64')
    
    return(df)
    

In [3]:
connection = psycopg2.connect(
    user = "postgres",
    password = "ucb",
    host = "postgres",
    port = "5432",
    database = "postgres"
)

# 1.1.1 Total sales as a dollar amount, total number of sales, average dollar amount per sale


Each record in the sales table is an individual sale, and the total_amount column is the total amount for that individual sale.

Write 1 and only 1 query.  Note that the query may have as many subqueries, including "with" clauses, as you wish.  

Name column headers exactly as shown in the example below. 

Format data exactly as shown in the example below.

Ensure that when you check this Juptyer Notebook into GitHub that the query results in the Pandas dataframe are clearly visible in GitHub.

The query should return only 1 row into a Pandas dataframe and should look similar to this: 

||total_sales_dollars|total_sales_million_dollars|total_number_of_sales|average_dollar_amount_per_sale|
|---|---|---|---|---|
|0|98739408|98.7|1537617|64.22|

In [4]:
rollback_before_flag = True
rollback_after_flag = True

query = """

SELECT sum(total_amount) as total_sales_dollars, 
    round(sum(total_amount)/1000000, 1) as total_sales_million_dollars, 
    count(sale_id) as total_number_of_sales,
    round(sum(total_amount)/count(sale_id), 2) as average_dollar_amount_per_sale

FROM sales

"""

df = my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)
df

Unnamed: 0,total_sales_dollars,total_sales_million_dollars,total_number_of_sales,average_dollar_amount_per_sale
0,98739408,98.7,1537617,64.22


# 1.1.2 Total sales as a dollar amount, total number of sales, average dollar amount per sale by store


Each record in the sales table is an individual sale, and the total_amount column is the total amount for that individual sale.

For store_name use the store's city.

Sort by store_name in alphabetical order.

Write 1 and only 1 query.  Note that the query may have as many subqueries, including "with" clauses, as you wish.  

Name column headers exactly as shown in the example below. 

Format data exactly as shown in the example below.

Ensure that when you check this Juptyer Notebook into GitHub that the query results in the Pandas dataframe are clearly visible in GitHub.

The query should return 5 rows into a Pandas dataframe. The first and last rows should look similar to this: 

||store_name|total_sales_dollars|total_sales_million_dollars|total_number_of_sales|average_dollar_amount_per_sale|
|---|---|---|---|---|---|
|0|Berkeley|25041060|25.0|390375|64.15|
|...|...|...|...|...|...|
|4|Seattle|22024512|22.0|342327|64.34|

In [5]:
rollback_before_flag = True
rollback_after_flag = True

query = """

SELECT s.city as store_name,
    sum(sa.total_amount) as total_sales_dollars, 
    round(sum(sa.total_amount)/1000000, 1) as total_sales_million_dollars, 
    count(sa.sale_id) as total_number_of_sales,
    round(sum(sa.total_amount)/count(sa.sale_id), 2) as average_dollar_amount_per_sale

FROM sales as sa
    JOIN stores as s
        ON s.store_id = sa.store_id

GROUP BY store_name
ORDER BY store_name ASC


"""

df = my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)
df

Unnamed: 0,store_name,total_sales_dollars,total_sales_million_dollars,total_number_of_sales,average_dollar_amount_per_sale
0,Berkeley,25041060,25.0,390375,64.15
1,Dallas,19408260,19.4,302120,64.24
2,Miami,17692404,17.7,275074,64.32
3,Nashville,14573172,14.6,227721,64.0
4,Seattle,22024512,22.0,342327,64.34


# 1.1.3 Total sales as a dollar amount, total number of sales, average dollar amount per sale by month

Each record in the sales table is an individual sale, and the total_amount column is the total amount for that individual sale.

Derive the month_number (1 = January) and the month from the sale_date.

Sort by month_number.

Write 1 and only 1 query.  Note that the query may have as many subqueries, including "with" clauses, as you wish.  

Name column headers exactly as shown in the example below. 

Format data exactly as shown in the example below.

Ensure that when you check this Juptyer Notebook into GitHub that the query results in the Pandas dataframe are clearly visible in GitHub.

The query should return 12 rows into a Pandas dataframe. The first and last rows should look similar to this: 

||month_number|month|total_sales_dollars|total_sales_million_dollars|total_number_of_sales|average_dollar_amount_per_sale|
|---|---|---|---|---|---|---|
|0|1|January  |7803828|7.8|121955|63.99|
|...|...|...|...|...|...|...|
|11|12|December |8340420|8.3|130209|64.05|

In [6]:
rollback_before_flag = True
rollback_after_flag = True

query = """

SELECT extract(month from sale_date) as month_number,
    to_char(sale_date, 'Month') as month,
    sum(total_amount) as total_sales_dollars, 
    round(sum(total_amount)/1000000, 1) as total_sales_million_dollars, 
    count(sale_id) as total_number_of_sales,
    round(sum(total_amount)/count(sale_id), 2) as average_dollar_amount_per_sale

FROM sales

GROUP BY month_number, month
ORDER BY month_number ASC

"""

df = my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)
df

Unnamed: 0,month_number,month,total_sales_dollars,total_sales_million_dollars,total_number_of_sales,average_dollar_amount_per_sale
0,1,January,7803828,7.8,121955,63.99
1,2,February,7574280,7.6,117984,64.2
2,3,March,8779620,8.8,136653,64.25
3,4,April,8251284,8.3,128155,64.39
4,5,May,7977840,8.0,124380,64.14
5,6,June,8124108,8.1,126248,64.35
6,7,July,7993044,8.0,124290,64.31
7,8,August,9029808,9.0,140467,64.28
8,9,September,7578960,7.6,117974,64.24
9,10,October,8895108,8.9,138731,64.12


# 1.1.4 Total sales as a dollar amount, total number of sales, average dollar amount per sale by store and month

Each record in the sales table is an individual sale, and the total_amount column is the total amount for that individual sale.

For store_name use the store's city.

Derive the month_number (1 = January) and the month from the sale_date.

Sort by store_name in alphabetical order then by month_number.

Write 1 and only 1 query.  Note that the query may have as many subqueries, including "with" clauses, as you wish.  

Name column headers exactly as shown in the example below. 

Format data exactly as shown in the example below.

Ensure that when you check this Juptyer Notebook into GitHub that the query results in the Pandas dataframe are clearly visible in GitHub.

The query should return 60 rows into a Pandas dataframe. The first and last rows should look similar to this: 

||store_name|month_number|month|total_sales_dollars|total_sales_million_dollars|total_number_of_sales|average_dollar_amount_per_sale|
|---|---|---|---|---|---|---|---|
|0|Berkeley|1|January  |1988904|2.0|31045|64.07|
|...|...|...|...|...|...|...|...|
|59|Seattle|12|December |1876056|1.9|29136|64.39|

In [7]:
rollback_before_flag = True
rollback_after_flag = True

query = """

SELECT s.city as store_name,
    extract(month from sa.sale_date) as month_number,
    to_char(sa.sale_date, 'Month') as month,
    sum(sa.total_amount) as total_sales_dollars, 
    round(sum(sa.total_amount)/1000000, 1) as total_sales_million_dollars, 
    count(sa.sale_id) as total_number_of_sales,
    round(sum(sa.total_amount)/count(sa.sale_id), 2) as average_dollar_amount_per_sale

FROM sales as sa
    JOIN stores as s
        ON s.store_id = sa.store_id

GROUP BY store_name, month_number, month
ORDER BY store_name, month_number ASC

"""

df = my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)
df

Unnamed: 0,store_name,month_number,month,total_sales_dollars,total_sales_million_dollars,total_number_of_sales,average_dollar_amount_per_sale
0,Berkeley,1,January,1988904,2.0,31045,64.07
1,Berkeley,2,February,1930272,1.9,30062,64.21
2,Berkeley,3,March,2224500,2.2,34704,64.1
3,Berkeley,4,April,2092056,2.1,32589,64.2
4,Berkeley,5,May,2019264,2.0,31485,64.13
5,Berkeley,6,June,2065140,2.1,32153,64.23
6,Berkeley,7,July,2034708,2.0,31582,64.43
7,Berkeley,8,August,2286732,2.3,35676,64.1
8,Berkeley,9,September,1922256,1.9,29876,64.34
9,Berkeley,10,October,2248008,2.2,35199,63.87


# 1.1.5 Total sales as a dollar amount, total number of sales, average dollar amount per sale by day of week

Each record in the sales table is an individual sale, and the total_amount column is the total amount for that individual sale.

Derive the dow (0 = Sunday) and the day_of_week from the sale_date.

Sort by dow.

Write 1 and only 1 query.  Note that the query may have as many subqueries, including "with" clauses, as you wish.  

Name column headers exactly as shown in the example below. 

Format data exactly as shown in the example below.

Ensure that when you check this Juptyer Notebook into GitHub that the query results in the Pandas dataframe are clearly visible in GitHub.

**Note: the reference output is in Markdown which drops trailing zeros.  Pandas does not drop trailing zeros.  This is ok.**

The query should return 7 rows into a Pandas dataframe. The first and last rows should look similar to this: 

||dow|day_of_week|total_sales_dollars|total_sales_million_dollars|total_number_of_sales|average_dollar_amount_per_sale|
|---|---|---|---|---|---|---|
|0|0|Sunday   |18589068|18.6|289869|64.13|
|...|...|...|...|...|...|...|
|6|6|Saturday |19421460|19.4|302055|64.3|

In [8]:
rollback_before_flag = True
rollback_after_flag = True

query = """

SELECT extract(dow from sale_date) as dow,
    to_char(sale_date, 'Day') as day_of_week,
    sum(total_amount) as total_sales_dollars, 
    round(sum(total_amount)/1000000, 1) as total_sales_million_dollars, 
    count(sale_id) as total_number_of_sales,
    round(sum(total_amount)/count(sale_id), 2) as average_dollar_amount_per_sale

FROM sales

GROUP BY dow, day_of_week
ORDER BY dow ASC

"""

df = my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)
df

Unnamed: 0,dow,day_of_week,total_sales_dollars,total_sales_million_dollars,total_number_of_sales,average_dollar_amount_per_sale
0,0,Sunday,18589068,18.6,289869,64.13
1,1,Monday,13167720,13.2,204909,64.26
2,2,Tuesday,6895332,6.9,107488,64.15
3,3,Wednesday,13952556,14.0,217288,64.21
4,4,Thursday,13834644,13.8,214969,64.36
5,5,Friday,12878628,12.9,201039,64.06
6,6,Saturday,19421460,19.4,302055,64.3


# 1.1.6 Total sales as a dollar amount, total number of sales, average dollar amount per sale by store and day of week


Each record in the sales table is an individual sale, and the total_amount column is the total amount for that individual sale.

For store_name use the store's city.

Derive the dow (0 = Sunday) and the day_of_week from the sale_date.

Sort by store_name, then by dow.

Write 1 and only 1 query.  Note that the query may have as many subqueries, including with clauses, as you wish.  

Name column headers exactly as shown in the example below. 

Format data exactly as shown in the example below.

Sort data exactly as shown in the example below.

Ensure that when you check this Juptyer Notebook into GitHub that the query results in the Pandas dataframe are clearly visible in GitHub.


The query should return 35 rows into a Pandas dataframe. The first and last rows should look similar to this: 

||store_name|dow|day_of_week|total_sales_dollars|total_sales_million_dollars|total_number_of_sales|average_dollar_amount_per_sale|
|---|---|---|---|---|---|---|---|
|0|Berkeley|0|Sunday   |4694640|4.7|73481|63.89|
|...|...|...|...|...|...|...|...|
|34|Seattle|6|Saturday |4336704|4.3|67220|64.52|

In [9]:
rollback_before_flag = True
rollback_after_flag = True

query = """

SELECT s.city as store_name,
    extract(dow from sale_date) as dow,
    to_char(sale_date, 'Day') as day_of_week,
    sum(sa.total_amount) as total_sales_dollars, 
    round(sum(sa.total_amount)/1000000, 1) as total_sales_million_dollars, 
    count(sa.sale_id) as total_number_of_sales,
    round(sum(sa.total_amount)/count(sa.sale_id), 2) as average_dollar_amount_per_sale

FROM sales as sa
    JOIN stores as s
        ON s.store_id = sa.store_id

GROUP BY store_name, dow, day_of_week
ORDER BY store_name, dow ASC

"""

df = my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)
df

Unnamed: 0,store_name,dow,day_of_week,total_sales_dollars,total_sales_million_dollars,total_number_of_sales,average_dollar_amount_per_sale
0,Berkeley,0,Sunday,4694640,4.7,73481,63.89
1,Berkeley,1,Monday,3340116,3.3,52072,64.14
2,Berkeley,2,Tuesday,1752036,1.8,27281,64.22
3,Berkeley,3,Wednesday,3546144,3.5,55216,64.22
4,Berkeley,4,Thursday,3507660,3.5,54561,64.29
5,Berkeley,5,Friday,3273240,3.3,51071,64.09
6,Berkeley,6,Saturday,4927224,4.9,76693,64.25
7,Dallas,0,Sunday,3650748,3.7,56896,64.17
8,Dallas,1,Monday,2602980,2.6,40280,64.62
9,Dallas,2,Tuesday,1352760,1.4,21137,64.0
