# Project 2, Part 6, Preliminary analytics

University of California, Berkeley

Master of Information and Data Science (MIDS) program

w205 - Fundamentals of Data Engineering

Student: Sophie Yeh

Year: 2022

Semester: Spring

Section: 8


# Included Modules and Packages

Code cell containing your includes for modules and packages

In [1]:
import csv
import numpy as np
import pandas as pd
import psycopg2
import math

# Supporting code

Code cells containing any supporting code, such as connecting to the database, any functions, etc.  

Remember you can freely use any code from the labs. You do not need to cite code from the labs.

In [3]:
#
# function to run a select query and return rows in a pandas dataframe
# pandas puts all numeric values from postgres to float
# if it will fit in an integer, change it to integer
#
def my_select_query_pandas(query, rollback_before_flag, rollback_after_flag):
    "function to run a select query and return rows in a pandas dataframe"
    
    if rollback_before_flag:
        connection.rollback()
    
    df = pd.read_sql_query(query, connection)
    
    if rollback_after_flag:
        connection.rollback()
    
    # fix the float columns that really should be integers
    
    for column in df:
    
        if df[column].dtype == "float64":

            fraction_flag = False

            for value in df[column].values:
                
                if not np.isnan(value):
                    if value - math.floor(value) != 0:
                        fraction_flag = True

            if not fraction_flag:
                df[column] = df[column].astype('Int64')
    
    return(df)


#connect to postgres database
connection = psycopg2.connect(
    user = "postgres",
    password = "ucb",
    host = "postgres",
    port = "5432",
    database = "postgres"
)
# create cursor for connections
cursor = connection.cursor()

# 2.6.1 Total dollar amount of sales

Write a query to sum the total_amount in the stage_1_peak_sales table and present the sum in a Pandas dataframe with appropriate column header name.

It is fine to leave the sum as is.  You do not have to format it or put in dollar signs.

Remember that you need to convert varchars to numeric using column::numeric before doing any math on them.

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [6]:
def dollar_sales():
    query = """
        SELECT sum(total_amount::numeric) as sum_total_dollars
        FROM stage_1_peak_sales
        """
    return my_select_query_pandas(query, True, True).style.set_caption("Total Dollar Amount of Sales")
dollar_sales()

Unnamed: 0,sum_total_dollars
0,6480


# 2.6.2 Total number of sales

Write a query to count the total number of sales in the stage_1_peak_sales table and present the sum in a Pandas dataframe with appropriate column header name.  Each record in the stage_1_peak_sales table is a sale.

It is fine to leave the count as is.  You do not have to format it.

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [7]:
def num_sales():
    query = """
        SELECT count(sale_id) as sum_num_sales
        FROM stage_1_peak_sales
        """
    return my_select_query_pandas(query, True, True).style.set_caption("Total Number of Sales")
num_sales()

Unnamed: 0,sum_num_sales
0,97


# 2.6.3 Total dollar amount of sales, total cut paid to Peak, net to AGM

AGM is paying Peak an 18% cut to deliver the meals.

Write a query to calculate the total dollar amount of sales, the total cut paid to Peak, and the net to AGM.  

You may want to round to 2 decimal places for the total cut paid to Peak and the net to AGM, as they will be decimal.  

You do not need to format the numers with commas, dollar signs, etc.

Remember that you need to convert varchars to numeric using column::numeric before doing any math on them.

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [8]:
def sales_cut():
    query = """
        SELECT sum(total_amount::numeric) as sum_total_dollars, cast(0.18*sum(total_amount::numeric) as varchar) as peak_cut, cast(sum(total_amount::numeric)*(1-0.18) as varchar) as agm_cut
        FROM stage_1_peak_sales
        """
    return my_select_query_pandas(query, True, True).style.set_caption("Total Dollar Amount of Sales and Cuts")
sales_cut()

Unnamed: 0,sum_total_dollars,peak_cut,agm_cut
0,6480,1166.4,5313.6


# 2.6.4 Total number of meals sold

Write a query to sum the quantity in the stage_1_peak_line_items table and present the sum in a Pandas dataframe with appropriate column header name

It is fine to leave the number as is.  You do not have to format it.

Remember that you need to convert varchars to numeric using column::numeric before doing any math on them.

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [9]:
def total_meals():
    query = """
        SELECT sum(quantity::numeric) as sum_quantity
        FROM stage_1_peak_line_items
        """
    return my_select_query_pandas(query, True, True).style.set_caption("Total Number of Meals Sold")
total_meals()

Unnamed: 0,sum_quantity
0,540


# 2.6.5 Total number of meals sold by meal

Expanding on the last query, group the sum of quantity by meal.  Display the meal followed by the number of meals sold. 

Sort by highest number sold first.

Note that you will need to use the peak_product_mapping table and the products table in addition to the stage_1_peak_line_items table.

It is fine to leave the numbers as is.  You do not have to format them.

Remember that you need to convert varchars to numeric using column::numeric before doing any math on them.

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [10]:
def meal_ranks():
    query = """
        SELECT p2.description as meal, sum(quantity::numeric) as quantity
        FROM stage_1_peak_line_items as l
            join peak_product_mapping as p1
            on p1.peak_product_id = l.product_id::numeric
            join products as p2
            on p2.product_id = p1.product_id
        group by p2.description
        order by quantity DESC

        """
    return my_select_query_pandas(query, True, True).style.set_caption("Total Number of Meals Sold By Meal")
meal_ranks()

Unnamed: 0,meal,quantity
0,Pistachio Salmon,113
1,Eggplant Lasagna,107
2,Curry Chicken,101
3,Teriyaki Chicken,80
4,Brocolli Stir Fry,60
5,Tilapia Piccata,44
6,Spinach Orzo,27
7,Chicken Salad,8


# 2.6.6 Average number of meals per sale

Write a query to find the average number of meals per sale, which should be equal to the total number of meals sold divided by the total number of sales, both of which we have calculated before.

You may want to round to 1 decimal place.

Remember that you need to convert varchars to numeric using column::numeric before doing any math on them.

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [11]:
def avg_meals():
    query = """
        SELECT cast(round(sum(l.quantity::numeric)/count(distinct s.sale_id), 1) as varchar) as quantity
        FROM stage_1_peak_sales as s
            JOIN stage_1_peak_line_items as l
            ON s.sale_id = l.sale_id
        """
    return my_select_query_pandas(query, True, True).style.set_caption("Average Number of Meals per Sale")
avg_meals()

Unnamed: 0,quantity
0,5.6
