# Project 2, Part 7, Executive summary

University of California, Berkeley

Master of Information and Data Science (MIDS) program

w205 - Fundamentals of Data Engineering

Student: Sophie Yeh

Year: 2022

Semester: Spring

Section: 8


# Included Modules and Packages

Code cell containing your includes for modules and packages

In [2]:
import csv
import numpy as np
import pandas as pd
import psycopg2
import math

# Supporting code

Code cells containing any supporting code, such as connecting to the database, any functions, etc.  

Remember you can freely use any code from the labs. You do not need to cite code from the labs.

In [3]:
#
# function to run a select query and return rows in a pandas dataframe
# pandas puts all numeric values from postgres to float
# if it will fit in an integer, change it to integer
#
def my_select_query_pandas(query, rollback_before_flag, rollback_after_flag):
    "function to run a select query and return rows in a pandas dataframe"
    
    if rollback_before_flag:
        connection.rollback()
    
    df = pd.read_sql_query(query, connection)
    
    if rollback_after_flag:
        connection.rollback()
    
    # fix the float columns that really should be integers
    
    for column in df:
    
        if df[column].dtype == "float64":

            fraction_flag = False

            for value in df[column].values:
                
                if not np.isnan(value):
                    if value - math.floor(value) != 0:
                        fraction_flag = True

            if not fraction_flag:
                df[column] = df[column].astype('Int64')
    
    return(df)


#connect to postgres database
connection = psycopg2.connect(
    user = "postgres",
    password = "ucb",
    host = "postgres",
    port = "5432",
    database = "postgres"
)
# create cursor for connections
cursor = connection.cursor()

# 2.7.1 Executive summary

Write an executive summary.  

The summary should be the equivalent to 3/4 to 1 page using standard fonts, spacing, and margins. 

You may write about any aspect (or aspects).  Basically figure out what you think is the most import aspect (or aspects) that the executives would want to know.  

It could be related to the process itself. Such as how you were able to take a dataset, load it into staging tables, and get analytics very quickly, instead of a months long traditional waterfall process. 

It could be related to the preliminary analytics.  Any insights you gained.  Possible comparison to the analytics from project 1.  Do the delivery sales have different patterns or the same patterns as in store sales?  Is this a good way to grow sales? etc.

It could be related to both.

You are not required to write any queries nor create any data visualizations.  However, you may want to include some to enhance and add quality to your submission.  Submissions with these tend to be higher quality, although, not always.

You may use any number of code cells and/or markdown cells. 

You may alternate between code cells and markdown cells.  That is perfectly fine.  It is understood that before we present it, an editor would pull out the text, results of queries, and data visualizations.

----


Peak Deliveries provided AGM with a one-day sales data, which opened the opportunity to understand how their data can be integrated into AGM's database in addition to evaluating the performance of using their services. Validating and exploring their data confirmed that Peak Deliveries provides clear data with no data gaps. Integration between AGM and Peak codes were intuitive with the exception of customer information, where typos and misspellings are frequent. However, the small volume was resolved with fuzzy join techniques. Undergoing the data wrangling process for their single-day sales data has allowed AGM to develop a streamlined process that can be reused as they continue sending data. 

The ability of third-party channels to provide clear, usable data has many benefits. Not only does it allow us to withold our database information from external parties, the datasets can also be used for business analytics. Recently, AGM chose to do a proof of concept (POC) using Peak Deliveries. The goal of this three month trial at the Berkeley store is to determine if adding a delivery option will increase sales, grow the customer base, and increase profitability.


## Did sales increase?



When comparing total sales, the Peak delivery service option cannot surpass AGM's average sales of 624 meals a day in the Berkeley store, as shown in the table below. 

In [51]:
def sales():
    query = """

    SELECT cast(avg(sales_count) as numeric(10,0)) as avg_agm_sales
    FROM(
        SELECT a.sale_date,
               count(a.sale_id) as sales_count
        FROM sales as a
            JOIN stores as s
            ON s.store_id = a.store_id
            JOIN customers as c
            on c.customer_id = a.customer_id
        WHERE s.city = 'Berkeley' and EXTRACT(MONTH FROM a.sale_date) = 10 and c.distance <= 5
        GROUP BY sale_date
        ORDER BY sale_date) as a

    """
    agm_sales = my_select_query_pandas(query, True, True) #.style.set_caption("AGM's Average Daily Sales")

    query = """
        SELECT count(sale_id) as peak_sales
        FROM stage_1_peak_sales
        """
    peak_sales = my_select_query_pandas(query, True, True) #.style.set_caption("Peak's Total Number of Sales")

    avg_daily_sales = pd.concat([agm_sales, peak_sales], axis=1)
    return avg_daily_sales.style.set_caption("AGM's Avg Daily Sales vs Peak's One-Day Sales Count")

sales()

Unnamed: 0,avg_agm_sales,peak_sales
0,624,97


However, the average number of meals per sale is higher at Peak than AGM. In fact, the number of meals also increased per sale also increased from 5.3 to 5.6, indicating that the delivery service will encourage people to buy more meals.

In [34]:
## do people buy more meals?

query = """
    SELECT cast(round(sum(l.quantity::numeric)/count(distinct s.sale_id), 1) as varchar) as peak_quantity
    FROM stage_1_peak_sales as s
        JOIN stage_1_peak_line_items as l
        ON s.sale_id = l.sale_id
    """
df1 = my_select_query_pandas(query, True, True)# .style.set_caption("Peak - Average Number of Meals per sale")

query = """

    SELECT SUM(l.quantity)/COUNT(DISTINCT l.sale_id) as agm_quantity
    FROM line_items as l
        JOIN stores as s 
        ON l.store_id = s.store_id
    WHERE s.city = 'Berkeley'
    GROUP BY s.city
    ORDER BY s.city
    """
df2 = my_select_query_pandas(query, True, True)

df3 = pd.concat([df1,df2],axis=1).style.set_caption('Average number of meals per sale in Berkeley')
df3

Unnamed: 0,peak_quantity,agm_quantity
0,5.6,5.345514


## Constant customer base
We were able to find exact matches for all customers in the AGM database. As of now, the trial is not suggesting that the customer base will grow. However, the customer base should be observed once the delivery option is expanded to farther distances. Project 1 concluded that most of AGM's customers are already within 16 miles of the store. In order to widen the customer base, the delivery option should be available to a larger radius so those who previously did not order because it was farther away would be more likely join the customer base.

In [52]:
def cust_base():
    query = """
    select cu1.first_name,
           cu1.last_name,
           cu1.street,
           cu2.first_name as agm_first_name,
           cu2.last_name as agm_last_name,
           cu2.street as agm_street
    from stage_1_peak_customers as cu1
         left join customers as cu2
         on levenshtein_less_equal(cu1.first_name, cu2.first_name, 5) < 6
            and levenshtein_less_equal(cu1.last_name, cu2.last_name, 5) < 3
             and levenshtein_less_equal(cu1.street, cu2.street, 5) < 6
    order by first_name
    ;
    """
    return my_select_query_pandas(query, True, True).style.set_caption("Customer records matching based on levenstein distance")
cust_base()


Unnamed: 0,first_name,last_name,street,agm_first_name,agm_last_name,agm_street
0,Adelice,Cosyns,95594 Kennedy Alley,Adelice,Cosyns,95594 Kennedy Alley
1,Amber,Peniello,1537 Kipling Drive,Amber,Peniello,1537 Kipling Drive
2,Antonie,Jakubski,262 Manufacturers Road,Antonie,Jakubski,262 Manufacturers Road
3,Ardyce,Lauderdale,483 Pearson Point,Ardyce,Lauderdale,483 Pearson Point
4,Armand,Olyet,48 Valley Edge Plaza,Armand,Olyet,48 Valley Edge Plaza
5,Arv,Doret,2120 Mesta Circle,Arv,Doret,2120 Mesta Circle
6,Aurilia,Sand,2187 Carpenter Pass,Aurilia,Sand,2187 Carpenter Pass
7,Babette,Patifield,818 Dryden Circle,Babette,Patifield,818 Dryden Circle
8,Barny,Cheal,8945 Vera Lane,Barny,Cheal,8945 Vera Lane
9,Basile,Wrassell,59 Burrows Parkway,Basile,Wrassell,59 Burrows Parkway


## Increased Profitability

The Peak sales data had an average of \\$66.80 per transaction in a single day in October. Compared to AGM's average of $63.86 per transaction in the month of October, using Peak's services increased average transaction by \\$2.94. This is slightly above Peak's estimated charge of \\$2.16 per meal by \\$0.78. Therefore, yes, the sales did increase. 

In [53]:
def avg_profit():
    query = """

    SELECT EXTRACT(MONTH FROM sale_date::date) as month_num, 
           to_char(sale_date::date, 'MONTH') as month_name, 
           cast(avg(total_amount::numeric) as numeric(10,2)) as peak_profit
    FROM stage_1_peak_sales
    GROUP BY month_num, month_name
    ORDER BY month_num, month_name

    """
    df1 = my_select_query_pandas(query, True, True)

    query = """

    SELECT EXTRACT(MONTH FROM a.sale_date) as month_num, 
           to_char(a.sale_date, 'MONTH') as month_name, 
           cast(avg(a.total_amount) as numeric(10,2)) as agm_profit
    FROM sales as a
        JOIN stores as s
        ON s.store_id = a.store_id
    WHERE s.city = 'Berkeley' and EXTRACT(MONTH FROM a.sale_date) = 10
    GROUP BY s.city, month_num, month_name
    ORDER BY s.city, month_num, month_name;

    """
    df2 = my_select_query_pandas(query, True, True)
    df3 = pd.concat([df1['peak_profit'],df2['agm_profit']], axis=1).style.set_caption("Average Total Dollar Amount per Sale in October, Berkeley store")
    return df3
avg_profit()

Unnamed: 0,peak_profit,agm_profit
0,66.8,63.87


## Recommendations

The single-day sales data from Peak showed positive signs of sales boost and profitability. More data will help determine whether this growth is consistent. Although there was no additions to AGM's customer base, further testing is needed to make a solid conclusion. Since AGM already has the majority of their customer base close to their stores, expanding the radius of the delivery option will provide insight into how it will affect customers farther away.