# Project 1, Part 2, Customer Related Queries



# Included Modules and Packages

Code cell containing your includes for modules and packages

In [1]:
import math
import numpy as np
import pandas as pd

import psycopg2

# Supporting code

Code cells containing any supporting code, such as connecting to the database, any functions, etc.  Remember you can use any code from the labs.

In [2]:
#
# function to run a select query and return rows in a pandas dataframe
# pandas puts all numeric values from postgres to float
# if it will fit in an integer, change it to integer
#

def my_select_query_pandas(query, rollback_before_flag, rollback_after_flag):
    "function to run a select query and return rows in a pandas dataframe"
    
    if rollback_before_flag:
        connection.rollback()
    
    df = pd.read_sql_query(query, connection)
    
    if rollback_after_flag:
        connection.rollback()
    
    # fix the float columns that really should be integers
    
    for column in df:
    
        if df[column].dtype == "float64":

            fraction_flag = False

            for value in df[column].values:
                
                if not np.isnan(value):
                    if value - math.floor(value) != 0:
                        fraction_flag = True

            if not fraction_flag:
                df[column] = df[column].astype('Int64')
    
    return(df)
    

In [3]:
connection = psycopg2.connect(
    user = "postgres",
    password = "ucb",
    host = "postgres",
    port = "5432",
    database = "postgres"
)

# 1.2.1 Total Number of Customers for all of AGM

Each record in the customer table is an individual customer.

Write 1 and only 1 query.  Note that the query may have as many subqueries, including "with" clauses, as you wish.  

Name column headers exactly as shown in the example below. 

Format data exactly as shown in the example below.

Ensure that when you check this Juptyer Notebook into GitHub that the query results in the Pandas dataframe are clearly visible in GitHub.

The query should return only 1 row into a Pandas dataframe and should look similar to this: 

||total_number_of_customers|
|---|---|
|0|31082|

In [4]:
rollback_before_flag = True
rollback_after_flag = True

query = """

SELECT count(customer_id) as total_number_of_customers
FROM customers

"""

df = my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)
df

Unnamed: 0,total_number_of_customers
0,31082


# 1.2.2 Total Number of Customers by Store

Each record in the customer table is an individual customer.  

Use the customer's closest_store_id to decide which store they belong to.

For store_name use the store's city.

Sort by store_name in alphabetical order.

Write 1 and only 1 query.  Note that the query may have as many subqueries, including "with" clauses, as you wish.  

Name column headers exactly as shown in the example below. 

Format data exactly as shown in the example below.

Ensure that when you check this Juptyer Notebook into GitHub that the query results in the Pandas dataframe are clearly visible in GitHub.

The query should return 5 rows into a Pandas dataframe. The first and last rows should look similar to this: 

||store_name|total_number_of_customers|
|---|---|---|
|0|Berkeley|8138|
|...|...|...|
|4|Seattle|7214|

In [5]:
rollback_before_flag = True
rollback_after_flag = True

query = """

SELECT s.city as store_name,
    count(cu.customer_id) as total_number_of_customers
    
FROM customers as cu
    JOIN stores as s
        ON s.store_id = cu.closest_store_id

GROUP BY store_name
ORDER BY store_name ASC

"""

df = my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)
df

Unnamed: 0,store_name,total_number_of_customers
0,Berkeley,8138
1,Dallas,6359
2,Miami,5725
3,Nashville,3646
4,Seattle,7214


# 1.2.3 List of Customers who have signed up but not bought anything

Each record in the customer table is an individual customer.  

Find all customers who are in the customers table but don't have any sales in the sales table.  

Sort by customer last_name, then first_name

Write 1 and only 1 query.  Note that the query may have as many subqueries, including "with" clauses, as you wish.  

Name column headers exactly as shown in the example below. 

Format data exactly as shown in the example below.

Ensure that when you check this Juptyer Notebook into GitHub that the query results in the Pandas dataframe are clearly visible in GitHub.

The query should return 35 rows into a Pandas dataframe. The first and last rows should look similar to this: 

||last_name|first_name|
|---|---|---|
|0|Agott|Tracy|
|...|...|...|
|34|Vedekhin|Cyrill|

In [6]:
rollback_before_flag = True
rollback_after_flag = True

query = """

SELECT last_name, first_name

FROM customers

WHERE customer_id NOT IN 
    (
    SELECT distinct sa.customer_id
    FROM sales as sa
    )

ORDER BY last_name, first_name

"""

df = my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)
df

Unnamed: 0,last_name,first_name
0,Agott,Tracy
1,Arnke,Daniella
2,Assandri,Hyacintha
3,Borman,Felice
4,Breit,Domini
5,Butterick,Jacenta
6,Camillo,Marysa
7,Dukelow,Lilas
8,Dukesbury,Corinna
9,Ellaway,Lorianna


# 1.2.4 What is the percentage of customers per population at the zip code level?

Each record in the customer table is an individual customer.  

Use the customer's zip code (not the store's zip code). 

Use the zip_code table in the secondary dataset to find the population in that zip code.

Sort by highest percentage first.  Note that we are rouding to 3 decimal place for display, but note that you may need to sort on values that are not rounded.

**Note: When a query result has a large number of rows, Pandas will only display the first 5 rows, a row with ellipses, and the last 5 rows. This is ok.**

**Note: the reference output is in Markdown which drops trailing zeros.  Pandas does not drop trailing zeros.  This is ok.**

The query should return 550 rows into a Pandas dataframe. The first and last rows should look similar to this: 

||zip|percentage_customers_per_population|
|---|---|---|
|0|98164|1.29|
|...|...|...|
|549|75034|0.001|

In [7]:
rollback_before_flag = True
rollback_after_flag = True

query = """

WITH a as (
        SELECT zip, 
            count(customer_id) as id_count
        FROM customers
        GROUP BY zip
    ), b as (
        SELECT zip, 
            population as pop
        FROM zip_codes
    )

SELECT a.zip,
    round((a.id_count/b.pop)*100, 3) as percentage_customers_per_population
FROM a
    JOIN b
        ON a.zip = b.zip

ORDER BY (a.id_count/b.pop)*100 DESC
    
"""

df = my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)
df

Unnamed: 0,zip,percentage_customers_per_population
0,98164,1.290
1,98050,1.087
2,33109,1.053
3,94613,1.045
4,37240,1.028
...,...,...
545,33033,0.002
546,75067,0.001
547,75035,0.001
548,94565,0.001
