# Serving for Data Analytics

This notebook provides an example of data serving for analytics after an ETL transformation. Data retrieval and analysis is done with Amazon Athena, using simple SQL queries. The output is taken to build an interactive dashboard for exploring sales data by country and product line.

Import all of the required packages.

In [None]:
# !pip install --upgrade pip
# !pip install Cython
# !pip install pandas
# !pip install awswrangler
# !pip install seaborn
# !pip install ipywidgets
import pandas as pd
import awswrangler as wr
import seaborn as sns
import ipywidgets as widgets
from IPython.display import display

Following the execution of the AWS Glue job, a new database named `de-c1w2-analytics-db` has been created. This database encompasses four tables with the following schema:

![image alt ><](./images/schema_after_ETL.png)

Your data is ready to be served for analytics. Amazon Athena enables the execution of simple SQL queries for retrieving data effortlessly. It simplifies the process of querying and analyzing data stored in various sources without the need for complex infrastructure management. Let's see the data stored in the `dim_products` table:

In [None]:
GLUE_DATABASE = "de-c1w2-analytics-db"

products_df = wr.athena.read_sql_query(
    """
    SELECT * FROM dim_products
    """,
    database=GLUE_DATABASE,
)
    
products_df.head()

You get some data insights making your SQL query slightly more complicated. In the following cell you will sum the total cells by country and display the top 10 records:

In [None]:
product_sales_by_country_df = wr.athena.read_sql_query(
    """
    SELECT
        dim_locations.country,
        SUM(fact_orders.orderAmount) AS total_sales
    FROM
        fact_orders
    JOIN
        dim_locations ON fact_orders.postalCode = dim_locations.postalCode
    GROUP BY 1
    """,
    database=GLUE_DATABASE,
)
    
product_sales_by_country_df.sort_values("total_sales", ascending=False).head(10)

Now you will combine data from three tables: `fact_orders`, `dim_products`, and `dim_locations`. The query will select the order date, product line, product name, country, and total sales amount, grouping the results by order date, product line, product name, and country:

In [None]:
product_sales_df = wr.athena.read_sql_query(
    """
    SELECT
        fact_orders.orderDate,
        dim_products.productLine,
        dim_products.productName,
        dim_locations.country,
        SUM(fact_orders.orderAmount) AS total_sales
    FROM
        fact_orders
    JOIN
        dim_products ON fact_orders.productCode = dim_products.productCode
    JOIN
        dim_locations ON fact_orders.postalCode = dim_locations.postalCode
    GROUP BY 1, 2, 3, 4
    """,
    database=GLUE_DATABASE,
)
    
product_sales_df.head()

The result can be taken to build an interactive dashboard using dropdown widgets, where you will be able to select a country and product line. You can also filter only particular periods of sales, showing the top N popular products at the end:

In [None]:
country_widget = widgets.Dropdown(
    options=["ALL"] + sorted(list(product_sales_df.country.unique())),
    value="ALL",
    description="Country",
)

productline_widget = widgets.Dropdown(
    options=["ALL"] + sorted(list(product_sales_df.productline.unique())),
    value="ALL",
    description="Product Line",
)

@widgets.interact(
    start_date=widgets.DatePicker(value=product_sales_df.orderdate.min(), description="Start Date"),
    end_date=widgets.DatePicker(value=product_sales_df.orderdate.max(), description="End Date"),
    country=country_widget,
    productline=productline_widget,
    top_n=widgets.IntSlider(value=5, min=1, max=10, step=1, description="Top N"),
)

def plot_top_n_sales(start_date, end_date, country, productline, top_n):
    filtered_df = product_sales_df[
        (product_sales_df.orderdate >= pd.to_datetime(start_date))
        & (product_sales_df.orderdate <= pd.to_datetime(end_date))
    ]
    
    title_str = f"Top {top_n} Popular "
    
    if productline != "ALL":
        filtered_df = filtered_df[filtered_df.productline == productline]
        title_str += productline
    else: 
        title_str += "Products"
        
    if country != "ALL":
        filtered_df = filtered_df[filtered_df.country == country]
        title_str += " in " + country
    
    if not (filtered_df.empty):
        ax = sns.barplot(
            x="total_sales",
            y="productname",
            data=filtered_df.head(top_n).sort_values("total_sales", ascending=False),
        )

        ax.set(
            xlabel="Total Sales",
            ylabel="Product Name",
            title=title_str
        )
    else:
        print(f"There were no sales of {productline} to {country} during that period")

Fantastic! You can now observe how effortlessly the data can be accessed following the completion of the ETL transformation.