# **Project 1: Pizza Sales Analysis**

## **Problem Statement**

### **Business Context**

A moderately sized, locally popular food joint operates a chain of outlets located in a metropolitan area, offering a diverse menu of pizzas, sides, and beverages. Despite having a steady flow of customers, they face challenges in optimizing their order fulfillment process, leading to delays during peak hours, which results in customer dissatisfaction and impacts repeat business. Additionally, they struggle with inventory management, often experiencing shortages of popular ingredients or excess stock of less favored items. To address these issues, they are implementing a new order management system and seeking to analyze sales data to better predict demand and streamline inventory.

### **Objective**

You have been engaged by the business as a Data Analyst to enhance operational efficiency and boost customer satisfaction. You have been provided with raw historical sales data and tasked with pre-processing historical sales data to uncover trends, building an interactive dashboard to enable visual reporting of key metrics, and generating email reports to communicate key insights to stakeholders. This will enable the stakeholders to get a clearer understanding of the business, stay on top of changing market scenarios via frequent alerts, and make quick, informed decisions to resolve operational challenges. The anticipated outcomes include reduced order processing times, improved inventory turnover, and increased customer satisfaction leading to higher repeat sales.

### **Data Description:**

This dataset contain detailed information about pizza orders, including specifics about the pizza variants, quantities, pricing, dates, times, and categorization details.
- **pizza_id:** A unique identifier assigned to each distinct pizza variant available for ordering.  
- **order_id:** A unique identifier for each order made, which links to multiple pizzas.  
- **pizza_name_id:** An identifier linking to a specific name of the pizza.  
- **quantity:** The number of units of a specific pizza variant ordered within an order.  
- **order_date:** The date when the order was placed.  
- **order_time:** The time when the order was placed.  
- **unit_price:** The cost of a single unit of the specific pizza variant.  
- **pizza_size:** Represents the size of the pizza (e.g., small, medium, large).  
- **pizza_category:** Indicates the category of the pizza, such as vegetarian, non-vegetarian, etc.  
- **pizza_name:** Specifies the name of the specific pizza variant ordered.  

## **AzureML Setup and Data Loading**

### **Connect to Azure Machine Learning Workspace**

In [1]:
# Handle to the workspace
from azure.ai.ml import MLClient

# Authentication package
from azure.identity import DefaultAzureCredential
credential = DefaultAzureCredential()

In [7]:
# Get a handle to the workspace
ml_client = MLClient(
    credential=credential,
    subscription_id="REPLACE WITH YOUR SUBSCRIPTION ID",
    resource_group_name="REPLACE WITH YOUR RESOURCE GROUP",
    workspace_name="REPLACE WITH YOUR WORKSPACE NAME",
)

### **Create Compute Cluster**

In [2]:
from azure.ai.ml.entities import AmlCompute

# Name assigned to the compute cluster
cpu_compute_target = "cpu-cluster"

try:
    # let's see if the compute target already exists
    cpu_cluster = ml_client.compute.get(cpu_compute_target)
    print(
        f"You already have a cluster named {cpu_compute_target}, we'll reuse it as is."
    )

except Exception:
    print("Creating a new cpu compute target...")

    # Let's create the Azure ML compute object with the intended parameters
    cpu_cluster = AmlCompute(
        name=cpu_compute_target,
        # Azure ML Compute is the on-demand VM service
        type="amlcompute",
        # VM Family
        size="Standard_A2_v2",
        # Minimum running nodes when there is no job running
        min_instances=0,
        # Nodes in cluster
        max_instances=1,
        # How many seconds will the node running after the job termination
        idle_time_before_scale_down=180,
        # Dedicated or LowPriority. The latter is cheaper but there is a chance of job termination
        tier="Dedicated",
    )

    # Now, we pass the object to MLClient's create_or_update method
    cpu_cluster = ml_client.compute.begin_create_or_update(cpu_cluster).result()

print(
    f"AMLCompute with name {cpu_cluster.name} is created, the compute size is {cpu_cluster.size}"
)

### **Register Dataset into Data Assets**

In [11]:
#Write your code here

### **Create a Job Environment**

In [13]:
# Create a directory called "project_env" for the preprocessing script

In [3]:
# Create a code.yml with all the dependencies and store it in to the project_env directorty

In [4]:
# Create the environment with the code.yml file

## Exploratory Data Analysis

### **Data Overview**

In [5]:
# Uncomment one of the following code snippets and execute it to install the seaborn library
# !pip install seaborn  
# pip install seaborn

In [19]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [20]:
#load the data

In [6]:
# print the top 5 rows

In [7]:
# get the shape of the data

In [8]:
# get the info of the data

In [9]:
# generate the statistical summary of the dataset

In [11]:
# calculate the sum of duplicated values 

In [10]:
# check for any missing values

In [27]:
df = data.copy()

### **Univariate Analysis**

In [16]:
#write your code here

### **Feature Engineering**

Write a code to convert the order_date and order_time columns into a single order_time column in datetime format, and then drop the original order_date column from the dataframe. Finally, display the first few rows of the modified dataframe.

In [17]:
# Write your code here

Write a code snippet to extract the month from the order_time column and create a new column called order_month. Additionally, classify the order_time into different parts of the day (Morning, Afternoon, Evening) based on the hour and store this information in a new column called time_of_day. Finally, display the first few rows of the modified dataframe.

In [18]:
# Write your code here

### **Bivariate Analysis**

In [19]:
# Write your code here

## **Data Preprocessing**

In [23]:
# Create a directory called "project_scripts" for the preprocessing script

### **Create Preprocessing Script**

In [24]:
%%writefile {src_dir_job_scripts}/pre_process.py

import argparse
import pandas as pd
from pathlib import Path

if __name__ == "__main__":

    """Main function of the script."""

    # input and output arguments
    parser = argparse.ArgumentParser()
    parser.add_argument("--data", type=str, help="path to input data")
    parser.add_argument("--output", type=str, help="path to output data")
    args = parser.parse_args()

    # Read the input data from the specified path
    df = pd.read_csv(args.data)
    
    """"""""""""
    
    #Keep all data processing and feature engineering 
    #steps here to automate the entire feature engineering process
    

    """"""""""""

    # Define output paths
    output_data_store = args.output
    preprocessed_data_output_path = Path(output_data_store, "processed_data.csv")

    # Save the processed data
    df.to_csv(preprocessed_data_output_path, index=False)

### **Creating Preprocessing Job**

Write a code snippet to define a data preparation step for a pizza analysis project using a command function. The step should read a CSV input, process and transform the data, with inputs and outputs specified as uri_folder. Include a description, display name, and the command to execute a Python script for data preprocessing. Use a specified environment for execution

In [27]:
# Write your code here

## **Reporting via Dashboard**

### **Creating Files for the Dashboard**

#### Creating Directory

In [28]:
# Create a directory called project_files for storing the dashboard files

#### Creating Dashboard Script

In [30]:
%%writefile {src_dir_hf_space}/app.py
import streamlit as st
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Load data
def load_data():
    df = pd.read_csv("processed_data.csv")  # replace with your dataset
    return df

# Create Streamlit app
def app():
    # Title for the app
    st.title("Pizza Sales Data Analysis Dashboard")
    df = load_data()

    df = pd.DataFrame(df)

    """"""""""""
    # Calculate key metrics
    # Write a code snippet to calculate key metrics from the pizza orders dataframe, including the 
    # total number of unique orders, total revenue generated, the most popular pizza size, the most 
    # frequent pizza category, total pizzas sold

    """"""""""""

    # Sidebar with key metrics
    # Write a code snippet to display key metrics in the sidebar of a Streamlit application. 
    # Show the total number of orders, total revenue (formatted as currency), the most popular
    # pizza size, the most popular pizza category, and the total number of pizzas sold 
    # using the st.sidebar.metric function.

    """""""""""""

    # Provide the details of the plots here
    plots = [
        {"title": "__________", "x": "_________", "y": "___________"},
    ]

    for plot in plots:
      st.header(plot["title"])
      
      fig, ax = plt.subplots()
      

      """""""""""""

      # Provide the details of the plots here


      """""""""""""

      st.pyplot(fig)
    

if __name__ == "__main__":
    app()

#### Creating Requirements File

In [29]:
# Write your code here and store in to the project_files directory

#### Creating Docker File

In [None]:
%%writefile {src_dir_hf_space}/Dockerfile
FROM python:3.9-slim

RUN useradd -m -u 1000 user

USER user

ENV HOME=/home/user \

    PATH=/home/user/.local/bin:$PATH

WORKDIR $HOME/app

COPY --chown=user . $HOME/app



# Run apt-get update and install as root
USER root

RUN apt-get update && apt-get install -y \
    build-essential \

    curl \

    software-properties-common \

    git \

    && rm -rf /var/lib/apt/lists/*

USER user


# COPY requirements.txt ./

# COPY src/ ./src/

COPY . .

RUN pip3 install -r requirements.txt

EXPOSE 8501

HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health

ENTRYPOINT ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0", "--server.enableXsrfProtection=false"]

#### Creating Script to Push Files to Hugging Face Space

In [31]:
%%writefile {src_dir_job_scripts}/hugging_face_auth_push_files.py
from huggingface_hub import login, HfApi
from datasets import Dataset
import argparse
import os 
import pandas as pd

os.makedirs("./outputs", exist_ok=True) # Create the "outputs" directory if it doesn't exist


# Once the job 1 run, the output processed_data.csv will gonna store under a output folder that we created below 
# After the job 1 completes, we can read the processed_data.csv file in the job 2 

def select_first_file(path):
    """Selects first file in folder, use under assumption there is only one file in folder
    Args:
        path (str): path to directory or file to choose
    Returns:
        str: full path of selected file
    """
    files = os.listdir(path)
    return os.path.join(path, files[0])

def main():
    parser = argparse.ArgumentParser()

    # Provide the input arguments for these job
    parser.add_argument("--processed_data_push", type=str, help="path to processed data")
    parser.add_argument("--streamlit_files", type=str, help="path to streamlit files")
    args = parser.parse_args()
    
    # Write a code snippet to authenticate and upload files to a Hugging Face space. 
    # The snippet should include logging in using an access token, initializing the Hugging Face API, 
    # uploading a folder containing application files and a requirements file, and 
    # then uploading a specific processed data file as processed_data.csv


if __name__ == '__main__':
    main()

### **Creating Dashboard Reporting Job**

Write a code to define a Hugging Face authentication step for a dashboard project using a command function. The step should authenticate to the Hugging Face hub and push specified files. It should include inputs for the path to the processed data and a folder containing Streamlit files, along with the command to execute a Python script for authentication and file upload. Specify the source folder for the component and the environment for execution.

In [49]:
# Write your code here

## **Reporting via Email**

### **Creating Email Report Script**

In [33]:
%%writefile {src_dir_job_scripts}/daily_email_report.py
import argparse
import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
import pandas as pd
import os


os.makedirs("./outputs", exist_ok=True) # Create the "outputs" directory if it doesn't exist


def select_first_file(path):
    files = os.listdir(path)
    return os.path.join(path, files[0])


""""""""""


# Define email configuration with sender, receiver, app passwords along with port number


""""""""""


def main():

    parser = argparse.ArgumentParser()
    parser.add_argument("--processed_data", type=str, help="path to processed data")
    args = parser.parse_args()
    df = pd.read_csv(select_first_file(args.processed_data))  # Read the processed data


    """""""""""""
    # Calculate the top 5 metrics
    # Write a code snippet to calculate key metrics from a pizza orders dataframe. 
    # This should include the total number of unique orders, total revenue from orders, 
    # the most popular pizza size, the most frequent pizza category, and the total quantity
    # of pizzas sold.Each metric should be clearly defined with a brief comment explaining its purpose.

    """""""""""""

    # Create a report dictionary for summary

    """"""""""""""

    # Create email content
    def create_email_content(report):
        email_content = f"""


        #Write you are message here

        """
        return email_content

    # Send email report
    def send_email_report(email_config, report_metrics):
        email_content = create_email_content(report_metrics)

        server = smtplib.SMTP(email_config['smtp_server'], email_config['smtp_port'])
        server.starttls()
        server.login(email_config['sender_email'], email_config['password'])

        msg = MIMEMultipart()  # Create new MIMEMultipart object for each recipient
        msg['From'] = email_config['sender_email']
        msg['To'] = email_config['receiver_emails']
        msg['Subject'] = 'Pizza Sales Analysis Report'
        msg.attach(MIMEText(email_content, 'html'))

        text = msg.as_string()
        server.sendmail(email_config['sender_email'], email_config['receiver_emails'], text)
        print(f"Email report sent to {email_config['receiver_emails']} successfully!")

        server.quit()

    # Call the function to send email report
    send_email_report(email_config, report)

if __name__ == '__main__':
    main()

### **Creating Email Reporting Job**

Write a code to define a step for sending report emails based on processed pizza order data. The step should include inputs for the processed data (as a URI folder), a clear name and display name, and a description of its functionality. Specify the source folder for the component and the command to execute a Python script that generates and sends the email report. Use the designated environment for execution.

In [34]:
# Write you are code here

## **Building an Analytical Pipeline**

### **Assembling all Jobs into a Single Pipeline**

Write a code snippet to define an Azure ML pipeline for the Intelligent Reporting Project. The pipeline should include three steps: 

1. A data preparation job that processes an input CSV file (pizza_sales.csv).
2. A job for Hugging Face authentication and pushing files, taking the processed data from the first job and specified Streamlit files as inputs.
3. An email report job that sends a report based on the processed data from the first job.

Ensure that the pipeline has appropriate input parameters and returns a dictionary of outputs with identifiers for each step.

In [None]:
# Write your code here

### **Providing Paths for the Jobs**

In [35]:
# Define paths for input registered data


Write a code snippet to instantiate the defined Azure ML pipeline by providing the necessary input parameters. The pipeline should take the input path for the pizza sales data as a URI file and a URI folder for the Hugging Face files, ensuring that both inputs are correctly specified for execution

In [36]:
# Write your code here

### **Executing the Pipeline**

Write a code snippet to create or update a job in Azure ML using the defined pipeline. The job should be associated with a specified experiment name, such as "Project 1 - Intelligent Reporting on Azure ML Pipeline," and utilize the previously instantiated pipeline for execution.

In [None]:
# Write your code here

## **Sample Output**

1. Conclusions and Recommendations
2. Hugging face space link and screenshot of dashboarding
3. Email report screenshot 

### Power Ahead!