# Liquor Sales Data Analysis

## Objective

In this case study, you will learn the principles of hands-on data processing and analysis using
a detailed dataset on liquor sales spanning 2020 to 2025. This assignment aims to give you
practical experience in processing large datasets using tech stacks such as AWS RDS, HBase,
and Hadoop MapReduce. You will not only apply the techniques covered in the project modules,
but also gain insights into the complexities of business data analytics. At the end of this
assignment, you will have developed a robust understanding of the following:
• Data ingestion processes using cloud-based tools like AWS RDS and HBase.
• Data cleaning and preparation to ensure high-quality analysis.
• Applying MapReduce to solve real-world analytics problems.
• Creating actionable insights and recommendations based on the analysis.
This hands-on approach will bridge the gap between theoretical learning and practical implementation,
preparing you for real-world challenges in the field of data analytics.



# Business Objective:
The liquor industry is a significant contributor to the retail economy, particularly in regions where
sales are highly regulated and tracked. For liquor businesses, understanding sales trends is vital to
maintaining competitive advantage, meeting customer demand, and ensuring efficient operations.
As an analyst, you are tasked with analyzing detailed liquor sales data from 2020 to 2025 to uncover
patterns and insights that can drive strategic decision-making. The objective is to identify trends in
consumer preferences, regional sales performance, and product popularity, enabling stakeholders to
optimize inventory management, boost profitability, and enhance customer satisfaction.

## Hadoop and MapReduce Assignment Tasks:

<br> Data Cleaning Tasks: </br>
<br>Task 1:  Data Cleaning</br>

<br> Data Ingestion Tasks: </br>
<br>Task 2: Upload Liquor Sales Data to AWS RDS</br>
<br>Task 3: Ingest Data into HBase</br>


<r>Data Analysis Using MapReduce:</br>
<br>Task 4: Total Revenue by Store</br>
<br>Task 5: Top-Selling Liquor Categories</br>
<br>Task 6: County-Level Sales Analysis</br>
<br>Task 7: Store Performance Analysis</br>
<br>Task 8: Trends in Liquor Sales Over Time</br>
<br>Task 9: Vendor Performance</br>


**NOTE:** The marks given along with headings and sub-headings are cumulative marks for those particular headings/sub-headings.<br>

The actual marks for each task are specified within the tasks themselves.

For example, marks given with heading *2* or sub-heading *2.1* are the cumulative marks, for your reference only. <br>

The marks you will receive for completing tasks are given with the tasks.

Suppose the marks for two tasks are: 3 marks for 2.1.1 and 2 marks for 3.2.2, or
* 2.1.1 [3 marks]
* 3.2.2 [2 marks]

then, you will earn 3 marks for completing task 2.1.1 and 2 marks for completing task 3.2.2.

---

## Data Understanding
The dataset link can be accessed from the following [link](https://liquor-data.s3.us-east-1.amazonaws.com/Liquor_Sales.csv).
The dataset contains liquor sales data from multiple stores across various states, providing rich information for analysis. The fields are as follows:


| Variable              | Class            | Description                                                     |
|-----------------------|------------------|-----------------------------------------------------------------|
| Invoice/Item Number   | String/Integer   | Unique identifier for each sale.                                |
| Date                  | Date             | The date of the sale.                                           |
| Store Number          | Integer          | Unique identifier for the store.                                |
| Store Name            | String           | Name of the store.                                              |
| Address               | String           | Store address.                                                  |
| City                  | String           | City where the store is located.                                |
| Zip Code              | String/Integer   | ZIP code of the store location.                                 |
| Store Location        | String/GeoPoint  | GPS coordinates of the store.                                   |
| County Number         | Integer          | Unique identifier for the county.                               |
| County                | String           | Name of the county.                                             |
| Category              | Integer          | Liquor category code.                                           |
| Category Name         | String           | Name of the liquor category (e.g., Whiskey, Vodka).             |
| Vendor Number         | Integer          | Vendor's unique identifier.                                     |
| Vendor Name           | String           | Name of the vendor/distributor.                                 |
| Item Number           | Integer          | Product's unique identifier.                                    |
| Item Description      | String           | Description of the liquor product.                              |
| Pack                  | Integer          | Number of bottles in a pack.                                    |
| Bottle Volume (ml)    | Float/Integer    | Volume of a single bottle in milliliters.                       |
| State Bottle Cost     | Float            | Cost per bottle for the state.                                  |
| State Bottle Retail   | Float            | Retail price per bottle.                                        |
| Bottles Sold          | Integer          | Number of bottles sold.                                         |
| Sale (Dollars)        | Float            | Total revenue from the sale.                                    |
| Volume Sold (Liters)  | Float            | Volume sold in liters.                                          |
| Volume Sold (Gallons) | Float            | Volume sold in gallons.                                         |


### Import Libraries and Load Dataset

In [None]:
!pip install mrjob==0.7.4

In [None]:
# Import the libraries you will be using for analysis
import pandas as pd
import numpy as np
from mrjob.job import MRJob
import csv

## **1** Data Cleaning
<font color = red>[5 marks]</font> <br>

#### **1.1** Fixing Columns

#### **1.2** Fixing Rows

#### **1.3** Handling Missing Values

#### **1.4** Handling Outliers

## **2** Data Ingestion Tasks
<font color = red>[15 marks]</font> <br>

#### 2.1 Upload Liquor Sales Data to AWS RDS
<font color = red>[5 marks]</font> <br>

####2.2 Ingest Data to HBase
<font color = red>[10 marks]</font> <br>

## **3** Analytics Queries using MapReduce
<font color = red>[60 marks]</font> <br>

#### 3.1 Total Revenue by Store
<font color = red>[10 marks]</font> <br>

In [None]:
class MRTotalRevenueByStore(MRJob):

    # This is the Mapper
    def mapper(self, _, line):

    # This is the Reducer
    def reducer(self, ):


if __name__ == '__main__':
    MRTotalRevenueByStore.run()

#### 3.2 Top-Selling Categories
<font color = red>[10 marks]</font> <br>

In [None]:
class MRTopSellingLiquorCategories(MRJob):

    # This is the Mapper
    def mapper(self, _, line):

    # This is the Reducer
    def reducer(self, ):


if __name__ == '__main__':
    MRTopSellingLiquorCategories.run()

#### 3.3 County-Level Sales Analysis
<font color = red>[10 marks]</font> <br>

In [None]:
class MRCountyLevelSalesAnalysis(MRJob):

    # This is the Mapper
    def mapper(self, _, line):

    # This is the Reducer
    def reducer(self, ):


if __name__ == '__main__':
    MRCountyLevelSalesAnalysis.run()

#### 3.4 Store Performance Analysis
<font color = red>[10 marks]</font> <br>

In [None]:
class MRStorePerformanceAnalysis(MRJob):

    # This is the Mapper
    def mapper(self, _, line):

    # This is the Reducer
    def reducer(self, ):


if __name__ == '__main__':
    MRStorePerformanceAnalysis.run()

#### 3.5 Trends in Liquor Sales Over Time
<font color = red>[10 marks]</font> <br>

In [None]:
class MRLiquorSalesTrends(MRJob):

    # This is the Mapper
    def mapper(self, _, line):

    # This is the Reducer
    def reducer(self, ):


if __name__ == '__main__':
    MRLiquorSalesTrends.run()

#### 3.6 Vendor Performance
<font color = red>[10 marks]</font> <br>

In [None]:
class MRVendorPerformance(MRJob):

    # This is the Mapper
    def mapper(self, _, line):

    # This is the Reducer
    def reducer(self,):


if __name__ == '__main__':
    MRVendorPerformance.run()