# Liquor Sales Capstone Project Pitch
- The state of **Iowa** publishes its class **E** license liquor sales data monthly at https://data.iowa.gov/Sales-Distribution/Iowa-Liquor-Sales/m3tr-qhgy
- It contains more than $19M$ samples of the detailed transaction records from the liquor vendors to
the various retail stores within the state, since $2012$.
- The dataset records the rich interaction between the $\sim 300$ liquor vendors, 
$\sim 2400$ liquor stores 
(with their lat-long coordinates), selling $\sim 9.4K$ liquor products.
- The quality of the data allows us to study the local liquor business activities, parallel to commerical data.

<img src="data/Iowa state.jpg" height=400 width=400>

<img src="data/liqour-by-ostill-istock.jpg" height=400 width=400>

## The Data and the Project Scope
- The data (downloadable from the above link) has various columns on transaction date, vendor information, store information (including name, address,
city, zip, county, long-lat), liquor item, liquor category, item cost, item retail selling price,
pack, bottle, total volume, sales amount, etc.
- With the additional information on **proof** (the alcohol content) at 
https://data.iowa.gov/Sales-Distribution/Iowa-Liquor-Products/gckp-fe7r, it allows us to
paint a vivid picture of the local alcohol consumption.

#### The Scope of the Project
- Building a relational database
- Data cleaning to battle the data inconsistency
- From the prospective of the products, the product analytics
- From the prospective of the vendors, the inventory/demand analytics
- From the prospective of the liquor stores, the location analytics and the demand analytics
- From the prospective of ML, Time series prediction technique, classification,clustering and MBA.
- You will make heavy usage of data analysis, visualization technique and statistical/ML models for these tasks.


- Please be aware that the purpose of the project is to train your data science ability in the
business environment.  No matter you perform data analysis, time series analysis or machine learning,
the ultimate goal is to provide business valuable insights/impact to the recipients/stakeholders.

### Step 1---Data Cleaning
- The transaction data has been collected through an on-going basis. The long time horizon of the data collection process has produced various inconsistency in the data.
   - The same liquor item might be mapped to different categories or names at different time.
   - The names of the categories are time dependent, somtimes producing over-fragmentation.
   - The same store (using the store number as its primary key) might be associated with non-unique
   long-lat coordinates and non-standardized store names.
   - The same vendor (using the vendor number as its primary key) might have more than 
one name.
   - The same store (using its store number as the primary key) might have more than one name.
   - The same item (using its item number as the primary key) might have different packs and inconsistent 'bottles sold' information or different item descriptions.

- To solve the data consistency issue, decompose the original dataframe into multiple tables.
    - Define a new dataframe called **products**, cleaning up the inconsistency.
    - Define a new dataframe called **product prices**, recording the historical bottle price 
    changes of the products.
    - Define a new dataframe called **vendors**, cleaning up the inconsistency.
    - Define a new dataframe called **stores**, cleaning up the inconsistency.
    - Define a master dataframe called **transactions**, which records the vendor number, store number, product (item number), transaction date, 
    bottles sold, volume (either in gallons or in liters), sales amount. 

<img src='data/mega_store.jpg' height=400 width=400>

### Creating an Relational Database
- The current data size is managable on a personal computer/laptop. But it would be helpful to your job search to create a relational database and populate these cleaned tables by insertions. The full transaction records should be a 
multi-fold table joins of the fundamental building blocks.

You may use the **sqlite** package or other similar packages to establish a python connection
to the relational DB.

<img src='data/db.gif'>

#### Time Aggregation
- The date strings can be converted to python datetime objects by pd.to_datetime
- To study the aggregated behavior, it is often helpful to aggregate sales into weekly or monthly
basis. The syntax df.groupby(['Item Number', ...., pd.Grouper(key='Date', freq='M')]) (change 'M' into
'W' for weekly sales) is very useful for you.
- The manual of **pd.Grouper** could be helpful: https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.Grouper.html

### The Usage of Pivot or Pivot Table
- To convert the long format dataframe into a wide format table, 
the dataframe method **pivot** or **pivot_table** is crucial. Please refer to 
**pandas** documentation for the details: https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.pivot.html
 or   https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html

### Data Analysis
- The thrust of the data consists of multiple liquor vendors selling to thousands of Iowa retail stores
the thousands of liquor products. The retail stores order these products from different vendors to satisfy the local demands
from the consumers. Even though the individual consumers do not enter the data, the aggregation of 
all the retail stores' sales in the same area does reflect the local demands on the liquor products.

- The data analysis can be carried out from the prospective of the vendors, 
the stores and the products.  As in any market within the capitalism system, the stores are open and are closed.
The new products are introduced into the market, while some of the existing products cease to 
gain popularity among the consumers. They are forced to exit the liquor market.

- **Vendor Analysis**:
    - As there are near $10K$ liquor products and only $300+$ vendors. 
    So most of the vendors must be selling multiple products. How many product does a vendor sell?
    How does it evolve w.r.t. time?  How many categories do these product fall into?
    
    - How many sales channels (retail stores) does each vendor have and how does it evolve w.r.t. time?
    - Are different vendors supplying the same product? Are different vendors supplying the same stores?
    - Are there direct competitions among different vendors? Depict a story (case study) on their sale-channel
    growth/shrinkage.

- **Store Analysis**:
    - Categorize the retail liquor stores into different types, chain-super market, specialized
    liquor stores, convenience stores, news stands, tobacco stores,..... and report the relevant statistics.
    - New stores pop up and some of them shut down after a few years. Analyze the store opening dynamics
    and report the findings on store life-cycle, store survival analysis, survial curve, survival
    probabilities, etc. Refine your findings based on the store categories you design. Is it easy for
    the stores to experience sales volume growth? With new stores popping up for competition, how does
    it affect the sales of the existing stores?  
    Does the pie (Gross sales volume in **Iowa**) grow bigger? Do the stores share the bigger pie?
    Does the winner take all or more players share smaller slices?
    - In terms of store inventories, report the varieties of product/product category for different
    store types. 
    What product/product category are the major sales contributors for different types of stores?
    - Study the monthly sales volumes and gross profits (without taking price-discounts into account)
    of the stores. 
    - Use the county population information, 
    say https://www.iowa-demographics.com/counties_by_population, to estimate the stores-per-capita in different
        counties. Use the chain-supermarket **Hy-Vee** as a case study, depict its store distribution
        in different counties of Iowa and the their liquor sales dynamics.
    - Within the context of **B2B**, 
    the stores play the role of the vendors' customers.
    Thus the concept and tools of customer analytics can be used to analyze the stores.
    As not all the stores can continue to survive or they might change their vendors, the vendors often like to know the
    **customer lifetime value** of their customers (the stores).
    Provide an analysis on the **CLTV** (in a fixed time horizon) based on the natures of the stores.
    You may visit https://exponea.com/blog/customer-lifetime-value-guide/
    for an introduction to customer life-time value computation. Note that
    the 'customer churning', in the context of the liquor stores as the vendors' customers, must
    include the store shutting-down in your discussion.
    

    
- **Product Analysis**:
    - Among the near $10K$ liquor products, what product categories do they belong to? What 
    are the popular product categories?  
    What are the popular brands (in terms of product line-ups)?
    As there are too many categories to be considered, design a coarse-grained category system based on
    the types of liquor (Whiskies, Rum, ....) and domestic/import status.
    
    - **Product Survival**: How many products are introduced into the market every year? How many
    are removed from the market every year?  On average, what is the chance (probability) that a new product
    can survive till the end of the current data?
    
    By looking at each product category (or the coarse category you design), 
    your team can refine your analysis and paint a dynamical picture on
    the product's life-cycle and the market competition.
    
    - For simplicity, restrict the analysis to the products which are sold from the beginning (2012) to the end of the data.
    In terms of mean monthly sales volume and sales amount, which product, which brand and what category are
    the leaders and the laggers?
    
    - For the successful products (in an absolute term, or in the category) 
    which are sold well (top in the monthly sales volume or sales $\$$ list), 
    what are their sales channels?
    How many stores are they sold to? Among the popular products, is there a growth trend on the sale
    volume?
    Do the number of stores (sales-channels) grow w.r.t. time or is it the same store sales
    increment which is responsible for the sales growth? On the contrary, does the shrinkage of the sales channels
    induce the sales declines? 
    
    With the new products popping up constantly, is it easy for a typical liquor 
    product to experience sale-volume growth? Relate your discussion with liquor market competition.
    
    - For liquor products, alcohol content/retail price is the basic gauge on how cheap or expensive the
    products are. Study if there is any relationship between alcohol volume/$\$$ and the sales volume.
    - What kind of sales channels (chain-supermarket, specialized liquor store, convenience store, drug store, ....) are more
    important in terms of the sales?
    - **profit margin**: Using the state retail price - bottle cost as a proxy, estimate the
        profit margin of the individual products or the product categories. Which product, or product category
        offer higher profit margins?
    -  Looking at the quarterly aggregated sales volumes, do different products show seasonal popularity?
    For example, hypothetically the whiskies are super popular among the **Iowa** residents in the winter, etc. 
    - Alcohol is brewed from different types of starch, which are commodities. Thus the liquor
    prices are affected by the commodity prices. Throughout the years, the vendor may hike their
    liquor product prices to compensate the inflation pressure. Report the percentage price increments of
    different types of liquor products/product categories during the studied period 2012-current.

### Price Sensitivity and Sales Volume
- It is a well known fact that the consumers react to the products dropping prices strongly.
It is a well known slogan "Pricing Drive Sales".
- The **Iowa** liquor data provides us an open window to investigate the relationship between
prices and sales volume un-parallel in any open data. 
- To compare prices and sales-volume, it is crucial to recognize:
    - different products have different bottle volumes, which affects their prices.
    - the same product could have different sales-volume at different time of the year due to
    seasonal fluctuations. If a liquor product is popular in the summer, its sales-volume picks up
    during the summer, which has nothing to do with its price change. 
- In order to address these issues, we will need to use normalized pricing, i.e. price/unit volume
in our study.
- To reduce data noises, we need to aggregate the sales-volume into weekly or monthly 'Volume Sold'
and average over different months to reduce the effect of seasonality.
- Verify that there is a general trend of negative price sensitivity: an increasing of liquor
    prices drops its average monthly sales-volume. Frame your analysis using simple linear regression.
- Stratify your analysis into different liquor categories and refine your analysis.  
- What does it tell you about the sales-volume sensitivity to the retail price/unit volume for different
liquor category: Hint: you may need to lump different categories together to avoid sparse samples.

<img src='data/price sensitivity plot.png' height=400 width=400>

- The negative slope linear regression line of the log-log plot suggests a power relationship,
$$volume = (price)^{\beta_1}\cdot 10^{({\beta_0}+\epsilon)}.$$
- The more negative is the slope $\beta_1$, the sharper drop is the sales-volume w.r.t. the rising price.
- The above simple linear regression has a $\beta_1\cong -1.22$ with $R^2\sim 0.25$.
- This indicates that $75\%$ of the sales-volume variance is not explainable by the price dynamics.
- **Question**: Stratifying the price/bottle-volume vs sales volume data according to the product
    categories, the products in different categories follow different negative slopes. Tie the
    different price sensitivities with the different levels of competition of different types of liquor products.

### Time Series Analysis
- **Descriptive analysis**, Cov19:
     - Study the time series of the **Iowa** whole-state monthly aggregate sales volumes and dollar amounts. 
        What is the major seasonality patterns? During what kinds of holidays?
     - The aggregate time series can be sub-divided into $\sim 10k$ **product sales volumes** time series,
    $\sim 2.5K$ **store sales volumes** time series, $\sim 300$ 
    **vendor sales volumes** time series, $\sim 100$ **county sales volume**
    time series etc.
    
    - Use product or store competition as your theme, dig out insights on the sales volume quantile statistics time series.
    - **Covid-19** and the surge of Iowa liquor sales:  According to the news media 
        https://globegazette.com/news/state-and-regional/govt-and-politics/covid-19-causes-iowa-liquor-sales-surge/article_d1b8f0b0-9449-57ac-bb3f-95b47d122e3a.html
        or https://iowacapitaldispatch.com/2020/05/15/covid-19-appears-to-have-triggered-a-spike-in-liquor-sales/ the Iowa 
            liquor sales spike after the corona-virus outbreaks. The fact is that not all stores, not all
            counties, not all products nor vendors observe the sales spike. In your analysis, use your
            quantitative skills to identify the stores, the products, the counties which experience the sales spikes.
    The news article suggests that the spikes of unemployment is 
    the major cause for people to be addictive to the liquor consumption. 
    Is your time series analysis consistent with this claim? Based on your finding, what type of
    sales channels are more prone to the consumers with an addictive drinking pattern?
    In terms of geographic locations, what cities, zip codes or counties are more prone to the
    corona-virus outbreak induced liquor addiction. 
    
- For **Iowa** unemployment rate, visit 
https://www.iowaworkforcedevelopment.gov/local-area-unemployment-statistics
- For county population, visit https://data.iowa.gov/Community-Demographics/County-Population-in-Iowa-by-Year/qtnr-zsrc    
- For the various **Iowa** related data, you may visit their data center https://www.iowadatacenter.org/data
    or the other online resources.
        
        
- **Demand/Inventory Analytics**: 
    - In supply chain analytics, forecasting the demands and optimizing the inventories are important tasks to maintain profitability. 
    - Following https://www.bain.com/insights/demand-forecasting-with-advanced-analytics/, taking
        a top-down approach to forecast the demands of the various liquor products. 
    - For inventory management, https://www.scnsoft.com/blog/inventory-optimization-with-data-science,
        inventory shortage and the holding cost are two opposite pitfalls your team want to avoid.
        Frame your regression problem to optimize it.
    - Instead of predicting each product's demand individually, it is more stable to forecast the total demand and
    then forecast the sub-components.  For example, forecast the monthly aggregated sales volume, then
    predict the monthly sales volume of liqour categories, .... You may find classifiers handy
    to predict, say the sales percentages of different liquor categories. 
    In other words, forecast the aggregate demand first, then forecast the sub-category demands
    as fractions (probabilities) using **ML** probabilisitic classifiers. 
    - Traditionally, time series models like **exponential weighted moving averge**, **arima**, **var** or **varmax** can be used to perform
        time series forecasting. Besides this, multiple linear regression and advanced regression models
        **SVR** or tree-based models, neural network models can also be used for the time series tasks.
        The concept of stationary vs non-stationary time series is an important consideration in dealing
        with time series forecasting.

#### Monthly Sales Volume Time Series of Various Whiskies Related Categories

<img src='data/time series of various whiskies sales volume.png' height=700 width=500>

### Seasonality of Liquor Consumption
- Some types of Liquor consumption are highly periodic (seasonal).
- Can you identify all these products?

<img src='data/Gins and Cocktails.png', width=500, height=400>

### Moving Average, Arima, Varmax
- For the concept of **moving average**, visit https://www.datacamp.com/community/tutorials/moving-averages-in-pandas
    
- Take a look of https://machinelearningmastery.com/arima-for-time-series-forecasting-with-python/
    for **ARIMA** in python statsmodels.
- For **VARMAX**, visit  
https://machinelearningmastery.com/time-series-forecasting-methods-in-python-cheat-sheet/
- For the application of time series forecasting in demand analytics, read the blog 
https://www.linkedin.com/pulse/demand-forecasting-using-arima-holt-winters-logistics-faiz-fablillah
    to gain some information.
    

### Moving Average Example
- The moving average smooths out the noises of the time series data.
- It can be used as a very basic way of time series forecasting. 

<img src='data/Vodka and Moving Average.png', height=500, width=400>

### Location Analytics Through Unsupervised Learning
- Apparently it does not make sense to place a store in a rural location
with almost no residents. On the other hand, open a new store in a densely populated area
could mean fierce competition from the existing stores. Therefore it is very important for the
store owner to have the business intelligence on choosing the right store location.
- In **scikit-learn** there is an unsupervised **KNN**, which outputs the row-indexes of the nearest
neighbors and their distances. Using unsupervised **KNN** as a tool on the store long-lat coordinates, analyze the store success rate/store sales volumes
in the highly competitive and the lower competitive areas. 
- Visualize on the **Iowa** state map and identify the highly competitive areas and the low competitive area
and relate them with the population densities or the major population hubs.

- There is a well known density based clustering technique called **DBScan**, which forms clusters
based on densely accumulated samples. The isolated samples which do not satisfy the density
criterion are marked the outliers with no cluster assignment.

- At the end, please report your findings as the advices to an entrepreneur who wants to enter the liquor sales
business.






### Clustering On the Store Demands
- Given that there are thousands of liquor stores, or supermarkets which sell liquors, it is interesting
to segment the stores based on their liquor order time series.
  - It would be insightful to aggregate the raw transactions into monthly sales volume.
  - Either one can focus on a particular product, a category, an aggregation of several categories,
or the total liquor sales volume.
  - We would like to group the different stores into different clusters, based on their similarity on 
    monthly sales-volume time series. 
  - There are multiple ways to frame the question:
     - Either we focus on the volume time series themselves
     - Or we consider the log-return time series $log(V_{m+1})-log(V_m)$
     - Notice that if we consider the $N\times M$ dataframe, where $N$ stands for the number of
    stores, $M$ stands for the total number of months, then the **cosine distance** after row-wise de-meaning
    is the same thing as computing the store-vs store cross-correlation.
- Using either **Kmeans** (with Euclidean distance) or the hierarchical clustering (with **cosine** distance)
to cluster the stores (you probably want to ignore the stores with short durations) and interpret
your result in terms of the store types or store locations. 
- **Question**: Does your clustering provide additional insights on 
    different stores' demand patterns based on their locations or their store customers?
- The store demands, from the prospective of the liquor vendors,
are highly related to the **customer lifetime value** discussed above.
Formulate your clustering features in terms of **CLTV** and provide
valuable insights to the liquor vendors.

### Market Basket Analysis/Associative Rule Mining
- Market Basket Analysis (also known as the associative rule mining) is an unsupervised ML technique
aiming at finding hidden patterns in the transaction records.
- In our current data, a single row consists of the order of a single liquor product item.
In order for the technique to be applicable, we aggregate the data 
according to the customers (in this case the store number), dates
with a weekly or a monthly frequency. 
- **MBA** would like to uncover the items (in our case the different liquor products) which
are frequently purchased together. Report your findings.
- What happens if we condition on different types of stores 
(chain-super markets, specialized liquor stores, small convenience stores, news stands)?

- What happens if we condition on the month, report your findings for different months.
- Consider the stores in the same county, with the same zip code, within the major cities
 (Des Moines, Cedar Rapids, Davenport, ....), what insights does **MBA** give us?
- Regarding the stores in the densely populated areas vs the rural areas, what kinds of 
transaction patterns do we see?

- In python, there are packages like **mlxtend**, or **apyori**, which implements apriori **MBA**
algorithm. Visit https://intellipaat.com/blog/data-science-apriori-algorithm/ for a brief introduction 
    on **MBA**.
- In R, **arules** implements **MBA**. Visit https://blog.exploratory.io/introduction-to-association-rules-market-basket-analysis-in-r-7a0dd900a3e0
for some explanation.
    

        
        

<img src='data/market-basket.png'>

### Customer/Store Lifetime Value Prediction
- Build machine learning models or statistical models to predict
the different stores **lifetime value** to different products/vendors.
- You will need to formulate the business question into a machine learning
problem and use the machine learning technique to answer business relevant questions.
- Naively, the **customer/store lifetime value** is a regression problem.
If you have used cluster technique to segment the customers/stores, you can also
use classification technique to predict the store segments.
Visit https://towardsdatascience.com/data-driven-growth-with-python-part-3-customer-lifetime-value-prediction-6017802f2e0f
for some discussion on **ML** style **CLTV** predictions.
    
- Besides the modern machine learning technique, **CLTV** can be also
estimated based on statistical modeling and time series analysis. For your team members with quantitative
background, you can refer to https://medium.com/bolt-labs/understanding-the-customer-lifetime-value-with-data-science-c14dcafa0364
for some high level description on using the classical binomial/poisson and gamma distributions
to model **CLTV**.


#### How to Compute the Distances between Two Long-Lat Coordinates?
- There are packages (like **geopy**) which allow you to compute the spherical distances. 
- For the purpose of machine learning, it is desirable to let the $L^2$ Euclidean distance (used by the algorithm) 
approximate the spherical distance. This can be very handy in the neighborhood comps modeling.
- Let $(\theta_1, \phi_1)$, $(\theta_2, \phi_2)$ be two points on the sphere (the earth).
When these two points are sufficiently close to each other, $\theta_1\cong \theta_2$, $\phi_1\cong
\phi_2$. For simplicity we assume that the sphere has a radius $1$ (The earth radius is about $3950$ miles). 
The spherical distance $\Delta s$ can be approximated by the following formula,

$$\Delta^2 s = cos^2(\phi)\Delta^2\theta + \Delta^2\phi,$$
where $\phi\cong \phi_1\cong\phi_2\cong 40.6^{\circ}$ to $43.5^{\circ}$ in **Iowa**.

This suggests that if we map the long-lat coordinates $(\theta_i, \phi_i), 1\leq i\leq 2$ to $(cos(\phi)\theta_1, \phi_1)$, $(cos(\phi)\theta_2, \phi_2)$,
the 2D Euclidean distance is an approximation of the spherical distance.

<img src='data/Long_Lat.png'>

- $\lambda$ refers to $\theta$ in our notation.

### Plotting the Iowa County Polygons
- The US county boundary (in long-lat coordinates) can be downloaded from 
https://public.opendatasoft.com/explore/dataset/us-county-boundaries/export/
- To convert a string encoded python dictionary like 
{"type": "Polygon", "coordinates": [[[-92.5543...
, we use the **json** package **loads** function, **json.loads**.
- The following plot illustrates the liquor store location scatter-plot overlaying with the **Iowa** county map.

<img src='data/Iowa Liquor Stores.png' height=500 width=500>