## Introduction
The purpose of this notebook is to practice working with geography (with shapefiles) and time (with datetime types), in the context of the Iowa Liquor Sales dataset. This will also focus on looking specifically at this data on the county level, with the inclusion of additional datasets for Iowa county populations during the 2010's and shapefiles for the Iowa county boundaries. These will allow us to do operations such as normalizing sales by the population in each county and vizualizing metrics associated with each county on a physical map.

In [None]:
import numpy as np
import pandas as pd
import fiona
import geopandas
import plotnine
import warnings
from plotnine import *
warnings.filterwarnings("ignore")

## Reading in the datasets

In [None]:
# Reading in all of the files.
geo_df = geopandas.read_file("/kaggle/input/iowa-county-boundaries/IowaCounties.shp")
pop_df = pd.read_csv("/kaggle/input/iowa-county-populations/IowaCountyPopulations.csv")
liq_df = pd.read_csv("/kaggle/input/iowa-liquor-sales/Iowa_Liquor_Sales.csv") 

The main dataset has the following columns and datatypes.

In [None]:
liq_df.dtypes

## Additional dataset cleaning and organization

The cleaned dataset with a subset of just the columns to be utilized looks like this.

In [None]:
# Cleaning up some of the columns in the liquor sales dataframe and subsetting with variables of interest.
liq_df["datetime"] = pd.to_datetime(liq_df["Date"])
liq_df["sale_usd"] = liq_df["Sale (Dollars)"].map(lambda x: float(str(x).replace("$","").replace(",","")), na_action="ignore")
liq_df["city"] = liq_df["City"].map(lambda x: str(x).lower().strip(), na_action="ignore")
liq_df["county"] = liq_df["County"].map(lambda x: str(x).lower().replace("'","").strip(), na_action="ignore")
liq_df["category"] = liq_df["Category Name"].map(lambda x: str(x).lower().strip(), na_action="ignore")
liq_df["year"] = liq_df["datetime"].dt.year
liq_df = liq_df.rename({"Volume Sold (Liters)":"volume_liters"},axis="columns")
liq_df = liq_df[["county", "city", "datetime", "year", "sale_usd", "category", "volume_liters"]]
liq_df.head(10)

Looking at just these fist couple rows, there are some columns that have missing values. We can look at the percentage of each column in this dataset that is missing.

In [None]:
liq_df.isna().apply(lambda x: (x.sum()/len(liq_df))*100).to_frame(name="percent_missing")

Fewer that 1% of each variable consists of missing values. The county variable is missing the most values, with ~0.6% of the values missing. For the purposes of this notebook, we will subset the dataset to only include rows where all these variables are present.

In [None]:
liq_df.dropna(axis=0, how="any", inplace=True)

## Fixing typos to make the datasets compatible

The following values are present in the county column for the main liquor sales datase but not the geography or population sizes datasets.

In [None]:
# Making sure the county colums in the the other two dataframes will be compatable so we can merge them.
geo_df["county"] = geo_df["CountyName"].map(lambda x: x.lower().replace("'",""))
pop_df["county"] = pop_df["County"].map(lambda x: x.lower().replace("'",""))

In [None]:
counties_in_geo_df = pd.unique(geo_df.county)
counties_in_pop_df = pd.unique(pop_df.county)
counties_in_liq_df = pd.unique(liq_df.county)
assert len(set(counties_in_geo_df)-set(counties_in_pop_df)) == 0
print(set(counties_in_liq_df)-set(counties_in_pop_df))

These look like typos that will have to be corrected manually. We can simply map them to the correct spellings and double-check that the counties mentioned in all three datasets are now the same.

In [None]:
county_name_typo_map = {"pottawatta":"pottawattamie",
                       "cerro gord":"cerro gordo",
                       "buena vist":"buena vista"}
liq_df["county"] = liq_df["county"].map(lambda x: county_name_typo_map.get(x,x))
counties_in_geo_df = pd.unique(geo_df.county)
counties_in_pop_df = pd.unique(pop_df.county)
counties_in_liq_df = pd.unique(liq_df.county)
assert len(set(counties_in_geo_df)-set(counties_in_pop_df)) == 0
assert len(set(counties_in_geo_df)-set(counties_in_liq_df)) == 0

## Exploring the dataset

Let's look at the number of sales recorded in this dataset during each year the dataset covers, as well as the total amount of sales in terms of USD.

In [None]:
# Making a plot of the total number of sales logged during each year in the dataset.
plot_data = liq_df.groupby(liq_df.datetime.dt.year).size().reset_index()
plot_data.columns = ["year","number_of_sales"]

plotnine.options.dpi = 100
plotnine.options.figure_size=(6,4)
(ggplot(data=plot_data)
 + aes(x="year", y="number_of_sales")
 + geom_bar(stat="identity", fill="lightgray", color="black", alpha=0.5)
 + scale_x_continuous(breaks=list(range(plot_data["year"].min(), plot_data["year"].max()+1)))
 + theme_bw()
 + ylab("Number of Sales")
 + xlab("Year")
)

In [None]:
# Making a plot of the total number of sales logged during each year in the dataset.
plot_data = liq_df.groupby(liq_df.datetime.dt.year)["sale_usd"].sum().reset_index()
plot_data.columns = ["year","sale_usd"]
plot_data["sale_usd_m"] = plot_data["sale_usd"]/1000000

plotnine.options.dpi = 100
plotnine.options.figure_size=(6,4)
(ggplot(data=plot_data)
 + aes(x="year", y="sale_usd_m")
 + geom_bar(stat="identity", fill="lightgray", color="black", alpha=0.5)
 + scale_x_continuous(breaks=list(range(plot_data["year"].min(), plot_data["year"].max()+1)))
 + theme_bw()
 + ylab("Total Sales (Million USD)")
 + xlab("Year")
)

We can also look at the total amount of sales in terms of USD that each city recorded over the entire duration. This figure is showing only the top 40 cities, ordered by total sales amount.

In [None]:
# Looking at how much each of the city sold over the entire period.
num_cities_to_show = 40
plot_data = pd.DataFrame(liq_df.groupby("city")["sale_usd"].sum()).reset_index()
plot_data = plot_data.sort_values(by="sale_usd",ascending=True).tail(num_cities_to_show)
plot_data["city"] = pd.Categorical(plot_data["city"], categories=plot_data["city"].values)
plot_data["sale_usd_m"] = plot_data["sale_usd"]/1000000

plotnine.options.dpi = 100
plotnine.options.figure_size=(6,6)
(ggplot(data=plot_data)
 + aes(x="city", y="sale_usd_m")
 + geom_bar(stat="identity", fill="lightgray", color="black", alpha=0.5)
 + theme_bw()
 + ylab("Total Sales (Million USD)")
 + xlab("City")
 + coord_flip()
)

We can look at the same plot with respect to counties instead.


In [None]:
# Looking at how much each of the county sold over the entire period.
num_counties_to_show = 40
plot_data = pd.DataFrame(liq_df.groupby("county")["sale_usd"].sum()).reset_index()
plot_data = plot_data.sort_values(by="sale_usd",ascending=True).tail(num_cities_to_show)
plot_data["county"] = pd.Categorical(plot_data["county"], categories=plot_data["county"].values)
plot_data["sale_usd_m"] = plot_data["sale_usd"]/1000000

plotnine.options.dpi = 100
plotnine.options.figure_size=(6,6)
(ggplot(data=plot_data)
 + aes(x="county", y="sale_usd_m")
 + geom_bar(stat="identity", fill="lightgray", color="black", alpha=0.5)
 + theme_bw()
 + ylab("Total Sales (Million USD)")
 + xlab("County")
 + coord_flip()
)

We can visualize this same information through a geographical view by merging the data for these plots with a dataframe containing a geometry column for specifying the shape of each county.

In [None]:
# Now let's visualize the total sales in each county over the entire period.
plot_data = pd.DataFrame(liq_df.groupby("county")["sale_usd"].sum()).reset_index()
plot_data = geo_df.merge(right=plot_data, how="left", on="county")
plot_data = plot_data.dropna(axis="index", how="any")
plot_data["sale_usd_m"] = plot_data["sale_usd"]/1000000
plot_data["centroid_x"] = plot_data.centroid.x
plot_data["centroid_y"] = plot_data.centroid.y

plotnine.options.dpi = 100
plotnine.options.figure_size=(10,7)
(ggplot(plot_data)
 + geom_map(aes(fill="sale_usd_m"), show_legend=True)
 + geom_text(aes(x="centroid_x", y="centroid_y", label="county"), size=5)
 + scale_fill_gradient(name="Total Sales (Million USD)", low="#CCFF99", high="#FFC300")
 + theme(legend_position="right", panel_background=element_rect(fill="white"))
 + theme(axis_text=element_blank(), axis_ticks=element_blank(), axis_title=element_blank())
)

Let's look at just data from one particular year instead, normalize by the population of each county in that year.

In [None]:
# Let's take just one particular year, and normalize sales in each county during
# that year by the population of that county estimate for that year.
plot_data = liq_df[liq_df["year"]==2016]
plot_data = pd.DataFrame(plot_data.groupby("county")["sale_usd"].sum()).reset_index()
plot_data = geo_df.merge(right=plot_data, how="left", on="county")
plot_data = plot_data.dropna(axis="index", how="any")
plot_data["centroid_x"] = plot_data.centroid.x
plot_data["centroid_y"] = plot_data.centroid.y
plot_data = plot_data.merge(right=pop_df[["county","2016"]], on="county", how="left")
plot_data = plot_data.rename({"2016":"population_in_2016"}, axis="columns")
plot_data
plot_data["sale_usd_per_person"] = plot_data["sale_usd"]/plot_data["population_in_2016"]
plot_data.head()

plotnine.options.dpi = 100
plotnine.options.figure_size=(10,7)
(ggplot(plot_data)
 + geom_map(aes(fill="sale_usd_per_person"), show_legend=True)
 + geom_text(aes(x="centroid_x", y="centroid_y", label="county"), size=5)
 + scale_fill_gradient(low="#CCFF99", high="#FFC300", name="2016 Sales (USD per Person)")
 + theme(legend_position="right", panel_background=element_rect(fill="white"))
 + theme(axis_text=element_blank(), axis_ticks=element_blank(), axis_title=element_blank())
)

As expected, the total amount of sales in each county is tightly correlated with population size. We can directly look at the relationship between these two variables. Here, Dickinson county is highlighted in a different color to indicate its position, as this is a potential data point of interest when looking at the previous plot.

In [None]:
# We can look at how as expected, total sales are tightly correlated with population.
# Let's label the data point that looks interesting on the plot above.
# This is a county with a low population where total sales were higher than would be expected.
plot_data["is_dickinson"] = plot_data["county"].map(lambda x: (x=="dickinson"))
plot_data["population_in_2016_k"] = plot_data["population_in_2016"]/1000
plot_data["sale_usd_m"] = plot_data["sale_usd"]/1000000

plotnine.options.dpi = 100
plotnine.options.figure_size=(6,4)
(ggplot(plot_data)
 + geom_point(aes(x="population_in_2016_k", y="sale_usd_m", fill="is_dickinson"), show_legend=False)
 #+ geom_smooth(aes(x="population_in_2016_k", y="sale_usd_m"), method="lm", se=False)
 + scale_fill_manual(values={True:"black",False:"white"})
 + theme_bw()
 + ylab("2016 Sales (Million USD)")
 + xlab("2016 Population (Thousands)")
)

We can also use the datetime column of this dataset to compare the volume of total sales during each day of the week.

In [None]:
liq_df["day_of_week_name"] = liq_df.datetime.dt.day_name()
liq_df["day_of_week_id"] = liq_df.datetime.dt.day_of_week

In [None]:
# What do the total sales look like in terms of cost and volume distributed over days of the week?
plot_data = liq_df.groupby(["day_of_week_name","day_of_week_id"])["sale_usd"].sum().reset_index()
plot_data = plot_data.sort_values(by="day_of_week_id")
plot_data["day_of_week_name"] = pd.Categorical(plot_data["day_of_week_name"], categories=plot_data["day_of_week_name"].values)
plot_data["sale_usd_m"] = plot_data["sale_usd"]/1000000

plotnine.options.dpi = 100
plotnine.options.figure_size=(6,4)
(ggplot(data=plot_data)
 + aes(x="day_of_week_name", y="sale_usd_m")
 + geom_bar(stat="identity", fill="lightgray", color="black", alpha=0.5)
 + theme_bw()
 + ylab("Total Sales (Million USD)")
 + xlab("Day of the Week")
)

Instead of looking at the total sales for each day of the week in this dataset, we can also look at the mean for each county, and the spread of the total sales for all the counties on each day, to get a sense of how consisent this pattern is across all of the counties in the state. Not that in this case, the y-axis now refers to the fraction of total sales for a given county on all days of the week, rather than raw amount.

In [None]:
plot_data = liq_df.groupby(["county","day_of_week_name","day_of_week_id"])["sale_usd"].sum().reset_index()
plot_data = plot_data.sort_values(by="day_of_week_id", ignore_index=True)
plot_data["day_of_week_name"] = pd.Categorical(plot_data["day_of_week_name"], categories=plot_data["day_of_week_name"].unique())

# We want to be able to look at sales in each county as a fraction of their sales throughout the week, 
# not just their total number of sales. So let's normalize by the total sale amount in each county.
county_to_total_sales = dict(liq_df.groupby("county")["sale_usd"].sum())
plot_data["total_sale_usd"] = plot_data["county"].map(county_to_total_sales)
plot_data["day_fraction"] = plot_data.apply(lambda row: row["sale_usd"]/row["total_sale_usd"], axis=1)
plot_data.head(10)

plotnine.options.dpi = 100
plotnine.options.figure_size=(6,4)
(ggplot(plot_data)
 + geom_boxplot(aes(x="day_of_week_name", y="day_fraction", group="day_of_week_id"), show_legend=True)
 + theme_bw()
 + ylab("Fraction of Total Sales")
 + xlab("Day of the week")
)

In general, Monday through Wednesday sees the heaviest sales, decreasing over the Thursday to Saturday period. There are interesting outliers in counties where specific days of the week account for nearly all of the recorded. We can sort the dataframe by the fraction of sales for a given day and county to look at which counties these are.

In [None]:
plot_data.sort_values(by="day_fraction", ascending=False, ignore_index=True)[["county","day_of_week_name","day_fraction"]].head(50)

In [None]:
sum_to_one_check = [(x>0.99 and x<1.01) for x in plot_data.groupby("county")["day_fraction"].sum().values]
assert all(sum_to_one_check)