# Milestone Report: Market Analysis for Sociable Cider Werks


## Introduction

Sociable Cider Werks is a cidery local to Minneapolis, Minnesota. Sociable has been rapidly growing, and its owners are looking to open a new taproom in 2018-2019. The owners would ideally like to open a new taproom in a metropolitan area outside of Minnesota with a population that is comparable to or larger than the population of Minneapolis (~400,000 people). 

My task is to identify a few target cities for the new Sociable Cider Werks taproom by exploring the demographic features of large US cities and identifying if some mix of demographic features make certain cities especially amenable to the craft cider industry. 

## The Data

### U.S. Census Bureau

I gathered multiple data sets from the U.S. Census Bureau on city demographics. The demographic features that included are:

* *Population:* Sociable Cider Werks is looking to open a new taproom in a city with a population of ~400,000 people or greater. Therefore, it is important to include population information to make sure the cities I explore meet Sociable's requirements. 

* *Income:* Exploring the population's income is important when considering potential target cities for a new cider taproom. Craft cider is a luxury, and it is essential to consider whether a city's population have enough disposable income to support a new cidery, so I include the median income of all cities in the analysis. 

*  *Age:* Sociable Cider Werk's consumers are typically in their 20s and 30s, so include city-level data on both median age and the percent of the population in their 20s and 30s in the analysis. 

* *Sex Ratio:* Cider is typically more popular among females, so I include sex ratio data for all cities. This measures the number of males for every 100 females. 


### US Cidery Data 

In addition to demographic data, I needed some measure of cidery performance. Over the course of my analysis, I used cidery location data (obtained [here](https://ciderguide.com/cider-maps/united-states/) as a proxy for a city's performance in the craft cider market. 




## Data Wrangling 

All code for data wrangling can be found [here!](https://github.com/sjordan29/Springboard-Code/blob/master/Capstone_1/Data_Files.ipynb)

#### Pre-processing 

The columns of US Census Bureau datasets are named with codes, and a better description of the columns' contents are in separate files. I ultimately wanted to combine all of this census information into a single dataframe, but different datasets had column names with identical code numbers. So, even though these codes looked the same, they referred to different titles for different datasets. Therefore, my first step was to rename the columns in each of the dataframes using the column descriptions provided in this other dataframe.

Additionally, all of the values in these dataframes were strings instead of floats. I converted all of the numbers to floats and all of the non-numbers to NaN to make my dataframe easier to work with.



#### Merging Data Frames

The next step was to merge all of the separate dataframes into a single dataframe containing all demographic and cider shipment information. The common column between the U.S. Census data and the cider data was the state name. However, in my demographic data set, city and state were contained in a single column. My first step was splitting this column into two columns, one with the city name and one with the state name.

At this point, I thought I was ready to merge the dataframes. However, I soon realized that there was an issue with the state names in the cidery data: there were leading spaces before each state name, so the merge funciton was unable to identify any overlapping state names. I was able to get rid of the leading spaces and then merge all of my data into a single dataframe.

#### Identifying Data of Interest

There were way more columns in the U.S. Census data than I need. I systematically went through all of the rows I did not need for my analysis, or those that provided redundant information, and removed them. 

The age breakdown given by the U.S. Census bureau gave much smaller age frames than I required for my analysis: it was broken down into age groups of five years. In the end, Sociable Cider Werks will want to target cities with a high percentage and a high number of people in their twenties to thirties. I therefore combined many age columns into a much smaller number: children (people too young to drink alcoholic beverages), people in their twenties (a target population), people in their thirties (another target population), and people forty and older who are less likely to go to a cider taproom.

#### Null Values

Finally, I dealt with null values. Upon examination, I realized that there was just one row with null values. Since I have plenty of city data, I decided to just drop this row rather than imputing data values.





## Initial Findings

### The Clusters 

I performed a cluster analysis on cities with populations of 400,000 or larger. I clustered based on the log of population (this reduced the effect of outliers like New York City and Los Angeles), age, income, and sex ratio. I found that there were three distinct clusters, which have the following cities:

#### Cluster 0
* San Francisco, CA
* San Jose, CA
* Washington, D.C.
* Seattle, WA

#### Cluster 1
* Mesa, AZ
* Phoenix, AZ
* Tucson, AZ
* Fresno, CA
* Sacramento, CA
* Colorado Springs, CO
* Jacksonville, FL
* Miami, FL
* Indianapolis, IN
* Louisville, KY
* Baltimore, MD
* Detroit, MI
* Kansas City, Missouri
* Omaha, NE
* Las Vegas, Nevada
* Albequerque, NM
* Columbus, OH
* Oklahoma City, OK
* Tulsa, OK
* Philadelphia, PA
* Memphis, TN
* Nashville, TN
* Dallas, TX
* El Paso, TX
* Fort Worth, TX
* Houston, TX
* San Antonio, TX
* Milwaukee, TX

#### Cluster 2
* Long Beach, CA
* Los Angeles, CA
* Oakland, CA
* San Diego, CA
* Denver, CO
* Atlantic, GA
* Chicago, IL
* Boston, MA
* Minneapolis, MN
* New York, NY
* Charlotte, NC
* Raleigh, NC
* Portland, OR
* Austin, TX
* Virginia Beach, VA


### Cidery Prevelance 

The next step was incorporating the cidery data to see if the demographic features that resulted in the clusters above have an impact on the cidery prevelance. 

In [2]:
import pandas as pd
import matplotlib.pyplot as plt


# import data files 
city_cluster_0 = pd.read_csv('data/cidery_cluster_0.csv')
city_cluster_1 = pd.read_csv('data/cidery_cluster_1.csv')
city_cluster_2 = pd.read_csv('data/cidery_cluster_2.csv')

cidery_frequency = [city_cluster_0.Frequency, city_cluster_1.Frequency, city_cluster_2.Frequency]
plt.boxplot(cidery_frequency)
plt.xticks([1, 2, 3], ['0', '1', '2'])
plt.xlabel('Cluster')
plt.ylabel('Number of Cideries')
plt.title('Number of Cideries by City')
plt.show()

<matplotlib.figure.Figure at 0x10eb3be10>

Right off the bat, I can tell that, Cluster 0 and Cluster 2 both look like strong candidates to focus on. A breakdown of cidery distribution across the three clusters then makes Cluster 2 an immediate standout:

##### Cluster 0
* Mean Number of Cideries: 2
* Median Number of Cideries: 1
* Percent of Cities with >1 Cidery: 50%

##### Cluster 1
* Mean Number of Cideries: .89
* Median Number of Cideries: 1
* Percent of Cities with >1 Cidery: 53.6%

##### Cluster 2
* Mean Number of Cideries: 3
* Median Number of Cideries: 2
* Percent of Cities with >1 Cidery: 80%

However, I took some precautions to make sure that the differences in cidery prevalance between these clusters was statistically significant. 

All code for the following analyses can be found [here'](https://github.com/sjordan29/Springboard-Code/blob/master/Capstone_1/Inferential%20Statistics.ipynb)


#### ANOVA and Tukey's Test

I performed a one-way ANOVA test to test the null hypothesis:

$H_0$ : $\mu_{cluster 0}$ = $\mu_{cluster 1}$ = $\mu_{cluster 2}$ 

Ultimately I was able to reject the null hypothesis: with a p-value of 0.03, there is a statistically significant difference in the prevalance of cideries between at least two of the clusters. After performing a Tukey test, I found that there was a statistically significant difference between Cluster 2 and Cluster 1. This left me with just Cluster 2 and Cluster 0 as potential candidates for identifying a target city recommendation. 

However, the much greater percentage of cities with a cidery in Cluster 2 (80%) than in Cluster 0 (50%), I was inclined to recommend a city from Cluster 2. I decided to proceed with permutation sampling to see how likely that this great of a difference in cidery prevalance could be observed just by chance. 

#### Permutation Sampling 

I made 10,000 random permutations of two clusters of the same lengths as Cluster 2 and Cluster 0 to find the probability that you would find a 30% difference in the number of cities with cideries in Cluster 2 and Cluster 0 simply by chance. With a p-value of 0.0001, I was able to reject this hypothesis. This test confirms that somehow the mix of demographic features that define Cluster 2 create an strong market for craft cider.  


### Analysis of Cluster 2

Cluster analysis and inferential statistics helped narrow down a long list of potential cities to just the fifteen strong candidate cities for a new taproom in Cluster 2. So, after focusing on differences between clusters it is now essential to explore the subtle difference between the cities in Cluster 2 to help narrow this list down to just a few strong candidates. I explored the differences in age, income, and gender distribution, and I also added in an index to represent the cost of living in each city to make sure I wouldn't recommend a city that might have the right demographic scene but prohibitive economics. The exploration of the data can be found [here](https://github.com/sjordan29/Springboard-Code/blob/master/Capstone_1/Cluster_2_Analysis%20.ipynb). I will summarize my findings below:

#### Top 5 Candidates based on Median Age 
* (Minneapolis)
* Boston
* Austin
* Atlanta
* Long Beach
* Raleigh

#### Top 5 Candidates based on % of Population in their 20s and 30s 
* Boston
* (Minneapolis)
* Austin
* Atlanta
* Denver
* Raleigh

#### Top 5 Candidates Based on Median Income
* San Diego
* Virgia Beach
* Oakland
* Austin
* Raleigh

#### Top 5 Candidates Based on Sex Ratio 
* New York City
* Raleigh
* Charlotte
* Boston
* Chicago 

#### Top 5 Candidates Based on Cidery Prevelance 
* Portland
* Denver
* (Minneapolis)
* Austin
* Charlotte
* Chicago

#### Top 5 Candidates based on Price of Living Index
* Raleigh
* Virginia Beach
* Charlotte
* Austin
* Atlanta 





Considering the cities above, Austin, Raleigh, and Boston are my top three recommendations for Sociable Cider Werks.


## Recommendations:

#### Austin, Texas:

Austin has a young population with a high median income, while the price of living remains low, which means these high-earning individuals have lots of disposable income. Furthermore, Austin has demostrated success in the craft cider market with four cideries currently in business. Overall, Austin looks highly similar to Minneapolis in terms of age distribution, cidery prevelance, and the ratio of males to females. However, its population has a higher overall median income with a lower price of living, which could help make a cidery even more successful in Austin than in Minneapolis. 

Breakdown:
* Median Age: 32.7
* % of Population in 20s and 30s: 39.1% 
* Median Income: $66,697
* Males per 100 Females: 102.9
* Price of Living Index: 163 (4th lowest in this cluster)
* Number of Cideries: 4

#### Raleigh, North Carolina

Similar to Austin, Raleigh has a young population and a high median income. Raleigh has the lowest cost of living index of all of these cities, so its population has plenty of disposable income for craft cider. Cider is typically more popular among females, and Raleigh has more females than males. Raleigh has only one cidery so far -- this could be seen as a warning or an exciting opportunity. Bart Watson, the economist for the US Brewer's Association, notes that the best place to open a new cidery has typically been in cities with other cideries. While there aren't many cideries here yet, Raleigh has the right demographic distribution, so it could be a great place to start carving out the craft cider industry alongside the other existing cidery. 

Breakdown:
* Median Age: 33.8
* % of Population in 20s and 30s: 35.8% 
* Median Income: $64,456
* Males per 100 Females: 91.3
* Price of Living Index: 131 (the lowest in this cluster)
* Number of Cideries: 1

#### Boston, Massachusetts 

Boston tops these rankings in the age category, and a young population is important for the craft cider scene. While Boston doesn't make the Top 5 list for median income, it does come in sixth for median income, and it is ranked second in this cluster for mean income. However, the high cost of living in Boston is extremely high, which could affect the population's overall dispoable income for craft cider. This could also indicate potential high startup costs for a new business. 

Breakdown:
* Median Age: 32.1
* % of Population in 20s and 30s: 41.2% 
* Median Income: $63,621
* Males per 100 Females: 92.4
* Price of Living Index: 186 (the second highest in this cluster)
* Number of Cideries: 2