# U.S Farmers Markets Exploratory

![Farmer Products](https://www.morningagclips.com/wp-content/uploads/2020/05/6468881225_9e3d16c641_k-720x400.jpg)

This is notebook that I have created for doing some exploration on U.S Farmer Datasets. There are some facts based on dataset that I want to explain. So lets get started.

## Table of Contents:
* [Overview](#section-one)
    - [Problem Question](#subsection-one-one)
    - [Methodology](#subsection-one-two)
* [Import Module](#section-two)
* [Acquire & EDA](#section-three)
* [Pre-processing](#section-four)
* [Analyze](#section-five)
* [Result & Conclusion](#section-six)

<a id="section-one"></a>
# Overview
***
<a id="subsection-one-one"></a>
## Problem Question

Problem question for this task is **One criticism of farmers markets are that they are inaccessible to Americans who live in certain parts of the country or are of low socio-economic status. Does the data reflect this criticism?** 

***
<a id="subsection-one-two"></a>
## Methodology

In order to  answer those question, I do some steps of data analysis which consists of:
* **Acquire Step**: This step includes finding, accessing, and moving data.
* **Prepare Step**: `Explore` (Receive data to find out the nature of the data, i.e. the quality and format. It also conducts initial analysis to retrieve data) dan `Pre-process` (after knowing at a glance about the data, then do the pre-process. It involves cleaning data, filtering data, and making data readable by changing to a particular data model or packaging to a specific data form. Finally it integrates data from various sources).
* **Analyze Step**: covering the selection of analysis techniques used, building data models, and analysis of results.
* **Report Step**: includes evaluating the results of the analysis, presenting the results to a visual form, and making a report that contains an evaluation of the results of the analysis with the desired criteria.

***


<a id="section-two"></a>
# Import Module
***

In [None]:
# import modul for data processing
import numpy as np
import pandas as pd

# import modul for data visualize
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.io as pio
import plotly.graph_objects as go
import seaborn as sns
from plotly.subplots import make_subplots

from IPython.core.interactiveshell import InteractiveShell
from IPython.display import display

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from itertools import cycle, islice
from pandas.plotting import parallel_coordinates


import pickle
import os

<a id="section-three"></a>
# Acquire & EDA
***

In [None]:
pd.options.display.max_columns = None
InteractiveShell.ast_node_interactivity = "all"
Mapbox = pickle.load(open('../input/mapbox/mapbox_credentials.pkl','rb'))
px.set_mapbox_access_token(Mapbox['Access Token'])

market_dataset = pd.read_csv('../input/farmers-markets-in-the-united-states/farmers_markets_from_usda.csv')
county_dataset = pd.read_csv('../input/farmers-markets-in-the-united-states/wiki_county_info.csv')
print("\n{:^30}".format('Market Dataframe'))
print("*"*120)
display(market_dataset.head(5))
print("\n{:^30}".format('County Dataframe'))
print("*"*120)
display(county_dataset.head(5))

In [None]:
# method for change data format
def convert_dollar_number(data):
    if(type(data)==float):
        return data
    elif(type(data)==str and '$' in data):
        return int("".join(data.split('$')[1].split(',')))
    elif(type(data)==str and '$' not in data):
        return int("".join(data.split(',')))

for i in range(3,8):
    county_dataset[county_dataset.columns[i]] = county_dataset[county_dataset.columns[i]].apply(convert_dollar_number)
print("\n{:^30}".format('Change Data Format'))
print("*"*120)

After I change data format, next I want to know nature of data. In this step, I want to know **column of dataframe, is there any null value, and also which column that has null value**
***

In [None]:
# This is for knowing any null value on dataframe
def graph_missing_value(dataset, name):
    num_missing_value = dataset.isnull().sum()
    mean_missing_value = (num_missing_value/market_dataset.shape[0])*100
    frame = { 'Name': num_missing_value.index, 'Sum': num_missing_value, 'Percentage':(num_missing_value/market_dataset.shape[0])*100 } 
    result = pd.DataFrame(frame).reset_index(drop=True)

    fig = px.bar(result, x='Name', y='Sum',
                 hover_data=['Percentage'], color='Percentage',
                 labels={'percent':'Persentase'}, height=400)

    fig.update_layout(
        title='Percentage of Null Value on '+name,
        xaxis=dict(
            title='Feature Name',
            titlefont_size=16,
            tickfont_size=14,
        ),
        yaxis=dict(
            title='Number of Null Value',
            titlefont_size=16,
            tickfont_size=14,
        )
    )
    return fig
    
graph_missing_value(market_dataset, 'Market Dataframe')
graph_missing_value(county_dataset, 'County Dataframe')

Based on graph above, there are many null value on column **website, twitter, youtube, season 2,3,4, location, and many more**. So, I decided to remove that columns because they dont important for this research. In county dataframe, number of null value is small, so we can easily to remove that.

<a id="section-four"></a>
# Pre-processing
***
In this step, I will remove some feature that doesnt important. And then, I will exchange value Y to 1 and N to 0

In [None]:
# Delete feature 'Website', 'Facebook', 'Twitter', 'Youtube','OtherMedia','Season2Date', 'Season2Time', 
#'Season3Date','Season3Time', 'Season4Date', 'Season4Time','Location','updateTime'

print('Pre-processing market dataset on progress')
print("*"*120)
market_dataset.drop(['Website', 'Facebook', 'Twitter', 'Youtube','OtherMedia','Season2Date', 'Season2Time', 
                     'Season3Date','Season3Time', 'Season4Date', 'Season4Time','Location','updateTime'], axis=1,inplace=True)

# Change null value in feature 'street', 'city', 'County', 'State', 'zip'
# become Not Defined
market_dataset.update(market_dataset[['street', 'city', 'County', 'State', 'zip']].fillna('Not Reported'))
market_dataset = market_dataset.dropna()


# Change all value Y to 1
market_dataset.replace('Y', 1, inplace=True)

# Change all value N to 0
market_dataset.replace('N', 0, inplace=True)

# Change all value - become mode of column on organic columns
market_dataset.replace('-',market_dataset['Organic'].mode()[0],inplace=True)
# Check null value
graph_missing_value(market_dataset, 'Market Dataframe')

print('Pre-processing county dataset on process')
print("*"*120)

county_dataset = county_dataset.dropna()
graph_missing_value(county_dataset, 'County Dataframe')

After I cleaning data from null value, I decide to merge **market dataset and county dataset** for better understanding and try to remove outlier.

In [None]:
# merging market and county dataset
# for make sure only county based on available store on market dataset

county_dataset = county_dataset.rename(columns={"county": "County"})

merge_dataset = market_dataset.copy().merge(county_dataset, how='inner',on=['County','State'])
merge_dataset.drop(['number'], axis=1,inplace=True)
InteractiveShell.ast_node_interactivity = "last"

So, merge dataframe has zero null value. Finally, I will remove outliers from dataframe.

In [None]:
print('EDA On Merge Dataframe')
print("*"*120)

fig = make_subplots(
    rows=3,
    row_heights=[0.6, 0.6, 0.9],
    specs=[[{"type": "box",}],
           [{"type": "box"}],
           [{"type": "box"}]])

fig.add_trace(
    go.Box(y=merge_dataset['median household income'], 
           x= county_dataset['State'],
           name='median household income (Hint: point represents county in given state)',
           boxpoints='outliers', # can also be outliers, or suspectedoutliers, or False
           jitter=0.3, # add some jitter for a better separation between points
          )
        )

fig.add_trace(
    go.Box(y=merge_dataset['per capita income'], 
           x= merge_dataset['State'],
           name='per capita income (Hint: point represents county in given state)',
           boxpoints='outliers', # can also be outliers, or suspectedoutliers, or False
           jitter=0.3, # add some jitter for a better separation between points
          )
        )

fig.add_trace(
    go.Box(y=merge_dataset['median family income'], 
           x= merge_dataset['State'],
           name='median family income (Hint: point represents county in given state)',
           boxpoints='outliers', # can also be outliers, or suspectedoutliers, or False
           jitter=0.3, # add some jitter for a better separation between points
          )
        )

fig.update_layout(
    title='Merge Dataframe : per capita income & median household income & median family income boxplot', 
    xaxis=({'categoryorder':'category ascending'}),
    template="plotly_dark",
    margin=dict(r=10, t=25, b=40, l=60),
    annotations=[
        dict(
            text="Source: NOAA",
            showarrow=False,
            xref="paper",
            yref="paper",
            x=0,
            y=0)
    ],
    legend=dict(
        x=0,
        y=0.1,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
    )
)

You just have to see box plot above only, that is final output. Okay, based picture above, We still have quite outlier on our data. Especially on column median household, per capita income and median family income.

In [None]:
print('EDA On Merge Dataframe')
print("*"*120)

fig = make_subplots(
    rows=2,
    row_heights=[0.4, 0.6],
    specs=[[{"type": "box",}],
           [{"type": "box"}]])

fig.add_trace(
    go.Box(y=merge_dataset['population'], 
           x= county_dataset['State'],
           name='population (Hint: point represents county in given state)',
           boxpoints='outliers', # can also be outliers, or suspectedoutliers, or False
           jitter=0.3, # add some jitter for a better separation between points
          )
        )

fig.add_trace(
    go.Box(y=merge_dataset['number of households'], 
           x= merge_dataset['State'],
           name='number of households (Hint: point represents county in given state)',
           boxpoints='outliers', # can also be outliers, or suspectedoutliers, or False
           jitter=0.3, # add some jitter for a better separation between points
          )
        )

fig.update_layout(
    title='Merge Dataframe : population & number of households boxplot', 
    xaxis=({'categoryorder':'category ascending'}),
    template="plotly_dark",
    margin=dict(r=10, t=25, b=40, l=60),
    annotations=[
        dict(
            text="Source: Internet",
            showarrow=False,
            xref="paper",
            yref="paper",
            x=0,
            y=0)
    ],
    legend=dict(
        x=0,
        y=0.1,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
    )
)

Also, column population and number of households have outliers. I will remove that below

In [None]:
print('Removing Outlier Process')
print("*"*120)

Q1 = merge_dataset[merge_dataset.columns[46:]].quantile(0.25)
Q3 = merge_dataset[merge_dataset.columns[46:]].quantile(0.75)

IQR = Q3 - Q1

merge_dataset_wout = merge_dataset[~((merge_dataset < (Q1 - 1.5 * IQR)) | (merge_dataset > (Q3 + 1.5 * IQR))).any(axis=1)]

print('Size of dataframe before outlier removing {}'.format(merge_dataset.shape))
print('Size of dataframe after outlier removing {}'.format(merge_dataset_wout.shape))

We have remove every outlier. So, after this, we will continue on analyze step.

<a id="section-five"></a>
# Analyze
***
We have already remove outlier from merge dataframe. So, in this step, we will try to analyze clean dataframe to get some insight. After doing some research, I think this problem can addressed by clustering analysis. We can cluster county or state to know characteristics that can answer the problem questions. I will use K-Mean Clustering with PCA. We can compare each stores/farmer market based on specific criteria. And those criteria can be cluster for our analysis.

In [None]:
print('Creating New Dataframe for Clustering')
print("*"*120)

df_cluster_1 = merge_dataset_wout.copy()
df_cluster_1.drop(['FMID','MarketName','street','city','County','zip','Season1Date','Season1Time','x','y'],axis=1,inplace=True)
name_state = df_cluster_1['State'].unique()
name_feature = [
       'Credit', 'WIC', 'WICcash', 'SFMNP', 'SNAP', 'Organic', 'Bakedgoods',
       'Cheese', 'Crafts', 'Flowers', 'Eggs', 'Seafood', 'Herbs', 'Vegetables',
       'Honey', 'Jams', 'Maple', 'Meat', 'Nursery', 'Nuts', 'Plants',
       'Poultry', 'Prepared', 'Soap', 'Trees', 'Wine', 'Coffee', 'Beans',
       'Fruits', 'Grains', 'Juices', 'Mushrooms', 'PetFood', 'Tofu',
       'WildHarvested']

df_cluster_2 = pd.DataFrame()
for i in name_state:
    new_row = {}
    new_row['State'] = i
    for j in name_feature:
        if 1 in df_cluster_1[df_cluster_1['State']==i][j].value_counts().index:
            new_row[j+'_'+'1']=df_cluster_1[df_cluster_1['State']==i][j].value_counts()[1]
        if 0 in df_cluster_1[df_cluster_1['State']==i][j].value_counts().index:
            new_row[j+'_'+'0']=df_cluster_1[df_cluster_1['State']==i][j].value_counts()[0]
    df_cluster_2 = df_cluster_2.append(new_row, ignore_index=True)

In code above, I try to create new dataframe that consist of availability product each states in U.S.

In [None]:
print('Cleaning new data frame')
print("*"*120)

df_cluster_2 = df_cluster_2.reset_index()
df_cluster_2.drop(['index'],axis=1,inplace=True)
df_cluster_2.replace(np.nan, 0, inplace=True)
display(df_cluster_2.head(5))

If you see feature that have 1 in the end of name, it represents number of store that sell those products each states. And if you see feature that have 0 in the end of name, it represents number of store that dont provide specific products.

In [None]:
print('Removing useless column and grouping data frame')
print("*"*120)
df_cluster_1.drop(['Credit', 'WIC', 'WICcash', 'SFMNP', 'SNAP', 'Organic',
       'Bakedgoods', 'Cheese', 'Crafts', 'Flowers', 'Eggs', 'Seafood', 'Herbs',
       'Vegetables', 'Honey', 'Jams', 'Maple', 'Meat', 'Nursery', 'Nuts',
       'Plants', 'Poultry', 'Prepared', 'Soap', 'Trees', 'Wine', 'Coffee',
       'Beans', 'Fruits', 'Grains', 'Juices', 'Mushrooms', 'PetFood', 'Tofu',
       'WildHarvested'],axis=1,inplace=True)

df_cluster_1 = df_cluster_1.groupby('State',as_index = False).agg({'per capita income': 'mean', 'median household income': 'mean',
                                                             'median family income': 'mean', 'population':'sum',
                                                             'number of households':'sum'})

In code above, I try to drop every feature that really needed on that dataframe. Why? This is because this dataframe created for clustering citizen income. I mean, the feature that really needed on this cluster just **state, per capita income, median household income, median family income, population, and number of household**

In [None]:
print('Feature engineering process: To find out how many information from columns')
print("*"*120)


# Standardize the data to have a mean of ~0 and a variance of 1
X_std = StandardScaler().fit_transform(df_cluster_1[df_cluster_1.columns[1:]])


# conduct dimentionality reduction using pca
pca = PCA()
pca.fit(X_std)

pca.explained_variance_ratio_

plt.plot(range(1,6), pca.explained_variance_ratio_.cumsum(), marker='o', linestyle='--')
plt.title('Explained Variance by Components')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')

Based on graph above, we just have creating PCA with 2 component. From 6 feature, we can simplify to only 2 feature. Explainedd variance represents number of useful information per component/feature. So, with just 2 feature, we have already covered 98% important information.

In [None]:
print('Feature engineering process: To find out how many information from columns')
print("*"*120)


product_group = df_cluster_2.columns.values[0:47].tolist()+df_cluster_2.columns.values[51:55].tolist()+ \
                    df_cluster_2.columns.values[56:62].tolist()+ \
                    df_cluster_2.columns.values[66:].tolist()

nutrition_program = df_cluster_2.columns.values[47:51].tolist()+df_cluster_2.columns.values[62:66].tolist()

# Standardize the data to have a mean of ~0 and a variance of 1
X_std_2 = StandardScaler().fit_transform(df_cluster_2[nutrition_program])
X_std_3 = StandardScaler().fit_transform(df_cluster_2[product_group])


# conduct dimentionality reduction using pca
pca_2 = PCA()
pca_2.fit(X_std_2)

pca_3 = PCA()
pca_3.fit(X_std_3)

In code above, we try to create 2 different PCA, so in total, we create 3 pca. PCA 1 created for cluster based on citizen income. PCA 2 created for cluster based on nutrition program, and the last one is PCA 3 (for cluster based on product group/availability). 

In [None]:
print('Feature engineering process: plotting number of information on nutrition program feature')
print("*"*120)

plt.plot(range(1,9), pca_2.explained_variance_ratio_.cumsum(), marker='o', linestyle='--')
plt.title('Explained Variance by Components')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')

Those graph (above) tell us number of explained variance based on number of components. So, for clustering based on nutrition program, we create PCA with 2 components.

In [None]:
print('Feature engineering process: plotting number of information on product group feature')
print("*"*120)

plt.plot(range(1,52), pca_3.explained_variance_ratio_.cumsum(), marker='o', linestyle='--')
plt.title('Explained Variance by Components')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')

Based on graph above, we have to use approximately 3-10 components. But I wan to use just 3 components PCA.

In [None]:
print('Feature engineering process:fitting pca with specific number of component & create pca scores for cluster 1')
print("*"*120)

pca = PCA(n_components = 2)
pca.fit(X_std)

pca.transform(X_std)

scores_pca = pca.transform(X_std)

In [None]:
print('Feature engineering process: fitting pca with specific number of component & create pca scores for cluster 2 - nutrition program')
print("*"*120)

pca_2 = PCA(n_components = 2)
pca_2.fit(X_std_2)

pca_2.transform(X_std_2)

scores_pca_2 = pca_2.transform(X_std_2)

In [None]:
print('Feature engineering process: fitting pca with specific number of component & create pca scores for cluster 3 - product group')
print("*"*120)

pca_3 = PCA(n_components = 3)
pca_3.fit(X_std_3)

pca_3.transform(X_std_3)

scores_pca_3 = pca_3.transform(X_std_3)

In [None]:
print('Feature engineering process: find out number of inertia (coherent between clusrter) for cluster 1')
print("*"*120)

wcss = []
for i in range(1,10):
    kmeans_pca = KMeans(n_clusters = i, init='k-means++', random_state=42)
    kmeans_pca.fit(scores_pca)
    wcss.append(kmeans_pca.inertia_)

In [None]:
print('Feature engineering process: decide how many cluster that we want to create for cluster 1')
print("*"*120)

plt.plot(range(1,10), wcss, marker='o', linestyle='--')
plt.title('K-Means with PCA Clustering')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.show()

Based on graph above, number of clusters with lowest different slops is 4 clusters. I meant, if you see each dot, from cluster one until four, slop of dot are very obvious. But, for cluster 5 and greater, it seems like differentiate between dot is more smooth. This principle will be use for next process (decide number of cluster for nutrition program and product availability)

In [None]:
print('Feature engineering process: find out number of inertia for cluster 2 - nutrition program')
print("*"*120)

wcss_2 = []
for i in range(1,10):
    kmeans_pca_2 = KMeans(n_clusters = i, init='k-means++', random_state=42)
    kmeans_pca_2.fit(scores_pca_2)
    wcss_2.append(kmeans_pca_2.inertia_)

In [None]:
print('Feature engineering process: decide how many cluster that we want to create for cluster 2 - nutrition program')
print("*"*120)

plt.plot(range(1,10), wcss_2, marker='o', linestyle='--')
plt.title('K-Means with PCA Clustering')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.show()

In [None]:
print('Feature engineering process: find out number of inertia for cluster 2 - product group')
print("*"*120)

wcss_3 = []
for i in range(1,10):
    kmeans_pca_3 = KMeans(n_clusters = i, init='k-means++', random_state=42)
    kmeans_pca_3.fit(scores_pca_3)
    wcss_3.append(kmeans_pca_3.inertia_)

In [None]:
print('Feature engineering process: decide how many cluster that we want to create for cluster 2 - product group')
print("*"*120)

plt.plot(range(1,10), wcss_3, marker='o', linestyle='--')
plt.title('K-Means with PCA Clustering')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.show()

In [None]:
print('Training process: KMeans with 4 clusters for cluster 1')
print("*"*120)

kmeans_pca = KMeans(n_clusters=4, init='k-means++',random_state=42)
kmeans_pca.fit(scores_pca)

In [None]:
print('Training process: KMeans with 4 clusters for cluster 2 - nutrition program')
print("*"*120)

kmeans_pca_2 = KMeans(n_clusters=4, init='k-means++',random_state=42)
kmeans_pca_2.fit(scores_pca_2)

In [None]:
print('Training process: KMeans with 4 clusters for cluster 2 - product group')
print("*"*120)

kmeans_pca_3 = KMeans(n_clusters=4, init='k-means++',random_state=42)
kmeans_pca_3.fit(scores_pca_3)

In [None]:
print('Create New Feature: Label based on cluster citizen income for dataframe cluster 1')
print("*"*120)

df_cluster_1['Segment Citizen Income'] = kmeans_pca.labels_
df_cluster_1['Segment Citizen Income'] = df_cluster_1['Segment Citizen Income'].map({
    0: 'one',
    1: 'two',
    2: 'three',
    3: 'four'
})


In code above, I try to create new feature based on label of kmeans clustering. In clustering 1 (based on citizen income), there are 4 different label, 0, 1,2,3. Until now, we dont have idea what differences between them. I try to rename those label to become one, two, three, four.

In [None]:
print('Create New Feature: Label based on cluster nutrition program for dataframe cluster 2')
print("*"*120)

df_cluster_2['Segment Nutrition Program'] = kmeans_pca_2.labels_
df_cluster_2['Segment Nutrition Program'] = df_cluster_2['Segment Nutrition Program'].map({
    0: 'one',
    1: 'two',
    2: 'three',
    3: 'four'
})


In [None]:
print('Create New Feature: Label based on cluster product group for dataframe cluster 2')
print("*"*120)

df_cluster_2['Segment Product Group'] = kmeans_pca_3.labels_
df_cluster_2['Segment Product Group'] = df_cluster_2['Segment Product Group'].map({
    0: 'one',
    1: 'two',
    2: 'three',
    3: 'four'
})


In [None]:
print('Function for drawing cluster characteristics')
print("*"*120)

def parallel_plot(data, category):
	my_colors = list(islice(cycle(['b', 'r', 'g', 'y', 'k']), None, len(data)))
	plt.figure(figsize=(15,15)).gca().axes.set_ylim([-5,+5])
	parallel_coordinates(data, category, color = my_colors, marker='o')

In [None]:
print('Cluster characteristics for cluster 1')
print("*"*120)

df_analysis_cluster_1 = df_cluster_1.copy()
df_analysis_cluster_1.iloc[:,1:6] = X_std

citizen_income = df_analysis_cluster_1.columns.values[1:8].tolist()
df_analysis_cluster_1 = df_analysis_cluster_1[citizen_income].groupby('Segment Citizen Income',
                                                                        as_index = False).mean()
parallel_plot(df_analysis_cluster_1, 'Segment Citizen Income')

In [None]:
print('Cluster characteristics for cluster 2')
print("*"*120)

df_analysis_cluster_2 = df_cluster_2.copy()
df_analysis_cluster_2[nutrition_program] = X_std_2
df_analysis_cluster_2[product_group] = X_std_3

nutrition_characteristics = nutrition_program + ['Segment Nutrition Program']
product_characteristics_1 = product_group[0:11] + ['Segment Product Group']
product_characteristics_2 = product_group[11:22] + ['Segment Product Group']
product_characteristics_3 = product_group[22:33] + ['Segment Product Group']
product_characteristics_4 = product_group[33:42] + ['Segment Product Group']
product_characteristics_5 = product_group[42:51] + ['Segment Product Group']
product_characteristics_6 = product_group[51:] + ['Segment Product Group']

In [None]:
print('Cluster characteristics for cluster 2 based on nutrition program')
print("*"*120)

df_nutrition_program = df_analysis_cluster_2[nutrition_characteristics].groupby('Segment Nutrition Program',
                                                                        as_index = False).mean()


parallel_plot(df_nutrition_program, 'Segment Nutrition Program')

In [None]:
print('Cluster characteristics for cluster 2 based on product group')
print("*"*120)

df_pg_1 = df_analysis_cluster_2[product_characteristics_1].groupby('Segment Product Group',
                                                                        as_index = False).mean()

parallel_plot(df_pg_1, 'Segment Product Group')

In [None]:
print('Cluster characteristics for cluster 2 based on product group')
print("*"*120)

df_pg_2 = df_analysis_cluster_2[product_characteristics_2].groupby('Segment Product Group',
                                                                        as_index = False).mean()

parallel_plot(df_pg_2, 'Segment Product Group')

In [None]:
print('Cluster characteristics for cluster 2 based on product group')
print("*"*120)

df_pg_3 = df_analysis_cluster_2[product_characteristics_3].groupby('Segment Product Group',
                                                                        as_index = False).mean()

parallel_plot(df_pg_3, 'Segment Product Group')

In [None]:
print('Cluster characteristics for cluster 2 based on product group')
print("*"*120)

df_pg_4 = df_analysis_cluster_2[product_characteristics_4].groupby('Segment Product Group',
                                                                        as_index = False).mean()

parallel_plot(df_pg_4, 'Segment Product Group')

In [None]:
print('Cluster characteristics for cluster 2 based on product group')
print("*"*120)

df_pg_5 = df_analysis_cluster_2[product_characteristics_5].groupby('Segment Product Group',
                                                                        as_index = False).mean()

parallel_plot(df_pg_5, 'Segment Product Group')

In [None]:
print('Cluster characteristics for cluster 2 based on product group')
print("*"*120)

df_pg_6 = df_analysis_cluster_2[product_characteristics_6].groupby('Segment Product Group',
                                                                        as_index = False).mean()

parallel_plot(df_pg_6, 'Segment Product Group')

In [None]:
from urllib.request import urlopen
import json
with urlopen('https://raw.githubusercontent.com/PublicaMundi/MappingAPI/master/data/geojson/us-states.json') as response:
    counties = json.load(response)

In [None]:
print('States Cluster Based on Citizen Income')

px.set_mapbox_access_token(Mapbox['Access Token'])
fig = px.choropleth_mapbox(df_cluster_1, geojson=counties, color="Segment Citizen Income",
                           locations="State", featureidkey="properties.name",
                           hover_data = ['per capita income', 'median household income', 'population'],
                           color_discrete_map={
                            "one": px.colors.sequential.Rainbow[8],
                            "two": px.colors.sequential.Rainbow[6],
                            "three": px.colors.sequential.Rainbow[4],
                            "four": px.colors.sequential.Rainbow[2]
                        })
fig.update_layout(mapbox_style="carto-positron",
                  mapbox_zoom=3, mapbox_center = {"lat": 37.0902, "lon": -95.7129})
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

After my code conducts KMeans clustering, I try to representing that to graph. Graph that you can see above is parallel plot of standard value of dataframe cluster 1. To remind you, df_cluster_1 is for clustering based on citizen income. So, there are 4 different cluster:
* One (Red Area)
* Two (Yellow Area)
* Three (Green Area)
* Four (Blue Area)

I will drill more specific characteristics below

## Cluster One (Red Area)
***
I think most of state on U.S fall into this category. In general, statistics tell us that they are: the third highest per capita income, median household income, median family income.

## Cluster Two (Yellow Area)
***
Aproximately, there are 5 states that fall into this category. Obviously, the unique characteristics of this cluster are huge amount of population and also household. So, I think, this states are metropolitan states with crowder population. In general, statistics tell us that they are: the second highest per capita income, median household income, median family income.

## Cluster Three (Green Area)
***
They have lowest per capita income, median household income, and family income with average population and number of household compared to other cluster. It means that they tend to be **low social economic status**.

## Cluster Four (Blue Area)
***
They have highest per capita income, median household income, and family income with average population and number of household compared to other cluster. It means that they tend to be **richest social economic status**.

In [None]:
print('States Cluster Based on Nutrition Program')
fig = px.choropleth_mapbox(df_cluster_2, geojson=counties, color="Segment Nutrition Program",
                           locations="State", featureidkey="properties.name",
                           color_discrete_map={
                            "one": px.colors.sequential.Rainbow[8],
                            "two": px.colors.sequential.Rainbow[6],
                            "three": px.colors.sequential.Rainbow[4],
                            "four": px.colors.sequential.Rainbow[2]
                           }
                          )
fig.update_layout(mapbox_style="carto-positron",
                  mapbox_zoom=3, mapbox_center = {"lat": 37.0902, "lon": -95.7129})
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

## Availability of Senior Farmer Market Nutrition Program
The Seniors Farmers' Market Nutrition Program (SFMNP) is designed to provide low-income seniors with access to locally grown fruits, vegetables, honey and herbs, increase the domestic consumption of agricultural commodities through farmers' markets, roadside stands, and community supported agricultural programs, and aid in the development of new and additional farmers' markets, roadside stands, and community support agricultural programs.
* Cluster **four** has highest avarage number of stores that serve SFMNP.
* Cluster **one** has second highest avarage number of stores that serve SFMNP.
* Cluster **two** has third highest avarage number of stores that serve SFMNP.
* Cluster **three** has lowest avarage number of stores that serve SFMNP.
***

## Availability of Supplemental Nutrition Assistance Program (SNAP)
The Supplemental Nutrition Assistance Program (SNAP) provides over **45 million low-income Americans with monthly benefits that can be used to purchase most foods and beverages.**
***
* Cluster **four** has highest avarage number of stores that serve SNAP
* Cluster **one** has second highest avarage number of stores that serve SNAP.
* Cluster **two** has third highest avarage number of stores that serve SNAP.
* Cluster **three** has lowest avarage number of stores that serve SNAP.
***

## Availability of WIC Program
The Special Supplemental Nutrition Program for Women, Infants, and Children (WIC) provides federal grants to states for supplemental foods, health care referrals, and nutrition education for low-income pregnant, breastfeeding, and non-breastfeeding postpartum women, and to infants and children up to age five who are found to be at nutritional risk.
***
* Cluster **four** has highest avarage number of stores that serve WIC.
* Cluster **one** has second highest avarage number of stores that serve WIC.
* Cluster **two** has third highest avarage number of stores that serve WIC.
* Cluster **three** has lowest avarage number of stores that serve WIC.

In [None]:
print('States Cluster Based on Product Group')
fig = px.choropleth_mapbox(df_cluster_2, geojson=counties, color="Segment Product Group",
                           locations="State", featureidkey="properties.name",
                           color_discrete_map={
                            "one": px.colors.sequential.Rainbow[8],
                            "two": px.colors.sequential.Rainbow[6],
                            "three": px.colors.sequential.Rainbow[4],
                            "four": px.colors.sequential.Rainbow[2]
                           }
                          )
fig.update_layout(mapbox_style="carto-positron",
                  mapbox_zoom=3, mapbox_center = {"lat": 37.0902, "lon": -95.7129})
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

## Availability of Products
* Average number of stores that serve **Bakedgoods** (from highest to lowest): two > four > one > three
* Average number of stores that serve **Beans** (from highest to lowest): two > four > one > three
* Average number of stores that serve **Cheese** (from highest to lowest): two > four > one > three
* Average number of stores that serve **Coffee** (from highest to lowest): two > four > one > three
* Average number of stores that serve **Crafts** (from highest to lowest): two > four > one > three
* Average number of stores that serve **Credit** (from highest to lowest): two > four > one > three

* Average number of stores that serve **Eggs** (from highest to lowest): two > four > one > three
* Average number of stores that serve **Flowers** (from highest to lowest): two > four > one > three
* Average number of stores that serve **Fruits** (from highest to lowest): two > four > one > three
* Average number of stores that serve **Grains** (from highest to lowest): two > four > one > three
* Average number of stores that serve **Herbs** (from highest to lowest): two > four > one > three
* Average number of stores that serve **Honey** (from highest to lowest): two > four > one > three

* Average number of stores that serve **Jams** (from highest to lowest): two > four > one > three
* Average number of stores that serve **Juices** (from highest to lowest): two > four > one > three
* Average number of stores that serve **Maple** (from highest to lowest): two > four > one > three
* Average number of stores that serve **Meat** (from highest to lowest): two > four > one > three
* Average number of stores that serve **Mushroam** (from highest to lowest): two > four > one > three

* Average number of stores that serve **Nursery** (from highest to lowest): two > four > one > three
* Average number of stores that serve **Nuts** (from highest to lowest): two > four > one > three
* Average number of stores that serve **Organic** (from highest to lowest): two > four > one > three
* Average number of stores that serve **PetFood** (from highest to lowest): two > four > one > three

* Average number of stores that serve **Plants** (from highest to lowest): two > four > one > three
* Average number of stores that serve **Poultry** (from highest to lowest): two > four > one > three
* Average number of stores that serve **Prepared** (from highest to lowest): two > four > one > three
* Average number of stores that serve **Seafood** (from highest to lowest): two > four > one > three
* Average number of stores that serve **Soap** (from highest to lowest): two > four > one > three

* Average number of stores that serve **Tofu** (from highest to lowest): two > four > one > three
* Average number of stores that serve **Trees** (from highest to lowest): two > four > one > three
* Average number of stores that serve **Vegetables** (from highest to lowest): two > four > one > three
* Average number of stores that serve **WildHarvested** (from highest to lowest): two > four > one > three
* Average number of stores that serve **Wine** (from highest to lowest): two > one > four > three

In [None]:
from functools import reduce
final_dataframe  = reduce(lambda x,y: pd.merge(x,y, on=['State'], how='inner'), 
                          [merge_dataset_wout.copy(),
                           df_cluster_2[['State','Segment Product Group','Segment Nutrition Program']],
                           df_cluster_1[['State','Segment Citizen Income']]])
final_dataframe['marker size'] = 1

In [None]:
fig = px.scatter_mapbox(final_dataframe, 
                        lat="y", 
                        lon="x", 
                        color="Segment Citizen Income",
                        size="marker size",
                        hover_name="MarketName", 
                        hover_data=["State","Segment Nutrition Program","Segment Product Group"],
                        color_discrete_map={
                            "one": px.colors.sequential.Rainbow[8],
                            "two": px.colors.sequential.Rainbow[6],
                            "three": px.colors.sequential.Rainbow[4],
                            "four": px.colors.sequential.Rainbow[2]
                           }, 
                        size_max=15, 
                        zoom=3,
                        mapbox_style='streets',
                        title='U.S Store cluster Based On Citizen Income')
fig.show()

In [None]:
fig = px.scatter_mapbox(final_dataframe, 
                        lat="y", 
                        lon="x", 
                        color="Segment Nutrition Program",
                        size="marker size",
                        hover_name="MarketName", 
                        hover_data=["State","Segment Citizen Income","Segment Product Group"],
                        color_discrete_map={
                            "one": px.colors.sequential.Rainbow[8],
                            "two": px.colors.sequential.Rainbow[6],
                            "three": px.colors.sequential.Rainbow[4],
                            "four": px.colors.sequential.Rainbow[2]
                        }, 
                        size_max=15, 
                        zoom=3,
                        mapbox_style='streets',
                        title='U.S Store cluster Based On Nutrition Program')
fig.show()

In [None]:
fig = px.scatter_mapbox(final_dataframe, 
                        lat="y", 
                        lon="x", 
                        color="Segment Product Group",
                        size="marker size",
                        hover_name="MarketName", 
                        hover_data=["State","Segment Nutrition Program","Segment Citizen Income"],
                        color_discrete_map={
                            "one": px.colors.sequential.Rainbow[8],
                            "two": px.colors.sequential.Rainbow[6],
                            "three": px.colors.sequential.Rainbow[4],
                            "four": px.colors.sequential.Rainbow[2]
                        }, 
                        size_max=15, 
                        zoom=3,
                        mapbox_style='streets',
                        title='U.S Store cluster Based On Product Group')
fig.show()

<a id="section-six"></a>
# Result & Conclusion
***

After I conduct intensive analysis, I have get enough information to answer problem question. So, in first section of this notebook, I state that problem question is **Are farmers markets inaccessible to Americans who live in certain parts of the country or are of low socio-economic status. Does the data reflect this criticism?**. Here my answer: **Its true. Critism that states that U.S farmers markets inaccessible to Americans is True. Why?** Because I got some facts, and they are:
1. The Real example of states that categorized as low socio-economic status based on per capita income, median household income is **puerto rico states**. In this states, they have avarage per capita income USD 15.160 and median household income USD 23.387. That amounts of money enough to make **puerto rico states** categorized as cluster three based on product availability/group and nutrition program. To recall, cluster three (3) is the lowest access both on nutrition programs (WIC, SFMNP, SNAP) that U.S Government provides and product variety and availability. Another states that similar to puerto rico are idaho, new mexico, montana, etc.
2. But how is the condition of a prosperous country?. I will tell you about **New York, California, & Massachusetts**. Three of them so rich and have better access both on product availability/variety and nutrition program. They are on the top rank of all clustering and that is proof for problem question.
3. Also, I want to tell some anomalies that exists based on cluster above. In **New Hampries, Connecticut,& Hawaii** which is generally known as a prosperous state, both of them dont have better access to farmer product availability and nutrition program. I dont know why, but I think, because they have high per capita income and household income, they are prohibited for accessing nutrition program.

So, that's my answers. I am open to receive advice & comments from all of you, because I am a beginner in this field. Thanks !!