## Exploratory Data Analysis I

Let's begin our analysis by exploring both the univariate and bivariate characteristics of this dataset. The general goals of this step include: 

* finding outliers & distributions through univariate visualizations
* finding trends & patterns through bivariate visualizations

When you are done with this section of the project, validate that your output matches the screenshot provided in the `docs/part1.md` file and answer the questions located underneath `Exploratory Data Analysis II` in your own words.

In [None]:
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Loading `data/raw/shopping.csv` as a pandas dataframe
# Documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

df = pd.read_csv('../data/raw/shopping.csv')

In [None]:
# Printing out  first 5 rows for display
# Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html

df.head()

In [None]:
# Printing out summary statistics for all numeric columns
# Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html

df.describe()

## Univariate Analysis

Let's generate visualizatons for each numeric variable to get an idea of the outliers & distributions present in our dataset.

In addition, let's also visualize the frequency-count of qualitative variables to get an understanding of the composition of our dataset. 

In [None]:
# Plot seaborn histogram for the "Age" column
# Documentation: https://seaborn.pydata.org/generated/seaborn.histplot.html

sns.histplot(x = 'Age', data = df)

In [None]:
# Plot seaborn histogram for the "Purchase Amount (USD)" column

sns.histplot(data = df, x = 'Purchase Amount (USD)')

In [None]:
# Plot a seaborn histogram for the "Review Rating" column

sns.histplot(data = df, x = 'Review Rating')

In [None]:
# Plot a seaborn histogram for the "Previous Purchases" column

sns.histplot(data = df, x = 'Previous Purchases')

In [None]:
# Count the frequency of unique values in the "Gender" column, save this value into a new dataframe named "gender_counts"
# Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.value_counts.html

gender_counts = df.value_counts('Gender')
gender_counts

In [None]:
# Plot matplotlib barplot for the gender_counts dataframe
# Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.bar.html

gender_counts.plot.bar()

In [None]:
# Count frequency of unique values in the "Season" column, save this value into a new dataframe named "season_counts"

season_counts = df.value_counts('Season')
season_counts

In [None]:
# Plot matplotlib barplot for the season_counts dataframe

season_counts.plot.bar()

In [None]:
# Count frequency of unique values in the "Shipping Type" column, save this value into a new dataframe named "ship_counts"

shipping_counts = df.value_counts('Shipping Type')
shipping_counts

In [None]:
# Plot matplotlib barplot for the shipping_counts dataframe

shipping_counts.plot.bar()

In [None]:
# Count the frequency of unique values in the "Promo Code Used" column, save this value into a new dataframe named "promo_counts"

promo_counts = df.value_counts('Promo Code Used')
promo_counts

In [None]:
# Plot matplotlib barplot for the promo_counts dataframe

promo_counts.plot.bar()

In [None]:
# Count frequency of unique values in the "Payment Method" column, save this value into a new dataframe named "pay_counts"

pay_counts = df.value_counts('Payment Method')
pay_counts

In [None]:
# Plot matplotlib barplot for the pay_counts dataframe

pay_counts.plot.bar()

In [None]:
# Count frequency of unique values in the "Frequency of Purchases" column, save this value into a new dataframe named "purch_counts"

purch_counts = df.value_counts('Frequency of Purchases')
purch_counts

In [None]:
# Plot a matplotlib barplot for the purch_counts dataframe

purch_counts.plot.bar()

In [None]:
# Count frequency of unique values in the "Location" column, save this value into a new dataframe named "loc_counts"

loc_counts = df.value_counts('Location')
loc_counts

In [None]:
# Plot a horizontal barplot for the loc_counts dataframe
# Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.barh.html
# Resize the figure using "plt.figure(figsize=(10,10))" to "unsquish" your visualization

plt.figure(figsize=(10,10))
loc_counts.plot.barh()

## Bivariate Analysis

Let's generate visualizatons for relationships between multiple numeric variables to get an idea of patterns and clusters that might be present in our dataset.

In [None]:
# Create a boxplot that reveals the range of "Purchase Amount (USD)" for each "Gender" 
# Documentation: https://seaborn.pydata.org/generated/seaborn.boxplot.html

sns.boxplot(data = df, x = 'Gender', y = 'Purchase Amount (USD)')

In [None]:
# Create a boxplot that reveals the range of "Purchase Amount (USD)" for each "Season" 
# Documentation: https://seaborn.pydata.org/generated/seaborn.boxplot.html

sns.boxplot(data = df, x = "Season", y = "Purchase Amount (USD)")

In [None]:
# Create a boxplot that reveals the range of "Purchase Amount (USD)" for each "Review Rating" 
# Documentation: https://seaborn.pydata.org/generated/seaborn.boxplot.html

sns.boxplot(data = df, x = "Review Rating", y = "Purchase Amount (USD)" )

In [None]:
# Create a boxplot that reveals the range of "Purchase Amount (USD)" for each "Promo Code Used" 
# Documentation: https://seaborn.pydata.org/generated/seaborn.boxplot.html

sns.boxplot(data = df, x = "Purchase Amount (USD)", y = "Promo Code Used" )

In [None]:
# Create a boxplot that reveals the range of "Payment Method" for each "Purchase Amount (USD)"
# Documentation: https://seaborn.pydata.org/generated/seaborn.boxplot.html

sns.boxplot(data = df, x = "Purchase Amount (USD)", y = "Payment Method" )

In [None]:
# Plot grid of diagrams on all numeric columns where the upper-half of the grid are scatter-plots
# the bottom-half are kde-plots
# and the diagonal is a histplot
# Documentation: https://seaborn.pydata.org/tutorial/axis_grids.html
# This might take a few seconds to load
# To read the kde diagrams in the bottom-half check out https://www.greenbelly.co/pages/contour-lines

g = sns.PairGrid(df)
g.map_upper(sns.scatterplot)
g.map_lower(sns.kdeplot)
g.map_diag(sns.histplot)

## Exploratory Data Analysis II

In the next section, answer a few questions regarding your dataset using the visualizations you've generated.

### Q1

Which state contains the most amount of shoppers? Which state contains the least?

California contains the most amount of shoppers, while Hawaii contains the least amount of shoppers. 

### Q2

Which season has the largest amount of purchases?

Winter has the largest amounts of purchases ($1321).

### Q3

What is the most popular form of payment for our customers in the US? What is the least popular form of payment?

US customers primarily used credit cards as their method of payment. Only a few US customers provided cash payment. 

### Q4

What is the most popular form of shipping for our customers in the US? What is the least popular form of shipping?

Standard shipping was the most popular form of shipping for US customers, while store pick-up was the least popular option.

### Q5

What kind of distribution do we observe for our `Age` column? What does this tell us about the typical shopper in the US?

The distribution on the Age column is skewedto the right; by looking at the distribution it appears that the typcial US shopper lies between 30 and 40 years old (mean: 34.26).

### Q6

What kind of distribution do we observe for our `Purchase Amount (USD)` column? Why might this be? Take a look at the boxplots that you've generated to help answer this question.

The "Purchase Amount (USD)" is a bimodal distribution, which appears to be affected by whether or not customers used a promotional code or not. Customers who didn't utilize a promotional code spent on average $30, while those who utilized a promotional code spent $50 on average, which are the same peaks observed in the "Purchase Amount (USD)" chart. 