<a href="https://colab.research.google.com/github/s-lasch/Notebooks/blob/main/Normal%20Distributions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Imports**
> I like to use [plotly](https://plotly.com/python/) as my primary data visualization library because it uses `Node.JS` for interactive plots

In [1]:
# statistics
import scipy.stats as st
import math

# data wrangling
import pandas as pd

# data visualization
import plotly.express as px
import plotly.figure_factory as ff

# **Normal Distributions:** Marketing Statistics Data
> Data Obtained From: https://github.com/nailson/ifood-data-business-analyst-test

In [2]:
# load data
df = pd.read_csv("https://raw.githubusercontent.com/nailson/ifood-data-business-analyst-test/master/ifood_df.csv")
df.head()

Unnamed: 0,Income,Kidhome,Teenhome,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,...,marital_Together,marital_Widow,education_2n Cycle,education_Basic,education_Graduation,education_Master,education_PhD,MntTotal,MntRegularProds,AcceptedCmpOverall
0,58138.0,0,0,58,635,88,546,172,88,88,...,0,0,0,0,1,0,0,1529,1441,0
1,46344.0,1,1,38,11,1,6,2,1,6,...,0,0,0,0,1,0,0,21,15,0
2,71613.0,0,0,26,426,49,127,111,21,42,...,1,0,0,0,1,0,0,734,692,0
3,26646.0,1,0,26,11,4,20,10,3,5,...,1,0,0,0,1,0,0,48,43,0
4,58293.0,1,0,94,173,43,118,46,27,15,...,0,0,0,0,0,0,1,407,392,0


In [3]:
# focus on income variable
income = df['Income']

In [19]:
# create distplot data
hist_data = [income]
labels = ['incomes']

# create normal distribution
norm = ff.create_distplot(hist_data,
                          labels,
                          bin_size=10000,
                          curve_type="normal",
                          show_hist=True,
                          show_rug=True,
                          colors=['rgba(0,0,255,.5)'],
                          ).update_layout(title="<b>Normal Distribution of Incomes</b>",
                                          title_x=0.42,
                                          hovermode='x unified',
                                          width=1000
                                          )
                          
# add a vertical line representing mean income
norm.add_vline(x=income.mean(), 
               line_dash="dash", 
               annotation=dict(font_color='darkred'),
               annotation_text=f"<b>mean:</b> {'${:,.2f}'.format(income.mean())}", annotation_position="right",
               line_color='darkred'
               )

# add a vertical line representing median income
norm.add_vline(x=income.median(), 
               line_dash="dash", 
               annotation=dict(font_color='blue'),
               annotation_text=f"<b>median:</b> {'${:,.2f}'.format(income.median())}", annotation_position="left",
               line_color='blue'
               )

# display plot
norm.show()



> As the mean and the median are almost identical, **the data is not skewed**







# **Example Problem:** Confidence Intvervals
> Calculate a **90% confidence interval** using the normal curve from above.  
> Using the **Margin of Error (MOE)**, we can determine the interval that will capture the value of ***$\bar{x}$*** **90%** of the time.

$\text{CI} = \bar{x} \ \pm \ \large{z\frac{s}{\sqrt{n}}}$


$\text{MOE} = \large{z\frac{s}{\sqrt{n}}}$

In [5]:
# find the size
n = len(income)
n

2205

In [6]:
# calculate mean, denoted as xbar
x_bar = income.mean()
x_bar

51622.0947845805

In [12]:
# calculate t ciritcal value, as sigma is unknown
t = st.t.ppf(.95, n-1)
t

1.6455452847215248

In [8]:
# calculate standard deviation
s = st.tstd(income)
s

20713.06382588019

In [13]:
# determine confidence interval
lower_bound = x_bar - t * (s / math.sqrt(n))
upper_bound = x_bar + t * (s / math.sqrt(n))

# display the interval
print(f"90% Confidence Interval: {'${:,.2f}'.format(lower_bound)} to {'${:,.2f}'.format(upper_bound)}")

90% Confidence Interval: $50,896.24 to $52,347.95
