Do your work for this exercise in a jupyter notebook named hypothesis_testing.ipynb.

For each of the following questions, formulate a null and alternative hypothesis (be as specific as you can be), then give an example of what a true positive, true negative, type I and type II errors would look like.

Is the website redesign any good?

- $H_0$: Web traffic after the redesign did not increase.
    
- $H_a$: Web traffic after the redesign increased.

- Type I error: We conclude that the redesign increased web traffic, when in reality our sample just had abnormally high traffic.

- Type II error: We conclude that the redesign did not increase web traffic, but actually, our sample just had abnormally low traffic. 

Is our television ad driving more sales?

- $H_0$: Sales did not increase after the ad aired.
    
- $H_a$: Sales increased after the ad aired.

- Type I error: We conclude that sales increased after the ad aired, when our sample just happened to have abnormally high sales, unimpacted by the ad.

- Type II error: We conclude that sales did not increase after the ad aired, when our sample actually just had abnormally low sales, despite the ad's impact.

Has the network latency gone up since we switched internet service providers?

- $H_0$: Since we changed ISPs, network latency has not gone up.
    
- $H_a$: Since we changed ISPs, network latency has gone up.

- Type I error: We find that our ISP switch increased network latency, but our sample actually just had abnormally high network latency, despite the switch's negligible impact.

- Type II error: We find that our ISP switch did not increase network latency, but our sample actually just had abnormally low network latency despite the significant impact of the ISP switch.

In [3]:
from scipy import stats

sample_1 = stats.norm(90, 15).rvs(40)
sample_2 = stats.norm(100, 20).rvs(50)

stats.ttest_ind(sample_1,sample_2)

# Average time is different, p-value of 0.0015

Ttest_indResult(statistic=-3.272226451476716, pvalue=0.0015253054415761674)

In [28]:
from pydataset import data

mpg = data('mpg')
mpg['mpg'] = (mpg['cty'] + mpg['hwy']) / 2

mpg_08 = mpg[mpg['year'] == 2008]['mpg']
mpg_99 = mpg[mpg['year'] == 1999]['mpg']

stats.ttest_ind(mpg_08, mpg_99)

# No difference, p-value of 0.83

compact = mpg[mpg['class'] == 'compact']['mpg']
average_car = mpg['mpg']

print(compact.mean())
print(average_car.mean())
print(stats.ttest_ind(compact, average_car))

# Yes, compact cars get better mileage, p-value of .00000029

mpg['is_auto'] = mpg['trans'].apply(lambda x: 'auto' in x)
manual = mpg[~mpg['is_auto']]['mpg']
auto = mpg[mpg['is_auto']]['mpg']


print(manual.mean())
print(auto.mean())
print(stats.ttest_ind(manual,auto))

# Yes, manual cars get better mileage, p-value of .0000071

24.21276595744681
20.14957264957265
Ttest_indResult(statistic=5.260311926248542, pvalue=2.8684546158129373e-07)
22.227272727272727
19.130573248407643
Ttest_indResult(statistic=4.593437735750014, pvalue=7.154374401145683e-06)


In [108]:
from env import user, host, password
import pandas as pd

def get_db_url(username, hostname, password, db_name):
    return f'mysql+pymysql://{username}:{password}@{hostname}/{db_name}'

url = get_db_url(user, host, password, 'telco_churn')

query = '''
select tenure, monthly_charges, total_charges, phone_service, internet_service_type_id from customers;
'''

telco = pd.read_sql(query,url)



In [113]:
def isfloat(value): #credit to stackexchange user Eric Leschinski
  try:
    float(value)
    return True
  except ValueError:
    return False

floats_only = telco.total_charges.apply(isfloat)

stats.pearsonr(telco.tenure,telco.monthly_charges) #Yes, there's a correlation. 
stats.pearsonr(telco[floats_only].tenure,telco[floats_only].total_charges.apply(float)) #Yes, there's a correlation.

has_internet = telco.internet_service_type_id > 1
has_phone = telco.phone_service == "Yes"

stats.pearsonr(telco.tenure[has_internet], telco.monthly_charges[has_internet]) #Yes, correlated
stats.pearsonr(telco.tenure[has_phone], telco.monthly_charges[has_phone]) #Yes, correlated

(0.24538898585362878, 7.117871077967264e-88)

In [116]:
url = get_db_url(user, host, password, 'employees')
query = '''
select datediff(curdate(),e.hire_date) as tenure,
    salary
    from employees as e
    join salaries as s 
    using(emp_no)
    where s.to_date like "9999%%";
     '''
salary_by_tenure = pd.read_sql(query,url)

  result = self._query(query)


In [119]:
salary_by_tenure['tenure'] = salary_by_tenure['tenure'] - salary_by_tenure['tenure'].min()

stats.pearsonr(salary_by_tenure['tenure'],salary_by_tenure['salary']) #Yes, they're correlated

(0.30646256131860783, 0.0)

In [121]:
query = '''
select datediff(curdate(),e.hire_date) as tenure,
    count(title) as titles
    from employees as e
    join titles as t 
    using(emp_no)
    group by emp_no;
    '''
titles_by_tenure = pd.read_sql(query,url)

In [122]:
titles_by_tenure['tenure'] = titles_by_tenure['tenure'] - titles_by_tenure['tenure'].min()

stats.pearsonr(titles_by_tenure['tenure'],titles_by_tenure['titles']) # Yes, they're correlated

(0.26659892991366196, 0.0)

In [123]:
sleep = data('sleepstudy')

In [125]:
stats.pearsonr(sleep.Reaction,sleep.Days) # Yes, they're correlated

(0.5352302262650253, 9.894096322214812e-15)

In [126]:
index = ['Macbook', 'Not Macbook']
columns = ['Codeup', 'Not Codeup']
macbooks = pd.DataFrame([[49,20], [1, 30]], index=index, columns=columns)

In [127]:
stats.chi2_contingency(macbooks) #Not independent, very low p-value

(36.65264142122487, 1.4116760526193828e-09, 1, array([[34.5, 34.5],
        [15.5, 15.5]]))

In [137]:
stats.chi2_contingency(pd.crosstab(mpg.trans, mpg.manufacturer))
# Null: All manufacturers are equally likely to make each transmission type.
# Alt: All manufacturers are not equally likely to make each transmission type.
# p-value < 0.05, we reject the null hypothesis

(246.91908570197074,
 7.163203875453598e-10,
 126,
 array([[ 0.38461538,  0.40598291,  0.79059829,  0.53418803,  0.19230769,
          0.2991453 ,  0.17094017,  0.08547009,  0.06410256,  0.08547009,
          0.27777778,  0.10683761,  0.2991453 ,  0.72649573,  0.57692308],
        [ 0.15384615,  0.16239316,  0.31623932,  0.21367521,  0.07692308,
          0.11965812,  0.06837607,  0.03418803,  0.02564103,  0.03418803,
          0.11111111,  0.04273504,  0.11965812,  0.29059829,  0.23076923],
        [ 6.38461538,  6.73931624, 13.12393162,  8.86752137,  3.19230769,
          4.96581197,  2.83760684,  1.41880342,  1.06410256,  1.41880342,
          4.61111111,  1.77350427,  4.96581197, 12.05982906,  9.57692308],
        [ 3.        ,  3.16666667,  6.16666667,  4.16666667,  1.5       ,
          2.33333333,  1.33333333,  0.66666667,  0.5       ,  0.66666667,
          2.16666667,  0.83333333,  2.33333333,  5.66666667,  4.5       ],
        [ 0.46153846,  0.48717949,  0.94871795,  0.641025

In [138]:
query = '''
select gender, dept_name
    from employees
    join dept_emp as de using(emp_no)
    join departments using(dept_no)
    where de.to_date like "9999%%";
    '''

gender_vs_dept = pd.read_sql(query,url)

  result = self._query(query)


In [146]:
dept_filter = gender_vs_dept['dept_name'].apply(lambda x: x in ['Marketing', 'Sales'])
relevant = gender_vs_dept[dept_filter]
stats.chi2_contingency(pd.crosstab(relevant.gender, relevant.dept_name)) # No correlation, p-value is .57

(0.3240332004060638,
 0.5691938610810126,
 1,
 array([[ 5893.2426013, 14969.7573987],
        [ 8948.7573987, 22731.2426013]]))

In [148]:
query = '''
select gender, count(dm.emp_no) as m_count
    from employees
    left join dept_manager as dm using(emp_no)
    group by emp_no;'''

gender_vs_manager = pd.read_sql(query,url)

In [152]:
gender_vs_manager['been_manager'] = gender_vs_manager['m_count'] > 0

stats.chi2_contingency(pd.crosstab(gender_vs_manager.gender, gender_vs_manager.been_manager)) # No correlation, p-value is .23

(1.4566857643547197,
 0.22745818732810363,
 1,
 array([[1.20041397e+05, 9.60331174e+00],
        [1.79958603e+05, 1.43966883e+01]]))