# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources (README.md file)
- Happy learning!

In [55]:
# import numpy and pandas
import pandas as pd
import numpy as np
import math
import statistics as stats


# Challenge 1 - Exploring the Data

In this challenge, we will examine all salaries of employees of the City of Chicago. We will start by loading the dataset and examining its contents.

In [56]:
# Your code here:
salaries = pd.read_csv('Current_Employee_Names__Salaries__and_Position_Titles.csv')


Examine the `salaries` dataset using the `head` function below.

In [57]:
# Your code here:
pd.set_option('display.max_columns', None)
salaries.head()


Unnamed: 0,Name,Job Titles,Department,Full or Part-Time,Salary or Hourly,Typical Hours,Annual Salary,Hourly Rate
0,"AARON, JEFFERY M",SERGEANT,POLICE,F,Salary,,101442.0,
1,"AARON, KARINA",POLICE OFFICER (ASSIGNED AS DETECTIVE),POLICE,F,Salary,,94122.0,
2,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,F,Salary,,101592.0,
3,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,F,Salary,,110064.0,
4,"ABASCAL, REECE E",TRAFFIC CONTROL AIDE-HOURLY,OEMC,P,Hourly,20.0,,19.86


We see from looking at the `head` function that there is quite a bit of missing data. Let's examine how much missing data is in each column. Produce this output in the cell below

In [58]:
# Your code here:
#salaries.info()
#salaries.isnull().values.any()
salaries.isnull().sum()


Name                     0
Job Titles               0
Department               0
Full or Part-Time        0
Salary or Hourly         0
Typical Hours        25161
Annual Salary         8022
Hourly Rate          25161
dtype: int64

Let's also look at the count of hourly vs. salaried employees. Write the code in the cell below

In [59]:
# Your code here:
salaries['Salary or Hourly'].value_counts()


Salary    25161
Hourly     8022
Name: Salary or Hourly, dtype: int64

What this information indicates is that the table contains information about two types of employees - salaried and hourly. Some columns apply only to one type of employee while other columns only apply to another kind. This is why there are so many missing values. Therefore, we will not do anything to handle the missing values.

There are different departments in the city. List all departments and the count of employees in each department.

In [60]:
# Your code here:
salaries['Department'].value_counts()


POLICE                   13414
FIRE                      4641
STREETS & SAN             2198
OEMC                      2102
WATER MGMNT               1879
AVIATION                  1629
TRANSPORTN                1140
PUBLIC LIBRARY            1015
GENERAL SERVICES           980
FAMILY & SUPPORT           615
FINANCE                    560
HEALTH                     488
CITY COUNCIL               411
LAW                        407
BUILDINGS                  269
COMMUNITY DEVELOPMENT      207
BUSINESS AFFAIRS           171
COPA                       116
BOARD OF ELECTION          107
DoIT                        99
PROCUREMENT                 92
INSPECTOR GEN               87
MAYOR'S OFFICE              85
CITY CLERK                  84
ANIMAL CONTRL               81
HUMAN RESOURCES             79
CULTURAL AFFAIRS            65
BUDGET & MGMT               46
ADMIN HEARNG                39
DISABILITIES                28
TREASURER                   22
HUMAN RELATIONS             16
BOARD OF

# Challenge 2 - Hypothesis Tests

In this section of the lab, we will test whether the hourly wage of all hourly workers is significantly different from $30/hr. Import the correct one sample test function from scipy and perform the hypothesis test for a 95% two sided confidence interval.

In [61]:
# Your code here:
import scipy.stats as st


In [62]:
hourly =salaries.dropna(subset=['Hourly Rate'])
hourly

Unnamed: 0,Name,Job Titles,Department,Full or Part-Time,Salary or Hourly,Typical Hours,Annual Salary,Hourly Rate
4,"ABASCAL, REECE E",TRAFFIC CONTROL AIDE-HOURLY,OEMC,P,Hourly,20.0,,19.86
6,"ABBATACOLA, ROBERT J",ELECTRICAL MECHANIC,AVIATION,F,Hourly,40.0,,46.10
7,"ABBATE, JOSEPH L",POOL MOTOR TRUCK DRIVER,STREETS & SAN,F,Hourly,40.0,,35.60
10,"ABBOTT, BETTY L",FOSTER GRANDPARENT,FAMILY & SUPPORT,P,Hourly,20.0,,2.65
18,"ABDULLAH, LAKENYA N",CROSSING GUARD,OEMC,P,Hourly,20.0,,17.68
...,...,...,...,...,...,...,...,...
33164,"ZUREK, FRANCIS",ELECTRICAL MECHANIC,OEMC,F,Hourly,40.0,,46.10
33168,"ZWARYCZ MANN, IRENE A",CROSSING GUARD,OEMC,P,Hourly,20.0,,17.68
33169,"ZWARYCZ, THOMAS J",POOL MOTOR TRUCK DRIVER,WATER MGMNT,F,Hourly,40.0,,35.60
33174,"ZYGADLO, JOHN P",MACHINIST (AUTOMOTIVE),GENERAL SERVICES,F,Hourly,40.0,,46.35


In [63]:
hourly_sample = hourly['Hourly Rate'].sample(30)
hourly_sample


17739    16.00
6527     16.17
4977     46.10
12359    15.22
1341     44.25
21822    45.35
9139     45.35
3926     40.20
12363    44.55
13892    12.49
5790     41.30
11777    22.12
2275     35.60
23170    35.60
23384    19.86
21089    32.04
17732    19.86
8304     49.10
14520    35.60
27674    46.10
5655     35.60
22078    40.20
26496    45.07
21007    22.36
11298    14.51
10999    36.21
30537    34.57
12467    35.60
13563    40.20
13997    35.60
Name: Hourly Rate, dtype: float64

In [64]:
#H0: \mu_1 = 30, H1 \mu_1 != 30

In [65]:
st.ttest_1samp(hourly_sample,30)

Ttest_1sampResult(statistic=1.6362415697286739, pvalue=0.11259767035521023)

In [66]:
# p is low that means the average hourly rate is not $30

We are also curious about salaries in the police force. The chief of police in Chicago claimed in a press briefing that salaries this year are higher than last year's mean of $86000/year a year for all salaried employees. Test this one sided hypothesis using a 95% confidence interval.

Hint: A one tailed test has a p-value that is half of the two tailed p-value. If our hypothesis is greater than, then to reject, the test statistic must also be positive.

In [67]:
# Your code here:
yearly =salaries.dropna(subset=['Annual Salary'])
yearly


Unnamed: 0,Name,Job Titles,Department,Full or Part-Time,Salary or Hourly,Typical Hours,Annual Salary,Hourly Rate
0,"AARON, JEFFERY M",SERGEANT,POLICE,F,Salary,,101442.0,
1,"AARON, KARINA",POLICE OFFICER (ASSIGNED AS DETECTIVE),POLICE,F,Salary,,94122.0,
2,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,F,Salary,,101592.0,
3,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,F,Salary,,110064.0,
5,"ABBASI, CHRISTOPHER",STAFF ASST TO THE ALDERMAN,CITY COUNCIL,F,Salary,,50436.0,
...,...,...,...,...,...,...,...,...
33178,"ZYLINSKA, KATARZYNA",POLICE OFFICER,POLICE,F,Salary,,72510.0,
33179,"ZYMANTAS, LAURA C",POLICE OFFICER,POLICE,F,Salary,,48078.0,
33180,"ZYMANTAS, MARK E",POLICE OFFICER,POLICE,F,Salary,,90024.0,
33181,"ZYRKOWSKI, CARLO E",POLICE OFFICER,POLICE,F,Salary,,93354.0,


In [68]:
yearly_sample = yearly['Annual Salary'].sample(30)
yearly_sample

14049     73992.0
15627    136794.0
2614      93354.0
28818     96060.0
9375      93300.0
24402     48078.0
22316     48078.0
20433    111474.0
10671     48078.0
22999     96060.0
17024     93354.0
31690     49704.0
18089     87006.0
31825     93354.0
19809    127068.0
25156     90024.0
22618     90024.0
14216    260004.0
24971     96060.0
12102     93354.0
31839     62940.0
4177      42108.0
4161     101442.0
20297     90024.0
7478      48078.0
12896     93876.0
9266     111474.0
27789     84870.0
6226     170112.0
6568      97440.0
Name: Annual Salary, dtype: float64

In [69]:
#H0: \mu_1 > 86000, H1 \mu_1 < 86000

In [70]:
st.ttest_1samp(yearly_sample,86000)

Ttest_1sampResult(statistic=1.0715974842598575, pvalue=0.29273824890928557)

In [71]:
# as the p value is close to 50% the boss is right, the yearly salary mean is higher than $86000

Using the `crosstab` function, find the department that has the most hourly workers. 

In [72]:
# Your code here:
salaries.rename(columns = {'Typical Hours':'Typical_Hours'}, inplace = True)


In [73]:
department = pd.crosstab(salaries.Department, salaries.Typical_Hours)
department
# STREETS & SAN department has the most hourly workers

Typical_Hours,10.0,20.0,35.0,40.0
Department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ANIMAL CONTRL,1,18,0,0
AVIATION,0,18,15,1049
BUDGET & MGMT,0,0,2,0
BUSINESS AFFAIRS,0,7,0,0
CITY COUNCIL,7,36,18,3
COMMUNITY DEVELOPMENT,0,3,1,0
CULTURAL AFFAIRS,0,0,7,0
FAMILY & SUPPORT,0,286,0,1
FINANCE,0,2,0,42
FIRE,0,0,0,2


The workers from the department with the most hourly workers have complained that their hourly wage is less than $35/hour. Using a one sample t-test, test this one-sided hypothesis at the 95% confidence level.

In [74]:
# Your code here:
#H0: \mu_1 =< 35, H1 \mu_1 > 35

In [75]:
streets = hourly[hourly["Department"] == "STREETS & SAN"]
streets

Unnamed: 0,Name,Job Titles,Department,Full or Part-Time,Salary or Hourly,Typical Hours,Annual Salary,Hourly Rate
7,"ABBATE, JOSEPH L",POOL MOTOR TRUCK DRIVER,STREETS & SAN,F,Hourly,40.0,,35.60
21,"ABDUL-SHAKUR, TAHIR",GENERAL LABORER - DSS,STREETS & SAN,F,Hourly,40.0,,21.43
24,"ABERCROMBIE, TIMOTHY",MOTOR TRUCK DRIVER,STREETS & SAN,F,Hourly,40.0,,35.60
36,"ABRAMS, DANIELLE T",SANITATION LABORER,STREETS & SAN,F,Hourly,40.0,,36.21
39,"ABRAMS, SAMUEL A",POOL MOTOR TRUCK DRIVER,STREETS & SAN,F,Hourly,40.0,,35.60
...,...,...,...,...,...,...,...,...
33106,"ZIZUMBO, JOSE N",MOTOR TRUCK DRIVER,STREETS & SAN,F,Hourly,40.0,,36.13
33107,"ZIZUMBO, LUIS",MOTOR TRUCK DRIVER,STREETS & SAN,F,Hourly,40.0,,35.60
33147,"ZUMMO, ROBERT J",MOTOR TRUCK DRIVER,STREETS & SAN,F,Hourly,40.0,,35.60
33149,"ZUNICH, JONATHAN G",SANITATION LABORER,STREETS & SAN,F,Hourly,40.0,,36.21


In [76]:
streets_sample = streets['Hourly Rate'].sample(30)
streets_sample

14019    36.21
11847    35.60
24608    36.21
24953    35.60
9849     35.60
18237    36.21
9117     36.21
22975    35.60
18084    36.21
10182    35.60
17435    36.21
152      35.60
19984    36.13
29391    36.21
30349    36.21
22927    35.60
21869    35.60
5419     44.55
15579    28.48
32149    28.48
1038     36.21
20861    36.21
6171     37.25
30386    36.21
22647    35.60
3276     36.21
6354     35.60
16429    36.21
17370    49.10
25430    35.60
Name: Hourly Rate, dtype: float64

In [77]:
st.ttest_1samp(streets_sample,35)

Ttest_1sampResult(statistic=1.8810272275583453, pvalue=0.07004343897701772)

In [78]:
# p value is above 5% that means the workers are right, their hourly rate is less then $35/hour

# Challenge 3: To practice - Constructing Confidence Intervals

While testing our hypothesis is a great way to gather empirical evidence for accepting or rejecting the hypothesis, another way to gather evidence is by creating a confidence interval. A confidence interval gives us information about the true mean of the population. So for a 95% confidence interval, we are 95% sure that the mean of the population is within the confidence interval. 
).

To read more about confidence intervals, click [here](https://en.wikipedia.org/wiki/Confidence_interval).


In the cell below, we will construct a 95% confidence interval for the mean hourly wage of all hourly workers. 

The confidence interval is computed in SciPy using the `t.interval` function. You can read more about this function [here](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.t.html).

To compute the confidence interval of the hourly wage, use the 0.95 for the confidence level, number of rows - 1 for degrees of freedom, the mean of the sample for the location parameter and the standard error for the scale. The standard error can be computed using [this](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.sem.html) function in SciPy.

In [79]:
# Your code here:
sample = hourly['Hourly Rate'].sample(30)
sample


14103    35.60
19044    35.60
20634    49.10
18691    46.10
12670    45.07
26250    28.48
29932    13.15
5287     33.56
6027     49.10
14874    35.60
836      41.10
14650    36.21
18       17.68
17811    49.95
12601    28.48
3079     19.86
16418    35.60
13799    21.98
17224    40.20
908      35.60
20972    36.21
28425    37.25
17654    36.21
31875    48.90
1412     40.20
27894    35.60
28945    47.80
27061    44.55
31664    28.48
9745     35.60
Name: Hourly Rate, dtype: float64

In [80]:
from statistics import mean

In [86]:
st.t.interval(0.95,len(sample)-1,loc=streets.mean(),scale=st.sem(sample))

  st.t.interval(0.95,len(sample)-1,loc=streets.mean(),scale=st.sem(sample))


(array([33.56613703,         nan, 30.15702854]),
 array([40.70883612,         nan, 37.29972763]))

In [None]:
# I don´t understand what I am doing wrong :(

Now construct the 95% confidence interval for all salaried employeed in the police in the cell below.

In [88]:
# Your code here:
salaried = salaries[salaries["Salary or Hourly"] == "Salary"]
sample_salaried = salaried['Annual Salary'].sample(30)
sample_salaried

32873     72264.0
31528     99024.0
24449    107988.0
3499      72510.0
23957     48078.0
24031     70092.0
2469      94476.0
19011    103932.0
26837     97386.0
4493      93666.0
2678      53340.0
2855      87006.0
22742    111474.0
14584     93354.0
8334     103932.0
13609     35004.0
11069     50628.0
29638    102510.0
22609     90024.0
9122      84054.0
2635      59436.0
21626     76266.0
23812     64392.0
25602     72510.0
16698    100980.0
2241      84054.0
32944     87006.0
3943      48078.0
11417     96060.0
30940     70092.0
Name: Annual Salary, dtype: float64

In [89]:
st.t.interval(0.95,len(sample_salaried)-1,loc=sample_salaried.mean(),scale=st.sem(sample_salaried))

(73312.60879322155, 88661.79120677845)

In [None]:
# Confidence Interval for salaried employees is between 73312.61 and 88661.79€

# Bonus Challenge - Hypothesis Tests of Proportions

Another type of one sample test is a hypothesis test of proportions. In this test, we examine whether the proportion of a group in our sample is significantly different than a fraction. 

You can read more about one sample proportion tests [here](http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/SAS/SAS6-CategoricalData/SAS6-CategoricalData2.html).

In the cell below, use the `proportions_ztest` function from `statsmodels` to perform a hypothesis test that will determine whether the number of hourly workers in the City of Chicago is significantly different from 25% at the 95% confidence level.

In [91]:
# Your code here:
from statsmodels.stats.proportion import proportions_ztest


In [92]:
significance = 0.05
sample_test = 225
sample_size = 300
null_hypothesis = 0.95
stat, p_value = proportions_ztest(count=sample_test, nobs=sample_size, value=null_hypothesis, alternative='larger')
# I don´t understand how I should build the sample test
