In [1]:
import numpy as np 
import pandas as pd
from scipy import stats

**Exercise 1:** The hourly wages in a particular industry are normally distributed with mean "13.20" and standard deviation "2.50".
A company in this industry employs 40 workers, paying them an average of "12.20" per hour. Can this company be accused of paying substandard wages? Use an α = .01 level test. (Wackerly, Ex.10.18)
- H0 : mu = 13.2
- H1 : mu < 13.2

In [2]:
xbar = 13.2
sigma = 2.5
n = 40
mu = 12.2
z = (xbar - mu ) / (sigma / np.sqrt(n))
z

2.5298221281347035

In [3]:
p = 1 - stats.norm.cdf(z)
p

0.005706018193000872

H0 rejected (0.0057 < 0.01). This company can be accused of paying substandard wages with %99 confidence.

**Exercise 2:** Shear strength measurements derived from unconfined compression tests for two types of soils gave the results shown in the following document (measurements in tons per square foot). Do the soils appear to differ with respect to average shear strength, at the 1% significance level?

In [4]:
df = pd.read_excel("soil.xlsx")
df.head()

Unnamed: 0,soil1,soil2
0,1.442,1.364
1,1.943,1.878
2,1.11,1.337
3,1.912,1.828
4,1.553,1.371


- H0 : µ_soil1 is equal to µ_soil2
- H1 : µ_soil1 is not equal to µ_soil2

In [5]:
df.soil1.mean()

1.6918

In [6]:
df.soil2.mean()

1.4171142857142855

In [7]:
df.soil1.std() ** 2

0.04280878620689653

In [8]:
df.soil2.std() ** 2

0.048041751260504195

In [9]:
soil_test = stats.ttest_ind(df.soil1.dropna(), df.soil2.dropna(), equal_var=True, alternative="two-sided")
soil_test

Ttest_indResult(statistic=5.1681473319343345, pvalue=2.593228732352821e-06)

H0 rejected (pvalue < alpha). These two soil type differ for shear strength.

**Exercise 3:** The following dataset is based on data provided by the World Bank (https://datacatalog.worldbank.org/dataset/education-statistics). World Bank Edstats.  2015 PISA Test Dataset

1. Get descriptive statistics (the central tendency, dispersion and shape of a dataset’s distribution) for each continent group (AS, EU, AF, NA, SA, OC).
2. Determine whether there is any difference (on the average) for the math scores among European (EU) and Asian (AS) countries (assume normality and equal variances). Draw side-by-side box plots.

In [10]:
df = pd.read_excel("pisa_test_2015.xlsx")
df.head()

Unnamed: 0,country_code,continent_code,internet_users_per_100,math,reading,science
0,ALB,EU,63.252933,413.157,405.2588,427.225
1,ARE,AS,90.5,427.4827,433.5423,436.7311
2,ARG,SA,68.043064,409.0333,425.3031,432.2262
3,AUS,OC,84.560519,493.8962,502.9006,509.9939
4,AUT,EU,83.940142,496.7423,484.8656,495.0375


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70 entries, 0 to 69
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   country_code            70 non-null     object 
 1   continent_code          65 non-null     object 
 2   internet_users_per_100  70 non-null     float64
 3   math                    70 non-null     float64
 4   reading                 70 non-null     float64
 5   science                 70 non-null     float64
dtypes: float64(4), object(2)
memory usage: 3.4+ KB


In [12]:
df.describe(include = "all").T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
country_code,70.0,70.0,ALB,1.0,,,,,,,
continent_code,65.0,5.0,EU,37.0,,,,,,,
internet_users_per_100,70.0,,,,71.973099,16.390632,21.976068,60.89902,72.99935,85.026763,98.2
math,70.0,,,,460.971557,53.327205,327.702,417.416075,477.60715,500.482925,564.1897
reading,70.0,,,,460.997291,49.502679,346.549,426.948625,480.19985,499.687475,535.1002
science,70.0,,,,465.439093,48.397254,331.6388,425.923375,475.40005,502.43125,555.5747


In [13]:
df_eu = df[df.continent_code == "EU"]["math"]
df_eu

0     413.1570
4     496.7423
5     506.9844
6     441.1899
9     521.2506
14    437.1443
15    492.3254
16    505.9713
17    511.0876
20    485.8432
21    519.5291
22    511.0769
23    492.9204
24    492.4785
26    453.6299
28    464.0401
29    476.8309
31    503.7220
32    488.0332
34    489.7287
40    478.3834
41    485.7706
42    482.3051
44    419.6635
46    371.3114
47    478.6448
48    417.9341
50    512.2528
51    501.7298
54    504.4693
55    491.6270
57    443.9543
58    494.0600
60    475.2301
61    509.9196
62    493.9181
66    420.4540
Name: math, dtype: float64

In [14]:
df_as = df[df.continent_code == "AS"]["math"]
df_as

1     427.4827
11    531.2961
25    403.8332
27    547.9310
30    386.1096
33    469.6695
35    380.2590
36    532.4399
37    459.8160
38    524.1062
39    396.2497
43    543.8078
49    446.1098
56    402.4007
59    564.1897
63    415.4638
69    494.5183
Name: math, dtype: float64

- H0 : df_eu.mean() is equal to df_as.mean()
- H1 : df_eu.mean() is not equal to df_as.mean()

In [15]:
df_eu.mean()

477.98144864864867

In [16]:
df_as.mean()

466.2166470588236

In [17]:
df_eu.std() ** 2

1235.550804859233

In [18]:
df_as.std() ** 2

4141.7578222101465

In [19]:
indTest = stats.ttest_ind(df_eu, df_as, equal_var=True, alternative="two-sided")
indTest

Ttest_indResult(statistic=0.8700553179679787, pvalue=0.38826888111307556)

Average of math scores differ for EU and AS (p-value < alpha)

**Exercise 4:**  A gym operator organized a 2-month exercise and diet program for 15 customers suffering from their excess weight. To evaluate whether this diet program was effective, he measured the customers' starting and ending weights and recorded them in the computer. Did the exercise and diet program have an impact on customers' weight loss? Use an α = .01 level test.  Weight Dataset

In [20]:
df = pd.read_excel("weight.xlsx")
df.head()

Unnamed: 0,ID,starting,ending
0,1.0,76.0,72.0
1,2.0,81.0,82.0
2,3.0,86.0,84.0
3,4.0,71.0,71.0
4,5.0,88.0,83.0


- H0 : dw = 0
- H1 : dw < 0


In [21]:
df["diff"] = df.starting - df.ending
df["diff"]

0     4.0
1    -1.0
2     2.0
3     0.0
4     5.0
5     4.0
6     6.0
7     1.0
8     1.0
9    -2.0
10    3.0
11    1.0
12    2.0
13   -2.0
14    1.0
Name: diff, dtype: float64

In [22]:
depTest = stats.ttest_rel(df["ending"], df["starting"], alternative="less")
depTest

Ttest_relResult(statistic=-2.6780834840499255, pvalue=0.00900646517506626)

The exercise and diet program have an impact on customers' weight loss (p-value < alpha).