The global ice volume is a statistic, that is, a quantity computed from a sample of glaciers. The population is the set of all glaciers on Earth. We try to describe that, but since our effort is empirical it is fundamentally bound to analyze a sample of glaciers, that is, a subset of the total population. 

We have several hundred thousand glaciers, so the sample statistics are likely to be close to the overall population parameters. For example, the sample mean will likely be close to the population mean.

We consider the test statistic ${X = \sum Y_i}$, where ${X}$ is the sum of the individual glacier volumes ${Y_{i}}$.

# Variance testing.
What do we want to do? We want to determine how different are the means and deviations between the different studies. As I understand this, each study provides a sample of the true (unkown) parent population of the global thicknesses of ice. What we can do is measure the error of the means, and the error between samples, but we cannot compare sample to the population. If we have truly randomly sampled the parent population, then there should be no significant differences among the sample means that we have obtained.

This doesn't sound fully right.

In [1]:
# data
import pandas as pd
df = pd.DataFrame(
    {
        'Study'       :['Edasi', 'Farinotti', 'Millan'],
        'Mean'        :[101.48 , 158.17     , 140.8],
        'STD'         :[42.19  , 41.03      , 40.4],
        'Sample Size' :[3811   , 5          , 1]
    }
)
print(df)
mean_E = df['Mean'].loc[0]
mean_F = df['Mean'].loc[1]
mean_M = df['Mean'].loc[2]

std_E = df['STD'].loc[0]
std_F = df['STD'].loc[1]
std_M = df['STD'].loc[2]

samp_E = df['STD'].loc[0]
samp_F = df['STD'].loc[1]
samp_M = df['STD'].loc[2]

       Study    Mean    STD  Sample Size
0      Edasi  101.48  42.19         3811
1  Farinotti  158.17  41.03            5
2     Millan  140.80  40.40            1


# Z Test
Comparing two samples directly we can implement a modified Z test.

\begin{equation}
    \text{Z} =  \frac{ (\bar{X_1} - \bar{X_2})} { \sqrt{ \sigma_{x_1}^2 + \sigma_{x_2}^2}}
\end{equation}

Where

- X<sub>1</sub> is the mean value of sample one

- X<sub>2</sub> is the mean value of sample two
- σ<sub>X<sub>1</sub></sub> is the standard deviation of sample one divided by the square root of the number of data points
- σ<sub>X<sub>2</sub></sub> is the standard deviation of sample two divided by the square root of the number of data points


In [2]:
# by hand
import numpy as np
# compare Edasi and Millan
Z_EM = (
    (mean_E - mean_M) / np.sqrt(
        (std_E / np.sqrt(samp_E))**2  + (std_M / np.sqrt(samp_M))**2
    )
)
print(f'Z score for Edasi and Millan {Z_EM}')


# compare Edasi and Farinotti
Z_EF = (
    (mean_E - mean_F) / np.sqrt(
        (std_E / np.sqrt(samp_E))**2 + (std_F / np.sqrt(samp_F))**2
    )
)
print(f'Z score for Edasi and Farinotti {Z_EF}')


# compare Farinotti and Millan
Z_FM = (
    (mean_F - mean_M) / np.sqrt(
        (std_F / np.sqrt(samp_F))**2 + (std_M / np.sqrt(samp_M))**2
    )
)
print(f'Z score for Farinotti and Millan {Z_FM}')

Z score for Edasi and Millan -4.326630187654598
Z score for Edasi and Farinotti -6.214305522474111
Z score for Farinotti and Millan 1.9248974671701997


# Load the full reference data

In [3]:
import glacierml as gl
pd.set_option('display.max_columns', None)

df, ref = gl.notebook_data_loader()

Global Volume: 101.48, UB: 39.16, LB: 42.34, STD: 42.19


# Z test by built in functions

In [4]:
from statsmodels.stats.weightstats import ztest as ztest
# https://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.ztest.html
vol_e = ref['Farinotti Volume']
print(vol_e.mean())
vol_f = ref['Edasi Volume']
print(vol_f.mean())
print(vol_e.mean() - vol_f.mean())
#perform two sample z-test
ztest(vol_e, vol_f, value = 0.305) 

0.7342542707588735
0.4294892992798123
0.30476497147906123


(-0.004399919924887662, 0.996489383150345)

# Now for some sampling!
Let's look at residuals via glacier size and other attributes.

In [5]:
def z_test(
    data = ref,
    variable = 'Area',
    lower_range = 1,
    upper_range = 10,
    range_scale = 1
):
    
    for i in np.arange(lower_range, upper_range, range_scale):

    #     print(((i - 1) / 10))
        df_temp = ref[
            (ref[variable] > ((i - range_scale))) &
            (ref[variable] < ((i)))
        ]
#         sum_vol_e = sum(df_temp['Edasi Volume'])
#         sum_vol_f = sum(df_temp['Farinotti Volume'])
        vol_std_e = df_temp['Edasi Volume'].std()
        vol_std_f = df_temp['Farinotti Volume'].std()
        mean_vol_e = df_temp['Edasi Volume'].mean()
        mean_vol_f = df_temp['Farinotti Volume'].mean()


        print('Z test sampling glaciers between ' +
            str((i - range_scale)) + ' and ' + 
            str((i)) +  ': ' +
            'Z = ' + str(
                (mean_vol_e - mean_vol_f) / np.sqrt(
                    (
                        vol_std_e / np.sqrt(len(df_temp))
                    )**2 + (
                        vol_std_f / np.sqrt(len(df_temp))
                    )**2
                )
            )
         )


    print('')

z_test(
    data = ref,
    variable = 'Area',
    lower_range = 1,
    upper_range = 100,
    range_scale = 1

)


Z test sampling glaciers between 0 and 1: Z = 54.67073295146456
Z test sampling glaciers between 1 and 2: Z = -3.1280608502368743
Z test sampling glaciers between 2 and 3: Z = -13.399337068332208
Z test sampling glaciers between 3 and 4: Z = -13.624747809816519
Z test sampling glaciers between 4 and 5: Z = -12.796984514285692
Z test sampling glaciers between 5 and 6: Z = -12.210044405029935
Z test sampling glaciers between 6 and 7: Z = -10.475664250924797
Z test sampling glaciers between 7 and 8: Z = -9.795232212420672
Z test sampling glaciers between 8 and 9: Z = -8.385564685500755
Z test sampling glaciers between 9 and 10: Z = -8.042903112319195
Z test sampling glaciers between 10 and 11: Z = -7.460218936356272
Z test sampling glaciers between 11 and 12: Z = -7.582969709699431
Z test sampling glaciers between 12 and 13: Z = -9.278927957618947
Z test sampling glaciers between 13 and 14: Z = -7.7559807530116815
Z test sampling glaciers between 14 and 15: Z = -8.417700984073711
Z test s