In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from matplotlib import pyplot as plt
import seaborn as sns
from scipy import stats as spst

# Introduction to non-parametric testing
Most of the traditional and introductory statistical methods (e.g., t-test, Pearson correlation) are "parametric", i.e., they rely on certain assumptions about the data determined by parameters. For instance, the most basic assumption of many such tests is your observations (aka samples, data-points, etc) are independently drawn from a normal distribution, and a normal distribution is simply defined by a formula with two parameters: mean and standard deviation.

A concrete example of data believed to have a normal distribution is people's height. For instance, say we want to ask whether there is "real" difference between girls' and boys' heights from a class, and there are 20 girls and 25 boys respectively. We can simply calculate the mean and standard deviation for boys and girls. Say we observed that the mean height for girls is 155cm and 170cm for boys, and both have SD=8cm. In the following cell I will simply simulate the two sets of values using Numpy's normal().

In [None]:
np.random.seed(0) #reset the seed so we can have the same values
height_girls = np.random.normal(155, 8, 20) # create a normal distribution with 20 samples, mean=155 and sd=8 
height_boys = np.random.normal(170, 8, 25)
plt.hist(height_boys,40,color='k',label='M')
plt.hist(height_girls,40,color='r',label='F',alpha=.8)
plt.legend()
plt.show()

* Just looking at the histograms, they look similar enough as a normal distributon, and carried out a independent t-test:

In [None]:
spst.ttest_ind(height_boys, height_girls)

* The resulting p-value is 0.0005, which tells us that, if there is NO real difference between the girls' and boys' heights, the probablity of observing what we have is 0.0005, i.e., 0.05%, which is a very small probability. Thus when this happens, we conclude that it is "very unlikely" that there is no real difference, meaning the girls' and boys' heights are indeed different.

* Put it more formally, the null hypothesis (H0) here is that there is NO real difference between the girls' and boys', i.e., height_boys == heights_girls. Thus the alternative hypothesis (H1) is height_boys != heights_girls. Because the p-value, i.e., the probablity of H0 is true, is "very small", we reject the H0 and therefore accept the H1.
    * Note that it is also called a "two-tailed" or "two-side" test because the H1 includes both directions, i.e., height_boys > heights_girls and height_boys < heights_girls. Otherwise if you are only interested in one direction, it is called one-tailed test.
    * We don't really know how small is "very small", it basically reflects your attitude, how conservative or how readily you are to reject the H0. But conventionally, p=0.05 is usually small enough, and if you want to more certain and more conservative about rejecting H0, you could raise that to p=0.01 or 0.001, and so on.
    * And if reject H0, the effect is said to be "statistically significant".    
    
* Nonetheless, let's go back to our assumption that people's height follow a normal distribution. That's probably true for all the people in the world as a whole, but if we just look at the 20 girls and 25 boys, it actually doesn't look that normal, especially for the boys' I'd say. This is actually not unsual when the sample size is small. So we may have violated the assumption of t-test, which put the validty of the result in question.

* **The same situation can arise when the data we are looking at don't follow a normal distribution, like the Entrepreneourship index here (see cell below), and hence, it would be inappropriate to use t-test.**

In [None]:
data = pd.read_csv('../input/women-entrepreneurship-and-labor-force/Dataset3.csv',sep=';')
data['Entrepreneurship Index'].hist()
data['Women Entrepreneurship Index'].hist(alpha=.6)
plt.show()

# Q1: Do the Women Entrepreneurship Index and Global Entrepreneurship Index values show a statistically significant difference between the countries that are members of the European Union and not? (Method Mann-Whitney U)

* The non-parameteric alternative of an indenpent t-test is *Mann-Whitney U test*
* In the following sample, we can see the Mann-Whitney test also give a statistics (u-value) and a p-value. The interpretation of the p-value is **exactly the same** as mentioned above, i.e., if there is NO difference between the two sets of values, what is the probabilty of observing those values? And if the probability/p-value is small enough, we reject the null hypothesis and conclude that there is true difference between the two sets.

In [None]:
lbl_eu = data['European Union Membership']=='Member'
targetColumn = 'Entrepreneurship Index'
mean_values = data.groupby(['European Union Membership']).mean()[targetColumn]
print(mean_values)
sns.boxplot(y=targetColumn, x='European Union Membership',palette='Blues',data=data)
plt.plot(range(2),mean_values,'kd',markersize=12)
plt.show()
u, p = spst.mannwhitneyu(data[lbl_eu][targetColumn],data[~lbl_eu][targetColumn])
print('Mann-Whitney u=%f, p=%f' % (u,p))

In [None]:
targetColumn = 'Women Entrepreneurship Index'
mean_values = data.groupby(['European Union Membership']).mean()[targetColumn]
print(mean_values)
sns.boxplot(y=targetColumn, x='European Union Membership',palette='Purples',data=data)
plt.plot(range(2),mean_values,'kd',markersize=12)
plt.show()
u, p = spst.mannwhitneyu(data[lbl_eu][targetColumn],data[~lbl_eu][targetColumn])
print('Mann-Whitney u=%f, p=%f' % (u,p))

In [None]:
# we can also visualize the four sets of values together:
data_long = data.melt(id_vars='European Union Membership',value_vars=['Entrepreneurship Index','Women Entrepreneurship Index'],\
                      var_name='Entprn Index Category',value_name='Entpnr Index')
data_long

In [None]:
plt.figure(figsize=(8,5))
sns.boxplot(x='Entprn Index Category', y='Entpnr Index', hue='European Union Membership',palette='Blues',data=data_long)
plt.plot([-0.2,0.2, 0.8,1.2],data_long.groupby(['Entprn Index Category','European Union Membership']).mean(),'kd',markersize=12)
plt.show()

* Another question we can ask is "Do the discrepancy between global and women EI differ between EU and non-EU countries?"

In [None]:
diff_EU = data[lbl_eu]['Entrepreneurship Index'] - data[lbl_eu]['Women Entrepreneurship Index']
diff_nonEU = data[~lbl_eu]['Entrepreneurship Index'] - data[~lbl_eu]['Women Entrepreneurship Index']
spst.mannwhitneyu(diff_EU, diff_nonEU)

* The p-value is 0.22, which is not considered a small enough value by convention. Thus we conclude that the discrepancy between global and women EI is the same in both EU and non-EU countries.

# Q2: Is there a statistically significant relationship between Women's Entrepreneurship Index and Global Entrepreneurship Index values? (Method Spearman Correlation Coefficient)

* When we want to ask whether two variables co-vary together (i.e., if variable 1 increase, variable 2 also increase/decrease), the most common measure is Pearson's correlation (r). For instance, we can calculate if there is a correlation between the amount of milk consumption and children's heights. 
* When Pearson's r=1, it means that the two variables have a perfect linear relationship, that the same amount of milk consumption increase always corresponds to the same amount of increase in height.  
* However, when the variables are ordinal or dichotomous, the alternative is to use Spearman's rank order (rho). Essentially in this situation, we don't really care the perfect linear relationship and don't think the actual values are that important. For instance, we may not consider the the difference between 140-150cm and the difference between 170-180cm are the same, and many kids have height between 160-170cm with several centimeters' difference, and we think those are important variations too. 
* In this situation, we can instead rank all the kids' height as well as their milk concumption from low to high, and forget about the actual values. Spearman'r rho measures the relationship between the two rank-orders. If rho=1, it means kids that consume more mike are always higher, regardless the actual amount of milk or height in cm.
    * In other words, Spearman's rho measure the monotonic relationship that does not have to be linear.

* In the cell below, we can see that in fact Pearson's r and Spearman's rho are very similar. There is a very high correlation between Entrepreneurship Index and Women Entrepreneurship Index across the countries measured by both correlation methods.

In [None]:
gEI = data['Entrepreneurship Index']
wEI = data['Women Entrepreneurship Index']
sns.scatterplot(gEI, wEI, hue=data['European Union Membership'])
plt.show()
r = np.corrcoef(gEI, wEI)[0,1]
rho, p = spst.spearmanr(gEI, wEI)
print("Pearson'r = %.4f, Spearman'r rho=%.4f, p=%.8f" % (r, rho,p))

## Permutation test.
* However I think a more interesting question would be whether EU and non-EU countries differ in the relationship between global EI and Women EI.
* As shown below, non-EU is lower than EU countries by 0.0868. Thus the question is, is there a real difference between EU and non-EU?
* In other words, do non-EU countries **really** have a weaker correlation between global EI and women EI?

In [None]:
rho, p = spst.spearmanr(gEI[lbl_eu], wEI[lbl_eu])
print("EU: Spearman'r rho=%.4f, p=%.8f" % (rho,p))
rho,p= spst.spearmanr(gEI[~lbl_eu], wEI[~lbl_eu])
print("non-EU: Spearman'r rho=%.4f, p=%.8f" % (rho,p))
real_difference = spst.spearmanr(gEI[lbl_eu], wEI[lbl_eu])[0] - spst.spearmanr(gEI[~lbl_eu], wEI[~lbl_eu])[0]
print(real_difference)

* To test this question, I use a permutation by permuting "European Union Membership" label 1000 times.
* The key in a permutation is to build an **empirical null distribution** from the data at hand.
    * The empirical distribution represents that, if there is no real effect in the data, what the distribution should be.
    * Thus the permutation is said to be more accurate because the empirical null distribution accurately reflect the variability in the data.
    * In the current example, to create this distribution we simple shuffle the "European Union Membership" label, eliminating the EU non-EU distinction in the dataset.
* The resulting p-value=0.04. It is calculated as, among the 1000 random values of difference, how many of them are greater than the real difference 0.0868? (and then convert to %)
* The interpretaion is exactly the same as in the other tests above, i.e., if there is NO REAL difference between the correlation values of EU and non-EU, what is the probably we observed a difference 0.0868?
* The permutation test tells us the probablity is 4%, a relative small number. Thus we can conclude with confidence that the correlation between global and Women EI in EU contries is **significantly higher** than in non-EU countries. 

In [None]:
n_it = 1000
null_distribtion = np.zeros(n_it)
np.random.seed(0)
for it in range(n_it):
    random_label = np.random.permutation(data['European Union Membership'])
    lbl_eu_random = random_label=='Member' 
    rho1 = spst.spearmanr(gEI[lbl_eu_random], wEI[lbl_eu_random])[0]
    rho2 = spst.spearmanr(gEI[~lbl_eu_random], wEI[~lbl_eu_random])[0]
    null_distribtion[it] = rho1 - rho2
plt.hist(null_distribtion,40)
plt.axvline(x=real_difference,color='r')
plt.show()
print('mean of the null distribution=%.4f+/-%.4f\nReal difference=%f, non-param p=%f' % \
      (np.mean(null_distribtion),np.std(null_distribtion), real_difference, np.sum(null_distribtion>real_difference)/(n_it*1.)))

* The emiprical null distribution is skew to the left. Therefore even though the real difference doesn't seem very large by itself, the chance of observing a 0.0868 difference in the current data is still quite small.

# Q3: Are European Union membership variable and development variable independent from each other? (Method: Chi-Square Test)

* Unlike the above examples in which variables take on continuous values (e.g., height, amount of milk consumption), sometimes values take on nominal or categorical values, e.g., like the dataset here, whether or not a country is in the EU, and whether it is Developed or not.
* The chi-squared test for independence is to test whether there is a statistical associations between two categorical variables, i.e., whether the observed counts/frequencies differ significantly from expected. 
    * For instance, we ask 10 boys and 10 girls whether they their favourite subject is English or Math (assuming there are only these 2 options), if there is no association between gender and fav subject, we should 5 boys like English and 5 like Math, and the same for girls. Conversely an extreme scenario of association is all girls like English and all boys like Math. And when we observe something in between, chi-squared test for independence tells us whether there is likely a real association between the two variables. 

* In this case the two categorical variables are whether or not a country is in the EU, and whether it is Developed or not.
* The first step of chi-squared test is to create a contingency table (with pd.crosstab())
* Using the contingency table, chi2_contingency() in scipy conducted the test and return the chi-square statistic, p-value, and the expected freq (see next cell)

In [None]:
contingency_table = pd.crosstab(data['European Union Membership'], data['Level of development'])
sns.heatmap(contingency_table.values,cmap='Blues',annot=True,xticklabels=contingency_table.index, yticklabels=contingency_table.columns,annot_kws={'fontsize':16})
plt.show()
chi2, p, dof, expected_freq = spst.chi2_contingency(contingency_table)
print('Chi2=%.4f, p=%.8f' % (chi2,p))

In [None]:
print('Expected frequency IF %s & %s are independent' % (contingency_table.index.name,contingency_table.columns.name))
sns.heatmap(expected_freq,cmap='Blues',annot=True,xticklabels=contingency_table.index, yticklabels=contingency_table.columns,annot_kws={'fontsize':16})
plt.show()

### Result: 
* The contigency table tells us there are 27 EU-member and 24 non-member (sum of each column), and 20 developed and 31 developing countries (sum of each row). The totoal number of countries is 51.
* Given those total numbers, we can calculate Expected Freqeuncy of each of the 4 cells.
    * That is, if the H0: there is no association between EU-membership and level of developement, is True, we should expect to observe those freq values.
* Because the actual observed frequencies (i.e., the contigency table) differ from the expected frequencies significantly, we reject the H0 and conclude that there is an association between EU-membership and level of developement.
    * The intepretation of the p-value is exactly the same as in other tests: if the H0 is true, what the probability of observing the frequencies is. Because the probability is very small (p=3e10-7), we do not believe H0 can be true.