# Women Entrepreurship and Labor Force



Author: Gabriella Pauline Djojosaputro <br>
Date: 23 Dec 2020 <br>
Data Source: [Kaggle](https://www.kaggle.com/babyoda/women-entrepreneurship-and-labor-force)

The aim of the study is to examine whether the 2015 Women's Entrepreneurship Index and Global Entrepreneurship Index show a significant difference between OECD countries that are members of the European Union and not, as well as to determine the significance of the relationship between the indexes.

In [None]:
#Importing necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import scipy.stats as st

In [None]:
#Read data into dataframe
data = pd.read_csv('../input/women-entrepreneurship-and-labor-force/Dataset3.csv',sep=';')
data

In [None]:
data.info()

## Task 1:
*Are **European Union membership** variable and **development** variable **independent from each other**?*

Chi-square test can be used to examine association between two categorical variables. It assumes that there is no cell with less than 5 members.

* H<sub>0</sub>: There is no association between European Union membership and the level development of the country
* H<sub>1</sub>: There is an association between European Union membership and the level development of the country

In [None]:
contingency = pd.crosstab(data['European Union Membership'],data['Level of development'])
contingency

Since the number of developing countries that is also a European Union Member is 0, correction need to be used during the chi-square test.

In [None]:
#Set correction to True
chisq,p,df,expected = st.chi2_contingency(contingency,correction=True)
print(f'Chi-squared: {chisq}')
print(f'p-value: {p}')
print(f'degrees of freedom: {df}')

Since the p-value is 3.04e-07 < 0.05, we reject the null hypothesis. There is enough evidence to support that there is an association between European Union Membership and the level of development of the country.

In [None]:
contingency-expected

From the difference between the observed and expected records for European Union Membership and Level of development, we see that European Union members tend to be developed countries and vice versa.

## Task 2:
*Do the **Women Entrepreneurship Index** and **Global Entrepreneurship Index** values show a statistically significant difference between the countries that are **members of the European Union** and not?*

Two hypotheses are being tested<br>
* Hypothesis 1:<br>
    * H<sub>0</sub>: The population average of Women Entrepreneurship Index for European Union members are equal to non-members<br>
    * H<sub>1</sub>: The population average of Women Entrepreneurship Index for European Union members are not equal to non-members<br>
<br>
* Hypothesis 2:<br>
    * H<sub>0</sub>: The population average of Entrepreneurship Index for European Union members are equal to non-members<br>
    * H<sub>1</sub>: The population average of Entrepreneurship Index for European Union members are not equal to non-members<br>
<br>


In [None]:
data['European Union Membership'].value_counts()

Because there is less than 30 members in the European Union, which does not satisfy the assumption for Central Limit Theorem, t-test should be used instead of z-test. <br><br>
If normality assumption and equal variance assumption are not met, the non-parametric version of t-test should be used. These are going to be tested first in the following sections.

### Normal Distribution Check

In [None]:
plt.subplots(2,1)
plt.subplot(2,1,1)
plt.hist(data['Women Entrepreneurship Index'],bins=10)
plt.title('Women Entrepreneurship Index')
plt.subplot(2,1,2)
plt.hist(data['Entrepreneurship Index'],bins=10)
plt.title('Entrepreneurship Index')
plt.tight_layout()

The histograms for both variables do not show normal distribution. The distribution shape does not look like a bell curve where the highest frequency occurs in the mean, but instead they have several peaks. <br><br>
Statistic tests can also be used to confirm this observation. If the p-value is less than the critical value of 0.05, the distribution is not normal.

In [None]:
st.shapiro(data['Women Entrepreneurship Index'])

### Equal Variance Test

Levene's test is an equal variance test with H<sub>0</sub> that the groups have equal variance. If the p-value is smaller than the critical value of 0.05, we can conclude that the variance is unequal between the groups.

In [None]:
#Equal Variance Test for Women Entrepreneurship Index
st.levene(data['Women Entrepreneurship Index'][data['European Union Membership']=='Member'], 
          data['Women Entrepreneurship Index'][data['European Union Membership']=='Not Member'],
          center='median')

The p-value is 0.21 > 0.05, so we do not reject the null hypothesis. We conclude that the variance of Women Entrepreneurship Index in European Union members and non-members are **equal**.

In [None]:
sns.stripplot(data=data,x='European Union Membership',y='Women Entrepreneurship Index')

In [None]:
#Equal Variance Test for Entrepreneurship Index
st.levene(data['Entrepreneurship Index'][data['European Union Membership']=='Member'], 
          data['Entrepreneurship Index'][data['European Union Membership']=='Not Member'],
          center='median')

The p-value is 0.45 > 0.05, so we do not reject the null hypothesis. We conclude that the variance of Entrepreneurship Index in European Union members and non-members are **equal**.

In [None]:
sns.stripplot(data=data,x='European Union Membership',y='Entrepreneurship Index')

Although the variances are equal, Women Entrepreneurship Index and Entrepreneurship Index do not have normal distribution. Therefore a non-parametric version of t-test, Mann-Whitney U test, will be used.


### Mann-Whitney U Test

In [None]:
# Mann-Whitney U Test for Women Entrepreneurship Index
st.mannwhitneyu(data['Women Entrepreneurship Index'][data['European Union Membership']=='Member'], 
          data['Women Entrepreneurship Index'][data['European Union Membership']=='Not Member'],)

The p-value is 6.79e-06 < 0.05, so we reject the null hypothesis. We conclude that the ___Women Entrepreneurship Index of European Union members are not equal to non-members___.

In [None]:
WEIfig = sns.boxplot(data=data,x='European Union Membership',y='Women Entrepreneurship Index')
WEIfig.set_title('Women Entrepreneurship Index in European Union Members and Non-Members')

In [None]:
# Mann-Whitney U Test for Entrepreneurship Index
st.mannwhitneyu(data['Entrepreneurship Index'][data['European Union Membership']=='Member'], 
          data['Entrepreneurship Index'][data['European Union Membership']=='Not Member'],)

The p-value is 0.0002 < 0.05, so we reject the null hypothesis. We conclude that the ___Entrepreneurship Index of European Union members are not equal to non-members___.

In [None]:
EIfig = sns.boxplot(data=data,x='European Union Membership',y='Entrepreneurship Index')
EIfig.set_title('Entrepreneurship Index in European Union Members and Non-Members')

## Task 3:
*Is there a statistically significant **relationship between Women's Entrepreneurship Index and Global Entrepreneurship Index values**?*

* H<sub>0</sub>: There is no correlation between Women's Entrepreneurship Index and Global Entrepreneurship Index
* H<sub>1</sub>: There is correlation between Women's Entrepreneurship Index and Global Entrepreneurship Index

Since the variables are not normally distributed, non-parametric correlation test such as Spearman or Kendall should be used instead of Pearson. Spearman will be used in this project.

In [None]:
st.spearmanr(data[['Women Entrepreneurship Index','Entrepreneurship Index']])

Since the p-value is less than 0.05 (4.06e-20), we reject the null hypothesis. There is a stastically significant correlation between Women Entrepreneurship Index and Entrepreneurship Index. <br><br>

The correlation value is **0.91**, indicating a ___strong positive correlation between the two variables___.

In [None]:
sns.jointplot(data=data,x='Women Entrepreneurship Index',y='Entrepreneurship Index')
