# TC data sample comparisons: 
We need to compare groups of indicators taken from Transparent cities Transparency dataset. The gropus are based on two criteria:
1. Complex vs simple indicators: a simple indicator takemes little to moderate effort to implement and is valued at 1 point max. A complex indicator takes more effort to implement and is valued at 2 points max. Read the latest Transparency methodology at http://transparentcities.in.ua for more.
2. Imperative vs non-imperative indicators: contents of an indicator may either be regulated by law (imperative) or be purely recommendational (non-imperative). A middle-ground case is when an indicator may be based on the legislation when implemented one way and may not in another.

This notebook is devoted to the complex vs simple comparison. Our goals are:
- To see if their means differ
- To see if the differences are statisically significant.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import copy
import numpy as np
import scipy.stats as stats
from statsmodels.graphics.gofplots import qqplot

In [None]:
#load the dataset
data = pd.read_excel("G:/My Drive/Особисті доки/Прозорість/210903 зведена база 2020.xlsx")

Let's add a binary variable for complexity.

In [None]:
data['complexity'] = data['maxPoint'].map({1.0:'simple', 2.0:'complex'})
#check if the operation went well - there should be two categories
data['complexity'].unique()

We will not compare absolute scores (which obviously differ a lot) but the average % of completion, measured as city's point divided by maximum point.

In [None]:
data['percent_implemeted'] = data['point']/data['maxPoint']

Calculate mean of both groups

In [None]:
all_implemented = np.array(data['percent_implemeted'])
simple_implemented = np.array(data[data['complexity']=='simple']['percent_implemeted'])
complex_implemented = np.array(data[data['complexity']=='complex']['percent_implemeted'])

In [None]:
# calculate means of both groups
simple_mean = simple_implemented.mean()
complex_mean = complex_implemented.mean()

# the difference between the means of two groups is our ground truth
gT = simple_mean - complex_mean

print(f'Mean implementation of simple indicators: {np.round(simple_mean, 3)*100}%')
print(f'Mean implementation of complex indicators: {np.round(complex_mean, 3)*100}%')
print(f'Difference of means: {np.round(gT, 3)*100}%')


Now that we know the means and their difference, we can formulate our hypotheses.
- **Null hypothesis**: there is no difference between % of implementation in two groups of indicators.
- **Alternative hypothesis***: there is a difference between % of implementation in two groups of indicators.

## Normality test
Now that we know the difference of averages is about 10%, let's check the distributions within our "complex" and "simple" groups. Based on the distributions, we'll decide if we should use parametric or non-parametric test to assess the significance.

In [None]:
# for simple indicators
stats.normaltest(simple_implemented)

In [None]:
#for complex indicators
stats.normaltest(complex_implemented)

The test has shown that the distributions are as far from normal as it gets. Let's see how they look.

In [None]:
np.random.seed(1)
plt.hist(simple_implemented)
plt.show()

In [None]:
np.random.seed(1)
plt.hist(complex_implemented)
plt.show()