# Prerequisites

- Python 3.10.4

> Warning: Installation from conda environment may take few minutes

Configuring conda environment
```cmd
conda create -n ca2_env
conda activate ca2_env
conda install -c conda-forge pingouin
```

Installing jupyter notebook on ca2_env environment
```cmd
conda install jupyter notebook
python -m ipykernel install --name ca2_env
```

Run jupiter 
```cmd
jupyter notebook
```

In [None]:
from preamble import agriculture
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pingouin as pg
import seaborn as sns;
from scipy import stats 
from scipy.stats import f_oneway
from statsmodels.formula.api import ols
import statsmodels.stats.multicomp as mc
import statsmodels.api as sm
from IPython.display import display, Markdown, Latex

# Change default colormap
plt.rcParams["image.cmap"] = "Set2"
sns.set_palette("Set2")
sns.color_palette("Set2")


In [None]:
# Datasets
agriculture_df = pd.read_csv("../data/agriculture_dataset.csv")
ireland = agriculture_df.query("country == 'IE'")
eu_country_codes = pd.read_csv("../data/eu_country_codes.csv");
eu_country_codes.columns = ["id","description","iso2"]
eu_country_codes = eu_country_codes[["iso2","description"]].set_index("iso2")
country_codes_dic =  eu_country_codes.description.to_dict()
columns_dic = agriculture.columns_dic

# Variable Analysis

In [None]:
ireland.shape

In [None]:
ireland.describe()

In [None]:
continuous_variables = ireland.dtypes[ireland.dtypes == np.float64].index
data = ireland[continuous_variables]
sns.set(rc = {'figure.figsize':(15,8)})
sns.heatmap(data.corr(), cmap="YlGnBu", annot=True, )

In [None]:
# Gross Value Added
ireland.gross_value_added.describe()

In [None]:
ireland.info()

# Analysis of countries with similar characteristics to Ireland

Get countries that have similar GVA or Total Used Agricultural Area similar. 
- Criteria 1: Countries with GVA on 75% min and max range of Ireland mean GVA
- Criteria 2: Test Anova and post hoc analisys to identify countries

## Criteria 1: +/- 75 of Irelands GVA

In [None]:
# Countries with simliar utilised agricultural area than IE
uaa_means = agriculture_df.groupby('country').total_uaa_ha.mean().reset_index()
gva_means = agriculture_df.groupby('country').gross_value_added.mean().reset_index()
ie_uua = uaa_means.query("country == 'IE'").total_uaa_ha.values[0]
ie_GVA = gva_means.query("country == 'IE'").gross_value_added.values[0]

country_with_similar_uaa = uaa_means.query(f"\
        total_uaa_ha >= {ie_uua * 0.25} and \
        total_uaa_ha <= {ie_uua * 1.75}").country.values.flatten()

counties_with_similar_gva = gva_means.query(f"\
        gross_value_added >= {ie_GVA * 0.25} and \
        gross_value_added <= {ie_GVA * 1.75} \
").country.values.flatten()

# Set similar countries to Ireland by UAA and GVA
similar_countries = list(set(country_with_similar_uaa).intersection(set(counties_with_similar_gva)))



## Criteria 2: Anova

### T-Test Assumptions

1. The first assumption made regarding t-tests concerns the scale of measurement. The assumption for a t-test is that the scale of measurement applied to the data collected follows a continuous or ordinal scale, such as the scores for an IQ test.

2. The second assumption made is that of a simple random sample, that the data is collected from a representative, randomly selected portion of the total population.

3. The third assumption is the data, when plotted, results in a normal distribution, bell-shaped distribution curve. When a normal distribution is assumed, one can specify a level of probability (alpha level, level of significance, p) as a criterion for acceptance. In most cases, a 5% value can be assumed.

4. The fourth assumption is a reasonably large sample size is used. A larger sample size means the distribution of results should approach a normal bell-shaped curve.


5. The final assumption is homogeneity of variance. Homogeneous, or equal, variance exists when the standard deviations of samples are approximately equal.

ref: Bartlett, M. S. (1937). Properties of sufficiency and statistical tests. Proc. R. Soc. Lond. A, 160(901), 268-282.

In [None]:
# Check normal distribution on GVA for each country
variables = agriculture_df.columns
results = [["country","variable","anova_score"]]
for variable in variables:
    if(variable == "country"):
        continue
    if(variable == "year"):
        continue
        
    print("=======================================================================")
    print(f"                           {variable}                                 ")
    print("=======================================================================")
    gva_c_normal_dist = [[]]
    for c in agriculture_df.country.unique():
        X = agriculture_df.query(f"country=='{c}'")[variable]
        stat, pvalue = stats.shapiro(X)
        if(pvalue > 0.05):
            #print(f"{country_codes_dic.get(c)} / {variable} is normally distributed for a pvalue: {pvalue}")
            if(gva_c_normal_dist[0] == []):
                gva_c_normal_dist[0] = [c,pvalue]
            else:
                gva_c_normal_dist.append([c,pvalue])


    similar_countries = np.array(gva_c_normal_dist)[:,0]

    #Homogeinity of variance between countries with Ireland for GVA: Levene's test
    arr = [[]]
    count = 0
    for c in similar_countries:
        if(c == 'IE'):
            continue;

        cName = country_codes_dic.get(c)
        #print(f"Analysing Homogeinity test Ireland / {cName}")
        countries = ['IE', c]

        ds = agriculture_df.query("country in @countries")
        levenes_result = pg.homoscedasticity(ds, dv=variable, group='country', method='levene')
        if(arr[0] == []):
            arr[0] = list(np.append(levenes_result.values.flatten(),[variable, c]))
        else:
            arr.insert(count,list(np.append(levenes_result.values.flatten(),[variable, c])))
        count += 1

    levene_df = pd.DataFrame(arr,columns=["W","pvalue","equal_var","variable","country"])


    # Results of countries
    levene_df.query("equal_var == True")

    similar_countries = list(levene_df.query("equal_var == True").country.unique())
    similar_countries.append('IE')

    # Run anova report for all valid countries
    print(f"Candidates for ANOVA {similar_countries}")
    dataset = agriculture_df.query("country in @similar_countries")

    model = ols(f"{variable}~country", data = dataset).fit()
    #print(model.summary())
    aov2 = sm.stats.anova_lm(model, type=2)
    print(aov2)
    if(aov2["PR(>F)"].country > 0.05):
        results.append([c,variable,aov2["PR(>F)"].country])

    comp = mc.MultiComparison(dataset[variable], dataset['country'])
    post_hoc_res = comp.tukeyhsd()
    result = post_hoc_res.summary()
    #f = post_hoc_res.plot_simultaneous(comparison_name="IE",ylabel="Country",xlabel=columns_dic.get(variable))\
    #    .savefig(f"../visualizations/01_stats_anova_meanplot_{variable}.png");


In [None]:
print("Anava result comparison for Ireland with other Members States and variables")
results_pd = pd.DataFrame.from_records(np.array(results)[1:,:], columns=results[0]);
results_pd["country_name"] = results_pd.country.apply(lambda x: country_codes_dic.get(x));
results_pd

> The following 2 countries are the same as Ireland 

- Bulgaria subsides on field crops
- Slovakia Production of cereals Price

However for the shake of analysis indpendant variables, it will be used Criteria 1 for the machine learning. In this exercise we have confirm which countries has similar statiscal characteristics with Ireland. It will be possible to break down europe countries by sub reginal areas which they most likely be closer to the size of Ireland on the selected variables of study.

In [None]:
# Set similar countries to Ireland by UAA and GVA as per Criteria 1
similar_countries = list(set(country_with_similar_uaa).intersection(set(counties_with_similar_gva)))

In [None]:
similar_countries

# Inferential Statistics Analysis of variables of the selected countries

In [None]:
dataset = agriculture_df.query("country in @similar_countries")

The criteria for selecting the above countries for the comparison analysis will be further investigating by comparing characteristics of them and analysis on the mean, variance and of the samples taken.

In [None]:
n = len(dataset.year.unique())
print(f"Sample data {n}")

## 1. Ireland 

In [None]:
# Normal distribution check
results = [["variable","pvalue"]]
for variable in continuous_variables:
    pvalue = agriculture.plot_normal_dist(ireland[variable],
                     columns_dic.get(variable) , 
                     f"../visualizations/01_stats_normaldist_{variable}.png")
    if(pvalue > 0.05):
        results.append([variable,pvalue])

In [None]:
ireland_var_normdist = pd.DataFrame.from_records(np.array(results)[1:,:], columns=results[0])
ireland_var_normdist.sort_values("variable")

The following variables shows approximately Normal distribution as per fit to histogram plot and regarding probability Plot: The below variables data points plots looks fairly straight, indicating normality.

    - "agri_energy_use_tj"
    - "cereals_produce_price_usd_tonne"
    - "compensation_of_employees"
    - "crop_mean_residues_kg"
    - "employment_ratio_rural_areas_pct"
    - "female_employment_ratio_rural_areas_pct"
    - "female_mean_weekly_working_hours"
    - "gross_value_added"
    - "male_employment_ratio_rural_areas_pct"
    - "male_mean_weekly_working_hours"
    - "prod_cereals_real_price"
    - "total_uaa_ha"
    - "wages_and_salaries"


In [None]:
variables_normally_dist_candidates = ireland_var_normdist.variable.unique()

In [None]:
# Shapiro wilk test for normality in Irish dataset
variables_normally_dist = []
for variable in variables_normally_dist_candidates:
    stat, pvalue = stats.shapiro(ireland[variable])
    result = ""
    if(pvalue > 0.05):
        variables_normally_dist.append(variable)
    else:
        print(f"Removing {variable} with pvalue: {pvalue}")
    


In [None]:
variables_normally_dist

# 2. Other member states and Ireland

## Check homogeneity of variances all countries / variable

In [None]:
#Homogeinity of variance: Levene's test
arr = [[]]
count = 0 
for variable in variables_normally_dist:
    levenes_result = pg.homoscedasticity(dataset, dv=variable, group='country',method='levene')
    if(count == 0):
        arr[0] = list(np.append(levenes_result.values.flatten(),[variable, "all"]))
    else:
        arr.insert(count,list(np.append(levenes_result.values.flatten(),[variable, "all"])))
    
    count+=1

levene_df = pd.DataFrame(arr,columns=["W","pvalue","equal_var","variable","country"])       
levene_df.query("equal_var == True")


> Variables above indicated have homogeneity of variance for all countries.

## Check homogeneity of variances within others member states on selected variables

In [None]:
levene_df.query("equal_var == True").variable

for c in similar_countries:
    if(c == 'IE'):
        continue;
    cName = country_codes_dic.get(c)
    print(f"Analysing Homogeinity test Ireland / {cName}")
    countries = ['IE', c]
    count = 0
    arr = [[]]
    for variable in levene_df.query("equal_var == True").variable.unique():
        ds = dataset.query("country in @countries")
        levenes_result = pg.homoscedasticity(ds, dv=variable, group='country', method='levene')
        if(count == 0):
            arr[0] = list(np.append(levenes_result.values.flatten(),[variable, c]))
        else:
            arr.insert(count,list(np.append(levenes_result.values.flatten(),[variable, c])))
        count += 1
    levene_df = pd.concat([levene_df, pd.DataFrame(arr,columns=["W","pvalue","equal_var","variable","country"])])

In [None]:
print("Variables for selection")
print("=======================")
_ = [print(f"\t{x}") for x in levene_df.query("equal_var == True").variable.unique()]

print("Countries for study")
print("=======================")
_ = [print(f"\t{country_codes_dic.get(x)}") for x in levene_df.query("equal_var == True").country.unique()]
print(f"\t{country_codes_dic.get('IE')}")

# One-way ANOVA 
## Analysis of all variables and countries for the series of 17 years to test if datatset has no differences among the means

- H0: $mu_1 = mu_2 = mu_3$
- H1: At least 1 group has differences on the means
- alpha = `0.05`
- Degress of freedom between (k - 1): 6 countries = `5`
- Degress of freedom within N - k: 17 - 6 = `11`

In [None]:
# Run anova report for all valid variables
results_anova = None
for variable in levene_df.query("equal_var == True").variable.unique():
    if(results_anova == None): 
        pvalue, df = agriculture.anova_result(dataset,variable,f"{variable}~country",False, False)
        if(pvalue > 0.05):
            df["variable"] = variable
            results_anova = df.query("group1 == 'IE' or group2 == 'IE' and reject == False")
    else:
        pvalue, df = agriculture.anova_result(dataset,variable,f"{variable}~country",False, False)
        if(pvalue > 0.05):
            df["variable"] = variable
            results_anova.insert(df.query("group1 == 'IE' or group2 == 'IE' and reject == False"),allow_duplicates=True)

In [None]:
results_anova["country_name"] = results_anova.apply(lambda r: country_codes_dic.get(r.group2) if r.group1 == 'IE' else country_codes_dic.get(r.group1), axis=1)
results_anova

> Cereals production price and countries Belgium, Denmark, Lithuanian, Latvia and Portugal have same means as Ireland

# Non parametrical test for those variables that do not meet Anova assumptions

In [None]:
# From the variables analised in the anova that do not meet requirements
anova_vars = set(levene_df.query("equal_var == True").variable.unique())
all_columns = set(columns_dic.keys())
all_columns
kruskal_vars = all_columns - anova_vars


print("Variables for selection")
print("=======================")
_ = [print(f"\t{x}") for x in kruskal_vars]

print("Countries for study")
print("=======================")
_ = [print(f"\t{country_codes_dic.get(x)}") for x in levene_df.query("equal_var == True").country.unique()]

countries = list(levene_df.query("equal_var == True").country.unique())
countries.append('IE')
dataset = agriculture_df.query("country in @countries")

results_nonp = [["variable","kruskal_test_result"]]
# KRUSKAL TEST
for variable in kruskal_vars:    
    k = agriculture.kruskal_report(dataset,variable,f"{variable}~country")
    results_nonp.append([variable,k])

## Summary NON parametrical test

In [None]:
print("Non parametrical test comparison indicates similar values with Ireland and selected  \n" +\
      f"Members states {similar_countries} on the following variables")
results_nonp_pd = pd.DataFrame.from_records(np.array(results_nonp)[1:,:], columns=results_nonp[0]);
results_nonp_pd["kruskal_test_result"] = results_nonp_pd.kruskal_test_result.astype("float")
results_nonp_pd.query("kruskal_test_result > 0.05")

# Employment by gender analysis


In [None]:
dataset[['female_employment_ratio_rural_areas_pct','male_employment_ratio_rural_areas_pct']].corr(method='pearson')

In [None]:
# Check means of the 2 groups male and female.
dataset[['female_employment_ratio_rural_areas_pct','male_employment_ratio_rural_areas_pct']].describe()

In [None]:
stat, pvalue = stats.levene(dataset.female_employment_ratio_rural_areas_pct, dataset.male_employment_ratio_rural_areas_pct)
print(f"H0: There is homegeneity of variance in male and female employment ratio with levene test pvalue {pvalue}")

> BoxPlot employment ratio by gender 

In [None]:
df_gender = dataset[['female_employment_ratio_rural_areas_pct','male_employment_ratio_rural_areas_pct']]
df_gender.columns = ["Female","Male"]
df_melt = df_gender.melt()

# generate a boxplot to see the data distribution by treatments. Using boxplot, we can 
# easily detect the differences between different treatments

ax = sns.boxplot(x='variable', y='value', data=df_melt, color='#99c2a2')
ax = sns.swarmplot(x="variable", y="value", data=df_melt, color='#7d0013')
plt.show()

In [None]:
df_gender = dataset[['country','female_employment_ratio_rural_areas_pct','male_employment_ratio_rural_areas_pct']]
df_gender.columns = ['country',"Female","Male"]
pvalue, df_gender_anova_group_result = agriculture.anova_result(df_gender,"Female",f"Male~Female+country", True)
print(pvalue)

The employement ratio for Female population in rural areas is not the same as male on selected countries. Within groups of different countries

In [None]:
df_gender_anova_group_result.query("reject == False and (group1 == 'IE' or group2 == 'IE')")

> Portugal and Ireland have similar employment for male and femal groups