### I. Preparation of Python enviroment

### I.1. Modules import and data loading

#### I.1.1. Modules import

To carry out the work, the following modules will be needed:
1. pandas - for data handling and analysis,
2. scipy.stats - for calculating correlation and statistical significance coefficients,
3. numpy - for modifying array objects,
4. LinearRegression - for calculating linear regression coefficients.

In [1]:
import pandas as pd
import scipy.stats as sp
import numpy as np
from sklearn.linear_model import LinearRegression

#### I.1.2. Data loading

Data from "bi_to_python_countries_data.csv" file will be loaded to a Dataframe object of the Pandas library.

In [2]:
dataset_df = pd.DataFrame()
dataset_df = pd.read_csv(r'C:\Users\Sebastian\Documents\Data Analyst\Portfolio\1\bi_to_python_countries_data.csv')

To test the correct loading of the data, the top and bottom rows of the Dataframe object will be displayed.

In [3]:
dataset_df.head()

Unnamed: 0,location,location_id,year,total_kcal,carbohydrates,protein,fat,lex,lex_females,lex_males
0,Afghanistan,4,1961,2999,2321.77,339.64,337.59,33.0681,33.8128,32.4086
1,Afghanistan,4,1962,2917,2246.59,331.92,338.49,33.5471,34.2969,32.8833
2,Afghanistan,4,1963,2698,2042.39,308.48,347.13,34.0162,34.7731,33.3461
3,Afghanistan,4,1964,2953,2268.49,333.96,350.55,34.4942,35.2464,33.8282
4,Afghanistan,4,1965,2956,2262.99,335.44,357.57,34.9528,35.7021,34.2889


In [4]:
dataset_df.tail()

Unnamed: 0,location,location_id,year,total_kcal,carbohydrates,protein,fat,lex,lex_females,lex_males
9638,Zambia,894,2015,2130,1413.8,240.28,475.92,61.2078,63.5089,58.785
9639,Zambia,894,2016,2181,1496.62,239.96,444.42,61.7937,64.1205,59.3493
9640,Zambia,894,2017,2232,1540.95,234.48,456.57,62.1201,64.6084,59.5269
9641,Zambia,894,2018,2254,1553.63,232.28,468.09,62.3422,64.9158,59.6741
9642,Zambia,894,2019,2267,1527.27,243.56,496.17,62.7926,65.4095,60.0801


The size of the dataset_df object will be checked out.

In [5]:
import sys
print(sys.getsizeof(dataset_df))

1268890


The dataset_df object takes up almost 1.5 MB of memory, so it can be processed and analysed locally.

A list of unique countries will be created to allow further work with the Country class.
The lenght of the country list will be calculated.

In [6]:
countries_list = dataset_df['location_id'].unique().tolist()
len(countries_list)

188

### I.2. The creation of a Country class

For easier data management, a Country class will be defined, with the attributes as follow:
1. *country_name* - an information about a name of the country, as string,
2. *country_data* - an information about a annual caloric supply per capita and a life expectancy in a given country, as dataframe,
3. *correlation* - an information about the Pearson's correlation coefficient for the relationship between caloric supply and life expectancy, rounded up to 2 decimal places,
4. *p_factor* - an information about the statistical significance coefficient (p), rounded up to 2 decimal places,
5. *a* - an information about the coefficient of linear regression function y = a*x + b, rounded to 4 decimal places,
6. *b* - an information about the coefficient of linear regression function y = a*x + b, rounded to 4 decimal places,.

For each combination of the source data (i.e. caloric supply and life expectancy), a *stats* methods will be defined for the Country class, with following attributes:
1. *country* - a method extracting data from the source file and storing it into *country_data* attribute,
2. *correlation* - a method calculating the Pearson's correlation coefficient for the parameters concerned,
3. *p_value* - a method calculating the statistical significance coefficient for the parameters concerned,
4. *regr_a* - a method calculating the coefficient *a* of linear regression,
5. *regr_b* - a method calculating the coefficient *b* of linear regression.

The instances of the Country class will be stored in a dictionary, where the key will be consecutive natural numbers and the value will be the instances of the class.

In [7]:
# creation of the Country class
class Country:

    # definition of country attributes
    def __init__(self, country_name = None, country_id = None, country_data = None,
                 correlation_tlex = None, correlation_clex = None, correlation_plex = None, correlation_flex = None,
                 correlation_tlexf = None, correlation_clexf = None, correlation_plexf = None, correlation_flexf = None,
                 correlation_tlexm = None, correlation_clexm = None, correlation_plexm = None, correlation_flexm = None,
                 p_value_tlex = None, p_value_clex = None, p_value_plex = None, p_value_flex = None,
                 p_value_tlexf = None, p_value_clexf = None, p_value_plexf = None, p_value_flexf = None,
                 p_value_tlexm = None, p_value_clexm = None, p_value_plexm = None, p_value_flexm = None,
                 regr_a_tlex = None, regr_b_tlex = None,
                 regr_a_tlexf = None, regr_b_tlexf = None,
                 regr_a_tlexm = None, regr_b_tlexm = None):
        self.country_name = country_name
        self.country_id = country_id
        self.country_data = country_data
        self.correlation_tlex = correlation_tlex
        self.correlation_clex = correlation_clex
        self.correlation_plex = correlation_plex
        self.correlation_flex = correlation_flex
        self.correlation_tlexf = correlation_tlexf
        self.correlation_clexf = correlation_clexf
        self.correlation_plexf = correlation_plexf
        self.correlation_flexf = correlation_flexf
        self.correlation_tlexm = correlation_tlexm
        self.correlation_clexm = correlation_plexm
        self.correlation_plexm = correlation_clexm
        self.correlation_flexm = correlation_flexm
        self.p_value_tlex = p_value_tlex
        self.p_value_clex = p_value_clex
        self.p_value_plex = p_value_plex
        self.p_value_flex = p_value_flex
        self.p_value_tlexf = p_value_tlexf
        self.p_value_clexf = p_value_clexf
        self.p_value_plexf = p_value_plexf
        self.p_value_flexf = p_value_flexf
        self.p_value_tlexm = p_value_tlexm
        self.p_value_clexm = p_value_clexm
        self.p_value_plexm = p_value_plexm
        self.p_value_flexm = p_value_flexm
        self.regr_a_tlex = regr_a_tlex
        self.regr_b_tlex = regr_b_tlex
        self.regr_a_tlexf = regr_a_tlexf
        self.regr_b_tlexf = regr_b_tlexf
        self.regr_a_tlexm = regr_a_tlexm
        self.regr_b_tlexm = regr_b_tlexm
        
    # definition of stats method for data_tlex
    def stats_tlex(self):
        tempdf = self.country_data
        tempdf = tempdf[['total_kcal', 'lex']]
        corr_coef_tlex, p_value_tlex = sp.pearsonr(tempdf['total_kcal'], tempdf['lex'])
        self.correlation_tlex = round(corr_coef_tlex, 2)
        self.p_value_tlex = round(p_value_tlex, 2)
               
        x = np.array(tempdf['total_kcal']).reshape(-1, 1)
        y = tempdf['lex']
        tempmodel = LinearRegression().fit(x, y)
        pre_regr_a_tlex = tempmodel.coef_
        regr_a_tlex = round(pre_regr_a_tlex[0], 4)
        regr_b_tlex = round(tempmodel.intercept_, 4)
        self.regr_a_tlex = regr_a_tlex
        self.regr_b_tlex = regr_b_tlex
                
    # definition of stats method for data_clex
    def stats_clex(self):
        tempdf = self.country_data
        tempdf = tempdf[['carbohydrates', 'lex']]
        corr_coef_clex, p_value_clex = sp.pearsonr(tempdf['carbohydrates'], tempdf['lex'])
        self.correlation_clex = round(corr_coef_clex, 2)
        self.p_value_clex = round(p_value_clex, 2)
        
    # definition of stats method for data_plex
    def stats_plex(self):
        tempdf = self.country_data
        tempdf = tempdf[['protein', 'lex']]
        corr_coef_plex, p_value_plex = sp.pearsonr(tempdf['protein'], tempdf['lex'])
        self.correlation_plex = round(corr_coef_plex, 2)
        self.p_value_plex = round(p_value_plex, 2) 
    
    # definition of stats method for data_flex
    def stats_flex(self):
        tempdf = self.country_data
        tempdf = tempdf[['fat', 'lex']]
        corr_coef_flex, p_value_flex = sp.pearsonr(tempdf['fat'], tempdf['lex'])
        self.correlation_flex = round(corr_coef_flex, 2)
        self.p_value_flex = round(p_value_flex, 2)
    
    # definition of stats method for data_tlexf
    def stats_tlexf(self):
        tempdf = self.country_data
        tempdf = tempdf[['total_kcal', 'lex_females']]
        corr_coef_tlexf, p_value_tlexf = sp.pearsonr(tempdf['total_kcal'], tempdf['lex_females'])
        self.correlation_tlexf = round(corr_coef_tlexf, 2)
        self.p_value_tlexf = round(p_value_tlexf, 2)
        
        x = np.array(tempdf['total_kcal']).reshape(-1, 1)
        y = tempdf['lex_females']
        tempmodel = LinearRegression().fit(x, y)
        pre_regr_a_tlexf = tempmodel.coef_
        regr_a_tlexf = round(pre_regr_a_tlexf[0], 4)
        regr_b_tlexf = round(tempmodel.intercept_, 4)
        self.regr_a_tlexf = regr_a_tlexf
        self.regr_b_tlexf = regr_b_tlexf
                
    # definition of stats method for data_clexf
    def stats_clexf(self):
        tempdf = self.country_data
        tempdf = tempdf[['carbohydrates', 'lex_females']]
        corr_coef_clexf, p_value_clexf = sp.pearsonr(tempdf['carbohydrates'], tempdf['lex_females'])
        self.correlation_clexf = round(corr_coef_clexf, 2)
        self.p_value_clexf = round(p_value_clexf, 2)
    
    # Metoda populująca atrybut data_plexf
    def stats_plexf(self):
        tempdf = self.country_data
        tempdf = tempdf[['protein', 'lex_females']]
        corr_coef_plexf, p_value_plexf = sp.pearsonr(tempdf['protein'], tempdf['lex_females'])
        self.correlation_plexf = round(corr_coef_plexf, 2)
        self.p_value_plexf = round(p_value_plexf, 2) 
    
    # definition of stats method for data_flexf
    def stats_flexf(self):
        tempdf = self.country_data
        tempdf = tempdf[['fat', 'lex_females']]
        corr_coef_flexf, p_value_flexf = sp.pearsonr(tempdf['fat'], tempdf['lex_females'])
        self.correlation_flexf = round(corr_coef_flexf, 2)
        self.p_value_flexf = round(p_value_flexf, 2)
        
    # definition of stats method for data_tlexm
    def stats_tlexm(self):
        tempdf = self.country_data
        tempdf = tempdf[['total_kcal', 'lex_males']]
        corr_coef_tlexm, p_value_tlexm = sp.pearsonr(tempdf['total_kcal'], tempdf['lex_males'])
        self.correlation_tlexm = round(corr_coef_tlexm, 2)
        self.p_value_tlexm = round(p_value_tlexm, 2)
        
        x = np.array(tempdf['total_kcal']).reshape(-1, 1)
        y = tempdf['lex_males']
        tempmodel = LinearRegression().fit(x, y)
        pre_regr_a_tlexm = tempmodel.coef_
        regr_a_tlexm = round(pre_regr_a_tlexm[0], 4)
        regr_b_tlexm = round(tempmodel.intercept_, 4)
        self.regr_a_tlexm = regr_a_tlexm
        self.regr_b_tlexm = regr_b_tlexm
                
    # definition of stats method for data_clexm
    def stats_clexm(self):
        tempdf = self.country_data
        tempdf = tempdf[['carbohydrates', 'lex_males']]
        corr_coef_clexm, p_value_clexm = sp.pearsonr(tempdf['carbohydrates'], tempdf['lex_males'])
        self.correlation_clexm = round(corr_coef_clexm, 2)
        self.p_value_clexm = round(p_value_clexm, 2)
    
    # definition of stats method for data_plexm
    def stats_plexm(self):
        tempdf = self.country_data
        tempdf = tempdf[['protein', 'lex_males']]
        corr_coef_plexm, p_value_plexm = sp.pearsonr(tempdf['protein'], tempdf['lex_males'])
        self.correlation_plexm = round(corr_coef_plexm, 2)
        self.p_value_plexm = round(p_value_plexm, 2) 
    
    # definition of stats method for data_flexm
    def stats_flexm(self):
        tempdf = self.country_data
        tempdf = tempdf[['fat', 'lex_males']]
        corr_coef_flexm, p_value_flexm = sp.pearsonr(tempdf['fat'], tempdf['lex_males'])
        self.correlation_flexm = round(corr_coef_flexm, 2)
        self.p_value_flexm = round(p_value_flexm, 2) 
        

In [8]:
countries_dict = {}
for i in countries_list:
    countries_dict[i] = Country()

print(countries_dict)

{4: <__main__.Country object at 0x0000020D078861B0>, 8: <__main__.Country object at 0x0000020D07628EC0>, 12: <__main__.Country object at 0x0000020D078862D0>, 24: <__main__.Country object at 0x0000020D07886990>, 28: <__main__.Country object at 0x0000020D07886300>, 31: <__main__.Country object at 0x0000020D07886E10>, 32: <__main__.Country object at 0x0000020D07886E70>, 36: <__main__.Country object at 0x0000020D07884560>, 40: <__main__.Country object at 0x0000020D07886F00>, 44: <__main__.Country object at 0x0000020D07886F30>, 50: <__main__.Country object at 0x0000020D07886F60>, 51: <__main__.Country object at 0x0000020D07886F90>, 52: <__main__.Country object at 0x0000020D07886FC0>, 56: <__main__.Country object at 0x0000020D07886FF0>, 58: <__main__.Country object at 0x0000020D07887020>, 60: <__main__.Country object at 0x0000020D07887050>, 68: <__main__.Country object at 0x0000020D07887080>, 70: <__main__.Country object at 0x0000020D078870B0>, 72: <__main__.Country object at 0x0000020D07887

Then, by iterating through the dictionary on key-value pairs, to the instances of the Country class will be assigned the corresponding values for a particular country, as attributes of these instances.

In [9]:
for key, value in countries_dict.items():
    temp_filter = dataset_df['location_id'] == key
    value.country_data = dataset_df[temp_filter]
    value.country_name = value.country_data['location'].unique()[0]
    value.country_id = value.country_data['location_id'].unique()[0]

In order to test the correct operation of the loop above, the attributes of Poland country will be shown.

In [10]:
countries_dict[616].country_data.head()

Unnamed: 0,location,location_id,year,total_kcal,carbohydrates,protein,fat,lex,lex_females,lex_males
7017,Poland,616,1961,3270,2088.77,386.08,795.15,67.9367,70.8064,64.8557
7018,Poland,616,1962,3273,2089.44,386.88,796.68,67.6365,70.5363,64.5455
7019,Poland,616,1963,3291,2085.39,391.2,814.41,68.5615,71.5218,65.4025
7020,Poland,616,1964,3304,2074.74,394.96,834.3,68.7841,71.6129,65.7483
7021,Poland,616,1965,3358,2094.24,403.72,860.04,69.486,72.3612,66.3999


In [11]:
countries_dict[616].country_data.tail()

Unnamed: 0,location,location_id,year,total_kcal,carbohydrates,protein,fat,lex,lex_females,lex_males
7071,Poland,616,2015,3373,1868.01,403.48,1101.51,77.4151,81.307,73.4696
7072,Poland,616,2016,3451,1919.96,415.76,1115.28,77.8025,81.7191,73.8273
7073,Poland,616,2017,3502,1963.9,421.2,1116.9,77.7205,81.5418,73.8484
7074,Poland,616,2018,3542,1966.42,423.76,1151.82,77.6282,81.4865,73.7447
7075,Poland,616,2019,3508,1978.73,419.84,1109.43,77.9272,81.7395,74.0824


In [12]:
print(countries_dict[616].country_name)

Poland


### II. The calculation of correlation coefficient and statistical significance coefficient

Country class method, i.e. stats, will be used for the calculation.

In [13]:
for key, value in countries_dict.items():
    value.stats_tlex()
    value.stats_clex()
    value.stats_plex()
    value.stats_flex()
    value.stats_tlexf()
    value.stats_clexf()
    value.stats_plexf()
    value.stats_flexf()
    value.stats_tlexm()
    value.stats_clexm()
    value.stats_plexm()
    value.stats_flexm()
    print(value.country_name, value.country_id,
          value.correlation_tlex, value.p_value_tlex, value.regr_a_tlex, value.regr_b_tlex,
          value.correlation_clex, value.p_value_clex,
          value.correlation_plex, value.p_value_plex,
          value.correlation_flex, value.p_value_flex,
          value.correlation_tlexf, value.p_value_tlexf, value.regr_a_tlexf, value.regr_b_tlexf,
          value.correlation_clexf, value.p_value_clexf,
          value.correlation_plexf, value.p_value_plexf,
          value.correlation_flexf, value.p_value_flexf,
          value.correlation_tlexm, value.p_value_tlexm, value.regr_a_tlexm, value.regr_b_tlexm,
          value.correlation_clexm, value.p_value_clexm,
          value.correlation_plexm, value.p_value_plexm,
          value.correlation_flexm, value.p_value_flexm)

Afghanistan 4 -0.77 0.0 -0.0218 98.6684 -0.74 0.0 -0.86 0.0 -0.5 0.0 -0.78 0.0 -0.0223 101.6533 -0.75 0.0 -0.87 0.0 -0.45 0.0 -0.76 0.0 -0.0211 95.4887 -0.72 0.0 -0.84 0.0 -0.54 0.0
Albania 8 0.89 0.0 0.018 22.78 0.52 0.0 0.89 0.0 0.86 0.0 0.88 0.0 0.0178 26.1165 0.5 0.0 0.88 0.0 0.85 0.0 0.9 0.0 0.0183 19.2219 0.53 0.0 0.89 0.0 0.87 0.0
Algeria 12 0.97 0.0 0.0207 8.0353 0.97 0.0 0.98 0.0 0.91 0.0 0.97 0.0 0.0209 8.6734 0.97 0.0 0.98 0.0 0.91 0.0 0.97 0.0 0.0204 7.5202 0.97 0.0 0.98 0.0 0.91 0.0
Angola 24 0.84 0.0 0.0233 0.9695 0.68 0.0 0.83 0.0 0.9 0.0 0.83 0.0 0.0221 6.1747 0.66 0.0 0.83 0.0 0.91 0.0 0.85 0.0 0.0241 -3.2383 0.69 0.0 0.83 0.0 0.89 0.0
Antigua and Barbuda 28 0.43 0.0 0.0075 56.798 -0.29 0.03 0.69 0.0 0.69 0.0 0.43 0.0 0.0076 59.1613 -0.29 0.02 0.69 0.0 0.69 0.0 0.45 0.0 0.0078 53.4426 -0.28 0.03 0.71 0.0 0.71 0.0
Azerbaijan 31 0.95 0.0 0.0087 43.8774 0.91 0.0 0.95 0.0 0.95 0.0 0.95 0.0 0.0077 50.2183 0.91 0.0 0.95 0.0 0.95 0.0 0.94 0.0 0.0097 37.548 0.91 0.0 0.94 0.0 0

The data for each country will be stored in a Dataframe object.

In [14]:
results_countries = []
results_id = []
results_corr_tlex = []
results_p_tlex = []
results_regr_a_tlex = []
results_regr_b_tlex = []
results_corr_clex = []
results_p_clex = []
results_corr_plex = []
results_p_plex = []
results_corr_flex = []
results_p_flex = []
results_corr_tlexf = []
results_p_tlexf = []
results_regr_a_tlexf = []
results_regr_b_tlexf = []
results_corr_clexf = []
results_p_clexf = []
results_corr_plexf = []
results_p_plexf = []
results_corr_flexf = []
results_p_flexf = []
results_corr_tlexm = []
results_p_tlexm = []
results_regr_a_tlexm = []
results_regr_b_tlexm = []
results_corr_clexm = []
results_p_clexm = []
results_corr_plexm = []
results_p_plexm = []
results_corr_flexm = []
results_p_flexm = []

for key, value in countries_dict.items():
    results_countries.append(value.country_name)
    results_id.append(value.country_id)
    results_corr_tlex.append(value.correlation_tlex) 
    results_p_tlex.append(value.p_value_tlex)
    results_regr_a_tlex.append(value.regr_a_tlex)
    results_regr_b_tlex.append(value.regr_b_tlex)
    results_corr_clex.append(value.correlation_clex) 
    results_p_clex.append(value.p_value_clex)
    results_corr_plex.append(value.correlation_plex) 
    results_p_plex.append(value.p_value_plex)
    results_corr_flex.append(value.correlation_flex) 
    results_p_flex.append(value.p_value_flex)
    results_corr_tlexf.append(value.correlation_tlexf) 
    results_p_tlexf.append(value.p_value_tlexf)
    results_regr_a_tlexf.append(value.regr_a_tlexf)
    results_regr_b_tlexf.append(value.regr_b_tlexf)
    results_corr_clexf.append(value.correlation_clexf) 
    results_p_clexf.append(value.p_value_clexf)
    results_corr_plexf.append(value.correlation_plexf) 
    results_p_plexf.append(value.p_value_plexf)
    results_corr_flexf.append(value.correlation_flexf) 
    results_p_flexf.append(value.p_value_flexf)
    results_corr_tlexm.append(value.correlation_tlexm) 
    results_p_tlexm.append(value.p_value_tlexm)
    results_regr_a_tlexm.append(value.regr_a_tlexm)
    results_regr_b_tlexm.append(value.regr_b_tlexm)
    results_corr_clexm.append(value.correlation_clexm) 
    results_p_clexm.append(value.p_value_clexm)
    results_corr_plexm.append(value.correlation_plexm) 
    results_p_plexm.append(value.p_value_plexm)
    results_corr_flexm.append(value.correlation_flexm) 
    results_p_flexm.append(value.p_value_flexm)

result_df = pd.DataFrame({'country' : results_countries, 'id' : results_id,
                          'corr_tlex' : results_corr_tlex, 'p_tlex' : results_p_tlex,
                          'regr_a_tlex' : results_regr_a_tlex, 'regr_b_tlex' : results_regr_b_tlex,
                          'corr_clex' : results_corr_clex, 'p_clex' : results_p_clex,
                          'corr_plex' : results_corr_plex, 'p_plex' : results_p_plex,
                          'corr_flex' : results_corr_flex, 'p_flex' : results_p_flex,
                          'corr_tlexf' : results_corr_tlexf, 'p_tlexf' : results_p_tlexf,
                          'regr_a_tlexf' : results_regr_a_tlexf, 'regr_b_tlexf' : results_regr_b_tlexf,
                          'corr_clexf' : results_corr_clexf, 'p_clexf' : results_p_clexf,
                          'corr_plexf' : results_corr_plexf, 'p_plexf' : results_p_plexf,
                          'corr_flexf' : results_corr_flexf, 'p_flexf' : results_p_flexf,
                          'corr_tlexm' : results_corr_tlexm, 'p_tlexm' : results_p_tlexm,
                          'regr_a_tlexm' : results_regr_a_tlexm, 'regr_b_tlexm' : results_regr_b_tlexm,
                          'corr_clexm' : results_corr_clexm, 'p_clexm' : results_p_clexm,
                          'corr_plexm' : results_corr_plexm, 'p_plexm' : results_p_plexm,
                          'corr_flexm' : results_corr_flexm, 'p_flexm' : results_p_flexm})

In order to test the correct operation of the code, the attributes of Poland country will be shown.

In [15]:
result_df.loc[result_df['id'] == 616]

Unnamed: 0,country,id,corr_tlex,p_tlex,regr_a_tlex,regr_b_tlex,corr_clex,p_clex,corr_plex,p_plex,...,corr_tlexm,p_tlexm,regr_a_tlexm,regr_b_tlexm,corr_clexm,p_clexm,corr_plexm,p_plexm,corr_flexm,p_flexm
134,Poland,616,0.16,0.22,0.0057,53.0656,-0.61,0.0,-0.13,0.34,...,0.16,0.23,0.0053,50.4938,-0.53,0.0,-0.14,0.31,0.62,0.0


The results will be saved in "python_to_bi_countries_stats.csv" file.

In [16]:
result_df.to_csv(r'C:\Users\Sebastian\Documents\Data Analyst\Portfolio\1\python_to_bi_countries_stats.csv', index=False)