# Analysis 2

## Model Building

In this section, we are going to try to fit a model where we use the columns we picked as predictors for level of web accessability. While we know that the relationships between many of these columns and level of web accessability are not causal, we could still use them as predictors if there is a strong correlation. We are going to build the model using AIC and BIC selection. AIC (Akaike's Information Criteria) often overestimates the number of significant predictors and is best for large datasets where number of observations divided by number of candidate variables is greater than 40. Since this is not the case for our data set, we will start with BIC (Bayesian Information Criteria), which underestimates the number of significant predictors.

In [36]:
from sklearn import linear_model

X = data.drop(['country', 'web_access'], axis=1) #X is an array of all of the candidate variables
y = data['web_access'] #y is the dependent variable, web accessibility

model = linear_model.LassoLarsIC('bic').fit(X, y)

print('Coefficients:\n' + str(model.coef_))

Coefficients:
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


Given that all of the coefficients are zero, BIC is saying that none of the predictors are statistically significant. We will try AIC instead:

In [37]:
modelA = linear_model.LassoLarsIC('aic').fit(X, y)

print('Coefficients:\n' + str(modelA.coef_))
print('\nR squared: ' + str(modelA.score(X,y)))

Coefficients:
[0.00000000e+00 0.00000000e+00 0.00000000e+00 1.16543390e-02
 0.00000000e+00 0.00000000e+00 0.00000000e+00 7.83794449e-03
 0.00000000e+00 3.29149392e-06 9.49257984e-05]

R squared: 0.23494436688426745


Most of the coefficients of significant predictors are close to zero, and the R squared value is very low. Since AIC is the more generous of the two criteria, the columns we chose arbitrarily clearly do not have a strong correlation with web accessibility. After seeing this result, we rethought about what we are really trying to do here. We realized that it did not make sense to drop columns while cleaning the dataset because some of those columns may have had a correlation with web access. Therefore, we are going to try building a model with the entire internet acessibility dataset. It is already clean other than the fact that the column names are long, but there are many columns and the long names help accurately describe what they represent, so we will leave the full dataset as is for now as you can see below.

In [38]:
full_internet_data = pd.read_csv("3i-index-data.csv", encoding='latin-1')

new_colnames_full = [c[c.find(')') + 2:].lower() for c in full_internet_data.columns]
new_colnames_full = [c.replace(' ', '_') for c in new_colnames_full]
new_colnames_full = [c[:c.find('/')] + '(' + c[c.find('/') + 2:] + ')' for c in new_colnames_full]

full_internet_data.columns = new_colnames_full

full_internet_data.head()

Unnamed: 0,s(o),ountry(roup),internet_users_(%_of_households),fixed-line_broadband_subscribers_(per_100_inhabitants),mobile_subscribers_(per_100_inhabitants),gender_gap_in_internet_access_(%_difference),gender_gap_in_mobile_phone_access_(%_difference),average_fixed_broadband_upload_speed_(kbps),average_fixed_broadband_download_speed_(kbps),average_fixed_broadband_latency_(ms),...,internet_users_(population)_(millions),offline_population_(millions),internet_access_gender_gap_(difference_in_percentage_points),mobile_phone_access_gender_gap_(difference_in_percentage_points),internet_users_(percent_of_population)_(%_of_population),male_internet_users_(%_of_male_population),female_internet_users_(%_of_female_population),male_mobile_phone_subscribers_(%_of_male_population),female_mobile_phone_subscribers_(%_of_female_population),total_fixed_line_broadband_subscribers_(number_of_subscriptions)
0,DZ,Algeria,74.4,7.26,121.9,21.7,7.3,2090.0,3990.0,64.0,...,25.16,17.07,13.0,6.0,59.6,60.0,47.0,82.0,76.0,3063835.0
1,AR,Argentina,75.9,19.1,130.0,-5.7,-3.6,7960.0,33960.0,31.0,...,32.64,11.29,-4.0,-3.0,74.3,70.0,74.0,83.0,86.0,8473655.0
2,AU,Australia,86.1,32.22,113.6,2.1,2.2,20030.0,42630.0,24.0,...,21.28,3.31,2.0,2.0,86.5,94.0,92.0,93.0,91.0,7922000.0
3,AT,Austria,88.8,28.35,123.5,2.2,-1.0,16920.0,55030.0,20.0,...,7.8,1.09,2.0,-1.0,87.7,91.0,89.0,96.0,97.0,2521000.0
4,AZ,Azerbaijan,78.2,18.2,103.9,15.0,11.5,23490.0,21200.0,31.0,...,7.94,2.01,12.0,11.0,79.8,80.0,68.0,96.0,85.0,1810474.0


In [39]:
X = full_internet_data.drop(['s(o)', 'ountry(roup)', 'level_of_web_accessibility_(qualitative_rating_0-4,_4=best)'], axis=1) #X is an array of all of the candidate variables
y = full_internet_data['level_of_web_accessibility_(qualitative_rating_0-4,_4=best)'] #y is the dependent variable, web accessibility

model = linear_model.LassoLarsIC('bic').fit(X, y)

print('Coefficients:\n' + str(model.coef_))
print('\nR squared: ' + str(model.score(X,y)))

Coefficients:
[0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.04015812 0.
 0.07166296 0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.        ]

R squared: 0.1294469934089938


In [40]:
modelA = linear_model.LassoLarsIC('aic').fit(X, y)

print('Coefficients:\n' + str(modelA.coef_))
print('\nR squared: ' + str(modelA.score(X,y)))

Coefficients:
[ 0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00 -1.40793729e-07
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  7.60446042e-02  4.98340508e-03  0.00000000e+00
  0.00000000e+00  0.00000000e+00  2.30333322e-02  0.00000000e+00
  0.00000000e+00  3.71069061e-04  0.00000000e+00  0.00000000e+00
 -1.21491943e-01  0.00000000e+00  0.00000000e+00  0.00000000e+00
 -6.03947231e-03  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  6.58977357e-02  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  3.13541277e-01  0.00000000e+00  0.00000000e+00  1.41328427e-01
  0.0000000

In [41]:
#find which variables are significant according to AIC criteria

coefficients = []

for i in range(len(modelA.coef_)):
    if modelA.coef_[i] != 0:
        coefficients.append(X.columns[i])

#do linear regression with only those columns as the predictors
import statsmodels.api as sm

reg12 = sm.OLS(y, full_internet_data[coefficients]).fit()
reg12.summary()

0,1,2,3
Dep. Variable:,"level_of_web_accessibility_(qualitative_rating_0-4,_4=best)",R-squared (uncentered):,0.833
Model:,OLS,Adj. R-squared (uncentered):,0.814
Method:,Least Squares,F-statistic:,44.43
Date:,"Mon, 23 Nov 2020",Prob (F-statistic):,5.32e-36
Time:,20:11:34,Log-Likelihood:,-132.71
No. Observations:,119,AIC:,289.4
Df Residuals:,107,BIC:,322.8
Df Model:,12,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
bandwidth_capacity_(bit/s_per_internet_user),-2.878e-07,9.75e-08,-2.952,0.004,-4.81e-07,-9.45e-08
"private_sector_initiatives_to_make_wi-fi_available_(qualitative_rating_0-2,_2=best)",0.1297,0.104,1.250,0.214,-0.076,0.335
internet_exchange_points_(number_of_ixps_per_10_million_inhabitants),0.0215,0.020,1.067,0.288,-0.018,0.061
mobile_phone_cost_(prepaid_tariff)_(%_of_monthly_gni_per_capita),0.0640,0.020,3.180,0.002,0.024,0.104
"average_revenue_per_user_(arpu,_annualized)_(usd)",-3.859e-06,0.001,-0.004,0.997,-0.002,0.002
"availability_of_basic_information_in_the_local_language_(qualitative_rating_0-3,_3=best)",-0.2766,0.146,-1.889,0.062,-0.567,0.014
value_of_e-finance_(%),-0.0150,0.006,-2.569,0.012,-0.027,-0.003
"support_for_digital_literacy_(qualitative_rating_0-3,_3=best)",0.2060,0.100,2.070,0.041,0.009,0.403
"technology-neutrality_policy_for_spectrum_use_(qualitative_rating_0-1,_1=best)",0.3957,0.242,1.637,0.104,-0.083,0.875

0,1,2,3
Omnibus:,1.733,Durbin-Watson:,1.994
Prob(Omnibus):,0.42,Jarque-Bera (JB):,1.283
Skew:,0.087,Prob(JB):,0.527
Kurtosis:,3.478,Cond. No.,2700000.0


In [42]:
coefficients.remove('bandwidth_capacity_(bit/s_per_internet_user)')
reg11 = sm.OLS(y, full_internet_data[coefficients]).fit()
reg11.summary()

0,1,2,3
Dep. Variable:,"level_of_web_accessibility_(qualitative_rating_0-4,_4=best)",R-squared (uncentered):,0.819
Model:,OLS,Adj. R-squared (uncentered):,0.801
Method:,Least Squares,F-statistic:,44.5
Date:,"Mon, 23 Nov 2020",Prob (F-statistic):,4.72e-35
Time:,20:11:34,Log-Likelihood:,-137.37
No. Observations:,119,AIC:,296.7
Df Residuals:,108,BIC:,327.3
Df Model:,11,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
"private_sector_initiatives_to_make_wi-fi_available_(qualitative_rating_0-2,_2=best)",0.1334,0.107,1.243,0.217,-0.079,0.346
internet_exchange_points_(number_of_ixps_per_10_million_inhabitants),0.0169,0.021,0.815,0.417,-0.024,0.058
mobile_phone_cost_(prepaid_tariff)_(%_of_monthly_gni_per_capita),0.0683,0.021,3.288,0.001,0.027,0.109
"average_revenue_per_user_(arpu,_annualized)_(usd)",0.0002,0.001,0.190,0.849,-0.002,0.002
"availability_of_basic_information_in_the_local_language_(qualitative_rating_0-3,_3=best)",-0.2066,0.150,-1.381,0.170,-0.503,0.090
value_of_e-finance_(%),-0.0159,0.006,-2.643,0.009,-0.028,-0.004
"support_for_digital_literacy_(qualitative_rating_0-3,_3=best)",0.2095,0.103,2.033,0.044,0.005,0.414
"technology-neutrality_policy_for_spectrum_use_(qualitative_rating_0-1,_1=best)",0.3890,0.250,1.555,0.123,-0.107,0.885
"government_efforts_to_promote_5g_(qualitative_rating_0-3,_3=best)",0.1952,0.083,2.344,0.021,0.030,0.360

0,1,2,3
Omnibus:,1.423,Durbin-Watson:,1.964
Prob(Omnibus):,0.491,Jarque-Bera (JB):,0.974
Skew:,0.001,Prob(JB):,0.614
Kurtosis:,3.443,Cond. No.,661.0


In [43]:
coefficients.remove('average_revenue_per_user_(arpu,_annualized)_(usd)')
reg10 = sm.OLS(y, full_internet_data[coefficients]).fit()
reg10.summary()

0,1,2,3
Dep. Variable:,"level_of_web_accessibility_(qualitative_rating_0-4,_4=best)",R-squared (uncentered):,0.819
Model:,OLS,Adj. R-squared (uncentered):,0.803
Method:,Least Squares,F-statistic:,49.38
Date:,"Mon, 23 Nov 2020",Prob (F-statistic):,6.64e-36
Time:,20:11:34,Log-Likelihood:,-137.39
No. Observations:,119,AIC:,294.8
Df Residuals:,109,BIC:,322.6
Df Model:,10,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
"private_sector_initiatives_to_make_wi-fi_available_(qualitative_rating_0-2,_2=best)",0.1328,0.107,1.243,0.217,-0.079,0.345
internet_exchange_points_(number_of_ixps_per_10_million_inhabitants),0.0172,0.021,0.835,0.406,-0.024,0.058
mobile_phone_cost_(prepaid_tariff)_(%_of_monthly_gni_per_capita),0.0676,0.020,3.322,0.001,0.027,0.108
"availability_of_basic_information_in_the_local_language_(qualitative_rating_0-3,_3=best)",-0.2132,0.145,-1.472,0.144,-0.500,0.074
value_of_e-finance_(%),-0.0164,0.005,-2.986,0.003,-0.027,-0.006
"support_for_digital_literacy_(qualitative_rating_0-3,_3=best)",0.2101,0.103,2.050,0.043,0.007,0.413
"technology-neutrality_policy_for_spectrum_use_(qualitative_rating_0-1,_1=best)",0.3832,0.247,1.550,0.124,-0.107,0.873
"government_efforts_to_promote_5g_(qualitative_rating_0-3,_3=best)",0.1989,0.081,2.468,0.015,0.039,0.359
"democracy_index_(score,_0-10;_10_=_best)",0.1176,0.056,2.087,0.039,0.006,0.229

0,1,2,3
Omnibus:,1.445,Durbin-Watson:,1.965
Prob(Omnibus):,0.486,Jarque-Bera (JB):,0.998
Skew:,0.005,Prob(JB):,0.607
Kurtosis:,3.448,Cond. No.,183.0


In [44]:
coefficients.remove('private_sector_initiatives_to_make_wi-fi_available_(qualitative_rating_0-2,_2=best)')
reg9 = sm.OLS(y, full_internet_data[coefficients]).fit()
reg9.summary()

0,1,2,3
Dep. Variable:,"level_of_web_accessibility_(qualitative_rating_0-4,_4=best)",R-squared (uncentered):,0.817
Model:,OLS,Adj. R-squared (uncentered):,0.802
Method:,Least Squares,F-statistic:,54.43
Date:,"Mon, 23 Nov 2020",Prob (F-statistic):,1.85e-36
Time:,20:11:34,Log-Likelihood:,-138.23
No. Observations:,119,AIC:,294.5
Df Residuals:,110,BIC:,319.5
Df Model:,9,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
internet_exchange_points_(number_of_ixps_per_10_million_inhabitants),0.0165,0.021,0.797,0.427,-0.024,0.057
mobile_phone_cost_(prepaid_tariff)_(%_of_monthly_gni_per_capita),0.0667,0.020,3.273,0.001,0.026,0.107
"availability_of_basic_information_in_the_local_language_(qualitative_rating_0-3,_3=best)",-0.2239,0.145,-1.545,0.125,-0.511,0.063
value_of_e-finance_(%),-0.0163,0.005,-2.958,0.004,-0.027,-0.005
"support_for_digital_literacy_(qualitative_rating_0-3,_3=best)",0.2121,0.103,2.064,0.041,0.008,0.416
"technology-neutrality_policy_for_spectrum_use_(qualitative_rating_0-1,_1=best)",0.4316,0.245,1.764,0.081,-0.053,0.916
"government_efforts_to_promote_5g_(qualitative_rating_0-3,_3=best)",0.2055,0.081,2.550,0.012,0.046,0.365
"democracy_index_(score,_0-10;_10_=_best)",0.1225,0.056,2.173,0.032,0.011,0.234
"eiu_business_environment_rankings_(score,_1-10,_10_=_high)",0.1450,0.099,1.462,0.146,-0.051,0.342

0,1,2,3
Omnibus:,1.524,Durbin-Watson:,1.974
Prob(Omnibus):,0.467,Jarque-Bera (JB):,1.076
Skew:,0.046,Prob(JB):,0.584
Kurtosis:,3.457,Cond. No.,180.0


In [45]:
coefficients.remove('internet_exchange_points_(number_of_ixps_per_10_million_inhabitants)')
reg8 = sm.OLS(y, full_internet_data[coefficients]).fit()
reg8.summary()

0,1,2,3
Dep. Variable:,"level_of_web_accessibility_(qualitative_rating_0-4,_4=best)",R-squared (uncentered):,0.816
Model:,OLS,Adj. R-squared (uncentered):,0.802
Method:,Least Squares,F-statistic:,61.36
Date:,"Mon, 23 Nov 2020",Prob (F-statistic):,3.1200000000000003e-37
Time:,20:11:34,Log-Likelihood:,-138.57
No. Observations:,119,AIC:,293.1
Df Residuals:,111,BIC:,315.4
Df Model:,8,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
mobile_phone_cost_(prepaid_tariff)_(%_of_monthly_gni_per_capita),0.0648,0.020,3.205,0.002,0.025,0.105
"availability_of_basic_information_in_the_local_language_(qualitative_rating_0-3,_3=best)",-0.2362,0.144,-1.641,0.104,-0.521,0.049
value_of_e-finance_(%),-0.0165,0.005,-3.005,0.003,-0.027,-0.006
"support_for_digital_literacy_(qualitative_rating_0-3,_3=best)",0.2103,0.103,2.051,0.043,0.007,0.414
"technology-neutrality_policy_for_spectrum_use_(qualitative_rating_0-1,_1=best)",0.4370,0.244,1.790,0.076,-0.047,0.921
"government_efforts_to_promote_5g_(qualitative_rating_0-3,_3=best)",0.2107,0.080,2.627,0.010,0.052,0.370
"democracy_index_(score,_0-10;_10_=_best)",0.1266,0.056,2.259,0.026,0.016,0.238
"eiu_business_environment_rankings_(score,_1-10,_10_=_high)",0.1551,0.098,1.580,0.117,-0.039,0.350

0,1,2,3
Omnibus:,1.204,Durbin-Watson:,1.947
Prob(Omnibus):,0.548,Jarque-Bera (JB):,0.746
Skew:,0.04,Prob(JB):,0.689
Kurtosis:,3.379,Cond. No.,180.0


In [46]:
coefficients.remove('eiu_business_environment_rankings_(score,_1-10,_10_=_high)')
reg7 = sm.OLS(y, full_internet_data[coefficients]).fit()
reg7.summary()

0,1,2,3
Dep. Variable:,"level_of_web_accessibility_(qualitative_rating_0-4,_4=best)",R-squared (uncentered):,0.811
Model:,OLS,Adj. R-squared (uncentered):,0.8
Method:,Least Squares,F-statistic:,68.85
Date:,"Mon, 23 Nov 2020",Prob (F-statistic):,1.2200000000000002e-37
Time:,20:11:34,Log-Likelihood:,-139.9
No. Observations:,119,AIC:,293.8
Df Residuals:,112,BIC:,313.2
Df Model:,7,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
mobile_phone_cost_(prepaid_tariff)_(%_of_monthly_gni_per_capita),0.0731,0.020,3.724,0.000,0.034,0.112
"availability_of_basic_information_in_the_local_language_(qualitative_rating_0-3,_3=best)",-0.0888,0.110,-0.805,0.422,-0.307,0.130
value_of_e-finance_(%),-0.0152,0.005,-2.790,0.006,-0.026,-0.004
"support_for_digital_literacy_(qualitative_rating_0-3,_3=best)",0.2376,0.102,2.335,0.021,0.036,0.439
"technology-neutrality_policy_for_spectrum_use_(qualitative_rating_0-1,_1=best)",0.4535,0.246,1.847,0.067,-0.033,0.940
"government_efforts_to_promote_5g_(qualitative_rating_0-3,_3=best)",0.2739,0.070,3.916,0.000,0.135,0.413
"democracy_index_(score,_0-10;_10_=_best)",0.1768,0.046,3.805,0.000,0.085,0.269

0,1,2,3
Omnibus:,1.103,Durbin-Watson:,1.919
Prob(Omnibus):,0.576,Jarque-Bera (JB):,0.648
Skew:,0.064,Prob(JB):,0.723
Kurtosis:,3.338,Cond. No.,178.0


In [47]:
coefficients.remove('availability_of_basic_information_in_the_local_language_(qualitative_rating_0-3,_3=best)')
reg6 = sm.OLS(y, full_internet_data[coefficients]).fit()
reg6.summary()

0,1,2,3
Dep. Variable:,"level_of_web_accessibility_(qualitative_rating_0-4,_4=best)",R-squared (uncentered):,0.81
Model:,OLS,Adj. R-squared (uncentered):,0.8
Method:,Least Squares,F-statistic:,80.46
Date:,"Mon, 23 Nov 2020",Prob (F-statistic):,1.7899999999999998e-38
Time:,20:11:34,Log-Likelihood:,-140.24
No. Observations:,119,AIC:,292.5
Df Residuals:,113,BIC:,309.2
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
mobile_phone_cost_(prepaid_tariff)_(%_of_monthly_gni_per_capita),0.0723,0.020,3.693,0.000,0.034,0.111
value_of_e-finance_(%),-0.0174,0.005,-3.637,0.000,-0.027,-0.008
"support_for_digital_literacy_(qualitative_rating_0-3,_3=best)",0.2111,0.096,2.196,0.030,0.021,0.402
"technology-neutrality_policy_for_spectrum_use_(qualitative_rating_0-1,_1=best)",0.4067,0.238,1.707,0.091,-0.065,0.879
"government_efforts_to_promote_5g_(qualitative_rating_0-3,_3=best)",0.2659,0.069,3.847,0.000,0.129,0.403
"democracy_index_(score,_0-10;_10_=_best)",0.1748,0.046,3.773,0.000,0.083,0.267

0,1,2,3
Omnibus:,0.96,Durbin-Watson:,1.93
Prob(Omnibus):,0.619,Jarque-Bera (JB):,0.554
Skew:,0.123,Prob(JB):,0.758
Kurtosis:,3.226,Cond. No.,173.0


In [48]:
coefficients.remove('technology-neutrality_policy_for_spectrum_use_(qualitative_rating_0-1,_1=best)')
reg5 = sm.OLS(y, full_internet_data[coefficients]).fit()
reg5.summary()

0,1,2,3
Dep. Variable:,"level_of_web_accessibility_(qualitative_rating_0-4,_4=best)",R-squared (uncentered):,0.805
Model:,OLS,Adj. R-squared (uncentered):,0.797
Method:,Least Squares,F-statistic:,94.39
Date:,"Mon, 23 Nov 2020",Prob (F-statistic):,7.28e-39
Time:,20:11:34,Log-Likelihood:,-141.76
No. Observations:,119,AIC:,293.5
Df Residuals:,114,BIC:,307.4
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
mobile_phone_cost_(prepaid_tariff)_(%_of_monthly_gni_per_capita),0.0762,0.020,3.884,0.000,0.037,0.115
value_of_e-finance_(%),-0.0170,0.005,-3.532,0.001,-0.027,-0.007
"support_for_digital_literacy_(qualitative_rating_0-3,_3=best)",0.2616,0.092,2.836,0.005,0.079,0.444
"government_efforts_to_promote_5g_(qualitative_rating_0-3,_3=best)",0.2751,0.069,3.959,0.000,0.137,0.413
"democracy_index_(score,_0-10;_10_=_best)",0.2047,0.043,4.734,0.000,0.119,0.290

0,1,2,3
Omnibus:,1.637,Durbin-Watson:,1.92
Prob(Omnibus):,0.441,Jarque-Bera (JB):,1.137
Skew:,0.15,Prob(JB):,0.566
Kurtosis:,3.373,Cond. No.,67.7


In [49]:
print(coefficients)

['mobile_phone_cost_(prepaid_tariff)_(%_of_monthly_gni_per_capita)', 'value_of_e-finance_(%)', 'support_for_digital_literacy_(qualitative_rating_0-3,_3=best)', 'government_efforts_to_promote_5g_(qualitative_rating_0-3,_3=best)', 'democracy_index_(score,_0-10;_10_=_best)']


In [50]:
newNames = {'mobile_phone_cost_(prepaid_tariff)_(%_of_monthly_gni_per_capita)':'Cell_Cost', 'value_of_e-finance_(%)':'E-Finance_Value','support_for_digital_literacy_(qualitative_rating_0-3,_3=best)':'Support_For_Dig_Lit','government_efforts_to_promote_5g_(qualitative_rating_0-3,_3=best)':'Govt_Promote_5g','democracy_index_(score,_0-10;_10_=_best)':'Democracy_Index'}
full_internet_data = full_internet_data.rename(columns=newNames)
coefficients = list(newNames.values())

In [51]:
from sklearn.preprocessing import PolynomialFeatures
interaction = PolynomialFeatures(degree=2).fit_transform(full_internet_data[coefficients])
interaction = pd.DataFrame(interaction)
full_internet_data[coefficients].head()

Unnamed: 0,Cell_Cost,E-Finance_Value,Support_For_Dig_Lit,Govt_Promote_5g,Democracy_Index
0,2.04,32.0,1.0,0.0,3.5
1,1.12,62.0,3.0,0.0,7.0
2,0.5,44.0,3.0,3.0,9.1
3,0.16,50.0,3.0,3.0,8.3
4,1.38,45.0,1.0,0.0,2.6


In [52]:
interaction.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,11,12,13,14,15,16,17,18,19,20
0,1.0,2.04,32.0,1.0,0.0,3.5,4.1616,65.28,2.04,0.0,...,1024.0,32.0,0.0,112.0,1.0,0.0,3.5,0.0,0.0,12.25
1,1.0,1.12,62.0,3.0,0.0,7.0,1.2544,69.44,3.36,0.0,...,3844.0,186.0,0.0,434.0,9.0,0.0,21.0,0.0,0.0,49.0
2,1.0,0.5,44.0,3.0,3.0,9.1,0.25,22.0,1.5,1.5,...,1936.0,132.0,132.0,400.4,9.0,9.0,27.3,9.0,27.3,82.81
3,1.0,0.16,50.0,3.0,3.0,8.3,0.0256,8.0,0.48,0.48,...,2500.0,150.0,150.0,415.0,9.0,9.0,24.9,9.0,24.9,68.89
4,1.0,1.38,45.0,1.0,0.0,2.6,1.9044,62.1,1.38,0.0,...,2025.0,45.0,0.0,117.0,1.0,0.0,2.6,0.0,0.0,6.76


In [53]:
internet_data_SQ = full_internet_data.copy()
coefficients_SQ = coefficients.copy()

In [54]:
internet_data_SQ['Cell_Cost_SQ'] = interaction[6]
internet_data_SQ['E-Finance_Value_SQ'] = interaction[11]
internet_data_SQ['Support_For_Dig_Lit_SQ'] = interaction[15]
internet_data_SQ['Govt_Promote_5g_SQ'] = interaction[18]
internet_data_SQ['Democracy_Index_SQ'] = interaction[20]

In [55]:
coefficients_SQ.extend(['Cell_Cost_SQ','E-Finance_Value_SQ','Support_For_Dig_Lit_SQ','Govt_Promote_5g_SQ','Democracy_Index_SQ'])

In [56]:
regSQ5 = sm.OLS(y, internet_data_SQ[coefficients_SQ]).fit()
regSQ5.summary()

0,1,2,3
Dep. Variable:,"level_of_web_accessibility_(qualitative_rating_0-4,_4=best)",R-squared (uncentered):,0.813
Model:,OLS,Adj. R-squared (uncentered):,0.796
Method:,Least Squares,F-statistic:,47.32
Date:,"Mon, 23 Nov 2020",Prob (F-statistic):,4.3100000000000004e-35
Time:,20:11:34,Log-Likelihood:,-139.47
No. Observations:,119,AIC:,298.9
Df Residuals:,109,BIC:,326.7
Df Model:,10,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Cell_Cost,0.0016,0.072,0.023,0.982,-0.142,0.145
E-Finance_Value,-0.0332,0.022,-1.541,0.126,-0.076,0.009
Support_For_Dig_Lit,0.3744,0.406,0.922,0.358,-0.430,1.179
Govt_Promote_5g,-0.1690,0.294,-0.575,0.566,-0.751,0.413
Democracy_Index,0.4632,0.218,2.124,0.036,0.031,0.896
Cell_Cost_SQ,0.0035,0.004,0.918,0.360,-0.004,0.011
E-Finance_Value_SQ,0.0001,0.000,0.664,0.508,-0.000,0.001
Support_For_Dig_Lit_SQ,-0.0343,0.108,-0.318,0.751,-0.248,0.179
Govt_Promote_5g_SQ,0.1479,0.093,1.590,0.115,-0.037,0.332

0,1,2,3
Omnibus:,0.337,Durbin-Watson:,2.018
Prob(Omnibus):,0.845,Jarque-Bera (JB):,0.081
Skew:,0.022,Prob(JB):,0.96
Kurtosis:,3.12,Cond. No.,17900.0


- finalize model with significant interaction terms
- plot multivariable regression maybe
- write it uppp

Since the squared values all increased the p values, we shall re remove them.

In [57]:
internet_data_inter = full_internet_data.copy()
coefficients_inter = coefficients.copy()

In [58]:
for i in interaction.columns:
    if i not in [0,1,2,3,4,5,6,11,15,18,20]:
        internet_data_inter['interaction ' + str(i)] = interaction[i]
        coefficients_inter.append('interaction ' + str(i))

In [59]:
regi = sm.OLS(y, internet_data_inter[coefficients_inter]).fit()
regi.summary()

0,1,2,3
Dep. Variable:,"level_of_web_accessibility_(qualitative_rating_0-4,_4=best)",R-squared (uncentered):,0.823
Model:,OLS,Adj. R-squared (uncentered):,0.798
Method:,Least Squares,F-statistic:,32.3
Date:,"Mon, 23 Nov 2020",Prob (F-statistic):,2.4900000000000004e-32
Time:,20:11:34,Log-Likelihood:,-136.03
No. Observations:,119,AIC:,302.1
Df Residuals:,104,BIC:,343.7
Df Model:,15,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Cell_Cost,0.1259,0.111,1.131,0.260,-0.095,0.347
E-Finance_Value,-0.0248,0.018,-1.359,0.177,-0.061,0.011
Support_For_Dig_Lit,-0.2218,0.360,-0.617,0.539,-0.935,0.491
Govt_Promote_5g,1.0106,0.422,2.396,0.018,0.174,1.847
Democracy_Index,0.3251,0.219,1.488,0.140,-0.108,0.758
interaction 7,-0.0011,0.002,-0.567,0.572,-0.005,0.003
interaction 8,0.0409,0.027,1.497,0.138,-0.013,0.095
interaction 9,-0.0785,0.065,-1.213,0.228,-0.207,0.050
interaction 10,-0.0096,0.020,-0.490,0.625,-0.048,0.029

0,1,2,3
Omnibus:,0.369,Durbin-Watson:,2.08
Prob(Omnibus):,0.832,Jarque-Bera (JB):,0.239
Skew:,-0.11,Prob(JB):,0.887
Kurtosis:,3.01,Cond. No.,2440.0


In [78]:
internet_data_final = full_internet_data.copy()
coefficients_final = coefficients.copy()

In [79]:
internet_data_final['Govt_Promote_5g:Democracy_Index'] = interaction[19]
coefficients_final.extend(['Govt_Promote_5g:Democracy_Index'])

In [80]:
regi19 = sm.OLS(y, internet_data_final[coefficients_final]).fit()
regi19.summary()

0,1,2,3
Dep. Variable:,"level_of_web_accessibility_(qualitative_rating_0-4,_4=best)",R-squared (uncentered):,0.814
Model:,OLS,Adj. R-squared (uncentered):,0.804
Method:,Least Squares,F-statistic:,82.18
Date:,"Mon, 23 Nov 2020",Prob (F-statistic):,6.85e-39
Time:,20:15:34,Log-Likelihood:,-139.22
No. Observations:,119,AIC:,290.4
Df Residuals:,113,BIC:,307.1
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Cell_Cost,0.0670,0.020,3.399,0.001,0.028,0.106
E-Finance_Value,-0.0206,0.005,-4.115,0.000,-0.030,-0.011
Support_For_Dig_Lit,0.1440,0.105,1.372,0.173,-0.064,0.352
Govt_Promote_5g,0.6449,0.180,3.580,0.001,0.288,1.002
Democracy_Index,0.3002,0.060,4.963,0.000,0.180,0.420
Govt_Promote_5g:Democracy_Index,-0.0608,0.027,-2.219,0.029,-0.115,-0.007

0,1,2,3
Omnibus:,0.176,Durbin-Watson:,2.014
Prob(Omnibus):,0.916,Jarque-Bera (JB):,0.161
Skew:,0.084,Prob(JB):,0.922
Kurtosis:,2.935,Cond. No.,144.0


In [81]:
internet_data_final.drop(columns=['Support_For_Dig_Lit'])
coefficients_final.remove('Support_For_Dig_Lit')

In [82]:
reg_final = sm.OLS(y, internet_data_final[coefficients_final]).fit()
reg_final.summary()

0,1,2,3
Dep. Variable:,"level_of_web_accessibility_(qualitative_rating_0-4,_4=best)",R-squared (uncentered):,0.81
Model:,OLS,Adj. R-squared (uncentered):,0.802
Method:,Least Squares,F-statistic:,97.49
Date:,"Mon, 23 Nov 2020",Prob (F-statistic):,1.66e-39
Time:,20:15:35,Log-Likelihood:,-140.2
No. Observations:,119,AIC:,290.4
Df Residuals:,114,BIC:,304.3
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Cell_Cost,0.0609,0.019,3.160,0.002,0.023,0.099
E-Finance_Value,-0.0191,0.005,-3.900,0.000,-0.029,-0.009
Govt_Promote_5g,0.7730,0.155,4.998,0.000,0.467,1.079
Democracy_Index,0.3505,0.048,7.265,0.000,0.255,0.446
Govt_Promote_5g:Democracy_Index,-0.0798,0.024,-3.358,0.001,-0.127,-0.033

0,1,2,3
Omnibus:,0.092,Durbin-Watson:,2.042
Prob(Omnibus):,0.955,Jarque-Bera (JB):,0.085
Skew:,0.057,Prob(JB):,0.958
Kurtosis:,2.936,Cond. No.,115.0


In [83]:
internet_data_final[coefficients_final].head()

Unnamed: 0,Cell_Cost,E-Finance_Value,Govt_Promote_5g,Democracy_Index,Govt_Promote_5g:Democracy_Index
0,2.04,32.0,0.0,3.5,0.0
1,1.12,62.0,0.0,7.0,0.0
2,0.5,44.0,3.0,9.1,27.3
3,0.16,50.0,3.0,8.3,24.9
4,1.38,45.0,0.0,2.6,0.0


In [84]:
X = internet_data_final[coefficients_final]