# Feature selection
This code will work on various methods to identify the features having the most impact on the life expectancy. It is expected the year will have the larges impact, so will we will have at least 4 features to rank overall.

2021-09-21 - DXG 

In [30]:
import pandas as pd

In [31]:
df = pd.read_csv('finalDataSetForModelling.csv')

In [32]:
features = df.columns.tolist()

In [33]:
features.remove('CountryName')
features.remove('Life expectancy at birth, total (years)')

In [34]:
X=df[features]
y=df['Life expectancy at birth, total (years)']

## Feature selection is based on a crude forward selection via linear regression to find quantitative predictors having the greatest impact on the life expectancy
The process is to find 10 at a time, then remove the ones that appear to obviously correlate with life expectancy, such as mortality rate, population growth, population ages as percentage of total.
Continue the trimming down until we have 10 selected features that do not appear to have a direct correlation with life expectancy.
Year is left out, as it is expected that Year would be directly correlated with life expectancy (and it's an uncontrollable predictor)

In [37]:
#!pip install mlxtend

In [38]:
# importing the models

from mlxtend.feature_selection import SequentialFeatureSelector as sfs
from sklearn.linear_model import LinearRegression

In [39]:
def GetFeatures(k_features, X, y, verbose):
    # calling the linear regression model
    lreg = LinearRegression()
    sfs1 = sfs(lreg, k_features=k_features, forward=True, verbose=verbose, scoring='neg_mean_squared_error')

    sfs1 = sfs1.fit(X, y)

    feat_names = list(sfs1.k_feature_names_)
    return feat_names

# First pass
Obviously, we need to exclude features that are highly correlated with the life expectancy:
So let's see what we have

In [40]:
featurePass1 = GetFeatures(10,X,y,0)

In [41]:
featurePass1

['Adolescent fertility rate (births per 1,000 women ages 15-19)',
 'Age dependency ratio, old (% of working-age population)',
 'CO2 emissions from liquid fuel consumption (% of total)',
 'Death rate, crude (per 1,000 people)',
 'GDP per capita (current US$)',
 'Mortality rate, adult, female (per 1,000 female adults)',
 'Mortality rate, under-5 (per 1,000)',
 'Population ages 65 and above (% of total)',
 'Survival to age 65, female (% of cohort)',
 'Survival to age 65, male (% of cohort)']

#### Let's remove all the morality related to upper end of age and survival to certain ages

In [42]:
featuresToRemove = [
 'Death rate, crude (per 1,000 people)',
 'Mortality rate, adult, female (per 1,000 female adults)',
 'Mortality rate, under-5 (per 1,000)',
 'Population ages 65 and above (% of total)',
 'Survival to age 65, female (% of cohort)',
 'Survival to age 65, male (% of cohort)']

X=X.drop(columns=featuresToRemove)

In [43]:
featurePass2 = GetFeatures(10,X,y,0)

In [44]:
featurePass2

['Year',
 'Birth rate, crude (per 1,000 people)',
 'GDP at market prices (current US$)',
 'Livestock production index (2004-2006 = 100)',
 'Merchandise trade (% of GDP)',
 'Mortality rate, adult, male (per 1,000 male adults)',
 'Mortality rate, infant (per 1,000 live births)',
 'Population density (people per sq. km of land area)',
 'Population, ages 15-64 (% of total)',
 'Rural population (% of total population)']

In [45]:
#### Let's remove all the morality related and Year, as well as relative population ages

In [46]:
featuresToRemove2 = ['Year',
 'Mortality rate, adult, male (per 1,000 male adults)',
 'Mortality rate, infant (per 1,000 live births)',
 'Population, ages 15-64 (% of total)']
X=X.drop(columns=featuresToRemove2)

In [47]:
featurePass3 = GetFeatures(10,X,y,0)

In [48]:
featurePass3

['Adolescent fertility rate (births per 1,000 women ages 15-19)',
 'Age dependency ratio (% of working-age population)',
 'Age dependency ratio, old (% of working-age population)',
 'Birth rate, crude (per 1,000 people)',
 'CO2 emissions from gaseous fuel consumption (% of total)',
 'CO2 emissions from solid fuel consumption (% of total)',
 'GDP per capita (current US$)',
 'Population growth (annual %)',
 'Urban population (% of total)',
 'Urban population growth (annual %)']

In [49]:
#### Remove age related indicators - fertilty rate is hard to control

In [50]:
featuresToRemove3 = ['Age dependency ratio (% of working-age population)',
 'Age dependency ratio, old (% of working-age population)']
X=X.drop(columns=featuresToRemove3)

In [51]:
featurePass4 = GetFeatures(10,X,y,0)
featurePass4

['Adolescent fertility rate (births per 1,000 women ages 15-19)',
 'Age dependency ratio, young (% of working-age population)',
 'Birth rate, crude (per 1,000 people)',
 'CO2 emissions from gaseous fuel consumption (% of total)',
 'CO2 emissions from solid fuel consumption (% of total)',
 'CO2 emissions from solid fuel consumption (kt)',
 'GDP per capita (current US$)',
 'Population growth (annual %)',
 'Urban population (% of total)',
 'Urban population growth (annual %)']

#### remove age dependencies, as well as growth, but leave relative growth

In [52]:
featuresToRemove4 = ['Age dependency ratio, young (% of working-age population)',
 'Population growth (annual %)',
 'Urban population growth (annual %)']
X=X.drop(columns=featuresToRemove4)


In [53]:
featurePass5 = GetFeatures(10,X,y,0)
featurePass5

['Adolescent fertility rate (births per 1,000 women ages 15-19)',
 'Birth rate, crude (per 1,000 people)',
 'CO2 emissions from gaseous fuel consumption (% of total)',
 'CO2 emissions from solid fuel consumption (% of total)',
 'CO2 emissions from solid fuel consumption (kt)',
 'GDP per capita (current US$)',
 'Permanent cropland (% of land area)',
 'Population, ages 0-14 (% of total)',
 'Rural population (% of total population)',
 'Rural population growth (annual %)']

In [54]:
featuresToRemove5 = [
 'Population, ages 0-14 (% of total)',
 'Rural population growth (annual %)']
X=X.drop(columns=featuresToRemove5)

In [55]:
featurePass6 = GetFeatures(10,X,y,0)
featurePass6

['Adolescent fertility rate (births per 1,000 women ages 15-19)',
 'Arable land (% of land area)',
 'Arable land (hectares per person)',
 'Birth rate, crude (per 1,000 people)',
 'CO2 emissions from solid fuel consumption (% of total)',
 'Crop production index (2004-2006 = 100)',
 'Livestock production index (2004-2006 = 100)',
 'Permanent cropland (% of land area)',
 'Population, female (% of total)',
 'Rural population (% of total population)']

### So we have 10 predictors that do not have an apparent direct correlation with life expectancy, so let's explore these.

In [56]:
finalColumns = list(featurePass6)

In [57]:
finalColumns.append('Life expectancy at birth, total (years)')
finalColumns.append('Year')
finalColumns.append('CountryName')

In [58]:
featureSelectedDataset = df[finalColumns]
featureSelectedDataset.reset_index()

Unnamed: 0,index,"Adolescent fertility rate (births per 1,000 women ages 15-19)",Arable land (% of land area),Arable land (hectares per person),"Birth rate, crude (per 1,000 people)",CO2 emissions from solid fuel consumption (% of total),Crop production index (2004-2006 = 100),Livestock production index (2004-2006 = 100),Permanent cropland (% of land area),"Population, female (% of total)",Rural population (% of total population),"Life expectancy at birth, total (years)",Year,CountryName
0,0,145.3210,11.947431,0.801756,51.614,35.807860,73.15,48.66,0.111816,48.634625,90.574,34.092878,1964,Afghanistan
1,1,145.3210,11.947431,0.785075,51.668,37.818182,75.72,51.53,0.114879,48.703560,90.250,34.525390,1965,Afghanistan
2,2,145.3210,12.014827,0.756515,51.716,31.142857,83.05,60.38,0.206782,48.824759,89.570,35.389415,1967,Afghanistan
3,3,145.3210,12.014827,0.740015,51.705,27.245509,85.05,64.93,0.208314,48.878412,89.214,35.822415,1968,Afghanistan
4,4,145.3210,12.039335,0.724457,51.672,38.521401,87.39,65.29,0.208314,48.929270,88.848,36.260390,1969,Afghanistan
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6254,6254,117.5050,10.339925,0.300802,35.397,78.975132,94.33,102.64,0.258498,50.498150,66.257,44.177756,2007,Zimbabwe
6255,6255,116.6702,10.986170,0.314921,35.788,76.754177,88.28,103.90,0.258498,50.526399,66.440,45.804488,2008,Zimbabwe
6256,6256,115.8354,10.598423,0.298812,36.094,75.834446,82.48,105.03,0.258498,50.555102,66.622,47.624659,2009,Zimbabwe
6257,6257,115.0006,10.339925,0.286248,36.267,75.873274,99.55,102.85,0.258498,50.586182,66.804,49.574659,2010,Zimbabwe


In [59]:
meanByYear = featureSelectedDataset[['Year','Life expectancy at birth, total (years)']].groupby('Year').mean().reset_index()
meanOverall = featureSelectedDataset[['Life expectancy at birth, total (years)']].mean()

featureSelectedDataset['MeanLifeExpetancyOverall'] = meanOverall[0]

meanByYear= meanByYear.rename(columns={'Life expectancy at birth, total (years)':'MeanLifeExpetancyForYear'})

featureSelectedDataset = pd.merge(left=featureSelectedDataset,right=meanByYear)

featureSelectedDataset['AboveAverageLifeExpectancyOverall'] = featureSelectedDataset['Life expectancy at birth, total (years)']>featureSelectedDataset['MeanLifeExpetancyOverall']

featureSelectedDataset['AboveAverageLifeExpectancyByYear'] = featureSelectedDataset['Life expectancy at birth, total (years)']>featureSelectedDataset['MeanLifeExpetancyForYear']

featureSelectedDataset.to_csv("finalDataSetForModelling.csv", index=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [60]:
featureSelectedDataset

Unnamed: 0,"Adolescent fertility rate (births per 1,000 women ages 15-19)",Arable land (% of land area),Arable land (hectares per person),"Birth rate, crude (per 1,000 people)",CO2 emissions from solid fuel consumption (% of total),Crop production index (2004-2006 = 100),Livestock production index (2004-2006 = 100),Permanent cropland (% of land area),"Population, female (% of total)",Rural population (% of total population),"Life expectancy at birth, total (years)",Year,CountryName,MeanLifeExpetancyOverall,MeanLifeExpetancyForYear,AboveAverageLifeExpectancyOverall,AboveAverageLifeExpectancyByYear
0,145.32100,11.947431,0.801756,51.614000,35.807860,73.150000,48.660000,0.111816,48.634625,90.574000,34.092878,1964,Afghanistan,64.193563,55.937218,False,False
1,46.64780,4.605392,3.168264,20.500000,61.654161,34.230000,61.430000,0.021999,49.604197,16.890000,70.880976,1964,Australia,64.193563,55.937218,True,True
2,56.30660,20.065391,0.229381,18.500000,48.662900,93.290000,72.460000,0.811334,53.270334,35.038000,69.921951,1964,Austria,64.193563,55.937218,True,True
3,88.50720,8.691025,0.378911,45.799000,0.000000,18.740000,35.630000,0.709471,52.189871,88.199000,39.136854,1964,Benin,64.193563,55.937218,False,False
4,100.55860,1.237884,0.336018,43.539000,0.247525,24.620000,20.760000,0.126465,50.360061,62.043000,43.430220,1964,Bolivia,64.193563,55.937218,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6254,84.36840,30.070682,0.135601,18.500000,72.813122,63.980000,80.210000,0.512545,51.567427,21.901000,70.826829,1963,United Kingdom,64.193563,56.791621,True,True
6255,77.95060,19.606375,0.948912,21.700000,31.668591,45.860000,57.320000,0.204062,50.488993,28.866000,69.917073,1963,United States,64.193563,56.791621,True,True
6256,62.75300,13.358473,0.887243,21.661000,2.716469,33.150000,58.160000,0.302823,50.074380,19.101000,68.440927,1963,Uruguay,64.193563,56.791621,True,True
6257,136.62780,3.291197,0.317951,45.607000,0.802505,38.580000,19.930000,0.656425,49.089366,35.023000,60.985829,1963,"Venezuela, RB",64.193563,56.791621,False,True


In [61]:
featureSelectedDataset.to_csv('featureSelectedDataset.csv', index=False)

In [62]:
print(*featureSelectedDataset.columns, sep = "\n")

Adolescent fertility rate (births per 1,000 women ages 15-19)
Arable land (% of land area)
Arable land (hectares per person)
Birth rate, crude (per 1,000 people)
CO2 emissions from solid fuel consumption (% of total)
Crop production index (2004-2006 = 100)
Livestock production index (2004-2006 = 100)
Permanent cropland (% of land area)
Population, female (% of total)
Rural population (% of total population)
Life expectancy at birth, total (years)
Year
CountryName
MeanLifeExpetancyOverall
MeanLifeExpetancyForYear
AboveAverageLifeExpectancyOverall
AboveAverageLifeExpectancyByYear


### Final feature selected dataset has 10 predictors, with Year and Country added on for informational purposes
Adolescent fertility rate (births per 1,000 women ages 15-19)
Arable land (% of land area)
Arable land (hectares per person)
Birth rate, crude (per 1,000 people)
CO2 emissions from solid fuel consumption (% of total)
Crop production index (2004-2006 = 100)
Livestock production index (2004-2006 = 100)
Permanent cropland (% of land area)
Population, female (% of total)
Rural population (% of total population)
Life expectancy at birth, total (years)
Year
CountryName