# Thinkful Project - Kalika Curry

Find the factors that affect the life expectancy. Specifically, you need to find out which factors increase the expected life in the countries and which factors decrease it.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from scipy.stats.mstats import winsorize
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
url = os.path.join(dirname, filename)
# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv(url)
df.head()

## Data Cleaning
*The Kaggle description for this dataset references that a lot of additional information has been added to the set for research purposes. This briefing references additional timeframes as well as countries. My goal is to observe the dataset with an idea in mind to create a predictive model with respect to life expectancy.*

1. Detect the problems with the data such as missing values and outliers. 
2. Are there any nonsense values that seem to be stemmed from the data collection?
3. For the missing values, discuss which technique would be the most suitable one in filling out these values. 
3. Regarding the outliers, discuss their potential effects on your analysis and select an appropriate method to deal with them.

### Missingness

In [None]:
#look at all the variables.
df.describe(include='all').T

In [None]:
#I want to take a look at some of the unique categorical values.
for col in df.describe(include='O'):
    print(df[col].unique())

In [None]:
#data types. There are nulls. Year is reading as an integer. I might have to fix this, for now I take note.
#I'm not noticing any nonesense values.
df.info()

In [None]:
#A summary of missing variables represented as a percentage of the total missing content. 
def missingness_summary(df, print_log=False, sort='ascending'):
  s = df.isnull().sum()*100/df.isnull().count()
  s = s [s > 0]
  if sort.lower() == 'ascending':
    s = s.sort_values(ascending=True)
  elif sort.lower() == 'descending':
    s = s.sort_values(ascending=False)  
  if print_log: 
    print(s)
  
  return pd.Series(s)

suspects = missingness_summary(df, True, 'descending')

I do not notice any nonesense variables. 


Drop Null Potentials: 
* Investigate variables with less than 5% missing and see if those rows can be dropped
* My variable of interest is the life expectancy. 0.34% of the data is missing for this variable. It is related to the [Adult Mortality](https://www.who.int/gho/mortality_burden_disease/mortality_adult/situation_trends_text/en/#:~:text=Adult%20mortality%20rate%20represents%20the,per%201000%20population%20in%202016.). I dropped all records where these values are null.
 

In [None]:
#Dropping Life expectancy and adult mortality.
df.dropna(subset=['Life expectancy ', 'Adult Mortality'], inplace=True)

suspects = missingness_summary(df, False)
drop_suspects = suspects[ suspects < 5 ]
drop_suspects

In [None]:
df[df[drop_suspects.keys()].isnull().any(axis=1)]


Investigating the weight, polio, and diphtheria. There are a number of approaches I can take. I could go back and take the average for each country/region/area for these values. 

They are all developing countries. I have forty rows of them, how is that going to impact the life expectancy and mortality calculations for those developing countries? When I weigh against polio, there's not a lot I can do about that. When I weigh against, say, population, HIV, etc, it could have a serious impact. Why don't we have polio information on these developing countries? Could it be because they didn't have access to the technology to test for polio? Were their systems down that year? Should I allow those possible interfering factors disallow their data to contribute to the other features that I'm observing? No.  

Am I going to look at polio and diptheria? Is there any correlation on these items and life expectancy?

I have a lot of features. I'm going to assign them to a different dataframe for the time being and remove the suspected null values from that dataframe. For the remainder of this project, I will omit these features. I might come back to them.

Remember the country a person lives in could also impact the life expectancy. Do I want to just think about the time of year? I anticipate that the time of year would have a very real impact on the life expectancy of ESPECIALLY developing countries.
Yes. I could drop the nulls individually, but there's enough of a significance is the nulls for these features for me to go ahead and make the assumption. I just feel a little bad for Sedan - when I look at polio, etc.

In [None]:
var = ['Life expectancy '] #variable of interest
lost_development = list(drop_suspects.keys())
lost_development.extend(var)

#Lost Development dataframe. Short name for easy access.
ld = df[lost_development]
ld = ld.dropna()

#No Missing Values for the lost development dataframe. 
missingness_summary(ld)

In [None]:
#Remove the drop_suspect features from the dataset. We will not use them with this dataset as this subset of data is incomplete.
cols = list(df.columns)
cols = [x for x in cols if x not in drop_suspects.keys()]

#My new Life Expectancy dataframe, without those few missing entries.
le = df[cols]

### Filling in Data
Now I have to understand what's going on with those columns who have a larger portion of the data missing. The idea being that less than half but more than five percent of them contain null values. 

I want to take a look at these and determine if they should be imputed, interpolated, or given the same treatment as the others. 

In [None]:
miss = missingness_summary(le, True)

I've got seven features to think on. I'm not comfortable filling in Hepatitis B with an estimation. I will omit that feature. Rather than create another dataframe this time, I will just opt to exclude it because now I'm rethinking my plan of attack. 

I think perhaps there should be an alternate feature set that should be handled differently when there's more time available to explore this dataset. 

Omitted Features and Reasonings:
* Total Expenditure - am satisfied with percentage expenditure.
* Hepatitis B - should be treated separately. 
* Income Composition of Resources - I don't understand what this is. Further research would be required to determine the appropriate handling.
* Alcohol - Treat it separately. 
* Population - 21% of the data with respect to the population is missing. I really want to use this feature, but there is a significant amount of data missing in developing countries, and it's consecutive. 
* Schooling has the same complication. I am required to omit that feature.
* GDP also has the same complication as all the others. There's too much data missing with respect to the country and much of the data is consecutive. A little research suggests that there may be [some relationship between precentage expenditure and GDP.](https://en.wikipedia.org/wiki/Government_spending#:~:text=The%20figures%20below%20of%2042,was%20%2422%2C726%20in%20the%20U.S.)

In [None]:
le[le[miss.keys()].isnull().any(axis=1)].sort_values(['Country','Year'])

In [None]:
le[le['Schooling'].isna()]

In [None]:
omit = ['Total expenditure', 'Hepatitis B', 'Income composition of resources', 'Alcohol', 'Population', 'Schooling', 'GDP']

#drop those features you don't want to use right now. 
cols =  [x for x in cols if x not in omit]
le = le[cols]
miss = missingness_summary(le, True)

In [None]:
le

### Outliers

Now that I have my data cleaned up enough to try and accomplish something with it, I want to take a look at it and see if there is any data that falls outside their normal means.

In [None]:
#Country, Year, and Status should not have any outliers. 
out = [x for x in cols if x not in ['Country', 'Year', 'Status']]

le[out[0]].plot.box(whis=3) 
plt.show()
out.pop(0)

In [None]:
le[out[0]].plot.box(whis=3) 
plt.show()
out.pop(0)

In [None]:
le[out[0]].plot.box(whis=3) 
plt.show()
out.pop(0)

In [None]:
#First incident of excessive outliers. Add the variable to an outlier list for future investigation.
outlier = ['infant deaths']

In [None]:
le[out[0]].plot.box(whis=3) 
plt.show()
out.pop(0)

In [None]:
outlier.append('percentage expenditure')

In [None]:
le[out[0]].plot.box(whis=3) 
plt.show()
out.pop(0)

In [None]:
outlier.append('Measles ')

In [None]:
le[out[0]].plot.box(whis=3) 
plt.show()
out.pop(0)

In [None]:
outlier.append('under-five deaths ')

In [None]:
le[out[0]].plot.box(whis=3) 
plt.show()
out.pop(0)

In [None]:
outlier.append(' HIV/AIDS')

In [None]:
outlier

Good news! Our variables of interest don't have very many outliers. They're usable as is. 

The rest do. The outliers can have an impact on the visual results. When there's too many outliers, the data needs to be compressed or transformed to make it easier to work with. 

These outliers that we're seeing could be a result of bad data, they could just a few dozen falling outside of the norm, or they could be something more serious.



In [None]:
# Tukey's method.
def tukey(field):
  q75, q25 = np.percentile(field, [75 ,25])
  iqr = q75 - q25
 
  for threshold in np.arange(1,5,0.5):
      min_val = q25 - (iqr*threshold)
      max_val = q75 + (iqr*threshold)
      print("The score threshold is: {}".format(threshold))
      print("Number of outliers is: {}".format(
          len((np.where((field > max_val) 
                        | (field < min_val))[0]))
      ))
        
for col in outlier:
    print("TUKEY INFORMATION FOR", col)
    print('____________________________')
    tukey(le[col])

In [None]:
#Trying out winsorize. 
from scipy.stats.mstats import winsorize

# Apply one-way winsorization to the highest end. I went with the 80th percentile. 
print(outlier[0])
wv1 = winsorize(le[outlier[0]], (0, 0.15))
plt.boxplot(wv1)
plt.show()

In [None]:
#Add a column to the datatable for this transformation.
le["w"+ outlier[0]] = wv1

# Apply one-way winsorization to the highest end. I went with the 80th percentile. 
print(outlier[1])
wv2 = winsorize(le[outlier[1]], (0, 0.15))
plt.boxplot(wv2)
plt.show()

In [None]:
#Add a column to the datatable for this transformation.
le["w"+ outlier[1]] = wv2

# Apply one-way winsorization to the highest end. I went with the 80th percentile. 
print(outlier[2])
wv3 = winsorize(le[outlier[2]], (0, 0.2))
plt.boxplot(wv3)
plt.show()

In [None]:
#Add a column to the datatable for this transformation.
le["w"+ outlier[2]] = wv3

# Apply one-way winsorization to the highest end. I went with the 80th percentile. 
print(outlier[3])
wv4 = winsorize(le[outlier[3]], (0, 0.0))
plt.boxplot(wv4)
plt.show()

In [None]:
#Add a column to the datatable for this transformation.
le["w"+ outlier[3]] = wv4

# Apply one-way winsorization to the highest end. I went with the 80th percentile. 
print(outlier[4])
wv5 = winsorize(le[outlier[4]], (0, 1.0))
plt.boxplot(wv4)
plt.show()

In [None]:
le[outlier[4]].describe()

A winsorize transformation worked for all variables except the HIV/AIDS variable. What impact do I believe that this data will have when run againsed the life expectancy? Even with this many outlier values, I should be able to gain a correlation understanding - I think. 

In [None]:
le['Life expectancy '].corr(le[' HIV/AIDS'])
le.plot.scatter(x='Life expectancy ', y=' HIV/AIDS')
plt.show()

The distribution isn't normal, so it's difficult to get a good comparison between these two variables. When I think on whether or not they'll make a good feature in my model, I'm also thinking about the other features that I'm using. Measles and HIV/AIDS seem to make more sense in another model that is looking at the impact of diseases on our overall life expectancy. Wheras this dataset, seems to be looking at how the the percent expenditures and infant deaths would relate to our overall life expentancy. At least that appears to be the direction I'm heading.

A log transformation COULD fit this data to the model, but because of the features that I've started to eliminate with respect to my approach, I'm going to work without measles and HIV/AIDS. They need to be investigated and handled separately - like all other diseases.  

In [None]:
cols = list(le.columns)
omit = [' HIV/AIDS', 'Measles ', 'wMeasles ' ]
#drop those features you don't want to use right now. 
cols =  [x for x in cols if x not in omit]
le = le[cols]
le

Explore the data using univariate and multivariate exploration techniques. You should pay special attention to your target variable. In this regard, your focus should be on finding the relevant variables that may affect life expectancy.



In [None]:
le.plot()

In [None]:
le.plot.scatter(x='Life expectancy ', y="percentage expenditure")
plt.show()

In [None]:
le.plot.scatter(x='Life expectancy ', y="wpercentage expenditure")
plt.show()

In [None]:
max = 0.0
var = 'Life expectancy '


#slice the columns from five on - since these are numerical data that don't include the area(s) of interest. 
for col in cols[4:]:
    correlation = le[var].corr(le[col])
    print("The correlation score for {} is {} ".format(col, correlation ))
    
    if abs(correlation) >= max:
        max = abs(correlation)
        best = col

print("The greatest correlation of expenditures against the {} is {}".format(var, best)) 


In [None]:
le.corr()

### Feature Engineering

I want to take a closer look at the infant deaths and under five deaths. The percentage expenditures had the second highest correlation, so I want to see how these three relate. 

In [None]:
features = cols[-3:]
le[features].corr()

the infant and under five deaths have a low, but similar correlation to each other. 
the percentage is pretty far from the rest of them. Good. I understand how these possible components might relate.


In [None]:
features.append(var)
le[features]

In [None]:
plt.figure(figsize=(20,7))
#Including ci=None because I'm looking at the consequence of including bootstrapping on just two variables when we use all data - on a relatively small dataset. (See summary)
sns.lineplot(data=le[features], x='Life expectancy ', y='wpercentage expenditure', ci=None)
plt.show()

In [None]:
le[['wunder-five deaths ','winfant deaths', var]].plot(x=var, y=['wunder-five deaths ','winfant deaths'], figsize=(20, 8))


Oh yeah, that's right. I can't really plot the under-five deaths and the infant deaths against the life expectancy. I mean, they died before they got that old.
I'm amazed that I managed to get a graph of this thing. They do relate to each other, though. 

# Summary

The life expectancy is most closely related to the country's percentage expenditure. I can see that as the expenditure increases there are gradual increases in the life expectancy. 

This information comes from just one graph. It's inclusive of all people all over the world dating all the way back from 2000 to 2015, which is just five years ago. This is inclusive all of the developed countries, undeveloped countries and regardless of disease. 


In [None]:
plt.figure(figsize=(20,7))
sns.lineplot(data=le[features], x='Life expectancy ', y='wpercentage expenditure')
plt.show()

Time, countries, whether or not those countries are developing. It would be interesting to pull these together and place them on this chart to see what impact they may have. 

As far as the diseases are concerned. Well. The investigation on what impact disease has on life expectancy is a experiment for another day. I'd be most excited to see what impact they have on children, as well as to pull in some information about when we introduced the polio vaccine - and all the other vaccines. 

That'll take a bit of time. 

#### Afterward 

I need to incorporate into my cleaning processes to include the removal of extra spaces that occur at the beginning and the end of these columns to increase the navigation through the data during my exploration. It's not something I have immediate time for, currently. It is something that happens, often.

Inplace reassignment had a big impact on my performance. I make a lot of typos, mini programming errs, etc. while I'm working. These inplace commands slow me down and force me to rerun an entire session. I might want to refrain from using them.


