# Week 2: Exploratory Data Analysis (EDA)



## This Week's Content
This week, we will be doing some EDA on our datasets to clean up the data a little bit and understand a bit more of what we are looking at. We will combine the years into one comprehensible dataset and explore the correlations between certain features and their corresponding happiness scores.

### Relevant Libraries (Read the Short Summary if any are new to you)
[pandas](https://pandas.pydata.org/): a fast, powerful, flexible and easy to use open source data analysis and manipulation tool built on top of the Python programming language. It is one of the most common libraries used in data analysis and we will primarily be using the pandas DataFrame to manipulate our data.

[numpy](https://numpy.org/): the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.

[matplotlib](https://matplotlib.org/): a comprehensive library for creating static, animated, and interactive visualizations in Python. Many of the matplotlib functions are built into pandas DataFrames, so we will likely not have to call them directly.

[seaborn](https://seaborn.pydata.org/): Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

In [None]:
# Uncomment and run the lines below if the code below causes an issue, you may need to download some of these pkgs
# !pip install pandas
# !pip install numpy
# !pip install seaborn
# !pip install matplotlib

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Next, we will load in the data from all five years of the World Happiness Report.

In [None]:
whr_2015 = pd.read_csv('https://github.com/shalindb/world_happiness_report/blob/main/data/WHR_2015.csv?raw=true')
whr_2016 = pd.read_csv('https://github.com/shalindb/world_happiness_report/blob/main/data/WHR_2016.csv?raw=true')
whr_2017 = pd.read_csv('https://github.com/shalindb/world_happiness_report/blob/main/data/WHR_2017.csv?raw=true')
whr_2018 = pd.read_csv('https://github.com/shalindb/world_happiness_report/blob/main/data/WHR_2018.csv?raw=true')
whr_2019 = pd.read_csv('https://github.com/shalindb/world_happiness_report/blob/main/data/WHR_2019.csv?raw=true')

Often, we will find that there is little to no consistency in column names for datasets over a long time period. If you look below, you will see that this stuff is a bit of a mess, so we're gonna have to clean it up a bit before we can do anything more : (

In [None]:
whr_2015.columns

In [None]:
whr_2016.columns

In [None]:
whr_2017.columns

In [None]:
whr_2018.columns

In [None]:
whr_2019.columns

Well, that's a mess. Let's try to make them all consistent and then put them into one dataset, yeah? I'll go ahead and do 2015 - 2018 and you can do 2019! Our goal is to use the columns from the 2015 iteration of the World Happiness Report (except Standard Error).

<!-- 
whr_2019 = whr_2019.rename(columns={'Overall rank': 'Happiness Rank',
                                   'Country or region': 'Country',
                                   'Score': 'Happiness Score',
                                   'GDP per capita': 'Economy (GDP per Capita)',
                                   'Social support': 'Family',
                                   'Healthy life expectancy': 'Health (Life Expectancy)',
                                   'Freedom to make life choices': 'Freedom',
                                   'Perceptions of corruption': 'Trust (Government Corruption)'})
-->

In [None]:
whr_2015['Year'] = '2015'
whr_2016['Year'] = '2016'
whr_2017['Year'] = '2017'
whr_2018['Year'] = '2018'
whr_2019['Year'] = '2019'
whr_2015 = whr_2015.drop(columns=['Standard Error'])
whr_2016 = whr_2016.drop(columns=['Lower Confidence Interval',
                                  'Upper Confidence Interval'])

# tell me who made the column names for 2017, i just wanna talk
whr_2017 = whr_2017.rename(columns={'Happiness.Rank': 'Happiness Rank',
                                   'Happiness.Score': 'Happiness Score',
                                   'Economy..GDP.per.Capita.': 'Economy (GDP per Capita)',
                                   'Health..Life.Expectancy.': 'Health (Life Expectancy)',
                                   'Trust..Government.Corruption.': 'Trust (Government Corruption)',
                                   'Dystopia.Residual': 'Dystopia Residual'}).drop(columns=['Whisker.high', 'Whisker.low'])
# i wish i could tell you why they decided to do 'country or region' randomly in 2018 and 2019
# maybe they forgot countries are not regions idk
whr_2018 = whr_2018.rename(columns={'Overall rank': 'Happiness Rank',
                                   'Country or region': 'Country',
                                   'Score': 'Happiness Score',
                                   'GDP per capita': 'Economy (GDP per Capita)',
                                   'Social support': 'Family',
                                   'Healthy life expectancy': 'Health (Life Expectancy)',
                                   'Freedom to make life choices': 'Freedom',
                                   'Perceptions of corruption': 'Trust (Government Corruption)'})
### YOU DO THIS PART
whr_2019 = ...

###

Now you can go ahead and merge them into one big, beautiful dataset. 

*Hint: This should only take one line of code*

<!-- pd.concat([whr_2015, whr_2016, whr_2017, whr_2018, whr_2019]) -->

In [None]:
merged = ...
merged

You may notice that the Region value for countries in 2018 and 2019 is NaN. This is because they apparently just gave up on adding that information, so I made it part of this EDA section to fix it! Below, I've written a lil' function to take in a country name and return its region. This is a bit of a challenging one, but I want you to use that function to fix all the NaN's in Region to be the correct region.

In [None]:
def region(country):
    """ Simply getting the country's region from the 2015 dataset since it was labeled in that one """
    region_name = ''
    try:
        region_name = list(whr_2015.loc[whr_2015['Country'] == country]['Region'])[0]
        return region_name
    except IndexError:
        print(f'{country} is not present in other datasets')
        return ''

Below, complete the line of code in order to apply the region function onto each row.

*Hint: Check out the [DataFrame.apply()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html) function*

<!-- merged.apply(lambda row: region(row['Country']), axis=1) -->

In [None]:
# i hope you're having a great day and drinking water, staying hydrated, and killing it
merged['Region'] = ...
merged

You should see output of the form:

Puerto Rico is not present in other datasets

Belize is not present in other datasets

Somalia is not present in other datasets

Somaliland Region is not present in other datasets

(Sorry, I would've inserted an image but Jupyter is being a lil' annoying today)

It's important to note here that in a situation like this, it might be the efficient method to simply drop the countries who didn't have an easily assigned region, but these are entire countries we are talking about. We can't exlude these people from our analyses, and so we will do our best to include them by finding the regions they are in. In the code below, fill in the blanks to finish off our region-fixing-extravaganza.

In [None]:
# List of valid regions
whr_2015['Region'].unique()

In [None]:
# I did the first few for you
merged.loc[merged['Country'] == 'Puerto Rico', 'Region'] = 'Latin America and Caribbean'
merged.loc[merged['Country'] == 'Belize', 'Region'] = 'Latin America and Caribbean'
merged.loc[merged['Country'] == 'Somalia', 'Region'] = 'Sub-Saharan Africa'
merged.loc[merged['Country'] == 'Somaliland Region', 'Region'] = 'Sub-Saharan Africa'
merged.loc[merged['Country'] == 'Namibia', 'Region'] = 'Sub-Saharan Africa'
merged.loc[merged['Country'] == 'South Sudan', 'Region'] = 'Sub-Saharan Africa'
merged.loc[merged['Country'] == 'Taiwan Province of China', 'Region'] = 'Eastern Asia'
# Go ahead and do the last 5, shouldn't take too long! Google is your friend :)
merged.loc[merged['Country'] == ...
merged.loc[merged['Country'] == ...
merged.loc[merged['Country'] == ...
merged.loc[merged['Country'] == ...
merged.loc[merged['Country'] == ...

Yeah, I get it. It might seem like busy work putting in the regions for these countries, but I have a reason for it (I swear, I don't hate you). It's important for us to take a step beyond what might be the most efficient or effective solution to our problem and understand what the *best* solution actually is. In this case, it would've been easy and saved time had we ignored all these countries and simply removed them from our dataset. However, if we're measuring something like happiness and seeking to make policymaking decisions, every country and person matters.

We also want to get rid of any other NaN values in other columns that may prevent sound analysis in the future. [Here](https://www.geeksforgeeks.org/replace-nan-values-with-zeros-in-pandas-dataframe/) is a link to do so (this should be very easy, do not overthink it)

<!-- merged.fillna(0) -->

In [None]:
merged = ...

Now that you guys finished that last, undoubtedly grueling task, we will finish off the week of EDA by plotting a correlation matrix to understand the correlations between our various features and our happiness scores.

In [None]:
plt.rcParams['figure.figsize'] = (20, 15)
df_corr = merged.corr() # gets the r-value (correlation) between the features, to see what features correlate most with others
ax = sns.heatmap(df_corr, cmap='copper',annot=True) # creates a heatmap
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)

plt.show()

Last, I am going to save the cleaned data into our data folder so we can access it in future weeks.

In [None]:
merged.to_csv('./data/cleaned_WHR_data.csv')
merged

### Discussion! 
Cool, cool cool cool. (Brooklyn 99 or Community reference anyone?) Now that we have finished this week's content, I want to ask you: what do you think the correlation matrix above shows? We will notice when we do Correlation Matrices for specific regions in a future week that the values change drastically based on what region we are in, based on their cultural and societal values. Pretty neat, huh? 

Again, I want to stress that if you feel the pace is too fast/too slow, or if you're bored or anything, please please let me know via Slack DM. I truly appreciate any and all feedback, and I want to do my best to make this a fun experience for you guys, so seriously. Roast the shit out of me if you must, because my goal is for y'all to enjoy it : )

Last note, please fill out this feedback form (it's anonymous and takes < 3 mins. I just want to gauge how everyone is feeling so I can cater future content to everyone's likes and dislikes)

[Form Link](https://docs.google.com/forms/d/e/1FAIpQLSfcmm4XOfpzUf7IQCKlxrGVBHriHLDVev3XmN38BSmKoUtLGQ/viewform?usp=sf_link)