# Week 3: Silly goofy Visualizations + Outlier Extravaganza

## This Week's Content
This week, we will be taking it step by step and starting with some visualizations via Plotly and Matplotlib to explore the dataset (think of it like an extended EDA) a little more and get a grasp on what the data looks like from a few perspectives. I thought this was a fun one because there are so many ways data scientists can view a dataset with the literal bajillions of options to visualize something these days. Let's explore a few of them : )

So the way this week will work will be that I will *attempt* to teach what each visualization is, why it may be important, ask a short coding and qualitative question, and then provide a demo for you guys!

### Relevant Libraries
[pandas](https://pandas.pydata.org/): a fast, powerful, flexible and easy to use open source data analysis and manipulation tool built on top of the Python programming language. It is one of the most common libraries used in data analysis and we will primarily be using the pandas DataFrame to manipulate our data.

[numpy](https://numpy.org/): the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.

[matplotlib](https://matplotlib.org/): a comprehensive library for creating static, animated, and interactive visualizations in Python. Many of the matplotlib functions are built into pandas DataFrames, so we will likely not have to call them directly.

[plotly](https://plotly.com/python/): Plotly's Python graphing library makes interactive, publication-quality graphs. Examples of how to make line plots, scatter plots, area charts, bar charts, error bars, box plots, histograms, heatmaps, subplots, multiple-axes, polar charts, and bubble charts.

[scipy](https://docs.scipy.org/doc/scipy/reference/stats.html): This module contains a large number of probability distributions, summary and frequency statistics, correlation functions and statistical tests, masked statistics, kernel density estimation, quasi-Monte Carlo functionality, and more.

In [52]:
# Uncomment and run the lines below if the code below causes an issue, you may need to download some of these pkgs
# !pip install plotly
# !pip install scipy

import numpy as np
import pandas as pd
from plotly.offline import init_notebook_mode, iplot, plot
import plotly as py
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import matplotlib.pyplot as plt
from scipy import stats

# Now for the content!
## First, load in our beautifully cleaned dataset

In [2]:
# For some reason when we load in the data a random Unnamed column shows up, so we are gonna drop it like it's hot
whr_data = pd.read_csv('https://github.com/shalindb/world_happiness_report/blob/main/data/cleaned_WHR_data.csv?raw=true').drop('Unnamed: 0', axis=1) 
whr_data

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual,Year
0,Switzerland,Western Europe,1,7.587,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738,2015
1,Iceland,Western Europe,2,7.561,1.30232,1.40223,0.94784,0.62877,0.14145,0.43630,2.70201,2015
2,Denmark,Western Europe,3,7.527,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204,2015
3,Norway,Western Europe,4,7.522,1.45900,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531,2015
4,Canada,North America,5,7.427,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176,2015
...,...,...,...,...,...,...,...,...,...,...,...,...
777,Rwanda,Sub-Saharan Africa,152,3.334,0.35900,0.71100,0.61400,0.55500,0.41100,0.21700,0.00000,2019
778,Tanzania,Sub-Saharan Africa,153,3.231,0.47600,0.88500,0.49900,0.41700,0.14700,0.27600,0.00000,2019
779,Afghanistan,Southern Asia,154,3.203,0.35000,0.51700,0.36100,0.00000,0.02500,0.15800,0.00000,2019
780,Central African Republic,Sub-Saharan Africa,155,3.083,0.02600,0.00000,0.10500,0.22500,0.03500,0.23500,0.00000,2019


# Scatterplots
Let's start with a common visualization method most people have heard of and are pretty familiar with: scatterplots!

To start us off, scatterplots are a type of data visualization that lets us compare two numerical variables and check for correlations (positive, negative, or neutral). Pretty simple, but these allow us to visualize things like certain correlations that we saw in an earlier week's correlation matrix, the change in happiness scores over time, etc!


Alright, before we get into something crazy like an overlaid scatterplot (wtf is that, you may ask? I just learned about it 5 minutes ago, so we are going through this together), my task for you is to create the x and y values you'd like to use - AKA, what two features would you like to plot against each other?

I have some skeleton code for you to use below, and I set up the plotly visualization below!
Below are some hints for the ellipses. If you are stuck for too long, double click on the associated hint to get the answer : )

Hint for first set of ellipses: We want to find out where the Year is 2015!
<!-- Answer: whr_data['Year'] == 2015 -->

Hint for second set of ellipses: We want to get the first twenty rows (Try using [this](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html))

<!-- Answer: .iloc[:20, :] -->

Hint for third set of ellipses: We simply want to index the feature name we are trying to get

<!-- Answer: 'Feature Name' -->
<!-- What I mean by this is that you can use any of the feature names i.e. Family, Freedom, Generosity, etc -->

In [5]:
x_data = whr_data.loc[whr_data['Year'] == 2015].iloc[:20, :]['Family']

y_data = whr_data.loc[whr_data['Year'] == 2015].iloc[:20, :]['Freedom'] # fill this out with the same exact code as above, just change your feature name!


In [51]:

data = go.Scatter(  x = x_data,
                    y = y_data,
                    mode = "markers",
                    name = "2015",
                    marker = dict(color = 'blue'),
                    #line = dict(color='firebrick', width=4, dash='dot'),
                    text= whr_data.loc[whr_data['Year'] == 2015].Country)
layout = dict(title = 'Happiness Rate Changing 2015 to 2019 for Top 20 Countries',
              xaxis= dict(title= 'Family',ticklen= 5,zeroline= False),
              yaxis= dict(title= 'Happiness',ticklen= 5,zeroline= False),
              hovermode="x"
             )
fig = dict(data = data, layout = layout)
iplot(fig)

Below, we will make an overlaid scatterplot of the changes in reported happiness for the top 20 countries

In [7]:
# import graph objects as "go"
import plotly.graph_objs a
# creating trace1
trace1 =go.Scatter(
                    x = whr_data.loc[whr_data['Year'] == 2015].iloc[:20, :]['Country'],
                    y = whr_data.loc[whr_data['Year'] == 2015].iloc[:20, :]['Happiness Score'],
                    mode = "markers",
                    name = "2015",
                    marker = dict(color = 'red'),
                    #line = dict(color='firebrick', width=4, dash='dot'),
                    text= whr_data.loc[whr_data['Year'] == 2015].Country)
# creating trace2
trace2 =go.Scatter(
                    x = whr_data.loc[whr_data['Year'] == 2015].iloc[:20, :]['Country'],
                    y = whr_data.loc[whr_data['Year'] == 2016].iloc[:20, :]['Happiness Score'],
                    mode = "markers",
                    name = "2016",
                    marker = dict(color = 'green'),
                    text= whr_data.loc[whr_data['Year'] == 2016].Country)
# creating trace3
trace3 =go.Scatter(
                    x = whr_data.loc[whr_data['Year'] == 2015].iloc[:20, :]['Country'],
                    y = whr_data.loc[whr_data['Year'] == 2017].iloc[:20, :]['Happiness Score'],
                    mode = "markers",
                    name = "2017",
                    marker = dict(color = 'blue'),
                    text= whr_data.loc[whr_data['Year'] == 2017].Country)

# creating trace4
trace4 =go.Scatter(
                    x = whr_data.loc[whr_data['Year'] == 2015].iloc[:20, :]['Country'],
                    y = whr_data.loc[whr_data['Year'] == 2018].iloc[:20, :]['Happiness Score'],
                    mode = "markers",
                    name = "2018",
                    marker = dict(color = 'black'),
                    text= whr_data.loc[whr_data['Year'] == 2018].Country)

# creating trace5
trace5 =go.Scatter(
                    x = whr_data.loc[whr_data['Year'] == 2015].iloc[:20, :]['Country'],
                    y = whr_data.loc[whr_data['Year'] == 2019].iloc[:20, :]['Happiness Score'],
                    mode = "markers",
                    name = "2019",
                    marker = dict(color = 'pink'),
                    text= whr_data.loc[whr_data['Year'] == 2019].Country)


data = [trace1, trace2, trace3, trace4, trace5]
layout = dict(title = 'Happiness Rate Changing 2015 to 2019 for Top 20 Countries',
              xaxis= dict(title= 'Country',ticklen= 5,zeroline= False),
              yaxis= dict(title= 'Happiness',ticklen= 5,zeroline= False),
              hovermode="x unified"
             )
fig = dict(data = data, layout = layout)
iplot(fig)

# Map Plots
These types of plots are used to understand geographical data and allows us to understand general trends of data across the globe! They offer a bunch of different methods to utilize this visualization; however, we won't be going through all of them. 

We will input our data in a dictionary as follows:
* type --> type of the map
* colorscale --> palette
* marker_line_width --> width of border line of countries
* locations --> locations from dataset
* locationmode --> locations created via country names
* z --> The column of the graph we want to display (in our case, the feature)
* text --> hovertext
* colorbar --> determining colorbar

All we really care about here is `z`, everything else doesn't really matter - you can choose what to put for those other values based on what you want your graph to look like. As long as you have some list of country names, you will be able to visualize those scores by region as we do below!

Note: Because there are different years' data in our dataset, we would have to do a little more tweaking in order to get the averaged value over the years which may lead to a slightly different visualization. In our case, we are technically looking at values from 2015 as they are the first values that appear for a country in our dataset.

In [31]:
feature_name = 'Family'
data = dict(
        type = 'choropleth',
        colorscale = 'Viridis',
         marker_line_width=1,
        locations = whr_data['Country'],
        locationmode = "country names",
        z = whr_data[feature_name],
        text = whr_data['Country'],
        colorbar = {'title' : feature_name})

layout = dict(title = f'Map of Global {feature_name} Values ',
              geo = dict(projection = {'type':'mercator'}, showocean = False, showlakes = True, showrivers = True, )
             )
choromap = go.Figure(data = [data],layout = layout)
iplot(choromap,validate=False)

### Policy Problem
Okay, so I'm going to introduce these types of questions where I propose certain policy decisions - or ask you to propose your own - and you use your visualization to try and figure out what you'd want to implement!

This first one will be a pretty easy one to get us started:
If we were ONLY using the visualization of Happiness Scores, what region would we likely want to focus our international policy decisions on in order to have equity in everyone's happiness? (AKA which region has, on average, the lowest happiness scores?)



This second one is a bit more tricky:
Given the distribution of 'Family' values across the globe, let us choose the country with the lowest score. What policy do you believe would be the most effective and why?
1. Offering income subsidies to the working class of this country to decrease their working hours per week
2. Enacting children's rights laws to ensure proper education and growth for children ages 1-12

Feel free to do a bit of research on this country and their position on family values, and how each of these policy decisions could impact norms in this country!

# Bubble Charts
A bubble chart is a bit of a step up from a scatterplot as discussed above. Here, we can also change the size of bubbles to reflect yet another, third variable! And we can even change the color to reflect another,,, FOURTH variable! That's a lot of variables. This can be useful if we have something like our happiness score, and then multiple features that may affect that score. We can see if the largest bubbles are also largest in regards to certain features.
Also these are some of the most aesthetically pleasing visualizations to look at, the pastel colors are all nice and pretty

Our data will be input in the form:

**Important:**
* x --> x axis
* y --> y axis
* marker --> color represents 3rd dimension of graph, size represents 4th dimension of graph

**Not as important:**
* mode --> how points are displayed
* name --> name of the color (i.e.: red for 2015, green for 2016..)
* showscale --> colorbar
* text --> Text that show when yo came on to a dot. (In this ex: Country name)

We will be plotting two features against each other with Happiness Score as the size of the bubbles and color (this will only be 3 variables). Feel free to play around with changing this by changing the `size` and `color` attribute in the `marker` key of the dictionary.

In [40]:
bubble_x = 'Family'
bubble_y = 'Economy'
bubble_color = 'Trust (Government Corruption)'
bubble_size = 'Happiness Score'
data = dict(y = whr_data.loc[whr_data['Year'] == 2019][bubble_feature_1],
            x = whr_data.loc[whr_data['Year'] == 2019][bubble_feature_2],
            mode = 'markers',
            marker = {
                'color': whr_data[bubble_color],
                'size': whr_data[bubble_size],
                'showscale': True,
            },
            text = whr_data.Country)

layout = go.Layout(barmode='group', hovermode="x",
                   title=f'Bubble Chart: x = {bubble_x}, y = {bubble_y}, Size = {bubble_size}, Color = {bubble_color}, Year = 2019',
                   xaxis=dict(title='Freedom'),
                   yaxis=dict(title='Trust'))

fig = go.Figure(data=data, layout=layout)
iplot(fig)

Conceptual understanding questions bc bubble charts can be confusing as hell

1. Given:

    `bubble_x = 'Family'
    bubble_y = 'Freedom'
    bubble_size = 'Happiness Score'
    bubble_color = 'Happiness Score'`
    
If I see a small bubble with low x value and high y value, what does that mean?

2. Given:

    `bubble_x = 'Family'
    bubble_y = 'Economy'
    bubble_size = 'Happiness Score'
    bubble_color = 'Trust (Government Corruption)'`
    
If I see a large, yellow bubble with high x value and high y value, what does that mean?


*Hint: Denmark is a country that is associated with the above characteristics*

# Intro to Outliers

What are outliers and why are they important to understand when doing data analysis? They're lil' data points that diverge from what all other data points are doing. They are the quirky ones that are out of place in what we expect the data to show. For example, let's say Health is highly correlated with Happiness Score. But what if there's a country, let's say it's called ShalinLand where they only eat Trader Joe's Frozen Pizzas because they're a monster. But what if their happiness is at its peak? It's intriguing, we're thinking how the heck is this country of insanely unhealthy maniacs still finding happiness? It lets us ask more questions and consider alternate hypotheses, etc. 

We can find outliers via visualizations (if you look at the visualizations above, I'm sure you can find a few points that stick out) or calculate them *mathematically*.

In order to calculate the outliers, there's this process where you find the interquartile ranges, find the mean, do a lil mathematics, and out pops what you're looking for. Instead of that *slightly* annoying process, right now we can simply use the `zcore` function that the scipy library so conveniently has for us. 

This will calculate how far some value is from the mean/average of the data in terms of standard deviations. (If this language is confusing to you, data8 will clear it up but for now, think of it as how far a certain datapoint is from the average of all the data. If it's a certain distance away, it's considered an outlier)

### Question 1:
Find outliers in the `whr_data` table where the absolute value of the zscore for Happiness Score is larger than 2

*Hint 1: stats.zscore(...) will return an array of zscores*

*Hint 2: np.abs(...) will come in handy*

*Hint 3: If we want ot filter a dataframe, we can do somehting like `df[array > threshold]`*

In [57]:
happiness_score_column = whr_data['Happiness Score']
abs_zscores = np.abs(stats.zscore(happiness_score_column))
filtered_whr_data = whr_data[abs_zscores > 2]

filtered_whr_data

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual,Year
155,Syria,Middle East and Northern Africa,156,3.006,0.6632,0.47489,0.72193,0.15684,0.18906,0.47179,0.32858,2015
156,Burundi,Sub-Saharan Africa,157,2.905,0.0153,0.41587,0.22396,0.1185,0.10062,0.19727,1.83302,2015
157,Togo,Sub-Saharan Africa,158,2.839,0.20868,0.13995,0.28443,0.36453,0.10731,0.16681,1.56726,2015
313,Syria,Middle East and Northern Africa,156,3.069,0.74719,0.14866,0.62994,0.06912,0.17233,0.48397,0.81789,2016
314,Burundi,Sub-Saharan Africa,157,2.905,0.06831,0.23442,0.15747,0.0432,0.09419,0.2029,2.10404,2016
468,Burundi,Sub-Saharan Africa,154,2.905,0.091623,0.629794,0.151611,0.059901,0.084148,0.204435,1.683024,2017
469,Central African Republic,Sub-Saharan Africa,155,2.693,0.0,0.0,0.018773,0.270842,0.056565,0.280876,2.066005,2017
624,Central African Republic,Sub-Saharan Africa,155,3.083,0.024,0.0,0.01,0.305,0.038,0.218,0.0,2018
625,Burundi,Sub-Saharan Africa,156,2.905,0.091,0.627,0.145,0.065,0.076,0.149,0.0,2018
626,Finland,Western Europe,1,7.769,1.34,1.587,0.986,0.596,0.393,0.153,0.0,2019


What do you notice about these outliers? Are there more outliers that have a low happiness score or high happiness score? Try playing around with what the feature you're trying to find an outlier for to see how the outlier countries change!

For multivariate datasets (i.e. we are trying to find outliers when plotting two variables against each other) we can use what is known as the Mahalanobis Distance (feel free to read up on it [here](https://towardsdatascience.com/multivariate-outlier-detection-in-python-e946cfc843b3)). This concept is incredibly out of scope for most classes you may take except for EECS 127 and CS 189. The main idea is that it finds outliers in situations where there are multiple variables. Use the above linked reading to find out more if you are interested in the math behind it. We can use this, along with scipy's `scipy.spatial.distance.mahalanobis` function in order to calculate that distance and find outliers in an n-dimensional space.