# Week 3: Silly goofy Visualizations + Outlier Extravaganza

## This Week's Content
This week, we will be taking it step by step and starting with some visualizations via Plotly and Matplotlib to explore the dataset (think of it like an extended EDA) a little more and get a grasp on what the data looks like from a few perspectives. I thought this was a fun one because there are so many ways data scientists can view a dataset with the literal bajillions of options to visualize something these days. Let's explore a few of them : )

So the way this week will work will be that I will *attempt* to teach what each visualization is, why it may be important, ask a short coding and qualitative question, and then provide a demo for you guys!

### Relevant Libraries
[pandas](https://pandas.pydata.org/): a fast, powerful, flexible and easy to use open source data analysis and manipulation tool built on top of the Python programming language. It is one of the most common libraries used in data analysis and we will primarily be using the pandas DataFrame to manipulate our data.

[numpy](https://numpy.org/): the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.

[matplotlib](https://matplotlib.org/): a comprehensive library for creating static, animated, and interactive visualizations in Python. Many of the matplotlib functions are built into pandas DataFrames, so we will likely not have to call them directly.

[plotly](https://plotly.com/python/): Plotly's Python graphing library makes interactive, publication-quality graphs. Examples of how to make line plots, scatter plots, area charts, bar charts, error bars, box plots, histograms, heatmaps, subplots, multiple-axes, polar charts, and bubble charts.

In [1]:
# Uncomment and run the lines below if the code below causes an issue, you may need to download some of these pkgs
# !pip install plotly

import numpy as np
import pandas as pd
from plotly.offline import init_notebook_mode, iplot, plot
import plotly as py
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import matplotlib.pyplot as plt

## First, load in our beautifully cleaned dataset

In [6]:
# For some reason when we load in the data a random Unnamed column shows up, so we are gonna drop it like it's hot
whr_data = pd.read_csv('https://github.com/shalindb/world_happiness_report/blob/main/data/cleaned_WHR_data.csv?raw=true').drop('Unnamed: 0', axis=1) 
whr_data

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual,Year
0,Switzerland,Western Europe,1,7.587,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738,2015
1,Iceland,Western Europe,2,7.561,1.30232,1.40223,0.94784,0.62877,0.14145,0.43630,2.70201,2015
2,Denmark,Western Europe,3,7.527,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204,2015
3,Norway,Western Europe,4,7.522,1.45900,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531,2015
4,Canada,North America,5,7.427,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176,2015
...,...,...,...,...,...,...,...,...,...,...,...,...
777,Rwanda,Sub-Saharan Africa,152,3.334,0.35900,0.71100,0.61400,0.55500,0.41100,0.21700,0.00000,2019
778,Tanzania,Sub-Saharan Africa,153,3.231,0.47600,0.88500,0.49900,0.41700,0.14700,0.27600,0.00000,2019
779,Afghanistan,Southern Asia,154,3.203,0.35000,0.51700,0.36100,0.00000,0.02500,0.15800,0.00000,2019
780,Central African Republic,Sub-Saharan Africa,155,3.083,0.02600,0.00000,0.10500,0.22500,0.03500,0.23500,0.00000,2019


## Scatterplots
Let's start with a common visualization method most people have heard of and are pretty familiar with: scatterplots!

To start us off, scatterplots are a type of data visualization that lets us compare two numerical variables and check for correlations (positive, negative, or neutral). Pretty simple, but these allow us to visualize things like certain correlations that we saw in an earlier week's correlation matrix, the change in happiness scores over time, etc!


Alright, before we get into something crazy like an overlaid scatterplot (wtf is that, you may ask? I just learned about it 5 minutes ago, so we are going through this together), my task for you is to create the x and y values you'd like to use - AKA, what two features would you like to plot against each other?

I have some skeleton code for you to use below, and I set up the plotly visualization below!
Below are some hints for the ellipses. If you are stuck for too long, double click on the associated hint to get the answer : )

Hint for first set of ellipses: We want to find out where the Year is 2015!
<!-- Answer: whr_data['Year'] == 2015 -->

Hint for second set of ellipses: We want to get the first twenty rows (Try using [this](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html))

<!-- Answer: .iloc[:20, :] -->

Hint for third set of ellipses: We simply want to index the feature name we are trying to get

<!-- Answer: 'Feature Name' -->
<!-- What I mean by this is that you can use any of the feature names i.e. Family, Freedom, Generosity, etc -->

In [3]:
x_data = whr_data.loc[...]...[...]

y_data = ... # fill this out with the same exact code as above, just change your feature name!


SyntaxError: invalid syntax (<ipython-input-3-d5fcf1de9b0c>, line 1)

In [4]:
import plotly.graph_objs as go
data = go.Scatter(  x = x_data,
                    y = y_data,
                    mode = "markers",
                    name = "2015",
                    marker = dict(color = 'blue'),
                    #line = dict(color='firebrick', width=4, dash='dot'),
                    text= whr_data.loc[whr_data['Year'] == 2015].Country)
layout = dict(title = 'Happiness Rate Changing 2015 to 2019 for Top 20 Countries',
              xaxis= dict(title= 'Family',ticklen= 5,zeroline= False),
              yaxis= dict(title= 'Happiness',ticklen= 5,zeroline= False),
              hovermode="x"
             )
fig = dict(data = data, layout = layout)
iplot(fig)

NameError: name 'x_data' is not defined

Below, we will make an overlaid scatterplot of the changes in reported happiness for the top 20 countries

In [5]:
# import graph objects as "go"
import plotly.graph_objs as go
# creating trace1
trace1 =go.Scatter(
                    x = whr_data.loc[whr_data['Year'] == 2015].iloc[:20, :]['Country'],
                    y = whr_data.loc[whr_data['Year'] == 2015].iloc[:20, :]['Happiness Score'],
                    mode = "markers",
                    name = "2015",
                    marker = dict(color = 'red'),
                    #line = dict(color='firebrick', width=4, dash='dot'),
                    text= whr_data.loc[whr_data['Year'] == 2015].Country)
# creating trace2
trace2 =go.Scatter(
                    x = whr_data.loc[whr_data['Year'] == 2015].iloc[:20, :]['Country'],
                    y = whr_data.loc[whr_data['Year'] == 2016].iloc[:20, :]['Happiness Score'],
                    mode = "markers",
                    name = "2016",
                    marker = dict(color = 'green'),
                    text= whr_data.loc[whr_data['Year'] == 2016].Country)
# creating trace3
trace3 =go.Scatter(
                    x = whr_data.loc[whr_data['Year'] == 2015].iloc[:20, :]['Country'],
                    y = whr_data.loc[whr_data['Year'] == 2017].iloc[:20, :]['Happiness Score'],
                    mode = "markers",
                    name = "2017",
                    marker = dict(color = 'blue'),
                    text= whr_data.loc[whr_data['Year'] == 2017].Country)

# creating trace4
trace4 =go.Scatter(
                    x = whr_data.loc[whr_data['Year'] == 2015].iloc[:20, :]['Country'],
                    y = whr_data.loc[whr_data['Year'] == 2018].iloc[:20, :]['Happiness Score'],
                    mode = "markers",
                    name = "2018",
                    marker = dict(color = 'black'),
                    text= whr_data.loc[whr_data['Year'] == 2018].Country)

# creating trace5
trace5 =go.Scatter(
                    x = whr_data.loc[whr_data['Year'] == 2015].iloc[:20, :]['Country'],
                    y = whr_data.loc[whr_data['Year'] == 2019].iloc[:20, :]['Happiness Score'],
                    mode = "markers",
                    name = "2019",
                    marker = dict(color = 'pink'),
                    text= whr_data.loc[whr_data['Year'] == 2019].Country)


data = [trace1, trace2, trace3, trace4, trace5]
layout = dict(title = 'Happiness Rate Changing 2015 to 2019 for Top 20 Countries',
              xaxis= dict(title= 'Country',ticklen= 5,zeroline= False),
              yaxis= dict(title= 'Happiness',ticklen= 5,zeroline= False),
              hovermode="x unified"
             )
fig = dict(data = data, layout = layout)
iplot(fig)