# Visualization Is The Gateway Drug To Statistics

The main point of this is to show you how much you can do with very little. We will get into the hard core (introductory) data science stuff soon, but first lets play. R need not be intimidating you just have to find something fun to do with it and off you will go.

I’m actually not sure who said visualization is the gateway drug to statistics but i like it, and its true. If i find out i will add an online errata to the book. But the best way to break down the barriers of R and the scary statistics is to visualize something, so lets get started.

I expect that after you see how easy some visuals are in R you will be off and running with your own data explorations. Data visualization is one of the Data Science pillars, so it is critical that you have a working knowledge of as many visualizations as you can, and be able to produce as many as you can. Even more important is the ability to identify a bad visualization, if for no other reason to make certain you do not create one and release it into the wild, there is a site for those people, don’t be those people! Edward Tufte has done some great work in the field of visualizations, one phrase i want to introduce to you is chart junk. Just knowing that phrase exists will make you’re visualizations better, you cer- tainly do not want to have one of your visualizations end up on http://viz.wtf.


We are going to start easy, you have installed R Studio, if you have not this would be a great time to do it. Your first visualization is what is typically con- sidered advanced, but I will let you be the judge of that after we are done.

**Packages** – Packages are the fundamental units of reproducible Python code. They include reusable Python functions, the documentation that describes how to use them, and sometimes sample data.

**Choropleth** – is a thematic map in which areas are shaded or patterned in pro- portion to the measurement of the statistical variable being displayed on the map, such as population density or per-capita income or just about anything you can imagine to stuff into a map.



Below is the code for a choropleth, using the package Folium and the data set Civilian_labor_force_2011 which is the population of every county in the US, though very likely out of date.

Install package called Folium the quotes are required, you will get a mean-ingless error without them


In [None]:
!pip install folium 
!pip install json
!pip install requests
!pip install pydataset
!pip install seaborn

In [None]:
#!pip will install the package to the notebook instance from the internet  
#!pip install folium

#import will load the package into memory 
import folium
import json
import requests
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

m = folium.Map(location=[35.5585, -75.4665], zoom_start=6)
m

In [None]:


url = (
    "https://raw.githubusercontent.com/python-visualization/folium/master/examples/data"
)

state_geo = f"{url}/us-states.json"
state_unemployment = f"{url}/US_Unemployment_Oct2012.csv"
state_data = pd.read_csv(state_unemployment)


In [None]:
state_unemployment

In [None]:
state_data.head(10)

In [None]:
m = folium.Map(location=[48, -102], zoom_start=3)

folium.Choropleth(
    geo_data=state_geo,  #json state polygons
    name="choropleth", # Name of map
    data=state_data,
    columns=["State", "Unemployment"],  #columns in teh dataset
    key_on="feature.id",  
    fill_color="YlGn",  #color 
    fill_opacity=0.9,  #shading
    line_opacity=0.4,  #state border shading
    legend_name="Unemployment Rate (%)",  #Legend
).add_to(m)

folium.LayerControl().add_to(m)

m

In [None]:
import branca

url = (
    "https://raw.githubusercontent.com/python-visualization/folium/master/examples/data"
)
county_data = f"{url}/us_county_data.csv"
county_geo = f"{url}/us_counties_20m_topo.json"


df = pd.read_csv(county_data, na_values=[" "])

colorscale = branca.colormap.linear.YlOrRd_09.scale(0, 50e3)
employed_series = df.set_index("FIPS_Code")["Civilian_labor_force_2011"]



In [None]:
df

**FIPS** codes are numbers which uniquely identify geographic areas.  The number of 
digits in FIPS codes vary depending on the level of geography.  State-level FIPS
codes have two digits, county-level FIPS codes have five digits of which the 
first two are the FIPS code of the state to which the county belongs. 

In [None]:
employed_series

In [None]:

def style_function(feature):
    employed = employed_series.get(int(feature["id"][-5:]), None)
    #print(employed)
    return {
        "fillOpacity": 1,
        "weight": .1,
        "fillColor": "#black" if employed is None else colorscale(employed),
    }


m = folium.Map(location=[48, -102], tiles="cartodbpositron", zoom_start=3)

folium.TopoJson(
    json.loads(requests.get(county_geo).text),
    "objects.us_counties_20m",
    style_function=style_function,
).add_to(m)


m

# Copy the code above, create a new dataset using a differnet column and create a new map. 

*****
*****
*****

# THE HISTOGRAM

You probably have noticed that Python behaves very much like a scripting language which for me, having been a T-SQL guy seemed familiar. You may have noticed that it behaves like a programing language too in that you can install a package, invoke function or data set stored in that package, very much like a dll, though no compiling is required. 

It’s clear that it is very flexible as a language, which you will learn is its strength and its downfall. If you decide to start designing your own Python packages, you can write them as terribly as you want, though i would rather you didn’t.

There are lots of places to find datasets, pydataset, seaborn, Scilit-learn, pandas, NLTK.  This will provide you nice list of datasets to doodle with, as you learn something new explore the datasets to see how you can apply your new-found knowledge.

Examples below



In [None]:
# Import package
#!pip install pydataset
from pydataset import data

# Check out datasets
data()

In [None]:
# Import seaborn
import seaborn as sns

# Check out available datasets
print(sns.get_dataset_names())

In [None]:
# Import package
# https://scikit-learn.org/stable/datasets.html

from sklearn.datasets import load_boston
X, y = load_boston(return_X_y=True)
print(X.shape)

In [None]:

#https://www.nltk.org/book/ch02.html
    

First lets check out the histogram. If you have worked with SQL you know what a histogram is, and it is marginally similar to a statistics visual histogram. We are going to look at a real one. The basic definition is that it is a graphical representation of the distribution of numerical data.

When to use it? When you want to know the distribution of a single column or variable.
Typically you will be using the Pandas hist command, we will start with that. 

First we have to load a dataset into Pandas, we will play with lots of datasets, but i want you to get used to being very good at one data set, the EPA MPG dataset from 2018.

# Histogram

In [None]:
import pandas as pd # laod the package for use
epa = pd.read_csv("https://raw.githubusercontent.com/sqlshep/SQLShepBlog/master/data/epaMpg.csv")  #read in a file into a dataframe

In [None]:
epa  #display the dataframe

In [None]:
epa['FuelEcon'].hist()  # this method will select the column and perform a basic histogram
plt.show()

In [None]:
epa['FuelEcon'].hist(bins = 30) 
plt.show()

In [None]:
epa.hist() 
plt.show()

## Your turn try a couple of histograms


# Scatterplot


In the ongoing visualization show and tell scatterplots have come up next on my list. As I write this I try very hard to check and double check my knowledge and methods, I usually have a dataset or two in mind long before I get to the point I want to write about it. I want to use the epa dataset again to play around with and run a line through the scatter plot to show a trend.


Below we have loaded the epa dataset,

In [None]:

epa.plot.scatter(x='HorsePower',
                y='FuelEcon',figsize=(10,8))

plt.show()

With the scatterplot we have two dimensions of data, on the left is the y axis, the FuelEconof the vehicle in ,iles per gallon, and on the bottom is the x axis HorsePower.

That was cool, no? Notice Anything interesting about the scatter plot?  Is there an outlier? 

Something else you can notice is that as the horse power increases the MPG tends to decrease indicating a trend. 

I wonder if we can write a script to draw a linear regression line through the plot? We will dive deeper into what exactly that line means and what you can do with it later, but know it exists. 

Yes, we will use Seaborn for this. 


In [None]:
import seaborn as sns

In [None]:

plot = sns.lmplot(x='HorsePower', y='FuelEcon', data=epa, fit_reg=True,height=8.27, aspect=11.7/8.27)
plt.show()

## Your turn, choose a couple of numeric columns in the dataset a perform a scater plot.


# Boxplot


Boxplot or, whisker plot! You will see this one, and its always a little fuzzy to recall what is what until you have been doing it for a while.

It is said a picture is worth a thousand words, so will leave these here. THer following two samples are from the epa dataset, so you can easily recreate them. Quartiles is coming up in the next chapter, so if you need to can jump to the Range and IQR section if you need to then come back to this. 

But for now, just know that The IQR describes the middle 50% of values when ordered from lowest to highest. To find the interquartile range (IQR), first find the median (middle value) of the lower and upper half of the data. These values are quartile 1 (Q1) and quartile 3 (Q3). The IQR is the difference between Q3 and Q1.

In [None]:

epa.boxplot(column=['FuelEcon'])
plt.show()

See the data points that fall outside the plot?  Those are called Outliers; An outlier is an observation that is numerically distant from the rest of the data. When reviewing a box plot, an outlier is defined as a data point that is located outside the whiskers of the box plot, typically 1.5 * the +/- IQR.

In [None]:
epa.boxplot(column=['Cylinders'])
plt.show()

In [None]:
epa.boxplot(column=['HorsePower'])
plt.show()

# Try some on your own...

# Violin Box plot

This is more eyecandy, but you can see the distribution of the data. 

A violin plot plays a similar role as a box and whisker plot. It shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared. Unlike a box plot, in which all of the plot components correspond to actual datapoints, the violin plot features a kernel density estimation of the underlying distribution.

In [None]:

x = sns.violinplot(x="Cylinders", y="FuelEcon", data=epa)
plt.show()