# Data Visualization

Data visualization is a very important part in data science. It is the most intuitive way of getting overall view of a dataset we get.

Firstly, data visualization is one step usually needed before implementing machine learning. Before we extract information from data by doing machine learning, implementing algorithms, aata visualization helps us to understand how the data looks from a human perspective and choose a suitable machine learning algorithm. Secondly, sometimes, we can extract large amount of information from pure data visualization. Information including trend of data, corelation betweeen data. Thirdly, data visualization is more attractive and persuasive than rarely numbers and texts, so it is often used to convey analysis results. 

# Packages

In this tutorial, we focus on interactive data visualization using Bokeh. But before that. I'd like to introduce other popular data visualization packages in python. When doing data visualization in practice, we need to make choice among them and find the most suitable one for our task.

*****
Matplotlib

Matplotlib is the standard for plotting in Python. It is the one being mostly used, although, by default, the figures look quite ugly. But that is pretty enough for data exploration visualization for the purpose of identifying trend and character of dataset.

*****
Seaborn

Seaborn is built on Matplotlib. It get strength of Matplotlib and styles and additional libraries have been created to give it a nicer look. However, it is aimed at generating static plots, not very good for interacting with data.

*****
Bokeh

Bokeh has the feature of both interactive & static plot. It is built on javascript library, the pyhton library specifies a plot and then hand over to javascript library which renders it. In this way, it gets strength of javascript which is interactivity without worring about writing javacript code. 

# Bokeh

To explain Bokeh in detail, we use a sample dataset to explain several types plots can be done by bokeh. We get US workers injury information from 2015 to 2017 fromnunited states department of labor, Occupational Safety and Health Administration(OSHA). The dataset is in a single csv file. 
https://www.osha.gov/severeinjury/index.html

In [1]:
# import packages
import pandas as pd
from bokeh.io import output_notebook, show
from bokeh.models import ColumnDataSource
from bokeh.plotting import figure, show, output_file

In [2]:
# read data in from csv file. 
data = pd.read_csv("severeinjury.csv", error_bad_lines=False, encoding = "ISO-8859-1")

In [3]:
# feature engineering
def employerFunction(row):
    return row.Employer.replace("United States", "U.S.").replace('US ', 'U.S. ').lower().strip()
data['EmployerNew'] = data.apply(employerFunction, axis = 1)

In [4]:
# view results in notebook
output_notebook()

### Time Series Chart

By drawing the time series plot, we view trend of data over the three years. The used to drew time series chart by importing TimeSeries from bokeh.charts. However, the old bokeh.charts API has been removed to a separate new bkcharts project. It is currently unmaintained, for that reason, it is strongly discouraged to be used. We can create time series plots with the stable bokeh.plotting API easily enough.

In [15]:
# overall view of trend. from 2015 to 2017
timeSeriesDF = pd.DataFrame({'count': data.groupby(["EventDate"]).size()}).reset_index()

timeSeriesDF.index = pd.to_datetime(timeSeriesDF['EventDate'])
timeSeriesDF.index.name = 'Date'
timeSeriesDF.sort_index(inplace=True)

source = ColumnDataSource(timeSeriesDF)

p = figure(x_axis_type="datetime", plot_width=800, plot_height=350)
p.line('Date', 'count', source=source)

show(p)

### Bar Plot

Check top 10 employers caused injury.

In [16]:
employersDF = pd.DataFrame({'count': data.groupby(["EmployerNew"]).size()}).reset_index()
employersDF.sort_values(by=['count'], ascending = False, inplace = True)
employersDF = employersDF[:10]

# x and y axes
employers = employersDF['EmployerNew'].tolist()
count = employersDF['count'].tolist()

p = figure(x_range=employers, plot_width=1300, plot_height=250, title="Top 10 Employers Causing Injury")
p.vbar(x=employers, top=count, width=0.9)

p.xgrid.grid_line_color = None
p.y_range.start = 0

show(p)

### Heatmap

Use heatmap to check relation between employer and part of body injured.

### Choropleth 

See the distribution of injury across the U.S.

In [247]:
# Feature Engineering
us_state_abbrev = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY',
}

data['StateAbbrev'] = data['State'].map(us_state_abbrev)


In [None]:
# unique value of each columns
data.apply(pd.Series.nunique).sort_values()