<div style="position: relative;">
<img src="https://user-images.githubusercontent.com/7065401/98728503-5ab82f80-2378-11eb-9c79-adeb308fc647.png"></img>

<h1 style="color: white; position: absolute; top:27%; left:10%;">
     INE Bootcamp
</h1>
<h2 style="color: white; position: absolute; top:36%; left:10%;">
    Data Analysis, Visualization and Predictive Modeling
</h2> 

<h3 style="color: #ef7d22; font-weight: normal; position: absolute; top:58%; left:10%;">
    <b>David Mertz, Ph.D.</b>
</h3>

<h3 style="color: #ef7d22; font-weight: normal; position: absolute; top:63%; left:10%;">
    <b>Data Scientist</b>
</h3>
</div>

<div style="width: 100%; height: 200px; background-color: #222; text-align: center; padding-top: 20px; margin-bottom: 40px;">
<br><br>

<h1 style="color: white; font-weight: bold;">
    Visualization using Pandas
</h1>

<br><br> 
</div>

<img src="https://user-images.githubusercontent.com/7065401/75165824-badf4680-5701-11ea-9c5b-5475b0a33abf.png" style="width:200px; float: right; margin: 0 40px 40px 40px;"/>

> The Pandas (Panel Data) Python library is a very powerful tool for data manipulation and analysis. In this lesson, we look at some of the visualization capabilities built into the library.

The visualizations in Pandas, like those we will see later with Seaborn, are based on the underlying library Matplotlib.  Whenever you issue a single `.plot()` method call in Pandas, "under the hood" Pandas is composing numerous Matplotlib calls to configure style, axes, colors, fonts, legends, tick marks, and other elements.  

In concept, everything you visualize in Pandas could be replicated exactly with Matplotlib alone.  However, using the Pandas wrapper makes the work *a whole lot easier* and *usually* assures that results are "reasonable looking" (whereas, with low-level Matplotlib, you can make very beautiful things; but because you control everything, you can also make very bad choices).

In [None]:
import pandas as pd
# Import Seaborn only to set style options throughout
import seaborn as sns
sns.set_style('whitegrid')
# Configure ellipses in DataFrames
pd.set_option('display.max_rows', 15)
pd.set_option('display.max_columns', 8)
# Make sure figures render inside notebook
%matplotlib inline

All Pandas DataFrame and Series objects have a `.plot()` method.  Many plot options are set by default. Some plot styles perform statistical operations.

There are several plot types available using the `kind=` keyword argument.

* `line` : line plot (default)
* `bar` : vertical bar plot
* `barh` : horizontal bar plot
* `hist` : histogram
* `box` : boxplot
* `kde` : Kernel Density Estimation plot
* `density` : same as 'kde'
* `area` : area plot
* `pie` : pie plot
* `scatter` : scatter plot
* `hexbin` : hexbin plot

See the [Pandas documentation](http://pandas.pydata.org/pandas-docs/stable/visualization.html).

<h2 style="font-weight: bold;">
    Line Plots
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

Let us read in a sample dataset about gender and university degrees in the United States.

In [None]:
degrees = pd.read_csv('data/percent-bachelors-degrees-women-usa.csv', index_col='Year')
degrees

Let's try to visualize this completely naively.

In [None]:
degrees.plot();

It turns out it is not *impossible* to go wrong with Pandas defaults.  While the data is *accurate*, the arrangment of the plot is a mess.  We can fix three issues very easily:

* The legend covers the plot
* There are two many line trends to follow easily
* The `AxesSubplot` object type echos needlessly

In [None]:
degrees.columns

In [None]:
stem = ['Computer Science', 'Math and Statistics', 'Engineering', 'Physical Sciences', 'Biology']
degrees[stem].plot(figsize=(12,5));

<h2 style="font-weight: bold;">
    Timeseries
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

Let us work with the fairly large NOAA dataset that was used in an exercise in an earlier lesson.  Recall that this data covers most of 2019, but only a subset of the tens of thousands of total stations.  Very roughly, those stations retrieved into this dataset are the northernmost ones, between 45°N and the north pole.

In [None]:
weather = pd.read_csv('data/NOAA-2019-partial.csv.gz', 
                      index_col='DATE', parse_dates=True)
weather

In [None]:
print(f"    Rows: {len(weather):,}")
print(f"   Dates: {weather.index.min().date()} to {weather.index.max().date()}")
print(f"Stations: {len(weather.STATION.unique())}")

For our purposes, let us select only the data from one station (we arbitrarily choose the first one by station ID)

In [None]:
jan_mayen = weather.loc[weather.STATION == 1001099999, ['TEMP', 'PRCP', 'WDSP', 'NAME']]
jan_mayen.head()

Here again, the very simplest call to `.plot()` is imperfect, but easy to improve.  The main issue here is that the different trends are over quantities measured in very different units, and hence very different numeric quantities (℉ vs. inches vs. mph).

In [None]:
jan_mayen.plot();

In [None]:
jan_mayen.plot(figsize=(12, 7), subplots=True,
               title="Temperature, Precipitation, and Windspeed in Jan Mayen, NO");

This version gives us a reasonble picture of how these quantities might be interrelated as a year progresses.

<h2 style="font-weight: bold;">
    Scatter
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)


Pandas scatter plots let us compare the relationship among quantites. Let's slice a different portion of the weather station data to find something likely to be useful.  Temperature, elevation, latitude, precipitation, and wind speed are likely to have some notable correlations; we choose a single day to compare across stations.

In [None]:
apr1 = weather.loc['2019-04-01', 
                   ['TEMP', 'ELEVATION', 'LATITUDE', 'PRCP', 'WDSP', 'STATION']]
apr1.set_index('STATION', inplace=True)
apr1

In [None]:
apr1.plot.scatter(x='LATITUDE', y='TEMP', color='blue', marker='.',
                  title="Latitude vs. Temperature on April 1, 2019");

In [None]:
apr1.plot.scatter(x='ELEVATION', y='TEMP', c='green', s=1,
                  title="Elevation vs. Temperature on April 1, 2019");

We can add several details here.  Probably latitude and elevation interact to affect temperature.  We can use color to represent a third dimension.  Moreover, we can choose a colormap that is iconic of how temperatures are usually represented (albeit, people in mid-latitudes are unlikely to think of 50℉ as "red").

As a tweak, we need to disable the `sharex` option to prevent the colorbar from hiding the X-axis label (this may not quite be a bug, but it's a glitch not a feature).

In [None]:
apr1.plot.scatter(x='LATITUDE', y='ELEVATION', figsize=(10, 6),
                  c=apr1.TEMP, cmap='coolwarm', sharex=False,
                  title="Elevation and Latitude influencing temperature");

<h2 style="font-weight: bold;">
    Box Plots
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)


Box Plots provide a quick statistical overview of column data.

In [None]:
apr1.plot(kind='box', subplots=True, figsize=(14,8), sym='k.');

All of these have quite a few outliers, and they are asymmetrically distributed for all the columns.  Precipitation is particularly notable since it is "usually zero" and hence the quartiles and 1.5x IQR whiskers are all solidly right at zero. 

<h2 style="font-weight: bold;">
    Histograms and Bar Charts
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)


The boxplot was not very revealing for precipitation.  A histogram might be more informative.  The predominance of zeros can be reduced by using a log Y scale.  We can use the entire large `weather` data for this. Note that *in this dataset*, zero can indicate "no report" rather than an "no precipitation" per se.

In [None]:
weather.PRCP.plot(kind='hist', logy=True, bins=30, 
                  title="2019 daily precipitation distribution across stations");

For bar charts we wish to plot some sort of categorical values.  The station ID for the April 1 observations is  good candidate.  All 1253 of them is not useful to plot this way.  Let us pick 8 at random for an example.

In [None]:
stations = [3660099999, 6806599999, 4196099999, 6619099999, 
            2428599999, 6797099999, 1336099999, 1257099999]
sample = apr1.loc[stations, ['ELEVATION', 'TEMP']]
sample

In [None]:
sample.plot.bar(secondary_y='TEMP', rot=60, figsize=(10, 4),
                title="Elevation and temperature at selected stations",
                color=['lightgreen', 'darkcyan']);

<div style="width: 100%; height: 200px; background-color: #ef7d22; text-align: center; padding-top: 20px; margin-bottom: 40px;">
<br><br>

<h1 style="color: white; font-weight: bold;">
    Exercises
</h1>

<br><br> 
</div>

<h2 style="font-weight: bold;">
    Visualize three correlated variables
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

In the early dataset about percentage of women in US colleges who major in various fields, there are a clear overall dependencies among the values, but they might be complex.  Each female college student has one of the groupings of majors listed.

As the student body of women collective shifts among majors over time, other majors thereby have fewer women in them.  In this exercise, try to visually express a relationship among Business, Engineering, and Art & Performance in comparison to each other.

**Hint**: It is easy to think this dataset tells us more than it actually does.  From this data alone, we know neither what overall percentage of students are women in a given year, nor what the relative enrollment in the different majors is.  Working with ratio data is tricky, so avoiding misrepresentation is especially important.  Think about accurate and descriptive plot titles.

In [None]:
# your code here...

<h2 style="font-weight: bold;">
    Stacked Bars
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

The lesson on the Ethics of Visualization specifically called out misuse of stacked bars (at least as an example of an irresponsible comparison of incompatible quantities).  For example, stacking elevation and temperature—or indeed any two of the measurements in the NOAA data—would be nonsensical. Moreover, stacking e.g. temperature from one day to the next also makes no sense.

However, one quantity measured clearly has a cumulative quality.  The precipitation over months is (approximately) the sum of the precipitation during each day.  This may not be exactly true in the dataset because of missing data, but it is not directly a conceptual problem.  Just to clarify, the measurements given are "rain or melted snow", so the units are compatible between days.

For this exercise, take several stations (e.g. those used in the lesson), and show the daily precipitation over the course of one month as a stacked bar chart.  You may pick whatever month you like for this exercise.  An even shorter period like 2 weeks might be better for the visualization; feel free to do that.

**Hint**: Look at the Pandas DataFrame `.pivot()` method.  This will make the task easier. Another approach uses `.groupby()`, discussed in next lesson.

In [None]:
# you code here...

<div style="width: 100%; height: 400px; background-color: #222; text-align: center; padding-top: 120px;">
<br><br>

<h1 style="color: white; font-weight: bold;">
    Review and questions
</h1>

<br><br> 
</div>

---
<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img>

<img src="https://user-images.githubusercontent.com/7065401/98864025-08deda80-2448-11eb-9600-22aa17884cdf.png" style="height: 100%; max-height: inherit; position: absolute; top: 20%; left: 0px;"></img>
<br>

<h2 style="font-weight: bold;">
    David Mertz, Ph.D.
</h2>

<h3 style="color: #ef7d22; margin-top: 0.8em">
    Data Scientist
</h3>
<hr>
<br><br>

<p style="font-size: 80%; text-align: right; margin: 10px 0px;">
    david.mertz@gmail.com
</p>
<p style="font-size: 80%; text-align: right; margin: 10px 0px;">
    linkedin.com/in/dmertz/
</p>

</div>

<br><br><br>