# Data Visualization
We'll take a look at two different libraries for data visulization:
* **Matplotlib**: this is probably the most commonly used data visualization package for Python. There are many examples in the [Matplotlib documentation](https://matplotlib.org/). For the examples here, I'll use the `pyplot` module. There are more advanced plotting options (similar to MATLAB) that are well documented in the link above.
* **Seaborn**: this library is actually built on top of Matplotlib with a focus on more statistically oriented plotting. Here is a link to the [Seaborn documentation](https://seaborn.pydata.org/).


In [None]:
# import necessary libraries
import pandas as pd # for data frames, reading and writing data
from matplotlib import pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns

import numpy as np
from math import sqrt

# the next line is so that the matplot lib plots show up in the notebook cell
%matplotlib inline

## Load Data
Let's use the same sample data that we used before in the Pandas section. We'll load the user data, since that has the most fields and potential for "dirty" data.

In [None]:
filename = 'sample_data.xlsx'
user_df = pd.read_excel(filename, sheet_name='user_data')
tweet_df = pd.read_excel(filename, sheet_name='tweet_data')
tweets_classified = pd.read_excel(filename, sheet_name='tweets_classified')
user_df.head()

## Super-Simplified Plotting w. Pandas
Pandas has some of basic plotting functionality directly connected to matplotlib. With that, you can create basic plots by calling methods right off of a data frame. Let's create a column in the users table for the month the user was created and then make a quick histogram from that.

### Pandas Histogram

In [None]:
user_df['created_month'] = user_df['created_at'].dt.month

# call .hist from the created_month series to get a histogram


### Pandas Timeseries
Let's take a look at a timeseries plot. Since we don't have a bunch of nice timeseries data in our sample data, let's just create some here. We'll create a dummy dataset with nine rows and 4 columns of random, normal data.

In [None]:
a = np.random.standard_normal((9,4))
# Create a dataframe from this matrix
df = 

# Rename the columns to 'No1...No4'


# Review your results
df

Add a date index to our dummy dataset:

In [None]:
dates = pd.date_range('2018-1-1', periods=9, freq='M')
dates

In [None]:
# set the index of the dataframe equal to the dates

df

In [None]:
# run the describe function on the data frame for more general summary info


Some of the aggregate functions can be applied at the data frame level, which then applies them across all columns. Here we'll look at the cumulative sum of each of the `NO#` columns and plot them as time series.

In [None]:
# call the cumsum() function on the data frame and then .plot


## Plotting directly with Matplotlib
Pandas did all the work in sorting out that plot with 4 series. Here's how you could do it directly with matplotlib.

Formatting dates can be messy sometimes. Here's a good bit of [example code](https://matplotlib.org/examples/api/date_demo.html) that can help.

In [None]:
fig, ax = plt.subplots(figsize = (8,5))
# Plot each series individually
plt.plot(df['No1'], color='green', marker='o', label='No1')
plt.plot( )
plt.plot( )
plt.plot( )
# Format the date axis
ax.format_xdata = mdates.DateFormatter('%Y-%m')
# rotates and right aligns the x labels, and moves the bottom of the
# axes up to make room for them
fig.autofmt_xdate()
# add a tile
plt.title('Multi-Series Plot')
plt.legend(loc=0)
plt.grid(True)

## Spliting into subplots
Let's say we want this same plot, but splitting into two differnt plots with No1-No2 in one and No3-No4 in the other. We can do this with the `plt.subplot` command. The syntax is plt.sublot(numrows, numcols, fignum).

In [None]:
fig, ax = plt.subplots(figsize = (14,5))

# Define the sub-plots - 1 row, two columns, 
# 1st plot
plt.subplot(1,2,1)
# Plot variable No1
plt.plot()
# Plot variable No2 together with No1
plt.plot()
# put a title on this subplot
plt.title('  ')
# Format the date axis
ax.format_xdata = mdates.DateFormatter('%Y-%m')
# rotates and right aligns the x labels, and moves the bottom of the
# axes up to make room for them
fig.autofmt_xdate()
plt.suptitle('Multi-Series Plots')
plt.legend(loc=0)
plt.grid(True)

# 2nd plot
plt.subplot(1,2,2)
# same as above - add series for No3 and No4 and a title



# Format the date axis
ax.format_xdata = mdates.DateFormatter('%Y-%m')
# rotates and right aligns the x labels, and moves the bottom of the
# axes up to make room for them
fig.autofmt_xdate()
# add a supertitle 
plt.suptitle(  , size=12)
plt.legend(loc=0)
plt.grid(True)

## Categorical variable versus a binary variable - Means and Confidence Intervals
From one of Bob's questions: 
* The categories of the categorical variable along the vertical axis
* The mean/percentage of the binary variable (1 values) along the horizontal axis
* Confidence intervals at the ends of the bars

In [None]:
plot_data = tweets_classified.groupby(['topic'])['class'].agg(['mean', 'std', 'count'])
plot_data['sqr_count'] = [sqrt(x) for x in plot_data['count']]
plot_data['SE'] = plot_data['std']/plot_data['sqr_count']
del(plot_data['sqr_count'])
plot_data.sort_values('mean', inplace=True)
plot_data

In [None]:
# Plot again using Matplotlib directly
# define the location on the y axis - based on the number of topics
fig, ax = plt.subplots(figsize=(9,5))
y_pos = np.arange(len(plot_data))
class_avg = plot_data['mean']
error = 2*plot_data['SE']

ax.errorbar(class_avg, 
            y_pos,
            xerr=error, 
            capsize = 8,
            fmt='o', 
            markersize=10, 
            linewidth=2, 
            color='green')
ax.set_yticks(y_pos)
ax.set_yticklabels(plot_data.index, size=12)
# ax.invert_yaxis()  # labels read top-to-bottom
ax.set_xlabel('Percent Positive Class')
ax.set_title('Manual Tweet Classifications\nError Bars = 2 Standard Errors of the Mean', size=14)

plt.show()

## Box Plots
Let's take a look at box plots of followers_counts by time-zone. First let's look at how many unique time-zone values we have:

In [None]:
user_df.time_zone.value_counts()

That's way to many! We'll distill those down to a more reasonable grouping as we did in the QA notebook.

In [None]:
# There are a lot of missing values! Let's mark those as missing for now:
user_df.loc[user_df.time_zone.isna(), 'time_zone'] = 'Missing'

# Create a new value for the grouped time zone so we don't lose the original data.
user_df['grouped_tz'] = 'Other'
user_df.loc[user_df.time_zone=='missing', 'grouped_tz'] = 'Missing'

#Europe
user_df.loc[user_df.time_zone.isin(['London', 'Dublin','Edinburgh','Amsterdam','Stockholm','Lisbon']),
            'grouped_tz'] = 'Europe'
# Eastern & Atlantic
user_df.loc[user_df.time_zone.isin(['Eastern Time (US & Canada)', 'America/New_York','Indiana (East)', 'Atlantic Time (Canada)']),
            'grouped_tz'] = 'Eastern'
# Central
user_df.loc[user_df.time_zone.isin(['Central Time (US & Canada)', 'America/Toronto']),
            'grouped_tz'] = 'Central'
# Pacific, Alaska, Hawaii
user_df.loc[user_df.time_zone.isin(['Pacific Time (US & Canada)', 'America/Los_Angeles','Alaska','Hawaii']),
            'grouped_tz'] = 'Pacific'

### Box Plot - followers_count by Timezone
This a good place to use Seaborn! Many of the faceting and advanced plotting functions we know in ggplot can be done with Seaborn. 

In this plot, we have some users with VERY large followers_counts. To make the plot more interesting, I've zoomed in by limiting the data to those users with less than 2000 followers.

In [None]:
fig, ax = plt.subplots(figsize=(10,5))
sns.set_style("whitegrid")
ax = sns.boxplot(x='grouped_tz', 
                 y='followers_count', 
                 hue='grouped_tz',
                 data=user_df.loc[user_df.followers_count<2000], 
                 palette="Set3")
ax.legend_ = None
# Set titles
plt.title('Follower Counts by Grouped Time Zone', size=15)
ax.set_xlabel('Grouped Time Zone')
ax.set_ylabel('Followers Count')
