# Week 7: Exploratory vs Explanatory Visualizations

## Lab

Execute the *Lab* portion of this notebook (everything before *Exercise*), download a copy as a notebook file, and submit it on BlackBoard.

### Introduction

While visualizations can certainly be useful as an aid in story telling, we can also rely on visualizations to help make sense of data when we're exploring it and trying to extract meaning. While an exploratory visualization might make sense as we're working closely with the data, it's meaning could be lost or overwhelming when used as an explanatory tool in a report or summary. An exploratory visualization *could* be used as an explanatory visualization especially if it clearly supports the message that is being conveyed and can be easily understood by the target audience.

Here we'll look at the examples of visualizations as both exploratory and explanatory. 


### Preparation

This notebook makes use of the seaborn, pandas, wordcloud, requests, and beautifulsoup libraries.  To begin, we'll make sure they are installed.

In [None]:
!pip install seaborn pandas wordcloud requests bs4

### Color, Form, Etc.

We often use explanatory visualizations to try to "tell a story" and seek to make such visualizations appealing visually.  While exploratory visualizations are typically used to make sense of data, similar design choices could be applied to help detect trends and reduce strain from viewing dull visualizations.


### Exploratory Analysis

We'll begin by exploring a [dataset](https://github.com/mwaskom/seaborn-data) provided with the [seaborn](https://seaborn.pydata.org/) library that contains tip-related data.  The seaborn library provides a high-level interface to the commonly used [matplotlib](https://matplotlib.org/) library ([examples](https://matplotlib.org/gallery/index.html)) that can be used to create more-polished [visualizations](https://seaborn.pydata.org/examples/index.html). 

First, we import the [pandas](https://pandas.pydata.org/) and seaborn libraries and name them pd and sns, respectively, based on common convention.  Next, we'll indicate to the notebook software that we want generated plots to appear in-line as part of the notebook and set the figure size to 12x10 "inches".

In [None]:
import pandas as pd
import seaborn as sns
%matplotlib inline
sns.set(rc={'figure.figsize':(12,10)})

As a dataset included with the seaborn library, we can load the data using the *load_dataset()* method with the dataset name specified.

In [None]:
tips = sns.load_dataset("tips")
tips

Here, we see that there are seven columns in the dataset.  This is a generated dataset used for demonstration but we can assign meaning to each of the columns.

- *total_bill*: the total bill for the meal
- *tip*: tip paid for service
- *sex*: of the customer
- *smoker*: whether the customer was a smoker or not
- *day*: day of the week on which the order occurred
- *time*: lunch or dinner
- *size*: number of items ordered

One of the first questions we might ask is, "What is the busiest day in terms of number of orders?" We can get numeric values relatively easily using functionality included with a pandas DataFrame.

In [None]:
tips.day.value_counts()

For a few values, its easy to determine which one is the greatest, which is the least, and what the order is.  With more values, it becomes harder to determine these things by sight without sorting the data.  A visualization, on the other hand, might allow us to more quickly extract meaning from the data.  Here, we use a bar chart to visualize each day's number of orders.

In [None]:
tips.day.value_counts().plot(kind='bar')

We can also get a rough idea of the relative differences between values using visualizations as well. 

In [None]:
tips.time.value_counts().plot(kind='bar')

We can use a histogram to understand how the data is distributed.  Below is a histogram displaying distribution details for *total_bill*. Note that this histogram is generated using the DataFrame *hist()* method.

In [None]:
# distribution of bill and tip
tips.total_bill.hist()

We can generate a histogram using the seaborn library directly.  Seaborn provides addition display options with a histogram including the ability to display a *rug plot*, a way of visualizing the distribution of data along a single axis, and a *kernel density estimation plot*, a visualization of the corresponding probability density for the given data. Note that when the kernel density estimation plot is shown, the y-axis represents values between 0 and 1.

In [None]:
sns.distplot(tips.tip, rug=True)

In a previous course, we briefly discussed exploratory data analysis but focused more on numeric values rather than visualizations themselves.  Recall that we can use a DataFrame's *describe()* method to calculate some descriptive statistics.  We could use visualizations such as box plots to quickly get a sense of some of these values and more easily compare similar data.

In [None]:
tips[['total_bill', 'tip', 'size']].describe()

In [None]:
sns.boxplot(x="total_bill", data=tips)

In [None]:
sns.boxplot(x="tip", data=tips)

In [None]:
sns.boxplot(y="size", data=tips)

We can also drill into our data using different grouping and aggregations.  Here, we calculate basic descriptive statistics for *total_bill* by each day of the week.

In [None]:
tips.groupby("day")['total_bill'].describe()

We can use visualizations to convey the same information. The next two plots allow us to see the distribution of *total_bill* by each day of the week.

In [None]:
sns.stripplot(x="day", y="total_bill", data=tips)

In [None]:
sns.swarmplot(x="day", y="total_bill", data=tips)

We can quickly generate box plots for *total_bill* separated by day of the week or time of day as well. 

In [None]:
sns.boxplot(x="day", y="total_bill", data=tips)

In [None]:
sns.boxplot(x="time", y="total_bill", data=tips)

Visualizations can also allow us to examine multiple, complex aspects of data at the same time. For example, *violin plots* combine aspects of a box plot and kernel density estimation plot.

In [None]:
sns.violinplot(x="day", y="total_bill", hue="sex", split=True, data=tips)

We can include distribution details in a violin plot as well.

In [None]:
sns.violinplot(x="day", y="total_bill", inner="stick", data=tips)

Previously, we used pivot tables to aggregate data based on categorical values.  Here we see aggregations for *smoker*, *sex*, and *time*.

In [None]:
pd.pivot_table(tips, index=["smoker"], values=["total_bill", "tip"], aggfunc=pd.np.median)

In [None]:
pd.pivot_table(tips, index=["sex"], values=["total_bill", "tip"], aggfunc=pd.np.median)

In [None]:
pd.pivot_table(tips, index=["time"], values=["total_bill", "tip"], aggfunc=pd.np.median)

Rather than compare numerical values, it might be easier or faster to compare visualizations.

In [None]:
g = sns.PairGrid(tips,
                 x_vars=["smoker", "sex", "time"],
                 y_vars=["tip", "total_bill"],
                 aspect=.9, size=5)
g.map(sns.violinplot, palette="pastel");

In a previous course, we looked at regressions as a way of modeling interdependence of data.  Consider the relationship between *total_bill* and *tip*.

In [None]:
tips[['total_bill', 'tip']]

We can calculate statistics such as mean, median, and mode for a data set or the [correlation coefficient](https://en.wikipedia.org/wiki/Correlation_coefficient) for some dataset but these can be misleading. Consider [Anscombe's Quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet).  This a collection of four datasets in which the means, sample variances, correlation coefficient, linear regression coefficients, and [coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination) are the same or very close in value even though the underlying data is quite different.

In [None]:
anscombe = sns.load_dataset("anscombe")
anscombe

In [None]:
anscombe.groupby("dataset").describe()

In [None]:
anscombe.groupby("dataset").corr()

Looking only at these statistics, its easy to mistake the four datasets as being very similar.  Visualizing the data can help us see how different they are.

In [None]:
sns.lmplot(x="x", y="y", col="dataset", hue="dataset", data=anscombe, ci=None, col_wrap=2)

For the tip data, we can plot *tip* and *total_bill* values using a scatter plot before calculating any regression coefficients to ensure that a linear regression would be appropriate.  The second plot includes the linear regression.

In [None]:
sns.regplot(x='total_bill', y='tip', data=tips, fit_reg=False)

In [None]:
sns.regplot(x='total_bill', y='tip', data=tips, ci=None)

We can also use a scatter plot to compare data based on categorical values. First we plot *tip* and *total_bill* by *sex* then show the pairwise plots for numeric data by *day*.

In [None]:
sns.lmplot(x="total_bill", y="tip", hue="sex", data=tips, ci=None)

In [None]:
sns.pairplot(data=tips, hue="day")

Rather than comparing *tip* and *total_bill*, it is probably of greater interest to examine tip as a percentage of the total.

In [None]:
tips['percent'] = tips.tip/tips.total_bill * 100

In [None]:
g = sns.PairGrid(tips,
                 x_vars=["smoker", "sex", "time"],
                 y_vars=["percent"],
                 aspect=.9, size=5)
g.map(sns.violinplot);

Consider another popular dataset, the [*Iris* dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set), that records sepal length and width and petal length and width by iris species. 

In [None]:
iris = sns.load_dataset("iris")
iris

We can easily see how the data is distributed for each column and see if there is any correlation between two columns using a pair plot where *species* is used to determine marker color.

In [None]:
sns.pairplot(hue="species", data=iris)

### Explanatory

While an exploratory plot could be used as an explanatory plot, explanatory plots tend to be simpler to understand as they are meant to convey a specific idea rather than to extra unknown information from data.

Suppose we wanted to show both the mean tip percent by day of the week and the fact that this value varied only slightly from day to day.  We might use a bar plot.  Note that in choosing this visualization, we knew what we wanted to convey ahead of time.

In [None]:
tips.groupby('day').mean()['percent'].plot(kind='bar')


We might want to drill into the days to show the difference with respect to meal time as well.  Here, we use a pivot table to generate a heat map.

In [None]:
tips.pivot_table(index="day", columns="time", values="percent", aggfunc=pd.np.mean)

In [None]:
sns.heatmap(tips.pivot_table(index="day", columns="time", values="percent"), 
            cmap=sns.light_palette("green"), annot=True, linewidths=0.5)

Sometimes, the best way to convey information is with a simple table of values.

In [None]:
tips.pivot_table(index="sex", values="percent", aggfunc=pd.np.median).T

Another difference between exploratory and explanatory visualizations is that explanatory visualizations might be more "flashy" or less precise.  Consider, for example, a [word cloud](https://en.wikipedia.org/wiki/Tag_cloud).  Here we generate a word cloud from [reviews of Columbus State on Yelp](https://www.yelp.com/biz/columbus-state-community-college-columbus-2). We scrape the site using the [Requests](http://docs.python-requests.org/en/master/) and [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) libraries; we extract the text associated with the reviews using specific HTML elements and classes.

In [None]:
from wordcloud import WordCloud
from bs4 import BeautifulSoup
import requests

In [None]:
response = requests.get("https://www.yelp.com/biz/columbus-state-community-college-columbus-2")
bs = BeautifulSoup(response.content, 'lxml')
review_containers = bs.findAll("div", {"class": "review-content"})

In [None]:
reviews_text = ""
for review_container in review_containers:
    paragraphs = review_container.find_all("p")
    for p in paragraphs:
        reviews_text += p.text
    reviews_text += ' '

In [None]:
reviews_text

In [None]:
wordcloud = WordCloud().generate(reviews_text)

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")

## Additional Resources

- [The Python Graph Gallery](https://python-graph-gallery.com/)
- [Seaborn Gallery](https://seaborn.pydata.org/examples/index.html)
- [Matplotlib Gallery](https://matplotlib.org/gallery/index.html)

## Exercise

Create a word cloud using text of your choice.  To do this, enter the text between the two sets of triple quotes below and execute the cell. Save the generated image and submit it on blackboard.

In [None]:
import matplotlib.pyplot as plt

from wordcloud import WordCloud

# insert your text between the quotes
text = """

"""

wordcloud = WordCloud().generate(text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")