# Types of Visualizations

- Histograms


- Bar Charts

- Scatterplots

- Time Series

- Heat Maps

There are many different pyhon libraries for creating plots.

We often use matplotlib as our library.

However, other popular choices are:

- Seaborn
- ggplot2
- prettyplotlib

We're going to focus on Seaborn today

In [None]:
# Before we begin

Make sure you have the Seaborn package installed.

Can you run the following line?

In [None]:
import seaborn as sns

If you get an error, go to the 'Anaconda Navigator' and look under "Environments".

You can add it from there.

# Histograms 

Histograms allow you to summarize many statistics of data in one simple chart.

Statistics captured in a histogram:
 - The Average
 - The Mode(s)
 - Variability 
 - Skew

Histograms are similar to scatterplots in that they plot one variable against another:
 - X-axis: Response Value
 - Y-axis: Fequency of that response

## Histogram Example:



We'll first look at creating histograms the way we've done previously.

First, let's import the packages we need:

In [None]:
import matplotlib.pyplot as plt
% matplotlib inline
import numpy as np
import pandas as pd

Next, we will simply create a pandas Series (or access a column from our Data Frame).

For this example, we'll simulate creating a histogram for people's heights (in inches)

This series is a collection of 1000 random values drawn from a normal distribution that has a mean of 67 and standard deviation of 3.

In [None]:
x = np.random.normal(67,3,1000)
height = pd.Series(x)

Now, to create a histogram, we'll use the hist() function for our Series

In [None]:
height.hist()

Notice how the histogram shows that heights are mainly concetrated around the 67 inch marks, with less and less concentration as you move away from the center

We can create a histogram using Seaborn with the following command

In [None]:
import seaborn as sns
sns.distplot(height,kde = False)

What if you want the histogram more concentrated on where the data are. That is, you only want to see the value that corresponds to where the data seem to begin and end.

In Seaborn, we can easily change the x and y axis to start and stop.

In [None]:
histogram = sns.distplot(height,kde = False)

#The code for setting the axes
axes = histogram.axes
axes.set_ylim(0,110)
axes.set_xlim(58,78)

Additionally, we can set titles and labels to the figure.

In [None]:
histogram = sns.distplot(height,kde = False)


axes = histogram.axes

axes.set_title("Histogram of Heights")
axes.set_ylabel("Frequency")
axes.set_xlabel("Height (in inches)")

In [None]:
We can change the background to all white instead of gray and with gridlines

In [None]:
histogram = sns.distplot(height,kde = False)

sns.set_style("white")

Here is an example of using a dark figure style

In [None]:
histogram = sns.distplot(height,kde = False)
sns.set_style("dark")

## Problem with histograms

Histogram are effective ways of demonstrating where your data are concetrated.

However, they are somewhat of a subjective representation of your data because their shape changes drastically depending on how you choose the bins (or grouping on the x-axis)

Below, we will create a bi-modal distribution of 40 cases, where one mode occurs at x=0 and another occurs at x=5.

We'll choose to make 10 evenly spaces bins in the range of -5 to 10


In [None]:
N = 20
X = np.hstack((np.random.normal(0, 1, 2 * N),
                    np.random.normal(5, 1, 4 * N)))

bins = np.linspace(-5, 10, 10)
# histogram 1
sns.distplot(X, bins=bins, kde=False)

Now, let's try it with 5 evenly spaced bins

In [None]:
bins = np.linspace(-5, 10, 5)
# histogram 2
sns.distplot(X, bins=bins, kde=False)

Now let's try it with 7 evenly spaced bins

In [None]:
bins = np.linspace(-5, 10, 7)

# histogram 3
sns.distplot(X, bins=bins, kde=False)

The problem arises because we are cutting our dta into groups instead of leaving it as continuous.

Rather, we should think of the probability of our data falling into a certain area.

We can use a 'Kernal Density Estimator' to plot the distribution in a continuous way

In [None]:
sns.kdeplot(X, shade=True);

## Plot Two or More Histograms

It's also possible to represent two different groups with histograms.

To plot the histograms for two different groups, we are assuming that the data for each group are in different columns (e.g, one column has male heights and the other has female heights).

If your heights are in one column and your genders in another, remember that you can create two series of data by doing the following

maleheights = df[(df.gender == 'male')]['height']
femaleheights = df[(df.gender == 'female')]['height']

Plotting two groups involves running the kdeplot command twice and setting it to two different variables

In [None]:
maleheights=np.random.normal(70,3,100)
femaleheights=np.random.normal(64,3,100)
hist1 = sns.kdeplot(maleheights, shade=True, color="blue")
hist2 = sns.kdeplot(femaleheights, shade=True,color="pink")
#you can hide the y-axis like this 
hist1.get_yaxis().set_visible(False)

# Two Histograms Against One Another

A histogram will tell you where are the data conentrated.

You can create a histogram along to axes, by seeing where are the data most likely to intersect

For example, consider the bus stop data we collected in the API Scraping Lesson.

Using the API, we could get a list of all of the latitude and longitude coordinates for every bus stop in the city.

This data has already been saved in a csv with the bus route and where every one of its stops is located

In [None]:
import pandas as pd
import seaborn as sns
bus_stops = pd.read_csv("busstops.csv")
bus_stops.sample(10)

If we plotted a histogram of just the Lattitude, we would see the lattitude where most bus stops tend to be

In [None]:
sns.kdeplot(bus_stops['Lat'], shade=True,color="blue")

LIkewise, if we plotted the histogram of just the longitude, we could see what longitudes have the highest concentration of buses

In [None]:
sns.kdeplot(bus_stops['Lon'], shade=True,color="green")

However, if we wanted to represent where in 2-Dimensional space the buses tend to be, we could use a joint plot, where we indicate:

- What are the two variables we want plotted against one another
- What kind of plot do we want to make
    
Here, we are plotting the Longitude against the Latitude, and making a kdeplot out of it.

The cmap indicates what color scheme we should use for high and low values. "Coolwarm" makes low values blue and high values red.

"Shade = True" means that we want the histograms shaded

In [None]:
g = sns.JointGrid(x="Lon", y="Lat", data=bus_stops)

g = g.plot_joint(sns.kdeplot, cmap="coolwarm", shade=True)

## Color Maps

http://matplotlib.org/users/colormaps.html
    

# Violin Plots

Violin plots, like histograms convey a lot of statistics in a single figure.

Violin plots are best used when you have many groups with overlapping values, whose central tendency you want to compare

In [None]:
#np.random.normal(mean,std_deviation,samplesize)
salaries=pd.DataFrame({'HR':np.random.normal(90000,5000,100),
                   'Management':np.random.normal(90000,5000,100),
                   'Accounting':np.random.normal(70000,5000,100)})

In [None]:
sns.set_style("white")
hist1 = sns.kdeplot(salaries['HR'], shade=True, color="blue")
hist2 = sns.kdeplot(salaries['Management'], shade=True,color="green")
hist3 = sns.kdeplot(salaries['Accounting'], shade=True,color="magenta")

A box plot is generally preferred when you are comparing the centra tendency of many groups, and overlaing them would be difficult to visualize

In [None]:
sns.set_style("whitegrid")
sns.boxplot(data=salaries, palette="deep")

Although box plots can convey the mean, interquartile range, confidence interval, and outliers, they do not convey the distribution of your data.

A violin plot can provide distribution information as well as information about the mean and variability.

In [None]:
sns.set_style("white")
sns.violinplot(data=salaries)

# Scatter Plots

When plotting the data against two continuous variables, scatterplots tend to be more freuqnetly used

Let's read in data collected from the 1984 Olympics that contains information about how the women from each country performed in various running events

In [None]:
olympics = pd.read_csv("olympicperformance.csv")
print olympics.head(10)

If we wanted to visualize how one variable related to another, we would use the lmplot() command to indicate we want a scatterplot

In [None]:
sns.set(color_codes=True)
g = sns.lmplot(x="m100", y="m400", data=olympics)

By default, the scatterplot tries to fit a regression line through the data and plots the bootstrapped 95% confidence interval for each point in the line.

We have many options, such as changing the color and the point marker

In [None]:
sns.set(color_codes=True)
g = sns.regplot(x="m100", y="Marathon", data=olympics, marker="+", color="g")

## Plotting Interactions

We might have data that we want to show has an interaction.

Remember, an interaction is when the relationship (slope) between two variables (X and Y) differs depending on a third variable.

The following data show an interaction between spiciness of food and preference for the food. The relationship between spiciness and perference changes depending on whether it is a hot or cold day.

In [None]:
spicepreference = pd.read_csv("spicepreference.csv")

g = sns.lmplot(x="Spiciness", y="Preference", data=spicepreference)
g.set(ylim=(0, None),xlim=(0, None))

In [None]:
g = sns.lmplot(x="Spiciness", y="Preference", hue="Weather", data=spicepreference)
g.set(ylim=(0, None),xlim=(0, None))

# Plot Clusters

In [None]:
pizza = pd.read_csv("pizza.csv",index_col=0)
pizza

In [None]:
g = sns.clustermap(pizza)

## Plotting Clusters

In [None]:
from sklearn.cluster import KMeans
k_means = KMeans(init='k-means++', n_clusters=3, n_init=10)
k_means.fit(pizza)
k_means_labels = k_means.labels_

In [None]:
k_means_labels

In [None]:
clusters = pd.DataFrame(k_means_labels,index=pizza.index,columns=["Cluster"])
pizza_with_cluster = pd.concat([pizza, clusters], axis=1)

In [None]:
g = sns.lmplot(x="Spicy_Italian", y="Mediterranean", hue="Cluster", 
               fit_reg=False, 
               data=pizza_with_cluster)

# Time Series

In [None]:
def make_number(content):
    cleaned = content.replace("$","").replace(",","")
    try:
        return float(cleaned)
    except:
        return None
dowjones = pd.read_csv("dow_jones_index.csv")
dowjones['date'] = pd.to_datetime(dowjones['date'])
dowjones['close'] = dowjones['close'].apply(make_number)

In [None]:
dowjones['date_delta'] = (dowjones['date'] - dowjones['date'].min())  / np.timedelta64(1,'D')

In [None]:
from scipy.stats import sem  
mean_close = dowjones.groupby("date_delta")['close'].mean().values
sem_close = dowjones.groupby("date_delta")['close'].apply(sem).mul(1).values
years = dowjones.date_delta.unique()


In [None]:
plt.figure(figsize=(8, 6)) 


plt.fill_between(years, mean_close - sem_close,  mean_close + sem_close, color="#3F5D7D")  
plt.plot(years, mean_close, color="white", lw=2)  

plt.ylim(20, 80)  
plt.title("Dow Jones by Day", fontsize=22) 
plt.ylabel("Closing Price", fontsize=16)  
plt.xlabel("Day", fontsize=10)  




# HeatMaps

In [None]:
olympics = pd.read_csv("olympicperformance.csv")
print olympics.head(10)

In [None]:
correlations = olympics.corr()
print correlations

In [None]:
ax = sns.heatmap(correlations,square=True)

In [None]:
import numpy as np
sns.set_style("white")

mask = np.zeros_like(correlations)
mask[np.triu_indices_from(correlations)] = True
ax = sns.heatmap(correlations, mask=mask, square=True)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.metrics import confusion_matrix
iris = load_iris()
X = iris.data
target = iris.target

knn = KNeighborsClassifier()
knn.fit(X,target)
predictions = knn.predict(X)

mat = confusion_matrix(target,predictions)
print mat

In [None]:
row_labels = ["Predicted_"+x for x in iris.target_names]
column_labels = ["Actual_"+x for x in iris.target_names]
confusion_df = pd.DataFrame(mat,row_labels,column_labels)
ax = sns.heatmap(confusion_df, square=False,cmap="Blues")

Even if you don't have a square matrix, you can plot data on a heatmap if you know what the rows and column variables are.

For example, let's consider this dataset that has difference between the temperature for that month and the temperature on average for the past 150 years.

We want to show that each year, the global temperature is farther above average, and this is true for every month.

Therefore, we want the Rows to be the year, and the columns to be month, and the values to the difference from the average global temperature

In [None]:
globaltemperatures = pd.read_csv("yearlytemp.csv")
globaltemperatures.sample(5)

We're going to use the "pivot" function so that each month gets its own column.

We'll end up with a Year x Month table, and each cellin the table contain the temperature difference

In [None]:
temps_by_month_and_year = globaltemperatures.pivot("Year","Month", "TempDiff")
print temps_by_month_and_year.head(10)

Let's plot the heatmap now using the pivoted table

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 16))
plt.setp(ax.yaxis.get_majorticklabels(), fontsize=6)
ax = sns.heatmap(temps_by_month_and_year,yticklabels=5,cmap="inferno")
ax.invert_yaxis()