---
title: Data Cleaning Lab + Mini-Lesson on Visualization
type: lab
duration: "1:5"
creator:
    name: Joshua Cook
    city: Santa Monica
---

# Basic Visualizations with Seaborn 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

iris_data_location = '../data/iris.csv'
iris_data_dataframe = pd.read_csv(iris_data_location,
                                  index_col=False,
                                  header=0,
                                  names=['sepal_length', 
                                         'sepal_width',
                                         'petal_length',
                                         'petal_width',
                                         'class'])

In [None]:
iris_data_dataframe.head(3)

## Using `seaborn`

http://seaborn.pydata.org/api.html

### `sns.pairplot`

Plots pairwise relationships in a dataset.

- draws scatterplots for joint relationships
- draws histograms for univariate relationships

In [None]:
sns.pairplot(iris_data_dataframe)

### `sns.countplot`

Show the counts of observations in each categorical bin using bars.

A count plot can be thought of as a histogram across a categorical, instead of quantitative, variable.

In [None]:
plt.figure(figsize=(20,6))
sns.countplot(x='class',
               data=iris_data_dataframe)

In [None]:
plt.figure(figsize=(20,6))
sns.countplot(x='petal_length',
               data=iris_data_dataframe)

### `sns.stripplot`

Draw a scatterplot where one variable is categorical.

A strip plot can be drawn on its own, but it is also a good complement to a box or violin plot in cases where you want to show all observations along with some representation of the underlying distribution.

In [None]:
sns.stripplot(x='class',
               y='sepal_length',
               data=iris_data_dataframe)

### `sns.swarmplot`

Draw a categorical scatterplot with non-overlapping points.

This function is similar to `stripplot()`, but the points are adjusted (only along the categorical axis) so that they don’t overlap.

In [None]:
sns.swarmplot(x='class',
               y='sepal_length',
               data=iris_data_dataframe)

### `sns.boxplot`

Draw a box plot to show distributions with respect to categories.

A box plot (or box-and-whisker plot) shows the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable.

The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution, except for points that are determined to be “outliers” using a method that is a function of the inter-quartile range.

https://en.wikipedia.org/wiki/Box_plot

In [None]:
sns.boxplot(x='class',
               y='sepal_length',
               data=iris_data_dataframe)

### `sns.barplot`

Draw a box plot to show distributions with respect to categories.

A box plot (or box-and-whisker plot) shows the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable.

The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution, except for points that are determined to be “outliers” using a method that is a function of the inter-quartile range.

https://en.wikipedia.org/wiki/Box_plot

In [None]:
sns.barplot(x='class',
               y='sepal_length',
               data=iris_data_dataframe)

### Subplots

In [None]:
fig = plt.figure(figsize=(20,6))
fig.add_subplot(121)
sns.violinplot(x='class', 
               y='sepal_length', 
               data=iris_data_dataframe)
fig.add_subplot(122)
sns.violinplot(x='class', 
               y='sepal_width', 
               data=iris_data_dataframe)

In [None]:
fig = plt.figure(figsize=(20,12))
for i, param in enumerate(['sepal_length', 
                           'sepal_width',
                           'petal_length',
                           'petal_width']):
    
    fig.add_subplot(221+i)
    plt.title('class v.' + param)
    sns.swarmplot(x='class',
                   y=param,
                   data=iris_data_dataframe)

# Lab: Data Cleaning

In [None]:
titanic_data_location = '../data/titanic.csv'
lusitania_data_location = '../data/lusitania.csv'

In [None]:
# Load It

In [None]:
# Summarize It (mean, median, mode, variance, std, range)

In [None]:
# Clean It

In [None]:
# Does cleaning it change the summary information?