# Important note!

Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
YOUR_ID = "" # Please enter your GT login, e.g., "rvuduc3" or "gtg911x"
COLLABORATORS = [] # list of strings of your collaborators' IDs

In [None]:
import re

RE_CHECK_ID = re.compile (r'''[a-zA-Z]+\d+|[gG][tT][gG]\d+[a-zA-Z]''')
assert RE_CHECK_ID.match (YOUR_ID) is not None

collab_check = [RE_CHECK_ID.match (i) is not None for i in COLLABORATORS]
assert all (collab_check)

del collab_check
del RE_CHECK_ID
del re

**Jupyter / IPython version check.** The following code cell verifies that you are using the correct version of Jupyter/IPython.

In [None]:
import IPython
assert IPython.version_info[0] >= 3, "Your version of IPython is too old, please update it."

# Pandas and Seaborn

This tutorial on Pandas and Seaborn was prepared by [Shang-Tse Chen](http://www.cc.gatech.edu/~schen351/), who was a teaching assistant for CSE 6040 in Fall 2015.

Most of the examples in this tutorial come from the [Pandas tutorial](http://pandas.pydata.org/pandas-docs/stable/tutorials.html) and the [Seaborn tutorial](http://stanford.edu/~mwaskom/software/seaborn/tutorial.html).

## Part 1: Data analysis using Pandas

Pandas is pre-installed with Anaconda. 
Let's try to import it.

In [None]:
import pandas as pd

## Create Data
The data set will consist of 5 baby names and the number of births recorded for that year (1880).

In [None]:
# The inital set of baby names and bith rates
names = ['Bob','Jessica','Mary','John','Mel']
births = [968, 155, 77, 578, 973]

To merge these two lists together we will use the zip function.

In [None]:
BabyDataSet = list (zip (names,births))

We are basically done creating the data set. We now will use the **pandas** library to export this data set into a csv file.

We will create a DataFrame object. You can think of this object holding the contents of the BabyDataSet in a format similar to an excel spreadsheet.

In [None]:
df = pd.DataFrame(data = BabyDataSet, columns=['Names', 'Births'])
df

Export the dataframe to a ***csv*** file. We can name the file ***births1880.csv***. The function ***to_csv*** will be used to export the file. The file will be saved in the same location of the notebook unless specified otherwise.

In [None]:
df.to_csv ('births1880.csv', index=False, header=False)

## Get Data
To pull in the csv file, we will use the pandas function *read_csv*. Let us take a look at this function and what inputs it takes.

In [None]:
df = pd.read_csv("births1880.csv")
df

This brings us the our first problem of the exercise. The ***read_csv*** function treated the first record in the csv file as the header names. This is obviously not correct since the text file did not provide us with header names.  

To correct this we will pass the ***header*** parameter to the *read_csv* function and set it to ***None*** (means null in python).

In [None]:
df = pd.read_csv ("births1880.csv", header=None)
df

If we wanted to give the columns specific names, we would have to pass another paramter called ***names***. We can also omit the *header* parameter.

In [None]:
df = pd.read_csv ("births1880.csv", names=['Names', 'Births'])
df

It is also possible to read in a csv file by passing an url address
Here we use the famous [Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set).

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals.

In [None]:
df = pd.read_csv ("https://raw.githubusercontent.com/bigmlcom/bigmler/master/data/iris.csv")
df.head(10)

## Analyze Data

In [None]:
# show basic statistics
df.describe()

In [None]:
# Select a column
df["sepal length"].head()

In [None]:
# select columns
df[["sepal length", "petal width"]].head()

In [None]:
# select rows by name
df.loc[5:10]

In [None]:
# select rows by position
df.iloc[5:10]

In [None]:
# select rows by condition
df[df["sepal length"] > 5.0]

We can get the maximum sepal length by

In [None]:
df["sepal length"].max()

If we want to find full information of the flower with maximum sepal length

In [None]:
df.sort_values (by=["sepal length"], ascending=False).head (1)

## Exercise 
Print the full information of the flower whose petal length is the second shortest in the 50 Iris-setosa flowers

In [None]:
df.sort_values (by=["petal length"], ascending=True, inplace=True)
df.iloc[1]

Pandas also has some basic plotting functions.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
df.hist()

# Part 2: Visualization using Seaborn
Seaborn is not installed by default in Anaconda.

Try install it using pip: **pip install seaborn**.

In [None]:
import seaborn as sns

# make the plots to show right below the codes
% matplotlib inline

## Plotting univariate distributions
 distplot() function will draw a histogram and fit a kernel density estimate

In [None]:
import numpy as np
x = np.random.normal (size=100)
sns.distplot (x)

In [None]:
import random
x = [random.normalvariate (0, 1) for i in range (0, 1000)]
sns.distplot (x)

## Plotting bivariate distributions

In [None]:
mean, cov = [0, 1], [(1, .5), (.5, 1)]
data = np.random.multivariate_normal(mean, cov, 200)
df = pd.DataFrame(data, columns=["x", "y"])

### Scatter plot

In [None]:
sns.jointplot(x="x", y="y", data=df)

### Hexbin plot

In [None]:
sns.jointplot(x="x", y="y", data=df, kind="hex")

### Kernel density estimation

In [None]:
sns.jointplot(x="x", y="y", data=df, kind="kde")

## Visualizing pairwise relationships in a dataset
To plot multiple pairwise bivariate distributions in a dataset, you can use the pairplot() function. This creates a matrix of axes and shows the relationship for each pair of columns in a DataFrame. by default, it also draws the univariate distribution of each variable on the diagonal Axes:

In [None]:
iris = sns.load_dataset("iris")
sns.pairplot(iris)

In [None]:
# we can add colors to different species
sns.pairplot(iris, hue="species")

### Visualizing linear relationships

In [None]:
tips = sns.load_dataset("tips")
tips.head()

We can use the function `regplot` to show the linear relationship between total_bill and tip. 
It also shows the 95% confidence interval.

In [None]:
sns.regplot(x="total_bill", y="tip", data=tips)

### Visualizing higher order relationships

In [None]:
anscombe = sns.load_dataset("anscombe")
sns.regplot(x="x", y="y", data=anscombe[anscombe["dataset"] == "II"])

The plot clearly shows that this is not a good model.
Let's try to fit a polynomial regression model with degree 2.

In [None]:
sns.regplot(x="x", y="y", data=anscombe[anscombe["dataset"] == "II"], order=2)

### Strip mplots
This is similar to scatter plot but used when one variable is categorical.

In [None]:
sns.stripplot(x="day", y="total_bill", data=tips)

### Boxplots

In [None]:
sns.boxplot(x="day", y="total_bill", hue="time", data=tips)

### Bar plots

In [None]:
titanic = sns.load_dataset("titanic")
sns.barplot(x="sex", y="survived", hue="class", data=titanic)