# SI 618: Data Manipulation and Analysis
## 05 - Data analysis II: ANOVA, t-test, linear models

### Dr. Chris Teplovs, School of Information, University of Michigan
<small><a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a> This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.


## Visualization for Data Scientists

We're going to ask a special virtual guest lecturer to provide some background on data visualization.  Together, we'll watch [a brief (8-minute video) by Dr. Chris Brooks](
https://www.coursera.org/learn/python-plotting/lecture/qrqqa/tools-for-thinking-about-design-alberto-cairo)
and pause it several times to answer the following questions:



## <font color="magenta">Q1a: As someone who is studying data science, who are you trying to reach through your visualizations?  </font>


(replace this with your response)

## <font color="magenta">Q1b: What sense can you make of this image?</font>
![](resources/BrooksResearch.png)


(replace this with your response)

## <font color="magenta">Q1c: How many different kinds of information can you see in the Minard graphic, and what are they?</font>

![](resources/Menard.png)

(replace this with your response)


## Returning to Seaborn: 

https://seaborn.pydata.org/examples/index.html

Take a look at the different visualizations that are possible.

## <font color="magenta">Q2a: Provide the title, description, and URL of one of the visualizations that you find particularly interesting and explain why you find it interesting.  </font>

(replace this with your response)

## <font color="magenta">Q2b: Given what we learned from Prof. Brooks, indicate 1-3 axes from Cairo's Visual Wheel where your chosen Seaborn visualization would likely score highly. Explain why.</font>

![](resources/CairoVisualWheel.png)

(replace this with your response)

## Seaborn versus Matplotlib
* Matplotlib
     * Low-level, basis for many packages
     * Painful to construct certain graphs
     * Not Pandas friendly
     * Not interactive
* Seaborn
     * Pandas friendlier
     * Great for some stats plots


## Part 1: Iris dataset
![](resources/iris.png)

In [None]:
import seaborn as sns

In [None]:
df = sns.load_dataset('iris')
df.head()

Remember our distplots:


In [None]:
sns.distplot(df.sepal_length)

## <font color="magenta"> Q3: Create similar plots for the other three numeric variables in the dataset. In a couple of sentences, describe each of the plots.  </font>

In [None]:
# insert your code here

Insert your interpretation here

We often want to see how variables vary with each other.  We'll get into the details 
in a few classes, but for now let's examine them visually.  In seaborn, we do this using 
the jointplot(). So, for example, if we wanted to look at the relationship between the
distributions of sepal_length and sepal_width, we could do something like:



In [None]:
sns.jointplot(x='sepal_length',y='sepal_width',data=df)

## <font color="magenta"> Q4: It's a bit difficult to see where the interesting areas in the plot are, so it's worth trying a hexbin plot.  Go ahead and copy the above  code block and add ```kind="hex"``` to the jointplot parameters. In a couple of sentences, describe what stands out to you about the visualization. </font>

In [None]:
# insert your code here

Insert your interpretation here

Now, take a look at what happens when you set ```kind="kde"```

In [None]:
sns.jointplot(x='sepal_length',y='sepal_width',data=df,kind="kde")

Finally, you may want to look at all the numeric variables in your
dataset. Use ```pairplot``` to do this:


In [None]:
sns.pairplot(df.query("species == 'setosa'"))

We can get fancier by using a different column to set the color (or "hue"):

Try running the following code:

In [None]:
sns.pairplot(df,hue="species")

Now let's introduce some correlations.  We're not going to spend time on learning about the 
theory behind correlation, as you've done that in the statistics prerequisite for this course.
Instead, we're going to jump right in and annotate a graph with a lot of statistical information:

In [None]:
from scipy import stats

In [None]:
# ignore the warning about deprecated annotation
g = sns.JointGrid(data=df,x='petal_length',y='sepal_length')
g = g.plot(sns.regplot, sns.distplot)
g = g.annotate(stats.pearsonr)

Think about what the different components mean.  We'll return to using this in the next section on Wine Quality.

## Part 2: Wine quality
![](resources/vinho.png)
https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009/home

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np

In [None]:
wine = pd.read_csv('data/winequality-red.csv')
wine.head()

In [None]:
wine['isgood'] = np.where(wine['quality'] > 5, 'good','bad')

In [None]:
# This will yield a warning if you're using python >= 3.7 and scipy < 1.2
#   For now, I suggest ignoring the warning
#   For a more detailed explanation, please see https://stackoverflow.com/questions/52594235/futurewarning-using-a-non-tuple-sequence-for-multidimensional-indexing-is-depre

sns.distplot(wine['fixed acidity'])

## <font color="magenta">Q5: Create a pairplot for the wine dataset that plots 'good' and 'bad' wines in different hues. In a couple of sentences, describe interesting relationships shown by the visualization.  

In [None]:
# insert your code here

Insert your interpretation here

## T-test

A t-test is a simple statistical model that's commonly used to test whether the means of two different
distributions are the same.  scipy.stats gives us a handy interface for this:

In [None]:
goodwines = wine.query('isgood == "good"')
badwines = wine.query('isgood == "bad"')

In [None]:
stats.ttest_ind(wine[wine.isgood == 'good']['fixed acidity'],wine[wine.isgood == 'bad']['fixed acidity'])

## <font color="magenta">Q6: Using the JointGrid approach we used above look at the relationship between sulphates and chlorides.  What patterns do you see?

In [None]:
# insert your code here

Insert your interpretation here.

## Ordinary Least Squares (OLS) Regression

We can get a lot more detail about the regression model by using statsmodels

In [None]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

statsmodels.formula.api allows you to use R-Style formulas: y ~ x1 + x2 + x3 + ...

1. y represents the outcome/dependent variable
2. x1, x2, x3, etc represent explanatory/independent variables 

In [None]:
model1 = smf.ols('chlorides ~ sulphates', data=wine).fit()
model1.summary()

### Interesting things happen when we use OLS to do an ANOVA (look closely at the model):

In [None]:
model2 = smf.ols('chlorides ~ C(isgood)', data=wine).fit()
model2.summary()

In [None]:
aov_table = sm.stats.anova_lm(model2, typ=2)
print(aov_table)

### We might want to experiment with the original ```quality``` variable, either in a regression model:

In [None]:
model3 = smf.ols('chlorides ~ quality', data=wine).fit()
model3.summary()

### or in an ANOVA (again, look closely at the model):

In [None]:
model4 = smf.ols('chlorides ~ C(quality)', data=wine).fit()
model4.summary()

In [None]:
aov_table = sm.stats.anova_lm(model4, typ=2)
print(aov_table)

## <font color="magenta">Q7: Use OLS to perform either a regression or an ANOVA on a variable (other than chlorides) and interpret your results.

In [None]:
# insert your code here

Insert your interpretation here

## Part 3:  Airplane Crashes and Fatalities
The next dataset we are going to look at is the full history of airplane crashes throughout the world, from 1908-2009.  It's taken from:

https://opendata.socrata.com/Government/Airplane-Crashes-and-Fatalities-Since-1908/q2te-8cvq

In [None]:
import pandas as pd
import seaborn as sns

We've provided the CSV file for this lab so you can go ahead and load it in the usual way:

In [None]:
crashes = pd.read_csv('data/Airplane_Crashes_and_Fatalities_Since_1908.csv')

As always, you should take a look at the data to get a sense of 
what it's like:

In [None]:
crashes.head()

As we mentioned in an earlier class, pandas is really good at helping
us deal with dates.  The 'Date' column in the dataframe contains 
strings that look like dates.  We can use the ```pandas.to_datetime()``` function to convert the strings to an internal datetime object
(see https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html for more details):

In [None]:
crashes['Date'] = pd.to_datetime(crashes['Date'])

And let's look at the dataframe again.  See any difference?

In [None]:
crashes.head()

The pandas datetime object makes it easy to extract interesting 
parts of the date or time.  In our case, we're interested in extracting
the year, so we can do that with the following code:

In [None]:
crashes['year'] = crashes['Date'].dt.year

And, as always, let's look at what we got:

In [None]:
crashes.year.head()

As part of the final exercise in this class, let's create a 
visualization of the number of Fatalities per year:

In [None]:
sns.barplot('year','Fatalities',data=crashes)

That doesn't look great, does it?  


## <font color="magenta">Q8: Create a barplot of the number of fatalities per decade and describe the results. 

Go ahead and create a new column called 'decade' 
that represents the decade for each year.  Remember that an integer divide (a.k.a. a floor divide) can be
done with the // operator.

What's the trend in airplane crash fatalities?

In [None]:
# insert your code here

Insert your interpretation here.

## <font color="magenta">Q9: (Optional, up to 2 bonus points): Explore some of the options available in Seaborn to control the aesthetics of your plots

Using any of the figures we created in this lab, or any other figures you like, explore manipulating various ways in which you can control
the aesthetics of your figures.  See https://seaborn.pydata.org/tutorial/aesthetics.html for additional information.

In [None]:
# insert your code here

Insert your interpretation here.

## Part 4 (FYI): Functional Magnetic Resonance Imagining
**NOTE:  The remainder of this notebook requires seaborn 0.9.0 (or newer) to provide the "relplot" capabilities**

![](resources/fmri.png)

In [None]:
fmri = sns.load_dataset("fmri")

In [None]:
fmri.head()

In [None]:
fmri.describe()

In [None]:
sns.relplot(x="timepoint", y="signal", kind="line", data=fmri);

In [None]:
sns.relplot(x="timepoint", y="signal", kind="line", ci=None, data=fmri);

In [None]:
sns.relplot(x="timepoint", y="signal", kind="line", ci="sd",data=fmri);

In [None]:
sns.relplot(x="timepoint", y="signal", kind="line", estimator=None, data=fmri);

In [None]:
sns.relplot(x = "timepoint", y = "signal", kind = "line", data = fmri, hue = "event");

In [None]:
sns.relplot(x="timepoint", y="signal", kind="line", data=fmri, hue="region", style="event");