# First Steps & Orientation

## Acquire, Describe, and Visualize Data
- The descriptive statistics tell a story of the data
- Vizualizing the data tells another layer of the story

In [1]:
# Run this block of code in order to import all of the libraries that we need for this module.
# Shift + Enter on your keyboard or click the play button to run code cells. 
import numpy as np # linear algebra library
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Vizualization libraries
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

## Acquire the data:
- In practice, our data isn't always 100% ready, clean, and accessible.
- You're lucky if you get a spreadsheet or CSV ready to go!

In [2]:
# Acquire the data 
quartet = pd.read_csv("../input/quartet.csv", index_col="id") # this is how the pandas library reads CSV files. Then we assign to a variable.

In [3]:
# There are 4 groups that each have an X value and a Y value
print(quartet)

   dataset     x      y
id                     
0        I  10.0   8.04
1        I   8.0   6.95
2        I  13.0   7.58
3        I   9.0   8.81
4        I  11.0   8.33
5        I  14.0   9.96
6        I   6.0   7.24
7        I   4.0   4.26
8        I  12.0  10.84
9        I   7.0   4.82
10       I   5.0   5.68
11      II  10.0   9.14
12      II   8.0   8.14
13      II  13.0   8.74
14      II   9.0   8.77
15      II  11.0   9.26
16      II  14.0   8.10
17      II   6.0   6.13
18      II   4.0   3.10
19      II  12.0   9.13
20      II   7.0   7.26
21      II   5.0   4.74
22     III  10.0   7.46
23     III   8.0   6.77
24     III  13.0  12.74
25     III   9.0   7.11
26     III  11.0   7.81
27     III  14.0   8.84
28     III   6.0   6.08
29     III   4.0   5.39
30     III  12.0   8.15
31     III   7.0   6.42
32     III   5.0   5.73
33      IV   8.0   6.58
34      IV   8.0   5.76
35      IV   8.0   7.71
36      IV   8.0   8.84
37      IV   8.0   8.47
38      IV   8.0   7.04
39      IV   8.0

## Descriptive Statisitics
- The `.describe()` method provides descriptive statistics for pandas dataframes
- `mean` is the average of values
- `std` is the "standard deviation" which represents the typical distance from the mean to an observation in the data. 
- Think of standard deviation as a measure of how much of a spread in the data there is
- Min and Max are the minimum and maximum values

In [4]:
# Let's look at the entire dataset
quartet.describe()

Unnamed: 0,x,y
count,44.0,44.0
mean,9.0,7.500682
std,3.198837,1.958925
min,4.0,3.1
25%,7.0,6.1175
50%,8.0,7.52
75%,11.0,8.7475
max,19.0,12.74


## Let's group by the "dataset" column to compare

In [5]:
# Let's look at the average of each quartet and the standard deviation 
# Mean means the average value
# Standard deviation represents the typical distance from the mean to an observation in the data. (How much of a spread in the data is there)
quartet.groupby('dataset').agg(["mean", "std"])

Unnamed: 0_level_0,x,x,y,y
Unnamed: 0_level_1,mean,std,mean,std
dataset,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
I,9.0,3.316625,7.500909,2.031568
II,9.0,3.316625,7.500909,2.031657
III,9.0,3.316625,7.5,2.030424
IV,9.0,3.316625,7.500909,2.030579


## Looking at the above measures of central tendency, what hypothesis do you have?
- Looks like each dataset is pretty similar, right?
- For x values that range between 4 to 19 and y values that range from 3 to 12, these seem pretty tight, right?

## Let's See! (Vizualize the data)

In [6]:
# Once you have compared the descriptive statistics of each of the datasets in the quartet,
# Uncomment the next line and run this cell to visualize all 4 datasets next to eachother!
# sns.relplot(x='x', y='y', col='dataset', data=quartet)

## Lessons learned
- Ask yourself, what did you gain from this exercise? 