<img src="https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png" style="float: left; margin: 15px;">

# Intro to Pandas 2

***

Week 2 | Lesson 2.1

### LEARNING OBJECTIVES
*After this lesson, you will be able to:*
- Perform boolean indexing on dataframes
- Perform math functions using pandas.Series functions

### LESSON GUIDE
| TIMING  | TYPE  | TOPIC  |
|:-:|---|---|
| 15 min  | [Introduction](#introduction)   | Series and DataFrame data types |
| 25 min  | [Demo / Guided Practice](#demo)  | pd.Series  |
| 25 min  | [Demo / Guided Practice](#demo)  | Boolean indexing  |
| 20 min  | [Independent Practice](#ind-practice)  |   |
| 5 min  | [Conclusion](#conclusion)  |  |

---

## Partner or group - What is a DataFrame and how is it useful? (3-5 mins)

- A DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. 

![](http://image.slidesharecdn.com/2013-11-14-20enterthematrix-131207071455-phpapp02/95/enter-the-matrix-10-638.jpg)

Remember this?

_note: A series is a fundamental building block inside of a DataFrame.  A perfect time time to draw a dataframe._

<a name="Series and DataFrame data types"></a>
## Introduction: Series and DataFrame data types (10 mins)

- Series is a one-dimensional labeled array capable of holding any data type (integers, strings, 
floating point numbers, Python objects, etc.). The axis labels are collectively referred to as 
the index. The basic method to create a Series is to call:

```Python
s = pd.Series(data, index=index)
```

- Here, data can be many different things:
    - a Python dict
    - an ndarray
    - a scalar value (like 5)
    - lists / tuples / sets
    - another dataframe

- The passed index is a list of _row axis_ labels. 

Generally speaking, the index parameter is *optional*.  If you don't specify one, Pandas will select appropriate defaults for you.

### Valid Types of Input

- Like Series, DataFrame accepts many different kinds of input:
    - Dict of 1D ndarrays, lists, dicts, or Series
    - 2-D numpy.ndarray
    - Structured or record ndarray
    - A Series
    - Another DataFrame

### The "Index" as Variable Labels

- Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. 

- If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame.


### Without "Index" Parameter

If axis labels are not passed, they will be constructed from the input data based on common sense rules.

Without an index parameter, Pandas will apply indexes if your input data is a dictionary with string keys values, and list data format rows.

[Pandas Documentation: Series and Dataframes](http://pandas.pydata.org/pandas-docs/stable/dsintro.html)


### But how does this work!?

I'm glad that you asked!

Here is how a dictionary turns into a DataFrame, when there are **`string`** keys having **`list`** values.

In [2]:
test_data = {
    "animal": ["doge", "cuthulu", "zebra"],
    "first": ["Pat", "Betsy", "Frank"],
    "last": ["Kat", "Reynolds", "Spinelli"]
}
pd.DataFrame(test_data)

NameError: name 'pd' is not defined

### How about with explicitly set indexes?

In [3]:
test_data = {
    "animal": ["doge", "cuthulu", "zebra"],
    "first": ["Pat", "Betsy", "Frank"],
    "last": ["Kat", "Reynolds", "Spinelli"]
}
pd.DataFrame(test_data, index=["Animal A", "Animal B", "Animal C"])

NameError: name 'pd' is not defined

Indexes can also be created when we aggregate data or create DataFrames from database resources.  The index can be thought of as a "primary key" in a sense if you are familliar with relational database conventions.  We will go over this in the future but it's a helpful concept to relate to if you are already familliar with database systems.

Also when we talk about joins later, this will be helpful to read a little bit on.

[RDBMS Key Primer](http://rdbms.opengrass.net/2_Database%20Design/2.1_TermsOfReference/2.1.2_Keys.html)

_note: Draw in the blanks exercize.._

**Check:** What are some differences between Series and DataFrame?

<a name="pd.Series"></a>
## Demo / Guided Practice: pd.Series + pd.DataFrame (25 mins)

We will demonstrate a few ways "series" and DataFrame work.  Let's create a series and see what pandas.Series can do. 

[demo code](.../code/W2%20L2.1%20pandas.Series%20and%20Boolean%20indexing%20demo%20code.ipynb)

In [4]:
import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(7), index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])  
s

a   -1.658622
b    0.388910
c   -0.669340
d   -0.563510
e   -1.085203
f    1.734919
g   -1.555393
dtype: float64

Now we have a series of 5 random numbers. Let's try out the same things we did with a data frame back in W2 L1.1. First, let's look at the series head!

In [5]:
pd.Series.head(s)

a   -1.658622
b    0.388910
c   -0.669340
d   -0.563510
e   -1.085203
dtype: float64

### Another way to do this:

In [6]:
s.head(5)

a   -1.658622
b    0.388910
c   -0.669340
d   -0.563510
e   -1.085203
dtype: float64

### Let's look at the tail. 

In [7]:
s.tail(3) # you can pass any N as the paramter to .head() or .tail()

e   -1.085203
f    1.734919
g   -1.555393
dtype: float64

### Summary Statistics


In [8]:
s.describe()

count    7.000000
mean    -0.486891
std      1.198280
min     -1.658622
25%     -1.320298
50%     -0.669340
75%     -0.087300
max      1.734919
dtype: float64

### Pandas Objects are Also Valid Dataframe Input

You can also put these summary stats in a dataframe to make them easier to read in a notebook.

In [9]:
pd.DataFrame(s.describe()).T  # notice the dataframe input is s.describe()!

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
0,7.0,-0.486891,1.19828,-1.658622,-1.320298,-0.66934,-0.0873,1.734919


## Slicing / Selecting Data

This deserves it's own section, but as a more serious "by the way this is how it works", you have a few ways to select data from your DataFrame.

- From the DataFrame directly between the brackets my_df['my_attribute']
- From DataFrame attributes that are generated
  - df.loc[row_label_index]
  - df.iloc[row_int_index]
  - df.ix[mixed_index]

### df.loc[row_label_index] - Select row by index label

Allowed inputs are:

 - A single label, e.g. 'a'
 - A list or array of labels ['a', 'b', 'c']
 - A slice object with labels 'a':'f', (note that contrary to usual python slices, both the start and the stop are included!)
 - A boolean array

In [10]:
s.loc['c':]
# s.loc['b':]    # Everythign from b onwards

c   -0.669340
d   -0.563510
e   -1.085203
f    1.734919
g   -1.555393
dtype: float64

In [11]:
s.loc['d':'f'] # Everything between d and f, including d and f

d   -0.563510
e   -1.085203
f    1.734919
dtype: float64

**Check**: How would you select just 'd'?

### Slicing By Rows

You can also slice rows.  This example will return rows 1-3.

In [12]:
s[:3]

a   -1.658622
b    0.388910
c   -0.669340
dtype: float64

_**Check:** How would you select just 'd'?_

### df.iloc[row_int_index] - select rows by int index

.iloc is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array. 

Allowed inputs are:

- An integer e.g. 5
- A list or array of integers [4, 3, 0]
- A slice object with ints 1:7
- A boolean array

[See more at Selection by Position](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-integer)

In [13]:
column_labels = ['a', 'b', 'c', 'd', 'e']
row_index = ['mon', 'tues', 'wed', 'thurs', 'fri', 'sat', 'sun']
df = pd.DataFrame(np.random.rand(7,5), index=row_index, columns=column_labels)
df.head(4)

Unnamed: 0,a,b,c,d,e
mon,0.930682,0.529349,0.412982,0.928974,0.283231
tues,0.230997,0.439511,0.894314,0.089572,0.983133
wed,0.525121,0.285015,0.035506,0.033048,0.096367
thurs,0.951957,0.097154,0.246512,0.000575,0.22415


Selecting row "thurs" alone?

In [14]:
df.iloc[3]

a    0.951957
b    0.097154
c    0.246512
d    0.000575
e    0.224150
Name: thurs, dtype: float64

How about row with indexes 'a', 'c', and 'e' **ONLY**?

In [15]:
# Try solution here

<a name="Boolean indexing"></a>
## Demo / Guided Practice: Boolean indexing (25 mins)

Another common operation is the use of boolean vectors to filter the data. The operators 
are: | for or, & for and, and ~ for not. These must be grouped by using parentheses.

Let's create another series and use pandas to do some Boolean indexing. 

In iPython notebook type: 
```Python
s = pd.Series(range(-3, 4))
s
```

Find the values that are > 0. 
```Python
s[s > 0]
```

Find the values that are < -1 or > 0.5
```Python
s[(s < -1) | (s > 0.5)]
```

Find the values that are not < 0.
```Python
s[~(s < 0)]
```

[boolean indexing](http://pandas.pydata.org/pandas-docs/stable/indexing.html#slicing-ranges)

**Check:** How would you find all the numbers that are < 2? 

<a name="ind-practice"></a>
## Independent Practice: (25 minutes)
- Create a series
- Look at the head, tail, and summary stats
- Select series values by index
  - Single value
  - Multiple value
- Using Boolean indexing find values that are < than another value
- Using Boolean indexing find values that are > than another value
- Using Boolean indexing find values that are < than another value and > another value

Bonus:  Create a DataFrame with at least 5 rows

- Slice for certain rows
  - Single row by label and index
  - Multiple row range by label and index
  - Single row series + multi-value range
      - labeled index
      - int index

<a name="conclusion"></a>
## Conclusion (5 mins)
We very briefly used data frames in W2 L1.1. In this lesson we learned more about them and also
about series. What are some differences between series and dataframes? 