# Introduction to pandas

## Meet pandas

### Welcome

Notes:
* pandas: panel + data
* Panel data: data that is multidimensional, involving measurements over time.
* pandas provides fast, flexible, and expressive data structures designed to make working with relational or labeled data easy and intuitive.
* Fundamental high level building block for doing practical and real world data analysis in Python.

Data frame: 2D data structure

### Meet Series

Series:
* A one-dimensional array with additional labels that allow you to access specific values.
* Python dictionary wrapped around a NumPy array.
    * Specify data type (dtype) of NumPy array
    * Labels = keys
* Single dimensional container object.
* Ordered, typed, indexable.

### Creating a Series

#### Code Challenge

In [10]:
# Just like how NumPy is almost always abbreviated as np
import numpy as np
#  pandas is usually shortened to pd
import pandas as pd

In [2]:
pd.Series([1, 2, 3])

0    1
1    2
2    3
dtype: int64

In [3]:
pd.Series(3, index=['mario', 'peach', 'toad'])

mario    3
peach    3
toad     3
dtype: int64

#### Quiz
* Keys are autogenerated when no index is specified.
* A scalar value is set for all keys specified in the index.

### Accessing a Series

#### Code Challenge

In [4]:
sample = {
    'neptune': 2.793,
    'earth': 92.96,
    'uranus': 1.784,
    'jupiter': 483.8,
}
distances = pd.Series(sample)
distances.loc['earth':'jupiter']

earth       92.960
uranus       1.784
jupiter    483.800
dtype: float64

#### Quiz
* 'loc' indexer is inclusive.
* 'iloc' is like standard list slicing; it is exclusive (excludes the end of the range).
* Properties are exposed on the underlying Series (if they pass naming rules).
* Indexing works just like a list.  Use negative indexing to get the last. Use 'iloc' to be more specific.

In [5]:
'pluto' in distances

False

In [6]:
distances.earth

92.96

In [7]:
distances.iloc[0:2]

neptune     2.793
earth      92.960
dtype: float64

In [8]:
distances[-1]

483.8

### Vectorization and Broadcasting Review

#### Vectorization in NumPy
Arrays provide a vectorized method named `add` which removes the need for you to loop through each value to add things together.

In [15]:
np.array([1, 2, 3]) + np.array([4, 5, 6])

array([5, 7, 9])

#### Broadcasting in NumPy
Scalar values can be broadcasted to values, it's as if there was an equal sized array of all 1's.

In [16]:
conference_counts = np.array([4, 5, 10, 8, 15])
# Broadcast a scalar value
conference_counts + 1

array([ 5,  6, 11,  9, 16])

### Series Vectorization and Broadcasting

#### Code Challenge

In [17]:
remaining = {
    'mario': 3,
    'peach': 2,
    'yoshi': 2,
}
completed = {
    'peach': 1,
    'bowser': 2,
}
remaining_laps = pd.Series(remaining)
completions = pd.Series(completed)
remaining_laps - completions

bowser    NaN
mario     NaN
peach     1.0
yoshi     NaN
dtype: float64

In [19]:
totals = {
    'mario': 135,
    'peach': 149,
    'yoshi': 122,
}
final = {
    'peach': 45,
    'mario': 63,
    'yoshi': 77,
}
total_laps = pd.Series(totals)
final_lap = pd.Series(final)
total_laps + final_lap

mario    198
peach    194
yoshi    199
dtype: int64

#### Quiz
* Vectorized math operations will return a new Series with np.nan for missing labels. You can "correct" this by using the subtract vectorized method with a fill_value of 0.
* Labels line up and vectorization takes care of element to element addition.
* Scalar values broadast to every element.

In [20]:
coins = {
    'mario': 1500,
    'peach': 2200,
    'yoshi': 500,
}
total_coins = pd.Series(coins)
total_coins + 500

mario    2000
peach    2700
yoshi    1000
dtype: int64

### Meet Data Frames
Imagine the data range as a bunch of series in a line next to each other, one after the other.  It has rows and columns like a spreadsheet.