## pandas

Two major workhorses of pandas: *Series* and *DataFrame*

A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its index. The simplest Series is formed from only an array of data.

In [5]:
import pandas as pd
from pandas import Series, DataFrame

a_series = Series([4, 7, -5, 3])

In [6]:
a_series

0    4
1    7
2   -5
3    3
dtype: int64

Often it will be desirable to create a Series with an index identifying each data point with a label.

In [7]:
b_series = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])

b_series


d    4
b    7
a   -5
c    3
dtype: int64

In [8]:
b_series.index

Index(['d', 'b', 'a', 'c'], dtype='object')

Using NumPy functions or NumPy-like operations, such as filtering with a boolean array, scalar multiplication, or applying math functions, will preserve the index-value link.

In [9]:
import numpy as np

np.exp(b_series)

d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

In [10]:
b_series ** 2

d    16
b    49
a    25
c     9
dtype: int64

A Python dict can be converted into a Series.


In [11]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

Series(sdata)

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

### Detecting Missing Data in pandas

The *isnull* and *notnull* functions in pandas should be used to detect missing data.

In [12]:
pd.isnull(a_series)

0    False
1    False
2    False
3    False
dtype: bool

In [13]:
a_series.isnull()

0    False
1    False
2    False
3    False
dtype: bool

A useful Series feature for many applications is that it automatically aligns by index label in arithmetic operations.


A Series’s index can be altered in-place by assignment.

In [14]:
a_series

0    4
1    7
2   -5
3    3
dtype: int64

In [15]:
a_series.index = ['Bob', 'chok', 4, 'fol']

In [16]:
a_series

Bob     4
chok    7
4      -5
fol     3
dtype: int64

## DataFrame

A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dict of Series all sharing the same index.

There are many ways to construct a DataFrame, though one of the most common is from a dict of equal-length lists or NumPy arrays.

In [17]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [19]:
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


Columns can be arranged in desired order.


In [20]:
pd.DataFrame(data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


If the column is not in the df, it will appear with NaN. 

In [21]:
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                       index=['one', 'two', 'three', 'four',
                             'five', 'six'])

In [22]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,
