# Chapter 5: Intro to pandas Data Structures
## Series
Series are one dimensional array-like objects containing an array of data (any NumPy data type) and associated labes called and *index*. Let's import NumPy and Pandas and create some data objects.

In [21]:
# import pandas and NumPy
import pandas as pd
import numpy as np
import pprint as pp

In [23]:
# create list as input for series
lst = [num ** 2 for num in xrange(1, 64) if num % 2 == 0 and num % 3 == 0]

# print this list
pp.pprint(lst)

[36, 144, 324, 576, 900, 1296, 1764, 2304, 2916, 3600]


In [25]:
# define series
series_1 = pd.Series(lst)

# print out series
series_1

0      36
1     144
2     324
3     576
4     900
5    1296
6    1764
7    2304
8    2916
9    3600
dtype: int64

Now I'd like to get each component out of the `Series`, the `index` to the left and the `array` values on the right. I'll do this using the `index` and `values` attributes.

In [26]:
# get the index
series_1.index

Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')

In [27]:
# get the values
series_1.values

array([  36,  144,  324,  576,  900, 1296, 1764, 2304, 2916, 3600])

In [28]:
# create a series and specify the index
series_2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])

# print
series_2

d    4
b    7
a   -5
c    3
dtype: int64

In [29]:
# Use the index to subset the series
series_2['a']

-5

In [30]:
# use the index to subset and reassign the value
series_2['d'] = 6

In [31]:
# one more subsetting example with a list of values
series_2[['c', 'a', 'd']]

c    3
a   -5
d    6
dtype: int64

### NumPy array operations on Series
Filtering with a boolean array, scalar multiplication, and applying a math function all preserve the `index`-`value` link.

In [32]:
# Boolean array
series_2[series_2 > 0]

d    6
b    7
c    3
dtype: int64

In [33]:
# scalar multiplication
series_2 * 2

d    12
b    14
a   -10
c     6
dtype: int64

In [34]:
# math function
np.exp(series_2)

d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

Another way to think about a `Series` is as a fixed-length, ordered dict. It is a mapping of index values to data values. It can be substituted into many functions that expect a dict.

In [35]:
# testing membership
'b' in series_2

True

In [36]:
# membership example 2
'e' in series_2

False

When you have a Python dict, you can create a Series from it by passing the dict.

In [37]:
# create the dictionary to be passed to Series
dict_data = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

In [38]:
# now pass the dictionary to Series
series_3 = pd.Series(dict_data)

# print it out
series_3

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

In [41]:
# pass a new index to series_3
states = ['California', 'Ohio', 'Oregon', 'Utah', 'Texas']

# Series 4 using the dictionary and the new index (states)
series_4 = pd.Series(dict_data, index=states)

# print it out
series_4

California      NaN
Ohio          35000
Oregon        16000
Utah           5000
Texas         71000
dtype: float64

Notice that California is inserted into the Series but with no data. Now, let's look at few ways to detect missing values: `isnull` and `notnull`, both of the `pandas` library.

In [42]:
# is the value Null?
pd.isnull(series_4)

California     True
Ohio          False
Oregon        False
Utah          False
Texas         False
dtype: bool

In [43]:
# is the value valid?
pd.notnull(series_4)

California    False
Ohio           True
Oregon         True
Utah           True
Texas          True
dtype: bool

In [44]:
# the Series also has these two library functions as instance methods
series_4.isnull()

California     True
Ohio          False
Oregon        False
Utah          False
Texas         False
dtype: bool

A nice feature of series: automatically aligns differently indexed data in arithmetic operations

In [52]:
pp.pprint(series_3)
print '\n'
pp.pprint(series_4)

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64


California      NaN
Ohio          35000
Oregon        16000
Utah           5000
Texas         71000
dtype: float64


In [53]:
# notice above the different indexing, let's add them
series_3 + series_4

California       NaN
Ohio           70000
Oregon         32000
Texas         142000
Utah           10000
dtype: float64

Series object and index have name attributes.

In [54]:
# give series name
series_4.name = 'population'

# give index name
series_4.index.name = 'state'

In [55]:
series_4

state
California      NaN
Ohio          35000
Oregon        16000
Utah           5000
Texas         71000
Name: population, dtype: float64

You can also alter the `index` in place by assignment.

In [56]:
# reassign the index
series_1.index = ['Bob', 'Jack', 'Ryan', 'Laura', 'Dave'] * 2

In [57]:
# print it out
series_1

Bob        36
Jack      144
Ryan      324
Laura     576
Dave      900
Bob      1296
Jack     1764
Ryan     2304
Laura    2916
Dave     3600
dtype: int64

## DataFrame
A `DataFrame` represents a tabular (spreadsheet-like) data structure containing an ordered collection of columns, each of which can be of a different data type. `DataFrame`'s have both row and column indexes. One thing I'd like to understand in more detail is difference between R's `data.frame` and pandas' `DataFrame`.

There are many ways  to create a `DataFrame`, one of the most common being dict of equal length lists or NumPy arrays.

In [58]:
# this dictionary will be passed the DataFrame function (dict of equal length components)
df_data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
           'year': [2000, 2001, 2002, 2001, 2002],
           'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}

In [59]:
# create the dataframe
frame = pd.DataFrame(df_data)

# print the dataframe
frame

Unnamed: 0,pop,state,year
0,1.5,Ohio,2000
1,1.7,Ohio,2001
2,3.6,Ohio,2002
3,2.4,Nevada,2001
4,2.9,Nevada,2002


In [60]:
# specify the order of columns
pd.DataFrame(df_data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9


In [64]:
# pass a column that doesn't exist in the underlying data set (get NaN's)
frame_2 = pd.DataFrame(df_data, columns=['year', 'state', 'pop', 'debt'],
                       index=['one', 'two', 'three', 'four', 'five'])

# print
frame_2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,


Get the columns from the `DataFrame` object

In [66]:
# use column attribute
frame_2.columns

Index([u'year', u'state', u'pop', u'debt'], dtype='object')

A column in a `DataFrame` can be retrieved as a `Series` either by dict-like notation or by attribute:

In [67]:
# dict-like notation
frame_2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

In [68]:
# attribute
frame_2.state

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

Two things to note here: **1)** the resulting `Series` has the same `index`. **2)** the `name` attribute has been set appropriately.

In [70]:
# use ix by name: gets the 3rd row
frame_2.ix['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

Columns can be modified by reassignment. Take the empty `debt` column for example.

In [71]:
# assign a scalar value to debt
frame_2['debt'] = 16.5

# print
frame_2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5


In [73]:
# or use an array
frame_2['debt'] = np.arange(5.)

# print
frame_2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0
two,2001,Ohio,1.7,1
three,2002,Ohio,3.6,2
four,2001,Nevada,2.4,3
five,2002,Nevada,2.9,4


When using an array or list to assign a column, they must be the same length as the `DataFrame`. However, using a `Series` is a little more flexible. The `Series` will be conformed to the `DataFrame`, matching on index and inserting missing values where the index is not present.

In [74]:
# create a series to be used for the debt column
debt_series = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])

# assignment
frame_2['debt'] = debt_series

# print
frame_2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7


Assigning columns that do not exist result in a new column. To delete a column, use the **del** keyword.

In [75]:
# assign a column that doesn't exist
frame_2['eastern'] = frame_2.state == 'Ohio'

# print
frame_2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False


In [76]:
# now delete
del frame_2['eastern']

# get columns to make sure it worked
frame_2.columns

Index([u'year', u'state', u'pop', u'debt'], dtype='object')

**Note:** <br>
The column returned when indexing a `DataFrame` is a *view* on the underlying data, not a copy. Thus, any in-place modifications to the `Series` will be reflected in the `DataFrame`. The column can be explicity copied using the `Series`'s `copy` method.

Another common form of input data is a nested dictionary: a dictionary of dictionaries.

In [77]:
# create nested dict to be passed to DataFrame
population = {'Nevada': {2001: 2.4, 2002: 2.9},
              'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

# pass to Dataframe
frame_3 = pd.DataFrame(population)

# print
frame_3

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [78]:
# let's transpose the result
frame_3.T

Unnamed: 0,2000,2001,2002
Nevada,,2.4,2.9
Ohio,1.5,1.7,3.6


The keys in the innner dicts are unioned and sorted to form the index in the final result. However, if you explicitly pass the index, this is not true:

In [79]:
pd.DataFrame(population, index=[2001, 2002, 2003])

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,


In [80]:
# lets try dict's of series
# first input: take a look first
frame_3['Ohio'][:-1] # [:-1] means go up to last one (but not including)

2000    1.5
2001    1.7
Name: Ohio, dtype: float64

In [82]:
# second input: take a look first
frame_3['Nevada'][:2] # same thing as above (but only if you know you have three elements)

2000    NaN
2001    2.4
Name: Nevada, dtype: float64