## Getting Started with Pandas

`pandas` adopts significant parts of NumPy's idiomatic stype of array-based computing, especially array-based functions and a preference for data processing without for loops. 

The biggest difference is `pandas` is designed for working with tabular or heterogeneous data, whereas NumPy, by contrast is best suited for working with homogeneous numerical array data.  

In [1]:
import pandas as pd
from pandas import Series, DataFrame

# 5.1 Introduction to pandas Data Structures 

## Series

 1d array-like object containing a sequence of values, and an associated array of data labels, called its index. 

In [2]:
# The simplest series is formed from only an array of data 
obj = pd.Series([1, 2, 3, 4])
obj

0    1
1    2
2    3
3    4
dtype: int64

In [3]:
obj.index

RangeIndex(start=0, stop=4, step=1)

In [4]:
obj.values

array([1, 2, 3, 4])

 Since we didnt assign an index, one was assigned automatically. Its often desirable to create a Series with an index identifying each data point with a **label**:

In [5]:
obj2 = pd.Series([4,7,-5,3], index=['a', 'b', 'c', 'd'])
obj2

a    4
b    7
c   -5
d    3
dtype: int64

In [6]:
obj2.index

Index(['a', 'b', 'c', 'd'], dtype='object')

Note: you can use labels to index the series

In [7]:
obj2['b']

np.int64(7)

Using NumPy functions or NumPy-like operations, such as filtering with a `boolean` array, scalar multiplication, or applying math functions, will preserve the index-value link. 

In [9]:
obj2[obj2 > 0]

a    4
b    7
d    3
dtype: int64

In [10]:
obj*2

0    2
1    4
2    6
3    8
dtype: int64

In [11]:
obj2*2

a     8
b    14
c   -10
d     6
dtype: int64

In [13]:
import numpy as np
np.exp(obj2)

a      54.598150
b    1096.633158
c       0.006738
d      20.085537
dtype: float64

**Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values.** 

In [14]:
# use it as if it were a dictionary 
'b' in obj2

True

In [15]:
'e' in obj2

False

As you might have guessed, you can pass a python dict in Series creation 


In [16]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah':5000}
obj3=pd.Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

Notice that the indeces are the dict keys from the source, in the order they were passed. This can be overridden by passing the dict keys in the order we want them to appear in the resulting Series.  

In [17]:
states = ['California', 'Ohio', 'Oregon', 'Texas']

In [18]:
obj4 = pd.Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

`NaN` (not a number) is considered in pandas to mark missing or `NA` values. If we want to detect **missing** or **NA** values in our data, we can use the `isnull` and `notnull` functions in pandas. 

In [19]:
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [20]:
pd.notnull(obj4) 

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

 These are pandas functions as well as Series instance methods. 

In [21]:
obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

A useful feature for Series' is that it automatically aligns by index label in arithmetic operations. 

In [22]:
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [23]:
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [24]:
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

This is similar to a `join` operation. 

Both the Series object **and** its index have a `name` attribute, which integrates with other key areas of pandas functionality.

In [26]:
obj4.name = 'population'

In [27]:
obj4.index.name = 'state'

In [28]:
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

In [29]:
obj4.index

Index(['California', 'Ohio', 'Oregon', 'Texas'], dtype='object', name='state')

 A Series' index can be altered in-place by assignment:

In [30]:
obj

0    1
1    2
2    3
3    4
dtype: int64

In [31]:
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
obj

Bob      1
Steve    2
Jeff     3
Ryan     4
dtype: int64

## Data Frame 

Represents a rectangular table of data and contains an ordered collection of columns, each of whichcan be a different value type (numeric, string, bool, etc.). The DF has both a row and column index; it can be thought of as a dict of Series alll sharing the same index. 
- There are many ways to construct one, one of the most common is from a dict of **equal length lists or numpy arrays**

In [33]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

In [35]:
frame = pd.DataFrame(data) 

In [36]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


Just like with `Series` we did not assign an index, so it will be assigned automatically. Also columns are palced in sorted order, just as they were passed to `pd.DataFrame`

In [37]:
# select the first 5 rows with the .head() method 
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


When creating the dataframe, use the `columns` argument to specify the order. 

In [38]:
pd.DataFrame(data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


 If you pass a column that isn't contained in the dict (input data) it will appear with missing values in the result

In [39]:
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['one', 'two', 'three', 'four', 'five', 'six'])
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


In [40]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

In [41]:
frame2.index

Index(['one', 'two', 'three', 'four', 'five', 'six'], dtype='object')

A column in a DF can be retrieved as a series either by dict-like notation or as an attribute

In [42]:
frame2['year']

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

In [43]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

Note, the returned Series have the same index as the DF, and their name attribute has been appropriately set. 

In [44]:
frame2.year.name

'year'

**Rows** can also be retreived by position or name with the special `loc` attribute. 

In [45]:
frame2.loc['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

Columns can be modified by assignment. Let's use this to assign values to the `debt` column. 

In [48]:
frame2.debt = 16.5
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5
six,2003,Nevada,3.2,16.5


In [52]:
frame2['debt'] = np.arange(6.)
frame2['debt']

one      0.0
two      1.0
three    2.0
four     3.0
five     4.0
six      5.0
Name: debt, dtype: float64

When assigning lists or arrays to a column, the value's **length** must match the length of the DF. 
- If you assign a series, its labels will be realigned exactly to the DF's index, inserting missing values into any holes.
- - So technically in this case it does not have to be the same length, you just have to provide the labels of where u want the data to go. 

In [54]:
val = pd.Series([-1.2, -1.5, -1.7,], index=['two', 'four', 'five'])

frame2['debt']=val

frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


- Assigning a column that doesnt exist will create a new column.
- Columns can not be assigned with the `frame.column` syntax
- The `del` key will delete columns as with a dict.  

In [58]:
# add a bool column tothe dict using the method above 
frame2['bool_column'] = frame2.state == 'Nevada'
frame2

Unnamed: 0,year,state,pop,debt,bool_column
one,2000,Ohio,1.5,,False
two,2001,Ohio,1.7,-1.2,False
three,2002,Ohio,3.6,,False
four,2001,Nevada,2.4,-1.5,True
five,2002,Nevada,2.9,-1.7,True
six,2003,Nevada,3.2,,True


In [59]:
# delete that column 
del frame2['bool_column']
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


 The column returned from indexing a DF is a **view** on the underlying data, not a copy. Thus, any in-place modificaitons to the Series will be reflected in the DF. The column can be explicitly copied with the Series' `copy` method. 

In [61]:
# nested dict of dicts. 
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}
       }
frame3 = pd.DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


 If the nested dict is passed to the DF, pandas will interpret the outer dict keys as the columns and the inner keys as the row indeces.

Transposing syntax is similar to NumPy

In [62]:
frame3.T

Unnamed: 0,2001,2002,2000
Nevada,2.4,2.9,
Ohio,1.7,3.6,1.5


The keys in the innder dicts (of a nested dict input) are combined and sorted to form the index inthe result. This isn't true if an explicit index is specified.

In [63]:
pd.DataFrame(pop, index=[2000, 2001, 2002])

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [69]:
# dicts of series are treated in much the same way. 
pdata = {'Ohio': frame3['Ohio'][:-1],
'Nevada': frame3['Nevada'][:2]}

In [70]:
pd.DataFrame(pdata)

Unnamed: 0,Ohio,Nevada
2001,1.7,2.4
2002,3.6,2.9


If the DF's `index` and icolumns` have their `name` arrtibutes set, These will also be displayed. 

In [71]:
frame3.index.name='year'; frame3.columns.name='state'

In [72]:
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


 **As with `Series`, the `values` attr. returns the data contained in the DF as a 2d Array**
 - Remember its an attribute, so you dont need to attach the `()` on the end of it. 

In [74]:
frame3['Nevada'].values

array([2.4, 2.9, nan])

 If the DF's columns are different dtypes, the dtype of the values array will be chosen to accommodate all of the columns: 

In [75]:
frame2.values

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7],
       [2003, 'Nevada', 3.2, nan]], dtype=object)

Possible data inputs to DF constructor. 
- **2D ndarray:** A matrix of data, passing optional row and column labels
- **dict of arrays, lists, or tuples:** Each sequence becomes a column in the DF; all sequences must be the same length
- **NumPy structured/record array:** Treated as the "dict of arrays" case
- **dict of series:** Each value becomes a column; indexes from each Seriesa are unioned together to form the result's row index if no explicit index is passed
- **dict of dicts:** Each inner dict becomes a column; keys are unioned to form the row index as in the "dict of Series" case.
- **List of dicts or Series:** Each item becomes a row in the DF; union of dict keys or Series indexes become the DF's column labels.
- **List of lists or tuples:** 2D ndarray case
- **Another DF:** DF's indexes are used unless different ones are passed.
- **NumPy MaskewdArray:** Like the 2d ndarray case except masked values become NA/missing in the DF result. 

## Index Objects 

 Responsible for holding the axis labels and other metadata (like the axis name or names)

In [78]:
# arrays opr sequences of labels passed to the `index` arg are converted to an Index
obj = pd.Series(range(3), index=['a','b','c'])
obj.index

Index(['a', 'b', 'c'], dtype='object')

Index objects can be indexed. 

In [79]:
obj.index[0]

'a'

Index objects are **immutable**. This makes it safer to share objects among data structures. 

In [80]:
labels = pd.Index(np.arange(3))
labels

Index([0, 1, 2], dtype='int64')

In [81]:
obj2 = pd.Series([1.5, -2.5, 0], index=labels)
obj2.index

Index([0, 1, 2], dtype='int64')

In [84]:
obj2.index is labels 

True

It also behaves like a fixed-size set. But unlike python sets, they can contain duplicate labels. 

In [87]:
1 in obj2.index

True

 Selections with duplicate labels will select all occurrences of that label. 

Index objects have a number of methods and properties for set logic
- `append`
- `difference`
- `intersection`
- un