## Getting Started with Pandas

`pandas` adopts significant parts of NumPy's idiomatic stype of array-based computing, especially array-based functions and a preference for data processing without for loops. 

The biggest difference is `pandas` is designed for working with tabular or heterogeneous data, whereas NumPy, by contrast is best suited for working with homogeneous numerical array data.  

In [1]:
import pandas as pd
from pandas import Series, DataFrame

# 5.1 Introduction to pandas Data Structures 

## Series

 1d array-like object containing a sequence of values, and an associated array of data labels, called its index. 

In [2]:
# The simplest series is formed from only an array of data 
obj = pd.Series([1, 2, 3, 4])
obj

0    1
1    2
2    3
3    4
dtype: int64

In [3]:
obj.index

RangeIndex(start=0, stop=4, step=1)

In [4]:
obj.values

array([1, 2, 3, 4])

 Since we didnt assign an index, one was assigned automatically. Its often desirable to create a Series with an index identifying each data point with a **label**:

In [5]:
obj2 = pd.Series([4,7,-5,3], index=['a', 'b', 'c', 'd'])
obj2

a    4
b    7
c   -5
d    3
dtype: int64

In [6]:
obj2.index

Index(['a', 'b', 'c', 'd'], dtype='object')

Note: you can use labels to index the series

In [7]:
obj2['b']

np.int64(7)

Using NumPy functions or NumPy-like operations, such as filtering with a `boolean` array, scalar multiplication, or applying math functions, will preserve the index-value link. 

In [9]:
obj2[obj2 > 0]

a    4
b    7
d    3
dtype: int64

In [10]:
obj*2

0    2
1    4
2    6
3    8
dtype: int64

In [11]:
obj2*2

a     8
b    14
c   -10
d     6
dtype: int64

In [13]:
import numpy as np
np.exp(obj2)

a      54.598150
b    1096.633158
c       0.006738
d      20.085537
dtype: float64

**Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values.** 

In [14]:
# use it as if it were a dictionary 
'b' in obj2

True

In [15]:
'e' in obj2

False

As you might have guessed, you can pass a python dict in Series creation 


In [16]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah':5000}
obj3=pd.Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

Notice that the indeces are the dict keys from the source, in the order they were passed. This can be overridden by passing the dict keys in the order we want them to appear in the resulting Series.  

In [17]:
states = ['California', 'Ohio', 'Oregon', 'Texas']

In [18]:
obj4 = pd.Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

`NaN` (not a number) is considered in pandas to mark missing or `NA` values. If we want to detect **missing** or **NA** values in our data, we can use the `isnull` and `notnull` functions in pandas. 

In [19]:
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [20]:
pd.notnull(obj4) 

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

 These are pandas functions as well as Series instance methods. 

In [21]:
obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

A useful feature for Series' is that it automatically aligns by index label in arithmetic operations. 

In [22]:
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [23]:
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [24]:
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

This is similar to a `join` operation. 

Both the Series object **and** its index have a `name` attribute, which integrates with other key areas of pandas functionality.

In [26]:
obj4.name = 'population'

In [27]:
obj4.index.name = 'state'

In [28]:
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

In [29]:
obj4.index

Index(['California', 'Ohio', 'Oregon', 'Texas'], dtype='object', name='state')