# Operating on Data in Pandas

Index Preservation

Because Pandas is designed to work with NumPy, any NumPy ufunc will work on Pandas Series and DataFrame objects. Let's start by defining a simple Series and DataFrame on which to demonstrate this:

In [1]:
import pandas as pd
import numpy as np

rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4))
ser

0    6
1    3
2    7
3    4
dtype: int64

In [2]:
df = pd.DataFrame(rng.randint(0, 10, (3, 4)),
                  columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
0,6,9,2,6
1,7,4,3,7
2,7,2,5,4


# Index Alignment

##Index alignment in Series:

As an example, suppose we are combining two different data sources, and find only the top three US states by area and the top three US states by population:

In [3]:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
                  'California': 423967}, name='area')

population = pd.Series({'California': 38332521, 'Texas': 26448193,
                        'New York': 19651127}, name='population')

# divide these to compute the population density:

population / area



Alaska              NaN
California    90.413926
New York            NaN
Texas         38.018740
dtype: float64

Any item for which one or the other does not have an entry is marked with NaN, or "Not a Number," which is how Pandas marks missing data

This index matching is implemented this way for any of Python's built-in arithmetic expressions; any missing values are filled in with NaN by default:

In [4]:
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B

0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

## Index alignment in DataFrame

A similar type of alignment takes place for both columns and indices when performing operations on DataFrames:


In [5]:
A = pd.DataFrame(rng.randint(0, 20, (2, 2)),
                 columns=list('AB'))
A

Unnamed: 0,A,B
0,1,11
1,5,1


In [6]:
B = pd.DataFrame(rng.randint(0, 10, (3, 3)),
                 columns=list('BAC'))
B

Unnamed: 0,B,A,C
0,4,0,9
1,5,8,0
2,9,2,6


In [7]:
A + B

Unnamed: 0,A,B,C
0,1.0,15.0,
1,13.0,6.0,
2,,,


# Missing Data in Pandas

The way in which Pandas handles missing values is constrained by its reliance on the NumPy package, which does not have a built-in notion of NA values for non-floating-point data types

The first sentinel value used by Pandas is None, a Python singleton object that is often used for missing data in Python code. Because it is a Python object, None cannot be used in any arbitrary NumPy/Pandas array, but only in arrays with data type 'object' (i.e., arrays of Python objects):

In [8]:
vals1 = np.array([1, None, 3, 4])
vals1

array([1, None, 3, 4], dtype=object)

NaN: Missing numerical data

The other missing data representation, NaN (acronym for Not a Number), is different; it is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation:

In [9]:
vals2 = np.array([1, np.nan, 3, 4]) 
vals2.dtype


dtype('float64')

In [10]:
1 + np.nan

nan

In [11]:
0 *  np.nan

nan

In [13]:
#Note that this means that aggregates over the values are well defined (i.e., they don't result in an error) 
#but not always useful:


vals2.sum(), vals2.min(), vals2.max()

(nan, nan, nan)


NaN and None in Pandas:

NaN and None both have their place, and Pandas is built to handle the two of them nearly interchangeably, converting between them where appropriate:


In [14]:
pd.Series([1, np.nan, 2, None])

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

For types that don't have an available sentinel value, Pandas automatically type-casts when NA values are present. For example, if we set a value in an integer array to np.nan, it will automatically be upcast to a floating-point type to accommodate the NA:

# Detecting null values

In [15]:
# Pandas data structures have two useful methods for detecting null data: isnull() and notnull(). Either one will
#return a Boolean mask over the data.
# For example:

data = pd.Series([1, np.nan, 'hello', None])

data.isnull()

0    False
1     True
2    False
3     True
dtype: bool