# Pandas Tutoria
Explore data analysis with Python. Pandas DataFrames make manipulating your data easy, from selecting or replacing columns and indices to reshaping your data. Pandas is a popular Python package for data science, and with good reason: it offers powerful, expressive and flexible data structures that make data manipulation and analysis easy, among many other things. The DataFrame is one of these structures.

__Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc.__

# Key Features of Pandas
- Fast and efficient DataFrame object with default and customized indexing.<br>
- Tools for loading data into in-memory data objects from different file formats.<br>
- Data alignment and integrated handling of missing data.<br>
- Reshaping and pivoting of date sets.<br>
- Label-based slicing, indexing and subsetting of large data sets.<br>
- Columns from a data structure can be deleted or inserted.<br>
- Group by data for aggregation and transformations.<br>
- High performance merging and joining of data.<br>
- Time Series functionality.<br>

At the very basic level, Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices. As we will see during the course of this chapter, Pandas provides a host of useful tools, methods, and functionality on top of the basic data structures, but nearly everything that follows will require an understanding of what these structures are. Thus, before we go any further, let's introduce these three fundamental Pandas data structures: the Series, DataFrame, and Index.

We will start our code sessions with the standard NumPy and Pandas imports:

In [1]:
import numpy as np
import pandas as pd

### Series
A Series is a one-dimensional object similar to an array, list, or column in a table. It will assign a labeled index to each item in the Series. By default, each item will receive an index label from 0 to N, where N is the length of the Series minus one. It can be created from a list or array as follows:

In [7]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

As we see in the output, the Series wraps both a sequence of values and a sequence of indices, which we can access with the __values__ and __index__ attributes.<br>__The values are simply a familiar NumPy array:__

In [8]:
data.values

array([ 0.25,  0.5 ,  0.75,  1.  ])

The __index__ is an array-like object of type __pd.Index__, which we'll discuss in more detail momentarily.

In [9]:
data.index

RangeIndex(start=0, stop=4, step=1)

Like with a NumPy array, data can be accessed by the associated index via the familiar Python square-bracket notation:

In [10]:
data[1]

0.5

In [11]:
data[1:3]

1    0.50
2    0.75
dtype: float64

As we will see, though, the Pandas Series is much more general and flexible than the one-dimensional NumPy array that it emulates

In [12]:
# create a Series with an arbitrary list
s = pd.Series([7, 'Heisenberg', 3.14, -1789710578, 'Happy Eating!'])
s

0                7
1       Heisenberg
2             3.14
3      -1789710578
4    Happy Eating!
dtype: object

__Series as generalized NumPy array__
From what we've seen so far, it may look like the Series object is basically interchangeable with a one-dimensional NumPy array. The essential difference is the presence of the index: while the Numpy Array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.

This explicit index definition gives the Series object additional capabilities. For example, the index need not be an integer, but can consist of values of any desired type. For example, if we wish, we can use strings as an index:

In [13]:
# you can specify an index to use when creating the Series.
s = pd.Series([7, 'Heisenberg', 3.14, -1789710578, 'Happy Eating!'],
              index=['A', 'Z', 'C', 'Y', 'E'])
s

A                7
Z       Heisenberg
C             3.14
Y      -1789710578
E    Happy Eating!
dtype: object

In [14]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [15]:
print(data['b'])
#We can even use non-contiguous or non-sequential indices:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=[2, 5, 3, 7])
print(data)
print(data.index)

0.5
2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64
Int64Index([2, 5, 3, 7], dtype='int64')


In [16]:
data[5]

0.5

## Series as specialized dictionary
In this way, you can think of a Pandas Series a bit like a specialization of a Python dictionary. A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a Series is a structure which maps typed keys to a set of typed values. This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas Series makes it much more efficient than Python dictionaries for certain operations.

The Series-as-dictionary analogy can be made even more clear by constructing a Series object directly from a Python  dictionary:

In [17]:
 population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Florida       19552860
Illinois      12882135
New York      19651127
Texas         26448193
dtype: int64

In [18]:
population['California']

38332521

In [19]:
#Unlike a dictionary, though, the Series also supports array-style operations such as slicing:
population['California':'Illinois']

California    38332521
Florida       19552860
Illinois      12882135
dtype: int64

We can also use dictionary-like Python expressions and methods to examine the keys/indices and values:

In [20]:
2 in data

True

In [21]:
data

2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64

In [22]:
data.keys()

Int64Index([2, 5, 3, 7], dtype='int64')

In [23]:
list(data.items())

[(2, 0.25), (5, 0.5), (3, 0.75), (7, 1.0)]

Series objects can even be modified with a dictionary-like syntax. Just as you can extend a dictionary by assigning to a new key, you can extend a Series by assigning to a new index value:

In [24]:
data[1] = 1.25
data

2    0.25
5    0.50
3    0.75
7    1.00
1    1.25
dtype: float64

In [25]:
data[3] = 1.5
data

2    0.25
5    0.50
3    1.50
7    1.00
1    1.25
dtype: float64

This easy mutability of the objects is a convenient feature: under the hood, Pandas is making decisions about memory layout and data copying that might need to take place; the user generally does not need to worry about these issues.

A Series builds on this dictionary-like interface and provides array-style item selection via the same basic mechanisms as NumPy arrays – that is, slices, masking, and fancy indexing. Examples of these are as follows:

In [26]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
# slicing by implicit integer index
data[0:2]

a    0.25
b    0.50
dtype: float64

In [27]:
# slicing by explicit index
data['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

Among these, slicing may be the source of the most confusion. Notice that when slicing with an explicit index (i.e., data['a':'c']), the final index is included in the slice, while when slicing with an implicit index (i.e., data[0:2]), the final index is excluded from the slice.

In [28]:
# masking
data[(data > 0.3) & (data < 0.8)]

b    0.50
c    0.75
dtype: float64

In [29]:
# fancy indexing
data[['a', 'e']]

a    0.25
e     NaN
dtype: float64

In [30]:
a = np.array([[1,2],[3,4]])
print(a)
print(a[:,1])

[[1 2]
 [3 4]]
[2 4]


__Indexers: loc, iloc__ <br>
These slicing and indexing conventions can be a source of confusion. For example, if your Series has an explicit integer index, an indexing operation such as data[1] will use the explicit indices, while a slicing operation like data[1:3] will use the implicit Python-style index.

In [40]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

1    a
3    b
5    c
dtype: object

In [41]:
# explicit index when indexing
data[1]

'a'

In [42]:
# implicit index when slicing
data[1:3]

3    b
5    c
dtype: object

Because of this potential confusion in the case of integer indexes, Pandas provides some special indexer attributes that explicitly expose certain indexing schemes. These are not functional methods, but attributes that expose a particular slicing interface to the data in the Series.

First, the __loc__ attribute allows indexing and slicing that always references the explicit index:

In [49]:
data.loc[:3]  #s.loc[:3] # slice up to and including label 3

1    a
3    b
dtype: object

The __iloc__ attribute allows indexing and slicing that always references the implicit Python-style index:

In [48]:
data.iloc[:3]   #s.iloc[:3] # slice the first three rows

1    a
3    b
5    c
dtype: object

## Constructing Series objects

We've already seen a few ways of constructing a Pandas Series from scratch; all of them are some version of the following:

__pd.Series(data, index=index)__<br>
where index is an optional argument, and data can be one of many entities.<br>
For example, data can be a list or NumPy array, in which case index defaults to an integer sequence:

In [52]:
# data can be a scalar, which is repeated to fill the specified index:
pd.Series(5, index=[100, 200, 300])

100    5
200    5
300    5
dtype: int64

In [53]:
# data can be a dictionary, in which index defaults to the sorted dictionary keys:
pd.Series({2:'a', 1:'b', 3:'c'})

1    b
2    a
3    c
dtype: object

In [55]:
# In each case, the index can be explicitly set if a different result is preferred:
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])
# Notice that in this case, the Series is populated only with the explicitly identified keys.

3    c
2    a
dtype: object

# The Pandas DataFrame Object
The next fundamental structure in Pandas is the DataFrame. Like the Series object discussed in the previous section, the DataFrame can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary. We'll now take a look at each of these perspectives.

## DataFrame as a generalized NumPy array
If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names. Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a DataFrame as a sequence of aligned Series objects. Here, by "aligned" we mean that they share the same index.<br><br>
To demonstrate this, let's first construct a new Series listing the area of each of the five states discussed in the previous section:

In [56]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
dtype: int64

In [59]:
print(population)

California    38332521
Florida       19552860
Illinois      12882135
New York      19651127
Texas         26448193
dtype: int64


Now that we have this along with the__ population__ Series from before, we can use a dictionary to construct a single two-dimensional object containing this information:

In [60]:

states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,area,population
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135
New York,141297,19651127
Texas,695662,26448193


Like the __Series__ object, the __DataFrame__ has an __index__ attribute that gives access to the index labels:

In [61]:
states.index

Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object')

In [62]:
#Additionally, the DataFrame has a columns attribute, which is an Index object holding the column labels:
states.columns

Index(['area', 'population'], dtype='object')

Thus the __DataFrame__ can be thought of as a generalization of a two-dimensional NumPy array, where both the rows and columns have a generalized index for accessing the data.

Similarly, we can also think of a DataFrame as a specialization of a dictionary. 
Where a dictionary maps a key to a value, a DataFrame maps a column name to a Series of column data.
For example, asking for the 'area' attribute returns the Series object containing the areas we saw earlier:

In [63]:
states['area']

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64

## Constructing DataFrame objects
A Pandas DataFrame can be constructed in a variety of ways. Here we'll give several examples.

### From a single Series object
A DataFrame is a collection of Series objects, and a single-column DataFrame can be constructed from a single Series:

In [66]:
pd.DataFrame(population, columns=['population'])


Unnamed: 0,population
California,38332521
Florida,19552860
Illinois,12882135
New York,19651127
Texas,26448193


### From a list of dicts¶
Any list of dictionaries can be made into a DataFrame. We'll use a simple list comprehension to create some data:

In [67]:
data = [{'a': i, 'b': 2 * i}
        for i in range(3)]
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


In [68]:
# Even if some keys in the dictionary are missing, Pandas will fill them in with NaN (i.e., "not a number") values:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


### From a dictionary of Series objects
As we saw before, a DataFrame can be constructed from a dictionary of Series objects as well:

In [70]:
pd.DataFrame({'population': population,
              'area': area})

Unnamed: 0,area,population
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135
New York,141297,19651127
Texas,695662,26448193


### From a two-dimensional NumPy array
Given a two-dimensional array of data, we can create a DataFrame with any specified column and index names. If omitted, an integer index will be used for each:

In [71]:
pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.411613,0.038766
b,0.177241,0.122676
c,0.102195,0.415941


# The Pandas Index Object
We have seen here that both the Series and DataFrame objects contain an explicit index that lets you reference and modify data. This Index object is an interesting structure in itself, and it can be thought of either as an immutable array or as an ordered set (technically a multi-set, as Index objects may contain repeated values). Those views have some interesting consequences in the operations available on Index objects. As a simple example, let's construct an Index from a list of integers:

In [73]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

Int64Index([2, 3, 5, 7, 11], dtype='int64')

## Index as immutable array
The Index in many ways operates like an array. For example, we can use standard Python indexing notation to retrieve values or slices:

In [74]:
ind[1]

3

In [75]:
ind[::2]

Int64Index([2, 5, 11], dtype='int64')

Index objects also have many of the attributes familiar from NumPy arrays:

In [76]:
print(ind.size, ind.shape, ind.ndim, ind.dtype)

5 (5,) 1 int64


One difference between Index objects and NumPy arrays is that indices are immutable–that is, they cannot be modified via the normal means:

In [77]:
ind[1] = 0

TypeError: Index does not support mutable operations

This immutability makes it safer to share indices between multiple DataFrames and arrays, without the potential for side effects from inadvertent index modification.



## Index as ordered set
Pandas objects are designed to facilitate operations such as joins across datasets, which depend on many aspects of set arithmetic. The Index object follows many of the conventions used by Python's built-in set data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way:

In [79]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [80]:
indA & indB  # intersection

Int64Index([3, 5, 7], dtype='int64')

In [81]:
indA | indB  # union


Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [82]:
indA ^ indB  # symmetric difference

Int64Index([1, 2, 9, 11], dtype='int64')

In [83]:
# These operations may also be accessed via object methods, for example
indA.intersection(indB)

Int64Index([3, 5, 7], dtype='int64')

In [85]:
states.area is states['area']

True

In [86]:
states.values

array([[  423967, 38332521],
       [  170312, 19552860],
       [  149995, 12882135],
       [  141297, 19651127],
       [  695662, 26448193]], dtype=int64)

With this picture in mind, many familiar array-like observations can be done on the DataFrame itself. For example, we can transpose the full DataFrame to swap rows and columns:

In [87]:
states.T

Unnamed: 0,California,Florida,Illinois,New York,Texas
area,423967,170312,149995,141297,695662
population,38332521,19552860,12882135,19651127,26448193


When it comes to indexing of DataFrame objects, however, it is clear that the dictionary-style indexing of columns precludes our ability to simply treat it as a NumPy array. In particular, passing a single index to an array accesses a row

In [89]:
states.values[0]

array([  423967, 38332521], dtype=int64)

In [91]:
# Because Pandas is designed to work with NumPy, any NumPy ufunc will work on Pandas Series and DataFrame objects.
# Let's start by defining a simple Series and DataFrame on which to demonstrate this:
rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4))
ser

0    6
1    3
2    7
3    4
dtype: int32

In [92]:
df = pd.DataFrame(rng.randint(0, 10, (3, 4)),
                  columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
0,6,9,2,6
1,7,4,3,7
2,7,2,5,4


If we apply a NumPy ufunc on either of these objects, the result will be another Pandas object with the indices preserved:

In [94]:
np.exp(ser)

0     403.428793
1      20.085537
2    1096.633158
3      54.598150
dtype: float64

In [95]:
np.sin(df * np.pi / 4)

Unnamed: 0,A,B,C,D
0,-1.0,0.7071068,1.0,-1.0
1,-0.707107,1.224647e-16,0.707107,-0.7071068
2,-0.707107,1.0,-0.707107,1.224647e-16


# Index alignment in Series
As an example, suppose we are combining two different data sources, and find only the top three US states by area and the top three US states by population:

In [96]:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
                  'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,
                        'New York': 19651127}, name='population')

Let's see what happens when we divide these to compute the population density:

In [98]:
population / area

Alaska              NaN
California    90.413926
New York            NaN
Texas         38.018740
dtype: float64

The resulting array contains the union of indices of the two input arrays, which could be determined using standard Python set arithmetic on these indices:

In [100]:
area.index | population.index

Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')

Any item for which one or the other does not have an entry is marked with NaN, or "Not a Number," which is how Pandas marks missing data
 This index matching is implemented this way for any of Python's built-in arithmetic expressions; any missing values are filled in with NaN by default:

In [101]:
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B

0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

If using NaN values is not the desired behavior, the fill value can be modified using appropriate object methods in place of the operators. For example, calling A.add(B) is equivalent to calling A + B, but allows optional explicit specification of the fill value for any elements in A or B that might be missing:

In [102]:
A.add(B, fill_value=0)


0    2.0
1    5.0
2    9.0
3    5.0
dtype: float64

#  Index alignment in DataFrame
A similar type of alignment takes place for both columns and indices when performing operations on DataFrames:

In [104]:
A = pd.DataFrame(rng.randint(0, 20, (2, 2)),
                 columns=list('AB'))
A

Unnamed: 0,A,B
0,1,11
1,5,1


In [105]:
B = pd.DataFrame(rng.randint(0, 10, (3, 3)),
                 columns=list('BAC'))
B

Unnamed: 0,B,A,C
0,4,0,9
1,5,8,0
2,9,2,6


In [106]:
A + B

Unnamed: 0,A,B,C
0,1.0,15.0,
1,13.0,6.0,
2,,,


Notice that indices are aligned correctly irrespective of their order in the two objects, and indices in the result are sorted. As was the case with Series, we can use the associated object's arithmetic method and pass any desired fill_value to be used in place of missing entries. Here we'll fill with the mean of all values in A (computed by first stacking the rows of A):

In [107]:
fill = A.stack().mean()
A.add(B, fill_value=fill)

Unnamed: 0,A,B,C
0,1.0,15.0,13.5
1,13.0,6.0,4.5
2,6.5,13.5,10.5


# Ufuncs: Operations Between DataFrame and Series
When performing operations between a DataFrame and a Series, the index and column alignment is similarly maintained. Operations between a DataFrame and a Series are similar to operations between a two-dimensional and one-dimensional NumPy array. Consider one common operation, where we find the difference of a two-dimensional array and one of its rows:

In [109]:
A = rng.randint(10, size=(3, 4))
A

array([[3, 8, 2, 4],
       [2, 6, 4, 8],
       [6, 1, 3, 8]])

In [110]:
A - A[0]

array([[ 0,  0,  0,  0],
       [-1, -2,  2,  4],
       [ 3, -7,  1,  4]])

In [111]:
# In Pandas, the convention similarly operates row-wise by default:
df = pd.DataFrame(A, columns=list('QRST'))
df - df.iloc[0]

Unnamed: 0,Q,R,S,T
0,0,0,0,0
1,-1,-2,2,4
2,3,-7,1,4


In [112]:
# If you would instead like to operate column-wise, you can use the object methods mentioned earlier, while specifying the axis keyword
df.subtract(df['R'], axis=0)


Unnamed: 0,Q,R,S,T
0,-5,0,-6,-4
1,-4,0,-2,2
2,5,0,2,7


Note that these DataFrame/Series operations, like the operations discussed above, will automatically align indices between the two elements:

In [113]:
halfrow = df.iloc[0, ::2]
halfrow

Q    3
S    2
Name: 0, dtype: int32

In [114]:
df - halfrow


Unnamed: 0,Q,R,S,T
0,0.0,,0.0,
1,-1.0,,2.0,
2,3.0,,1.0,


This preservation and alignment of indices and columns means that operations on data in Pandas will always maintain the data context, which prevents the types of silly errors that might come up when working with heterogeneous and/or misaligned data in raw NumPy arrays.



In [117]:
# https://jakevdp.github.io/PythonDataScienceHandbook/03.04-missing-values.html
# https://drive.google.com/file/d/0ByIrJAE4KMTtTUtiVExiUGVkRkE/view

# Pythonic missing data

In [31]:
vals1 = np.array([1, None, 3, 4])
vals1


array([1, None, 3, 4], dtype=object)

This dtype=object means that the best common type representation NumPy could infer for the contents of the array is that they are Python objects.


The use of Python objects in an array also means that if you perform aggregations like sum() or min() across an array with a None value, you will generally get an error:

In [32]:
vals1.sum()

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

# NaN: Missing numerical data

The other missing data representation, NaN (acronym for Not a Number), is different; it is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation

In [34]:
vals2 = np.array([1, np.nan, 3, 4]) 
vals2.dtype

dtype('float64')

In [39]:
vals2.sum(), vals2.min(), vals2.max()

(nan, nan, nan)

In [37]:
1 + np.nan

nan

In [38]:
0 *  np.nan

nan

NumPy does provide some special aggregations that will ignore these missing values:

In [40]:
np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)

(8.0, 1.0, 4.0)

Keep in mind that __NaN__ is specifically a floating-point value; there is no equivalent __NaN__ value for integers, strings, or other types.

# NaN and None in Pandas
__NaN__ and __None__ both have their place, and Pandas is built to handle the two of them nearly interchangeably, converting between them where appropriate:

In [41]:
pd.Series([1, np.nan, 2, None])

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

For types that don't have an available sentinel value, Pandas automatically type-casts when NA values are present. For example, if we set a value in an integer array to np.nan, it will automatically be upcast to a floating-point type to accommodate the NA:

In [42]:
x = pd.Series(range(2), dtype=int)
x

0    0
1    1
dtype: int64

In [43]:
x[0] = None
x

0    NaN
1    1.0
dtype: float64

Notice that in addition to casting the integer array to floating point, Pandas automatically converts the None to a NaN value. (Be aware that there is a proposal to add a native integer NA to Pandas in the future; as of this writing, it has not been included).<br>

Typeclass	Conversion When Storing NAs	NA Sentinel Value<br>
    __floating__	No change	     __np.nan__ <br>
    __object__	No change	        __None or np.nan__<br>
    __integer__	Cast to float64	   __np.nan__<br>
    __boolean__	Cast to object	   __None or np.na__n<br>
Keep in mind that in Pandas, string data is always stored with an object dtype.<br>

# Operating on Null Values
As we have seen, Pandas treats None and NaN as essentially interchangeable for indicating missing or null values. To facilitate this convention, there are several useful methods for detecting, removing, and replacing null values in Pandas data structures. They are:

__isnull():__ Generate a boolean mask indicating missing values <br>
__notnull():__ Opposite of isnull()<br>
__dropna(): __Return a filtered version of the data<br>
__fillna():__ Return a copy of the data with missing values filled or imputed<br>

In [46]:
data = pd.Series([1, np.nan, 'hello', None])
data.isnull()

0    False
1     True
2    False
3     True
dtype: bool

In [47]:
data[data.notnull()]

0        1
2    hello
dtype: object

### Dropping null values
In addition to the masking used before, there are the convenience methods,__ dropna() __(which removes NA values) and __fillna()__ (which fills in NA values). For a Series, the result is straightforward:

In [49]:
data.dropna()

0        1
2    hello
dtype: object

In [51]:
# For a DataFrame, there are more options. Consider the following DataFrame:
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


We cannot drop single values from a DataFrame; we can only drop full rows or full columns. Depending on the application, you might want one or the other, so dropna() gives a number of options for a DataFrame.

By default, dropna() will drop all rows in which any null value is present:

In [52]:
df.dropna()

Unnamed: 0,0,1,2
1,2.0,3.0,5


Alternatively, you can drop NA values along a different axis; axis=1 drops all columns containing a null value:

In [53]:
df.dropna(axis='columns')

Unnamed: 0,2
0,2
1,5
2,6


In [54]:
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [55]:
df[3] = np.nan
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [57]:
df.dropna(axis='columns', how='all')

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


For finer-grained control, the thresh parameter lets you specify a minimum number of non-null values for the row/column to be kept:

In [58]:
df.dropna(axis='rows', thresh=3)

Unnamed: 0,0,1,2,3
1,2.0,3.0,5,


Here the first and last row have been dropped, because they contain only two non-null values.

# Filling null values
Sometimes rather than dropping NA values, you'd rather replace them with a valid value. This value might be a single number like zero, or it might be some sort of imputation or interpolation from the good values. You could do this in-place using the isnull() method as a mask, but because it is such a common operation Pandas provides the fillna() method, which returns a copy of the array with the null values replaced.

In [59]:
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data

a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64

We can fill NA entries with a single value, such as zero:

In [60]:
data.fillna(0)

a    1.0
b    0.0
c    2.0
d    0.0
e    3.0
dtype: float64

We can specify a forward-fill to propagate the previous value forward:

In [62]:
# forward-fill
data.fillna(method='ffill')

a    1.0
b    1.0
c    2.0
d    2.0
e    3.0
dtype: float64

In [63]:
# Or we can specify a back-fill to propagate the next values backward:
# back-fill
data.fillna(method='bfill')

a    1.0
b    2.0
c    2.0
d    3.0
e    3.0
dtype: float64

For DataFrames, the options are similar, but we can also specify an axis along which the fills take place:

In [64]:
df.fillna(method='ffill', axis=1)

Unnamed: 0,0,1,2,3
0,1.0,1.0,2.0,2.0
1,2.0,3.0,5.0,5.0
2,,4.0,6.0,6.0


Notice that if a previous value is not available during a forward fill, the NA value remains.