# Chapter 7 - Data Cleaning and Preparation

*__During the course of doing data analysis and modeling, a significant amount of time
is spent on data preparation: loading, cleaning, transforming, and rearranging. Such
tasks are often reported to take up 80% or more of an analyst’s time__*. Sometimes the
way that data is stored in files or databases is not in the right format for a particular
task. Many researchers choose to do ad hoc processing of data from one form to
another using a general-purpose programming language, like Python, Perl, R, or Java,
or Unix text-processing tools like sed or awk. *__Fortunately, pandas, along with the
built-in Python language features, provides you with a high-level, flexible, and fast set
of tools to enable you to manipulate data into the right form__*.

If you identify a type of data manipulation that isn’t anywhere in this book or else‐
where in the pandas library, feel free to share your use case on one of the Python
mailing lists or on the pandas GitHub site. Indeed, much of the design and imple‐
mentation of pandas has been driven by the needs of real-world applications.

In this chapter I discuss tools for missing data, duplicate data, string manipulation,
and some other analytical data transformations. In the next chapter, I focus on com‐
bining and rearranging datasets in various ways.

## 7.1 Handling Missing Data

Missing data occurs commonly in many data analysis applications. One of the goals
of pandas is to make working with missing data as painless as possible. For example,
all of the descriptive statistics on pandas objects exclude missing data by default.

The way that missing data is represented in pandas objects is somewhat imperfect,
but it is functional for a lot of users. For numeric data, pandas uses the floating-point
value NaN (Not a Number) to represent missing data. We call this a sentinel value that
can be easily detected:

In [1]:
import pandas as pd
from pandas import Series, DataFrame 

import numpy as np

In [6]:
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [8]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In pandas, we’ve adopted a convention used in the R programming language by refer‐
ring to missing data as NA, which stands for not available. In statistics applications,
NA data may either be data that does not exist or that exists but was not observed
(through problems with data collection, for example). When cleaning up data for
analysis, it is often important to do analysis on the missing data itself to identify data
collection problems or potential biases in the data caused by missing data.

The built-in Python None value is also treated as NA in object arrays:

In [10]:
string_data[0] = None

In [11]:
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

There is work ongoing in the pandas project to improve the internal details of how
missing data is handled, but the user API functions, like pandas.isnull, abstract
away many of the annoying details. See Table 7-1 for a list of some functions related
to missing data handling.

![](NA_handling.jpg)

### Filtering Out Missing Data

There are a few ways to filter out missing data. While you always have the option to
do it by hand using pandas.isnull and boolean indexing, the dropna can be helpful.
On a Series, it returns the Series with only the non-null data and index values:

In [12]:
from numpy import nan as NA

In [13]:
data = pd.Series([1, NA, 3.5, NA, 7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [14]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

This is equivalent to:

In [15]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

With DataFrame objects, things are a bit more complex. You may want to drop rows
or columns that are all NA or only those containing any NAs. *__dropna by default drops
any row containing a missing value__*:

In [16]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],[NA, NA, NA], [NA, 6.5, 3.]])
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [17]:
cleaned = data.dropna()
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


*__Passing how='all' will only drop rows that are all NA__*:

In [18]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [19]:
data[4] = NA
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [21]:
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


A related way to filter out DataFrame rows tends to concern time series data. Suppose
you want to keep only rows containing a certain number of observations. You can
indicate this with the thresh argument:

In [22]:
df = pd.DataFrame(np.random.randn(7,3))
df

Unnamed: 0,0,1,2
0,0.298935,1.857714,0.249267
1,-1.24045,-0.62792,-2.270728
2,-1.26108,-0.290014,0.073613
3,-1.689148,0.692656,-0.391908
4,0.859421,0.290142,-0.943482
5,0.037043,-0.983893,0.498118
6,0.128975,-0.698469,0.897002


In [24]:
df.iloc[:4,1] = NA

In [25]:
df.iloc[:2,2] = NA

In [26]:
df

Unnamed: 0,0,1,2
0,0.298935,,
1,-1.24045,,
2,-1.26108,,0.073613
3,-1.689148,,-0.391908
4,0.859421,0.290142,-0.943482
5,0.037043,-0.983893,0.498118
6,0.128975,-0.698469,0.897002


In [27]:
df.dropna()

Unnamed: 0,0,1,2
4,0.859421,0.290142,-0.943482
5,0.037043,-0.983893,0.498118
6,0.128975,-0.698469,0.897002


In [28]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,-1.26108,,0.073613
3,-1.689148,,-0.391908
4,0.859421,0.290142,-0.943482
5,0.037043,-0.983893,0.498118
6,0.128975,-0.698469,0.897002


### Filling In Missing Data

Rather than filtering out missing data (and potentially discarding other data along
with it), you may want to fill in the “holes” in any number of ways. For most pur‐
poses, the fillna method is the workhorse function to use. Calling fillna with a
constant replaces missing values with that value:

In [29]:
df

Unnamed: 0,0,1,2
0,0.298935,,
1,-1.24045,,
2,-1.26108,,0.073613
3,-1.689148,,-0.391908
4,0.859421,0.290142,-0.943482
5,0.037043,-0.983893,0.498118
6,0.128975,-0.698469,0.897002


In [30]:
df.fillna(0)

Unnamed: 0,0,1,2
0,0.298935,0.0,0.0
1,-1.24045,0.0,0.0
2,-1.26108,0.0,0.073613
3,-1.689148,0.0,-0.391908
4,0.859421,0.290142,-0.943482
5,0.037043,-0.983893,0.498118
6,0.128975,-0.698469,0.897002


Calling fillna with a dict, you can use a different fill value for each column:

In [31]:
df.fillna({1: 0.5, 2: 0})

Unnamed: 0,0,1,2
0,0.298935,0.5,0.0
1,-1.24045,0.5,0.0
2,-1.26108,0.5,0.073613
3,-1.689148,0.5,-0.391908
4,0.859421,0.290142,-0.943482
5,0.037043,-0.983893,0.498118
6,0.128975,-0.698469,0.897002


fillna returns a new object, but you can modify the existing object in-place:

In [32]:
_ = df.fillna(0, inplace=True)

In [33]:
df

Unnamed: 0,0,1,2
0,0.298935,0.0,0.0
1,-1.24045,0.0,0.0
2,-1.26108,0.0,0.073613
3,-1.689148,0.0,-0.391908
4,0.859421,0.290142,-0.943482
5,0.037043,-0.983893,0.498118
6,0.128975,-0.698469,0.897002


The same interpolation methods available for reindexing can be used with fillna:

In [34]:
df = pd.DataFrame(np.random.randn(6,3))

In [35]:
df

Unnamed: 0,0,1,2
0,0.146076,1.778811,0.904821
1,0.995247,-0.194858,0.81802
2,0.532572,-1.472086,-0.468087
3,-2.429891,0.413744,-0.619391
4,-0.994494,-0.962905,1.079316
5,0.114552,-0.497561,1.576744


In [37]:
df.iloc[2:,1] = NA

In [38]:
df.iloc[4:,2] = NA

In [39]:
df

Unnamed: 0,0,1,2
0,0.146076,1.778811,0.904821
1,0.995247,-0.194858,0.81802
2,0.532572,,-0.468087
3,-2.429891,,-0.619391
4,-0.994494,,
5,0.114552,,


In [40]:
df.fillna(method='ffill')

Unnamed: 0,0,1,2
0,0.146076,1.778811,0.904821
1,0.995247,-0.194858,0.81802
2,0.532572,-0.194858,-0.468087
3,-2.429891,-0.194858,-0.619391
4,-0.994494,-0.194858,-0.619391
5,0.114552,-0.194858,-0.619391


In [41]:
df.fillna(method='ffill', limit=2)

Unnamed: 0,0,1,2
0,0.146076,1.778811,0.904821
1,0.995247,-0.194858,0.81802
2,0.532572,-0.194858,-0.468087
3,-2.429891,-0.194858,-0.619391
4,-0.994494,,-0.619391
5,0.114552,,-0.619391


With fillna you can do lots of other things with a little creativity. For example, you
might pass the mean or median value of a Series:

In [42]:
data = pd.Series([1., NA, 3.5, NA, 7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [44]:
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

See Table 7-2 for a reference on fillna.

![](fillna.jpg)

## 7.2 Data Transformation

So far in this chapter we’ve been concerned with rearranging data. Filtering, cleaning,
and other transformations are another class of important operations.

### Removing Duplicates

Duplicate rows may be found in a DataFrame for any number of reasons. Here is an
example:

In [45]:
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'], 'k2': [1, 1, 2, 3, 3, 4, 4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


The DataFrame method duplicated returns a boolean Series indicating whether each
row is a duplicate (has been observed in a previous row) or not:

In [46]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

Relatedly, drop_duplicates returns a DataFrame where the duplicated array is
False:

In [47]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


Both of these methods by default consider all of the columns; alternatively, you can
specify any subset of them to detect duplicates. Suppose we had an additional column
of values and wanted to filter duplicates only based on the 'k1' column:

In [49]:
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


In [50]:
data['v1'] = range(7)
data

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5
6,two,4,6


In [53]:
data.drop_duplicates(['k1'])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


duplicated and drop_duplicates by default keep the first observed value combination. Passing keep='last' will return the last one:

In [54]:
data.drop_duplicates(['k1', 'k2'], keep='last')

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


### Transforming Data Using a Function or Mapping

For many datasets, you may wish to perform some transformation based on the values in an array, Series, or column in a DataFrame. Consider the following hypothetical data collected about various kinds of meat:

In [55]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami', 'corned beef', 'Bacon',
                    'pastrami', 'honey ham', 'nova lox'], 
                     'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})

In [56]:
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


In [60]:
len(data)

9

In [63]:
data['food']

0          bacon
1    pulled pork
2          bacon
3       Pastrami
4    corned beef
5          Bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

Suppose you wanted to add a column indicating the type of animal that each food
came from. Let’s write down a mapping of each distinct meat type to the kind of
animal:

In [57]:
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}

In [58]:
meat_to_animal

{'bacon': 'pig',
 'pulled pork': 'pig',
 'pastrami': 'cow',
 'corned beef': 'cow',
 'honey ham': 'pig',
 'nova lox': 'salmon'}

The map method on a Series accepts a function or dict-like object containing a mapping, but here we have a small problem in that some of the meats are capitalized and
others are not. Thus, we need to convert each value to lowercase using the str.lower
Series method:

In [59]:
lowercased = data['food'].str.lower()
lowercased

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

In [65]:
data['animal'] = lowercased.map(meat_to_animal)

In [66]:
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


We could also have passed a function that does all the work:

In [67]:
data['food'].map(lambda x: meat_to_animal[x.lower()])

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

Using map is a convenient way to perform element-wise transformations and other
data cleaning–related operations.

### Replacing Values

Filling in missing data with the fillna method is a special case of more general value
replacement. As you’ve already seen, map can be used to modify a subset of values in
an object but replace provides a simpler and more flexible way to do so. Let’s con‐
sider this Series:

In [68]:
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

The -999 values might be sentinel values for missing data. To replace these with NA
values that pandas understands, we can use replace, producing a new Series (unless
you pass inplace=True):

In [69]:
data.replace(-999, np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

If you want to replace multiple values at once, you instead pass a list and then the
substitute value:

In [71]:
data.replace([-999, -1000], np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

To use a different replacement for each value, pass a list of substitutes:

In [72]:
data.replace([-999, -1000], [np.nan, 0])

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

The argument passed can also be a dict:

In [73]:
data.replace({-999: np.nan, -1000: 0})

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

The data.replace method is distinct from data.str.replace,
which performs string substitution element-wise. We look at these
string methods on Series later in the chapter.

### Renaming Axis Indexes

Like values in a Series, axis labels can be similarly transformed by a function or map‐
ping of some form to produce new, differently labeled objects. You can also modify
the axes in-place without creating a new data structure. Here’s a simple example:

In [74]:
data = pd.DataFrame(np.arange(12).reshape((3,4)), index=['Ohio', 'Colorado', 'New York'],
                   columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


Like a Series, the axis indexes have a map method:

In [87]:
transform = lambda x: x[:4].upper()

In [88]:
data.index.map(transform)

Index(['OHIO', 'COLO', 'NEW '], dtype='object')

You can assign to index, modifying the DataFrame in-place:

In [89]:
data.index = data.index.map(transform)

In [90]:
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


If you want to create a transformed version of a dataset without modifying the origi‐
nal, a useful method is rename:

In [91]:
data.rename(index=str.title, columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colo,4,5,6,7
New,8,9,10,11


Notably, rename can be used in conjunction with a dict-like object providing new values for a subset of the axis labels:

In [92]:
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [93]:
data.rename(index={'OHIO':'INDIANA'}, columns={'three':'peekaboo'})

Unnamed: 0,one,two,peekaboo,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


rename saves you from the chore of copying the DataFrame manually and assigning
to its index and columns attributes. Should you wish to modify a dataset in-place,
pass inplace=True:

In [94]:
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [95]:
data.rename(index={'OHIO': 'INDIANA'}, inplace=True)

In [96]:
data

Unnamed: 0,one,two,three,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


### Discretization and Binning

Continuous data is often discretized or otherwise separated into “bins” for analysis.
Suppose you have data about a group of people in a study, and you want to group
them into discrete age buckets:

In [97]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

Let’s divide these into bins of 18 to 25, 26 to 35, 36 to 60, and finally 61 and older. To
do so, you have to use cut, a function in pandas:

In [98]:
bins = [18, 25, 35, 60, 100]

In [99]:
cats = pd.cut(ages, bins)

In [100]:
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

The object pandas returns is a special Categorical object. The output you see
describes the bins computed by pandas.cut. You can treat it like an array of strings
indicating the bin name; internally it contains a categories array specifying the distinct category names along with a labeling for the ages data in the codes attribute:

In [101]:
cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [102]:
cats.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
              closed='right',
              dtype='interval[int64]')

In [103]:
pd.value_counts(cats)

(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64

Note that pd.value_counts(cats) are the bin counts for the result of pandas.cut.

Consistent with mathematical notation for intervals, a parenthesis means that the side
is open, while the square bracket means it is closed (inclusive). You can change which
side is closed by passing right=False:

In [104]:
pd.cut(ages, [18, 26, 36, 61, 100], right=False)

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

You can also pass your own bin names by passing a list or array to the labels option:

In [105]:
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']

In [106]:
pd.cut(ages, bins, labels=group_names)

[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]

If you pass an integer number of bins to cut instead of explicit bin edges, it will com‐
pute equal-length bins based on the minimum and maximum values in the data.
Consider the case of some uniformly distributed data chopped into fourths:

In [107]:
data = np.random.rand(20)

In [110]:
data_bin = pd.cut(data, 4, precision=2)
data_bin

[(0.26, 0.49], (0.72, 0.96], (0.26, 0.49], (0.26, 0.49], (0.023, 0.26], ..., (0.49, 0.72], (0.49, 0.72], (0.023, 0.26], (0.72, 0.96], (0.72, 0.96]]
Length: 20
Categories (4, interval[float64]): [(0.023, 0.26] < (0.26, 0.49] < (0.49, 0.72] < (0.72, 0.96]]

In [111]:
pd.value_counts(data_bin)

(0.72, 0.96]     6
(0.26, 0.49]     6
(0.49, 0.72]     4
(0.023, 0.26]    4
dtype: int64

The precision=2 option limits the decimal precision to two digits.

A closely related function, qcut, bins the data based on sample quantiles. Depending
on the distribution of the data, using cut will not usually result in each bin having the
same number of data points. Since qcut uses sample quantiles instead, by definition
you will obtain roughly equal-size bins:

In [112]:
data = np.random.randn(1000) # Normally distributed

In [113]:
cats = pd.qcut(data, 4) #Cut into quartiles
cats

[(0.0249, 0.684], (0.0249, 0.684], (-3.114, -0.621], (0.684, 2.701], (-3.114, -0.621], ..., (-0.621, 0.0249], (-3.114, -0.621], (-0.621, 0.0249], (-3.114, -0.621], (-3.114, -0.621]]
Length: 1000
Categories (4, interval[float64]): [(-3.114, -0.621] < (-0.621, 0.0249] < (0.0249, 0.684] < (0.684, 2.701]]

In [115]:
pd.value_counts(cats)

(0.684, 2.701]      250
(0.0249, 0.684]     250
(-0.621, 0.0249]    250
(-3.114, -0.621]    250
dtype: int64

Similar to cut you can pass your own quantiles (numbers between 0 and 1, inclusive):

In [116]:
pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])

[(0.0249, 1.289], (0.0249, 1.289], (-1.211, 0.0249], (1.289, 2.701], (-3.114, -1.211], ..., (-1.211, 0.0249], (-1.211, 0.0249], (-1.211, 0.0249], (-1.211, 0.0249], (-3.114, -1.211]]
Length: 1000
Categories (4, interval[float64]): [(-3.114, -1.211] < (-1.211, 0.0249] < (0.0249, 1.289] < (1.289, 2.701]]

We’ll return to cut and qcut later in the chapter during our discussion of aggregation
and group operations, as these discretization functions are especially useful for quan‐
tile and group analysis.

### Detecting and Filtering Outliers

Filtering or transforming outliers is largely a matter of applying array operations.
Consider a DataFrame with some normally distributed data:

In [117]:
data = pd.DataFrame(np.random.randn(1000,4))

In [118]:
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.068295,0.001946,-0.037967,-0.024312
std,0.998774,0.997067,0.968906,1.029778
min,-2.629798,-2.938478,-2.944608,-3.063135
25%,-0.772509,-0.635134,-0.703351,-0.744794
50%,-0.05563,0.00547,-0.006946,0.010222
75%,0.599362,0.707452,0.576067,0.670859
max,3.297096,3.466338,3.255889,4.457267


Suppose you wanted to find values in one of the columns exceeding 3 in absolute
value:

In [119]:
col = data[2]

In [120]:
col[np.abs(col)>3]

103    3.255889
512    3.040428
Name: 2, dtype: float64

To select all rows having a value exceeding 3 or –3, you can use the any method on a
boolean DataFrame:

In [121]:
data[(np.abs(data)>3).any(1)]

Unnamed: 0,0,1,2,3
22,3.297096,-1.509745,0.519751,-2.074684
103,-0.982861,1.179048,3.255889,-0.528437
145,-1.07569,-0.947955,0.166397,3.267319
439,1.687686,3.466338,1.443134,0.65853
451,0.994681,3.082845,-0.260967,0.902555
512,0.864805,0.288832,3.040428,0.764003
525,0.571544,-0.43124,-0.691053,3.099501
661,-1.15575,3.127733,-1.337533,-0.273137
707,-0.059804,0.048412,-1.062518,4.457267
719,0.946133,0.394742,-2.188168,3.213398


Values can be set based on these criteria. Here is code to cap values outside the inter‐
val –3 to 3:

In [122]:
data[np.abs(data)>3] = np.sign(data)*3

In [123]:
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.068592,0.001269,-0.038263,-0.026286
std,0.997815,0.994906,0.967941,1.022482
min,-2.629798,-2.938478,-2.944608,-3.0
25%,-0.772509,-0.635134,-0.703351,-0.744794
50%,-0.05563,0.00547,-0.006946,0.010222
75%,0.599362,0.707452,0.576067,0.670859
max,3.0,3.0,3.0,3.0


The statement np.sign(data) produces 1 and –1 values based on whether the values
in data are positive or negative:

In [124]:
np.sign(data).head()

Unnamed: 0,0,1,2,3
0,-1.0,-1.0,1.0,1.0
1,-1.0,1.0,-1.0,1.0
2,-1.0,-1.0,-1.0,1.0
3,1.0,1.0,-1.0,-1.0
4,1.0,1.0,1.0,1.0


### Permutation and Random Sampling

Permuting (randomly reordering) a Series or the rows in a DataFrame is easy to do
using the numpy.random.permutation function. Calling permutation with the length
of the axis you want to permute produces an array of integers indicating the new
ordering:

### Computing Indicator/Dummy Variables

Another type of transformation for statistical modeling or machine learning applica‐
tions is converting a categorical variable into a “dummy” or “indicator” matrix. If a
column in a DataFrame has k distinct values, you would derive a matrix or Data‐
Frame with k columns containing all 1s and 0s. pandas has a get_dummies function
for doing this, though devising one yourself is not difficult. Let’s return to an earlier
example DataFrame:

In [125]:
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], 'data1': range(6)})
df

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [131]:
df['data1']

0    0
1    1
2    2
3    3
4    4
5    5
Name: data1, dtype: int64

In [126]:
pd.get_dummies(df['key'])

Unnamed: 0,a,b,c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In some cases, you may want to add a prefix to the columns in the indicator DataFrame, which can then be merged with the other data. get_dummies has a prefix argument for doing this:

In [127]:
dummies = pd.get_dummies(df['key'], prefix='key')

In [128]:
dummies

Unnamed: 0,key_a,key_b,key_c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In [132]:
df_with_dummy = df[['data1']].join(dummies)

In [133]:
df_with_dummy

Unnamed: 0,data1,key_a,key_b,key_c
0,0,0,1,0
1,1,0,1,0
2,2,1,0,0
3,3,0,0,1
4,4,1,0,0
5,5,0,1,0


If a row in a DataFrame belongs to multiple categories, things are a bit more complicated. Let’s look at the MovieLens 1M dataset, which is investigated in more detail in
Chapter 14:

In [3]:
mnames = ['movie_id', 'title', 'genres'] # Create column names.
mnames

['movie_id', 'title', 'genres']

In [4]:
movies = pd.read_table('datasets/movielens/movies.dat', sep='::', header=None, names=mnames)

  """Entry point for launching an IPython kernel.
  """Entry point for launching an IPython kernel.


In [5]:
movies[:10]

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children's
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


In [6]:
movies.genres.head()

0     Animation|Children's|Comedy
1    Adventure|Children's|Fantasy
2                  Comedy|Romance
3                    Comedy|Drama
4                          Comedy
Name: genres, dtype: object

Adding indicator variables for each genre requires a little bit of wrangling. First, we
extract the list of unique genres in the dataset:

In [8]:
all_genres = []

In [9]:
for x in movies.genres:
    all_genres.extend(x.split('|'))

In [10]:
all_genres

['Animation',
 "Children's",
 'Comedy',
 'Adventure',
 "Children's",
 'Fantasy',
 'Comedy',
 'Romance',
 'Comedy',
 'Drama',
 'Comedy',
 'Action',
 'Crime',
 'Thriller',
 'Comedy',
 'Romance',
 'Adventure',
 "Children's",
 'Action',
 'Action',
 'Adventure',
 'Thriller',
 'Comedy',
 'Drama',
 'Romance',
 'Comedy',
 'Horror',
 'Animation',
 "Children's",
 'Drama',
 'Action',
 'Adventure',
 'Romance',
 'Drama',
 'Thriller',
 'Drama',
 'Romance',
 'Thriller',
 'Comedy',
 'Action',
 'Action',
 'Comedy',
 'Drama',
 'Crime',
 'Drama',
 'Thriller',
 'Thriller',
 'Drama',
 'Sci-Fi',
 'Drama',
 'Romance',
 'Drama',
 'Drama',
 'Romance',
 'Adventure',
 'Sci-Fi',
 'Drama',
 'Drama',
 'Drama',
 'Sci-Fi',
 'Adventure',
 'Romance',
 "Children's",
 'Comedy',
 'Drama',
 'Drama',
 'Romance',
 'Drama',
 'Documentary',
 'Comedy',
 'Comedy',
 'Romance',
 'Drama',
 'Drama',
 'War',
 'Action',
 'Crime',
 'Drama',
 'Drama',
 'Action',
 'Adventure',
 'Comedy',
 'Drama',
 'Drama',
 'Romance',
 'Crime',
 'Thrill

Note that all_genres does not contain unique movie genres yet.

In [11]:
genres = pd.unique(all_genres) # Get unique genres only.
genres

array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
       'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
       'Western'], dtype=object)

One way to construct the indicator DataFrame is to start with a DataFrame (matrix) of all
zeros:

In [19]:
zero_matrix = np.zeros((len(movies), len(genres)))

In [20]:
dummies = pd.DataFrame(zero_matrix, columns=genres)
dummies.head()

Unnamed: 0,Animation,Children's,Comedy,Adventure,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Sci-Fi,Documentary,War,Musical,Mystery,Film-Noir,Western
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now, iterate through each movie and set entries in each row of dummies to 1. To do
this, we use the dummies.columns to compute the column indices for each genre. For illustration, let's do this first for movies at index 0.

In [21]:
gen = movies.genres[0]

In [22]:
gen.split('|')

['Animation', "Children's", 'Comedy']

In [26]:
indices = dummies.columns.get_indexer(gen.split('|'))
indices

array([0, 1, 2], dtype=int64)

Then, we can use .iloc to set values based on these indices:

In [29]:
dummies.iloc[0,indices] = 1 

In [30]:
dummies.head()

Unnamed: 0,Animation,Children's,Comedy,Adventure,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Sci-Fi,Documentary,War,Musical,Mystery,Film-Noir,Western
0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Let's do this for all movies:

In [33]:
for i, gen in enumerate(movies.genres):
    indices = dummies.columns.get_indexer(gen.split('|'))
    dummies.iloc[i,indices] = 1

In [34]:
dummies.head()

Unnamed: 0,Animation,Children's,Comedy,Adventure,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Sci-Fi,Documentary,War,Musical,Mystery,Film-Noir,Western
0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Then, as before, you can combine this with movies:

In [35]:
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic

Unnamed: 0,movie_id,title,genres,Genre_Animation,Genre_Children's,Genre_Comedy,Genre_Adventure,Genre_Fantasy,Genre_Romance,Genre_Drama,...,Genre_Crime,Genre_Thriller,Genre_Horror,Genre_Sci-Fi,Genre_Documentary,Genre_War,Genre_Musical,Genre_Mystery,Genre_Film-Noir,Genre_Western
0,1,Toy Story (1995),Animation|Children's|Comedy,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji (1995),Adventure|Children's|Fantasy,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men (1995),Comedy|Romance,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale (1995),Comedy|Drama,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II (1995),Comedy,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,6,Heat (1995),Action|Crime|Thriller,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,7,Sabrina (1995),Comedy|Romance,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,8,Tom and Huck (1995),Adventure|Children's,0.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,9,Sudden Death (1995),Action,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,10,GoldenEye (1995),Action|Adventure|Thriller,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


For much larger data, this method of constructing indicator vari‐
ables with multiple membership is not especially speedy. It would
be better to write a lower-level function that writes directly to a
NumPy array, and then wrap the result in a DataFrame.

A useful recipe for statistical applications is to combine get_dummies with a discreti‐
zation function like cut:

In [37]:
np.random.seed(12345)

In [38]:
values = np.random.rand(10)

In [40]:
bins = [0, 0.2, 0.4, 0.6, 0.8, 1]

In [41]:
pd.get_dummies(pd.cut(values,bins))

   (0.0, 0.2]  (0.2, 0.4]  (0.4, 0.6]  (0.6, 0.8]  (0.8, 1.0]
0           0           0           0           0           1
1           0           1           0           0           0
2           1           0           0           0           0
3           0           1           0           0           0
4           0           0           1           0           0
5           0           0           1           0           0
6           0           0           0           0           1
7           0           0           0           1           0
8           0           0           0           1           0
9           0           0           0           1           0

We set the random seed with numpy.random.seed to make the example deterministic.
We will look again at pandas.get_dummies later in the book.

## 7.3 String Manipulation

Python has long been a popular raw data manipulation language in part due to its
ease of use for string and text processing. Most text operations are made simple with
the string object’s built-in methods. For more complex pattern matching and text
manipulations, regular expressions may be needed. pandas adds to the mix by ena‐
bling you to apply string and regular expressions concisely on whole arrays of data,
additionally handling the annoyance of missing data.

### String Object Methods

In many string munging and scripting applications, built-in string methods are suffi‐
cient. As an example, a comma-separated string can be broken into pieces with
split:

### Regular Expressions

Regular expressions provide a flexible way to search or match (often more complex)
string patterns in text. A single expression, commonly called a regex, is a string
formed according to the regular expression language. Python’s built-in re module is
responsible for applying regular expressions to strings; I’ll give a number of examples
of its use here.

### Vectorized String Functions in pandas

Cleaning up a messy dataset for analysis often requires a lot of string munging and
regularization. To complicate matters, a column containing strings will sometimes
have missing data:

## 7.4 Conclusion

Effective data preparation can significantly improve productive by enabling you to
spend more time analyzing data and less time getting it ready for analysis. We have
explored a number of tools in this chapter, but the coverage here is by no means comprehensive. In the next chapter, we will explore pandas’s joining and grouping func
tionality.