## Handling Missing Data
Missing data occurs commonly in many data analysis applications. One of the goals of pandas is to make working with missing data as painless as possible. For example, all of the descriptive statistics on pandas objects exclude missing data by default.

The way that missing data is represented in pandas objects is somewhat imperfect, but it is functional for a lot of users. For numeric data, pandas uses the floating-point value NaN (Not a Number) to represent missing data. We call this a sentinel value that can be easily detected:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])

In [3]:
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [4]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In [5]:
string_data[0] = None

In [6]:
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

In [7]:
string_data.dropna()

1    artichoke
3      avocado
dtype: object

## NA handling methods
|Argument|Description|
|---|---|
|dropna|Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.|
|fillna|Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill'.|
|isnull|Return boolean values indicating which values are missing/NA.|
|notnull|Negation of isnull.|

In [8]:
from numpy import nan as NA

In [9]:
data = pd.Series([1,NA,3.5,NA,7])

In [10]:
data = pd.Series([1,NA,3.5,NA,7])

In [11]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

In [12]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

In [13]:
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [14]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
                     [NA, NA, NA], [NA, 6.5, 3.]])

In [15]:
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [16]:
cleaned = data.dropna()

In [17]:
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [18]:
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


上面的代码显示了dropna方法作用在DataFrame上面的时候，默认会把包含空值的行都drop掉，所以就会出现上面cleand只有一行的情况，下面有一个方法可以选择性的drop：

In [19]:
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [20]:
data.dropna(axis=0)

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [21]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [22]:
data[4] = NA

In [23]:
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [24]:
data.dropna(axis=1,how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [25]:
data.dropna(axis=0,how='all')

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
3,,6.5,3.0,


In [37]:
df = pd.DataFrame(np.random.randn(7,3))

In [38]:
df.iloc[:4, 1] = NA #把前4行的第1列变为NA

In [39]:
df.iloc[:2,2] = NA #把前2行的第2列变为NA

In [29]:
df

Unnamed: 0,0,1,2
0,-0.049311,,
1,-0.041027,,
2,0.952446,,0.001225
3,0.036818,,-1.102898
4,-0.024246,0.717118,1.310856
5,0.956486,1.042377,-0.610825
6,0.672617,-0.103243,-0.130949


In [30]:
df.dropna()

Unnamed: 0,0,1,2
4,-0.024246,0.717118,1.310856
5,0.956486,1.042377,-0.610825
6,0.672617,-0.103243,-0.130949


In [31]:
df.dropna(thresh=2) #这里这是dropna了第2列的，保护了第2列的飞空数值

Unnamed: 0,0,1,2
2,0.952446,,0.001225
3,0.036818,,-1.102898
4,-0.024246,0.717118,1.310856
5,0.956486,1.042377,-0.610825
6,0.672617,-0.103243,-0.130949


In [32]:
df.fillna(0)

Unnamed: 0,0,1,2
0,-0.049311,0.0,0.0
1,-0.041027,0.0,0.0
2,0.952446,0.0,0.001225
3,0.036818,0.0,-1.102898
4,-0.024246,0.717118,1.310856
5,0.956486,1.042377,-0.610825
6,0.672617,-0.103243,-0.130949


In [33]:
df.fillna({1:0.5, 2:0})

Unnamed: 0,0,1,2
0,-0.049311,0.5,0.0
1,-0.041027,0.5,0.0
2,0.952446,0.5,0.001225
3,0.036818,0.5,-1.102898
4,-0.024246,0.717118,1.310856
5,0.956486,1.042377,-0.610825
6,0.672617,-0.103243,-0.130949


In [34]:
df

Unnamed: 0,0,1,2
0,-0.049311,,
1,-0.041027,,
2,0.952446,,0.001225
3,0.036818,,-1.102898
4,-0.024246,0.717118,1.310856
5,0.956486,1.042377,-0.610825
6,0.672617,-0.103243,-0.130949


In [41]:
df.fillna(0,inplace=True)

In [42]:
df

Unnamed: 0,0,1,2
0,2.887393,0.0,0.0
1,-0.60344,0.0,0.0
2,-0.748708,0.0,-0.835078
3,-0.240042,0.0,0.876521
4,0.020183,-0.370359,-0.173548
5,0.452031,-1.296994,0.14299
6,-0.153101,0.927651,-0.460632


In [43]:
df = pd.DataFrame(np.random.randn(6,3))

In [44]:
df.iloc[2:,1] = NA

In [45]:
df.iloc[4:, 2] = NA

In [46]:
df

Unnamed: 0,0,1,2
0,0.412648,0.23623,0.779399
1,-0.078059,1.141965,0.473746
2,-0.566939,,0.025557
3,-0.093627,,-0.353639
4,0.904277,,
5,-0.374406,,


In [47]:
df.fillna(method='ffill')

Unnamed: 0,0,1,2
0,0.412648,0.23623,0.779399
1,-0.078059,1.141965,0.473746
2,-0.566939,1.141965,0.025557
3,-0.093627,1.141965,-0.353639
4,0.904277,1.141965,-0.353639
5,-0.374406,1.141965,-0.353639


In [48]:
df.fillna(method='ffill', limit=2)

Unnamed: 0,0,1,2
0,0.412648,0.23623,0.779399
1,-0.078059,1.141965,0.473746
2,-0.566939,1.141965,0.025557
3,-0.093627,1.141965,-0.353639
4,0.904277,,-0.353639
5,-0.374406,,-0.353639


In [49]:
data = pd.Series([1.,NA,3.5, NA,7])

In [50]:
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [51]:
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

##  fillna function arguments
|Argument|Description|
|---|---|
|value|Scalar value or dict-like object to use to fill missing values|
|method|Interpolation; by default 'ffill' if function called with no other arguments|
|axis|Axis to fill on; default axis=0|
|inplace|Modify the calling object without producing a copy|
|limit|For forward and backward filling, maximum number of consecutive periods to fill|

这个地方表现了duplicated的drop默认是drop真的每一行都是相同的，二如果提供一个列表里面有相关的制定列就可以只针对这一列来进行drop

In [52]:
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
                     'k2': [1, 1, 2, 3, 3, 4, 4]})

In [53]:
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


In [54]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [55]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


In [56]:
data['v1'] = range(7)

In [57]:
data

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5
6,two,4,6


In [58]:
data.drop_duplicates(['k1'])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


In [59]:
data.drop_duplicates(['v1'])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5
6,two,4,6


In [63]:
data.drop_duplicates()

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5
6,two,4,6


duplicated and drop_duplicates by default keep the first observed value combination. Passing keep='last' will return the last one:

In [64]:
data.drop_duplicates(['k1', 'k2'], keep='last')

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


In [2]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
                            'Pastrami', 'corned beef', 'Bacon',
                            'pastrami', 'honey ham', 'nova lox'],
                            'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})

In [3]:
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


In [4]:
meat_to_animal = {
  'bacon': 'pig',
  'pulled pork': 'pig',
  'pastrami': 'cow',
  'corned beef': 'cow',
  'honey ham': 'pig',
  'nova lox': 'salmon'
}

In [5]:
lowercased = data['food'].str.lower()

In [6]:
lowercased

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

In [7]:
data['animal'] = lowercased.map(meat_to_animal)

In [8]:
lowercased

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

In [9]:
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


In [10]:
data['food'].map(lambda x: meat_to_animal[x.lower()])

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

In [11]:
data = pd.Series([1., -999., 2., -999., -1000., 3.])

In [12]:
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

In [13]:
data.replace(-999,np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

In [14]:
data = pd.DataFrame(np.arange(12).reshape((3,4)),index=['Ohio','Colorado','New York'],
                   columns=['one','two','three','four'])

In [15]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


In [17]:
transform = lambda x: x[:4].upper() #这个地方是讲x的前四位取出来（x是字符串），之后转化为大写

In [18]:
data.index.map(transform)

Index(['OHIO', 'COLO', 'NEW '], dtype='object')

In [19]:
data.index = data.index.map(transform)

In [20]:
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [21]:
data.rename(index=str.title,columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colo,4,5,6,7
New,8,9,10,11


In [22]:
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [23]:
data.rename(index={'OHIO':'INDIANA'},
           columns={'three':'peekaboo'})

Unnamed: 0,one,two,peekaboo,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [24]:
data.rename(index={'OHIO':'INDIANA'},inplace=True)

In [25]:
data

Unnamed: 0,one,two,three,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [2]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

In [3]:
bins = [18,25,35,60,100]

In [4]:
cat = pd.cut(ages, bins)

In [5]:
cat

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

In [6]:
data = np.random.rand(20)

In [7]:
pd.cut(data, 4, precision=2)

[(0.73, 0.96], (0.27, 0.5], (0.5, 0.73], (0.27, 0.5], (0.73, 0.96], ..., (0.041, 0.27], (0.27, 0.5], (0.73, 0.96], (0.73, 0.96], (0.73, 0.96]]
Length: 20
Categories (4, interval[float64]): [(0.041, 0.27] < (0.27, 0.5] < (0.5, 0.73] < (0.73, 0.96]]

In [8]:
data = pd.DataFrame(np.random.randn(1000, 4))

In [9]:
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.022154,-0.020094,-0.00394,-0.049114
std,1.021228,0.960554,1.037436,1.03232
min,-3.291055,-3.220117,-3.275621,-3.314959
25%,-0.630975,-0.672014,-0.677309,-0.713331
50%,0.036015,-0.029478,-0.046533,-0.03618
75%,0.73928,0.621921,0.641739,0.599139
max,2.925631,3.378068,3.411638,3.062952


In [10]:
data.head(10)

Unnamed: 0,0,1,2,3
0,-0.732505,0.863037,-1.165898,1.006525
1,0.611177,1.432307,0.324261,0.001816
2,-0.53523,0.13164,-1.757463,-0.363379
3,2.815838,-0.673132,-0.513225,-1.269158
4,-0.536183,-0.988692,0.430385,0.885017
5,-0.392571,-1.562996,0.127269,-2.004748
6,1.07771,-0.746262,0.436635,-0.469977
7,-0.122376,-0.926434,-0.462651,-1.640351
8,1.267969,1.285491,1.380572,-0.192813
9,0.263314,1.4607,0.823524,-0.875365


In [11]:
col = data[2]

In [14]:
col[0:10]

0   -1.165898
1    0.324261
2   -1.757463
3   -0.513225
4    0.430385
5    0.127269
6    0.436635
7   -0.462651
8    1.380572
9    0.823524
Name: 2, dtype: float64

In [12]:
col[np.abs(col) > 3]

275    3.411638
322   -3.275621
917    3.192670
947    3.012454
955    3.033400
Name: 2, dtype: float64

In [15]:
data[np.abs(data) > 3] = np.sign(data) * 3

In [16]:
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.022445,-0.020359,-0.004314,-0.048479
std,1.020324,0.958232,1.034614,1.030041
min,-3.0,-3.0,-3.0,-3.0
25%,-0.630975,-0.672014,-0.677309,-0.713331
50%,0.036015,-0.029478,-0.046533,-0.03618
75%,0.73928,0.621921,0.641739,0.599139
max,2.925631,3.0,3.0,3.0


In [17]:
val = 'a,b,  guido'

In [18]:
val.split(',')

['a', 'b', '  guido']

In [19]:
pieces = [x.strip() for x in val.split(',')] # strip方法来去除字符串中的多余空格

In [20]:
pieces

['a', 'b', 'guido']

These substrings could be concatenated together with a two-colon delimiter using addition:

In [21]:
first, second, third = pieces

In [23]:
first + '::' + second + '::' + third

'a::b::guido'

In [24]:
first

'a'

In [25]:
'::'.join(pieces)

'a::b::guido'

In [26]:
'guido' in val

True

In [27]:
val.index(',')

1

In [28]:
val.find(':')

-1

In [29]:
val.count(',')

2

In [30]:
val.replace(',','::')

'a::b::  guido'

In [31]:
val.replace(',','')

'ab  guido'

## Python built-in string methods
|Argument|Description|
|---|---|
|count|Return the number of non-overlapping occurrences of substring in the string.|
|endswith|Returns True if string ends with suffix.|
|startswith|Returns True if string starts with prefix.|
|join|Use string as delimiter for concatenating a sequence of other strings.|
|index|Return position of first character in substring if found in the string; raises ValueError if not found.|
|find|Return position of first character of first occurrence of substring in the string; like index, but returns –1 if not found.|
|rfind|Return position of first character of last occurrence of substring in the string; returns –1 if not found.|
|replace|Replace occurrences of string with another string.|
|strip, rstrip, lstrip|Trim whitespace, including newlines; equivalent to x.strip() (and rstrip, lstrip, respectively) for each element.|
|split|Break string into list of substrings using passed delimiter.|
|lower|Convert alphabet characters to lowercase.|
|upper|Convert alphabet characters to uppercase.|
|casefold|Convert characters to lowercase, and convert any region-specific variable character combinations to a common comparable form.|
|ljust, rjust|Left justify or right justify, respectively; pad opposite side of string with spaces (or some other fill character) to return a string with a minimum width.|

## **Regular Expressions**
Regular expressions provide a flexible way to search or match (often more complex) string patterns in text. A single expression, commonly called a regex, is a string formed according to the regular expression language. Python’s built-in re module is responsible for applying regular expressions to strings; I’ll give a number of examples of its use here.

The re module functions fall into three categories: pattern matching, substitution, and splitting. Naturally these are all related; a regex describes a pattern to locate in the text, which can then be used for many purposes. Let’s look at a simple example: suppose we wanted to split a string with a variable number of whitespace characters (tabs, spaces, and newlines). The regex describing one or more whitespace characters is \s+: