# Pandas数据转换
- 去重
    - DataFrameObj.duplicated() 得到一个布尔Series
    - DataFrameObj.drop_duplicates()
    - DataFrameObj.drop_duplicates(['column_name']) 基于某一列算重复值
    - DataFrameObj.drop_duplicates(['column_name1', 'column_name2'], keep='last') 基于某两列算重复值
- 使用Function 或 Mapping做转换
    - 使用map()
        - data['column_name'].str.lower().map(func)
        - DataFrameObj['column_name'].map(lambda x: func[x.lower()])
- 替换值
    - SeriesObj.replace(-999, np.nan) 一对一
    - SeriesObj.replace([-999, -1000], np.nan) 多对1
    - SeriesObj.replace([-999, -1000], [np.nan, 0]) 本质还是一对一
    - SeriesObj.replace({-999: np.nan, -1000: 0}) 多组一对一
- 重命名 Axis Indexes
    - DataFrameObj.index.map(transformFunc)
    - DataFrameObj.rename(index=str.title, columns=str.upper)
    - DataFrameObj.rename(index={'origIndexName': 'newIndexName'},columns={'origColumnsName': 'newColumnName'})
    - rename(inplace=True)
- Discretization 与 Binning
    - cats = pd.cut(数组数据, bins)
        - cats：分箱完Obj
        - cats.categories 每个箱子的name
        - cats.codes 每个值属于的箱子编号
        - pd.value_counts(cats) 得到每个箱子内数量
        - right=False：每个箱子默认是左闭右开，设置False后，左开右闭
        - pd.cut(数组数据, bins, labels=group_names)，设置category名字
        - pd.cut(数组数据, 4, precision=2) 分成4组，保留小数点2位
    - cats = pd.qcut(data, 4)：Cut into quartiles
- 发现和过滤 Outliers
    - 基于某规则过滤一列，转化成一个布尔数组：
        - col = DataFrameObj[column_name];col[np.abs(col) > 3]
    - 找出任一符合某规则的列
        - DataFrameObj[(np.abs(DataFrameObj) > 3).any(1)]
    - 排除Outliers
        - DataFrameObj[np.abs(DataFrameObj) > 3] = np.sign(DataFrameObj) * 3
- Permutation 和 Random Sampling
    - np.random.permutation(5) 排列0-4
    - df.take(一个数字数组) ，index是数字，则按这个数字数组作为index取数据
    - np.random.randint(0, 100, size=10) 在0-100中随机取10个整数
- 计算 Indicator/Dummy 变量
    - pd.get_dummies()
    - 示例了如何将movielens/movies.dat的genres做成Indicator
    - 示例了将bins和get_dummies结合

In [1]:
# coding:utf-8
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
%pwd

u'/Users/zhangjun/Documents/machine-learning-notes/data-processing'

## 去重

In [2]:
data = pd.DataFrame({'k1': ['one'] * 3 + ['two'] * 4,
                     'k2': [1, 1, 2, 3, 3, 4, 4]})
data

Unnamed: 0,k1,k2
0,one,1
1,one,1
2,one,2
3,two,3
4,two,3
5,two,4
6,two,4


The DataFrame method `duplicated` returns a boolean Series indicating whether each row is a duplicate or not:

In [4]:
data.duplicated()

0    False
1     True
2    False
3    False
4     True
5    False
6     True
dtype: bool

Relatedly, `drop_duplicates` returns a DataFrame where the duplicated array is False:

In [5]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
2,one,2
3,two,3
5,two,4


Suppose we had an additional column of values and wanted to filter duplicates only based on the 'k1' column:

In [6]:
data['v1'] = range(7)
data.drop_duplicates(['k1'])

Unnamed: 0,k1,k2,v1
0,one,1,0
3,two,3,3


Passing `keep='last'` will return the last one:

In [7]:
data.drop_duplicates(['k1', 'k2'], keep='last')

Unnamed: 0,k1,k2,v1
1,one,1,1
2,one,2,2
4,two,3,4
6,two,4,6


## 使用Function 或 Mapping做转换

In [8]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami',
                              'corned beef', 'Bacon', 'pastrami', 'honey ham',
                              'nova lox'],
                     'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


Suppose you wanted to add a column indicating the type of animal that each food came from.

In [9]:
meat_to_animal = {
  'bacon': 'pig',
  'pulled pork': 'pig',
  'pastrami': 'cow',
  'corned beef': 'cow',
  'honey ham': 'pig',
  'nova lox': 'salmon'
}

The `map` method on a Series accepts a function or dict-like object containing a mapping.

In [10]:
data['animal'] = data['food'].str.lower().map(meat_to_animal)
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


We could also have passed a function that does all the work:

In [11]:
data['food'].map(lambda x: meat_to_animal[x.lower()])

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

## 替换值

In [12]:
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

To replace these with NA values that pandas understands, we can use `replace`, producing a new Series:

In [13]:
data.replace(-999, np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

If you want to replace multiple values at once, you instead pass a list then the substitute value:

In [14]:
data.replace([-999, -1000], np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

To use a different replacement for each value, pass a list of substitutes:

In [15]:
data.replace([-999, -1000], [np.nan, 0])

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

The argument passed can also be a dict:

In [16]:
data.replace({-999: np.nan, -1000: 0})

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

## 重命名 Axis Indexes
Like a Series, the axis indexes have a `map` method:

In [18]:
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index=['Ohio', 'Colorado', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
transform = lambda x: x[:4].upper()
data.index.map(transform)

array(['OHIO', 'COLO', 'NEW '], dtype=object)

You can assign to `index`, modifying the DataFrame in place:

In [19]:
data.index = data.index.map(transform)
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


If you want to create a transformed version of a data set without modifying the original, a useful method is `rename`:

In [20]:
data.rename(index=str.title, columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colo,4,5,6,7
New,8,9,10,11


Notably, `rename` can be used in conjunction with a dict-like object providing new values for a subset of the axis labels:

In [21]:
data.rename(index={'OHIO': 'INDIANA'},
            columns={'three': 'peekaboo'})

Unnamed: 0,one,two,peekaboo,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


rename saves having to copy the DataFrame manually and assign to its index and columns attributes. Should you wish to modify a data set in place, pass `inplace=True`:

In [22]:
data.rename(index={'OHIO': 'INDIANA'}, inplace=True)
data

Unnamed: 0,one,two,three,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


## Discretization 与 Binning

In [23]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

Let’s divide these into bins of 18 to 25, 26 to 35, 36 to 60, and finally 61 and older. To do so, you have to use `cut`, a function in pandas:

In [24]:
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, object): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

The object pandas returns is a special `Categorical` object. You can treat it like an array of strings indicating the bin name; internally it contains a `categories` array indicating the distinct category names along with a labeling for the `ages` data in the `codes` attribute:

In [25]:
cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [26]:
cats.categories

Index([u'(18, 25]', u'(25, 35]', u'(35, 60]', u'(60, 100]'], dtype='object')

In [27]:
pd.value_counts(cats)

(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64

Consistent with mathematical notation for intervals, a parenthesis means that the side is open while the square bracket means it is closed (inclusive). Which side is closed can be changed by passing `right=False`:

In [28]:
pd.cut(ages, [18, 26, 36, 61, 100], right=False)

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, object): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

You can also pass your own bin names by passing a list or array to the `labels` option:

In [29]:
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages, bins, labels=group_names)

[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]

If you pass `cut` a integer number of bins instead of explicit bin edges, it will compute equal-length bins based on the minimum and maximum values in the data.

In [30]:
data = np.random.rand(20)
pd.cut(data, 4, precision=2)

[(0.51, 0.72], (0.72, 0.93], (0.086, 0.3], (0.3, 0.51], (0.086, 0.3], ..., (0.086, 0.3], (0.72, 0.93], (0.72, 0.93], (0.086, 0.3], (0.51, 0.72]]
Length: 20
Categories (4, object): [(0.086, 0.3] < (0.3, 0.51] < (0.51, 0.72] < (0.72, 0.93]]

A closely related function, `qcut`, bins the data based on sample quantiles. Depending on the distribution of the data, using cut will not usually result in each bin having the same number of data points. Since qcut uses sample quantiles instead, by definition you will obtain roughly equal-size bins:

In [31]:
data = np.random.randn(1000)  # Normally distributed
cats = pd.qcut(data, 4)  # Cut into quartiles
cats

[(0.739, 3.74], [-3.207, -0.637], (0.739, 3.74], [-3.207, -0.637], (0.0548, 0.739], ..., (-0.637, 0.0548], [-3.207, -0.637], (0.739, 3.74], (0.0548, 0.739], (0.739, 3.74]]
Length: 1000
Categories (4, object): [[-3.207, -0.637] < (-0.637, 0.0548] < (0.0548, 0.739] < (0.739, 3.74]]

In [32]:
pd.value_counts(cats)

(0.739, 3.74]       250
(0.0548, 0.739]     250
(-0.637, 0.0548]    250
[-3.207, -0.637]    250
dtype: int64

Similar to `cut` you can pass your own quantiles (numbers between 0 and 1, inclusive):

In [33]:
pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])

[(1.279, 3.74], [-3.207, -1.179], (0.0548, 1.279], (-1.179, 0.0548], (0.0548, 1.279], ..., (-1.179, 0.0548], (-1.179, 0.0548], (0.0548, 1.279], (0.0548, 1.279], (0.0548, 1.279]]
Length: 1000
Categories (4, object): [[-3.207, -1.179] < (-1.179, 0.0548] < (0.0548, 1.279] < (1.279, 3.74]]

## 发现和过滤 Outliers

In [34]:
data = pd.DataFrame(np.random.randn(1000, 4))
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.103722,-0.054296,-0.028274,-0.008895
std,0.973973,1.020384,0.992144,0.999435
min,-2.93757,-2.879931,-3.209693,-3.095305
25%,-0.544465,-0.734624,-0.701465,-0.670831
50%,0.108698,-0.075327,-0.026093,-0.046242
75%,0.790623,0.614452,0.665148,0.662897
max,3.256732,2.998553,2.836367,3.462821


Suppose you wanted to find values in one of the columns exceeding three in magnitude:

In [35]:
col = data[3]
col[np.abs(col) > 3]

56     3.462821
162    3.111654
511    3.213526
938   -3.095305
Name: 3, dtype: float64

To select all rows having a value exceeding 3 or -3, you can use the `any` method on a boolean DataFrame:

In [36]:
data[(np.abs(data) > 3).any(1)]

Unnamed: 0,0,1,2,3
56,1.231454,-1.206923,0.86404,3.462821
162,0.872046,-1.446569,-0.643924,3.111654
178,-1.724388,0.258003,-3.209693,-1.770807
511,-0.539618,-1.004802,-0.405893,3.213526
583,3.256732,1.04386,0.783482,0.482464
822,0.414631,-0.685493,-3.100293,0.544737
938,-1.972411,-0.03954,0.390369,-3.095305


Here is code to cap values outside the interval -3 to 3:

In [37]:
data[np.abs(data) > 3] = np.sign(data) * 3
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.103465,-0.054296,-0.027964,-0.009587
std,0.973175,1.020384,0.991187,0.99663
min,-2.93757,-2.879931,-3.0,-3.0
25%,-0.544465,-0.734624,-0.701465,-0.670831
50%,0.108698,-0.075327,-0.026093,-0.046242
75%,0.790623,0.614452,0.665148,0.662897
max,3.0,2.998553,2.836367,3.0


The ufunc `np.sign` returns an array of 1 and -1 depending on the sign of the values.

## Permutation 和 Random Sampling
Calling `permutation` with the length of the axis you want to permute produces an array of integers indicating the new ordering:

In [38]:
df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))
sampler = np.random.permutation(5)
sampler

array([1, 4, 3, 0, 2])

That array can then be used in `ix-based indexing` or the `take` function:

In [39]:
df

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


In [40]:
df.take(sampler)

Unnamed: 0,0,1,2,3
1,4,5,6,7
4,16,17,18,19
3,12,13,14,15
0,0,1,2,3
2,8,9,10,11


To select a random subset `without replacement`, one way is to slice off the first k elements of the array returned by `permutation`, where k is the desired subset size.

In [41]:
df.take(np.random.permutation(len(df))[:3])

Unnamed: 0,0,1,2,3
3,12,13,14,15
1,4,5,6,7
2,8,9,10,11


To generate a sample `with replacement`, the fastest way is to use `np.random.randint` to draw random integers:

In [42]:
bag = np.array([5, 7, -1, 6, 4])
sampler = np.random.randint(0, len(bag), size=10)
sampler

array([2, 4, 1, 1, 4, 0, 1, 3, 1, 3])

In [43]:
draws = bag.take(sampler)
draws

array([-1,  4,  7,  7,  4,  5,  7,  6,  7,  6])

## 计算 Indicator/Dummy 变量
Another type of transformation for statistical modeling or machine learning applications is converting a categorical variable into a “dummy” or “indicator” matrix. If a column in a DataFrame has k distinct values, you would derive a matrix or DataFrame containing k columns containing all 1’s and 0’s. pandas has a `get_dummies` function for doing this.

In [44]:
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                   'data1': range(6)})
pd.get_dummies(df['key'])

Unnamed: 0,a,b,c
0,0.0,1.0,0.0
1,0.0,1.0,0.0
2,1.0,0.0,0.0
3,0.0,0.0,1.0
4,1.0,0.0,0.0
5,0.0,1.0,0.0


In some cases, you may want to add a prefix to the columns in the indicator DataFrame, which can then be merged with the other data. `get_dummies` has a prefix argument for doing this:

In [45]:
dummies = pd.get_dummies(df['key'], prefix='key')
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy

Unnamed: 0,data1,key_a,key_b,key_c
0,0,0.0,1.0,0.0
1,1,0.0,1.0,0.0
2,2,1.0,0.0,0.0
3,3,0.0,0.0,1.0
4,4,1.0,0.0,0.0
5,5,0.0,1.0,0.0


If a row in a DataFrame belongs to multiple categories:

In [46]:
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('data/movielens/movies.dat', sep='::', header=None, names=mnames)
movies[:10]

  from ipykernel import kernelapp as app


Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children's
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


First, we extract the list of unique genres in the dataset:

In [47]:
all_genres = []
for x in movies.genres:
    all_genres.extend(x.split('|'))
genres = pd.unique(all_genres)
genres

array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
       'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
       'Western'], dtype=object)

One way to construct the indicator DataFrame is to start with a DataFrame of all zeros:

In [48]:
zero_matrix = np.zeros((len(movies), len(genres)))
dummies = pd.DataFrame(zero_matrix, columns=genres)

Now, iterate through each movie and set entries in each row of dummies to 1. To do this, we use the `dummies.columns` to compute the column indices for each genre:

In [49]:
gen = movies.genres[0]
gen.split('|')

['Animation', "Children's", 'Comedy']

In [50]:
dummies.columns.get_indexer(gen.split('|'))

array([0, 1, 2])

Then, we can use `.iloc` to set values based on these indices:

In [51]:
for i, gen in enumerate(movies.genres):
    indices = dummies.columns.get_indexer(gen.split('|'))
    dummies.iloc[i, indices] = 1
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[0]

movie_id                                       1
title                           Toy Story (1995)
genres               Animation|Children's|Comedy
Genre_Animation                                1
Genre_Children's                               1
Genre_Comedy                                   1
Genre_Adventure                                0
Genre_Fantasy                                  0
Genre_Romance                                  0
Genre_Drama                                    0
Genre_Action                                   0
Genre_Crime                                    0
Genre_Thriller                                 0
Genre_Horror                                   0
Genre_Sci-Fi                                   0
Genre_Documentary                              0
Genre_War                                      0
Genre_Musical                                  0
Genre_Mystery                                  0
Genre_Film-Noir                                0
Genre_Western       

A useful recipe for statistical applications is to combine `get_dummies` with a discretization function like `cut`:

In [52]:
np.random.seed(12345)  # Set the random seed for deterministic results
values = np.random.rand(10)
values

array([ 0.92961609,  0.31637555,  0.18391881,  0.20456028,  0.56772503,
        0.5955447 ,  0.96451452,  0.6531771 ,  0.74890664,  0.65356987])

In [53]:
bins = [0, 0.2, 0.4, 0.6, 0.8, 1]
pd.get_dummies(pd.cut(values, bins))

Unnamed: 0,"(0, 0.2]","(0.2, 0.4]","(0.4, 0.6]","(0.6, 0.8]","(0.8, 1]"
0,0.0,0.0,0.0,0.0,1.0
1,0.0,1.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0
3,0.0,1.0,0.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0
5,0.0,0.0,1.0,0.0,0.0
6,0.0,0.0,0.0,0.0,1.0
7,0.0,0.0,0.0,1.0,0.0
8,0.0,0.0,0.0,1.0,0.0
9,0.0,0.0,0.0,1.0,0.0
