# Data Cleaning and Preparation

### DO NOT ASSUME GOOD QUALITY OF YOUR DATA

* Data preparation takes up more than 80% of an analyst's time. Data may be in the wrong format and/or bad quality.
* pandas provides high-level tools to manipulate data into right form.

## Handling missing data
* For numeric data, pandas uses value NaN. It is called a Sentinel value and can be easily detected.

In [8]:
import pandas as pd
import numpy as np

string_data = pd.Series(['a', 'b', np.nan, 'd'])
string_data

0      a
1      b
2    NaN
3      d
dtype: object

In [9]:
sum(string_data.isnull())

1

In [10]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

* It is equivalent to NA in R language.
* NA may either be data that does not exist or data that was not observed, aka, missing data
* Analyse the missing data to identify data collection problems or <b>potential bias</b> due to missing data. For example, when collecting salary info, very rich people don't want to provide the data, then the average salary of population will be lower biased. 

In [4]:
string_data[0] = None
string_data

0    None
1       b
2     NaN
3       d
dtype: object

In [5]:
string_data[3] = np.nan
string_data

0    None
1       b
2     NaN
3     NaN
dtype: object

### What is the difference between NaN and None?
#### np.nan allows for vectorized operations; its a float value, while None, by definition, forces object type, which basically disables all efficiency in numpy.
#### So repeat 3 times fast: object==bad, float==good

In [6]:
string_data.isnull()

0     True
1    False
2     True
3     True
dtype: bool

In [7]:
string_data.isna() #exactly same as isnull()

0     True
1    False
2     True
3     True
dtype: bool

### Filtering 'Out' Missing Data
* We always have the option to filter out missing data by hand using 'isnull' and boolean indexing.
* The 'dropna' function can be pretty useful too. For a Series it returns the Series with only non-null data and index values.
* For DataFrame, it is a bit complex. dropna by default will drop any row that contains even 1 missing value. By passing "how='all'" will target rows with all NAs.
* To drop columns, pass 'axis=1'.

In [8]:
from numpy import nan as NA

data = pd.Series([1, NA, 3.5, NA, 7])
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

In [9]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

In [10]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
                   [NA, NA, NA], [NA, 6.5, 3.]])

cleaned = data.dropna()
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [11]:
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [12]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [13]:
data[4] = NA
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [14]:
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


* Another DataFrame cleaning method concerns with time series data.
* To keep only rows with certain number of observations, use the 'thresh' argument. 
* thresh=N requires that a column has at least N non-NaNs to survive. In the first example, both columns have at least one non-NaN, so both survive. In the second example, only the last column has at least two non-NaNs, so it survives, but the previous column is dropped. Thresh refers to non-nan values.

In [15]:
df = pd.DataFrame(np.random.randn(7,3))
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA

df

Unnamed: 0,0,1,2
0,0.45362,,
1,-1.833472,,
2,-1.210399,,-1.163364
3,0.723744,,0.591394
4,-1.557714,-1.351764,-0.074159
5,-1.440924,1.003015,-1.146151
6,0.990508,1.131964,-0.48035


In [16]:
df.dropna()

Unnamed: 0,0,1,2
4,-1.557714,-1.351764,-0.074159
5,-1.440924,1.003015,-1.146151
6,0.990508,1.131964,-0.48035


In [17]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,-1.210399,,-1.163364
3,0.723744,,0.591394
4,-1.557714,-1.351764,-0.074159
5,-1.440924,1.003015,-1.146151
6,0.990508,1.131964,-0.48035


### Filling In Missing Data
* Rather than removing NAs and discarding important information in the same rows, we can also fill in the NAs in different ways.
* The 'fillna' is a workhorse function, where the constant we pass replaces missing values.
* If we call fillna with a dict, we can fill different value for each column.

In [18]:
df.fillna(0)

Unnamed: 0,0,1,2
0,0.45362,0.0,0.0
1,-1.833472,0.0,0.0
2,-1.210399,0.0,-1.163364
3,0.723744,0.0,0.591394
4,-1.557714,-1.351764,-0.074159
5,-1.440924,1.003015,-1.146151
6,0.990508,1.131964,-0.48035


In [19]:
df.fillna({1:0.5, 2: 0}) #each column may have different value to fill NA

Unnamed: 0,0,1,2
0,0.45362,0.5,0.0
1,-1.833472,0.5,0.0
2,-1.210399,0.5,-1.163364
3,0.723744,0.5,0.591394
4,-1.557714,-1.351764,-0.074159
5,-1.440924,1.003015,-1.146151
6,0.990508,1.131964,-0.48035


* By default it returns a new object, but we can modify it to change in-place.
* The interpolation methods used for reindexing like 'ffill' can also be used with fillna.
* It allows you to do lots of creative things, like filling with mean or median values.

In [20]:
_ = df.fillna(0, inplace=True)

df

Unnamed: 0,0,1,2
0,0.45362,0.0,0.0
1,-1.833472,0.0,0.0
2,-1.210399,0.0,-1.163364
3,0.723744,0.0,0.591394
4,-1.557714,-1.351764,-0.074159
5,-1.440924,1.003015,-1.146151
6,0.990508,1.131964,-0.48035


In [21]:
df = pd.DataFrame(np.random.randn(6,3))
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
df

Unnamed: 0,0,1,2
0,-0.767574,-1.508552,0.726299
1,-0.481561,0.319287,-1.385299
2,0.150983,,2.406146
3,1.180724,,1.018532
4,-0.26726,,
5,-1.005461,,


In [22]:
df.fillna(method='ffill') #‘ffill’ stands for ‘forward fill’ and will propagate last valid observation forward

Unnamed: 0,0,1,2
0,-0.767574,-1.508552,0.726299
1,-0.481561,0.319287,-1.385299
2,0.150983,0.319287,2.406146
3,1.180724,0.319287,1.018532
4,-0.26726,0.319287,1.018532
5,-1.005461,0.319287,1.018532


In [23]:
df.fillna(method='ffill', limit=2)

Unnamed: 0,0,1,2
0,-0.767574,-1.508552,0.726299
1,-0.481561,0.319287,-1.385299
2,0.150983,0.319287,2.406146
3,1.180724,0.319287,1.018532
4,-0.26726,,1.018532
5,-1.005461,,1.018532


In [24]:
data = pd.Series([1., NA, 3.5, NA, 7])

data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

In [25]:
(1+3.5+7)/3

3.8333333333333335

### Removing Duplicates (sometimes it is called "dedu")
* The DataFrame method 'duplicated' returns boolean Series indicating if each row is a duplicate (i.e. observed in a previous row) or not.
* Similarly, 'drop_duplicates' returns DataFrame where 'duplicated' array is False.

In [11]:
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
                    'k2': [1,1,2,3,3,4,4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


In [27]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [28]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


* Both the above methods by default consider all of the columns. You can also specify any subset of the DataFrame to detect duplicates.
* By default, both keep the first observation in case of duplicates. We can specify "keep='last'" to instead keep the last observation.

In [12]:
data['v1'] = range(7)
data

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5
6,two,4,6


In [13]:
data.drop_duplicates(['k1'])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


In [30]:
data.drop_duplicates(['k1', 'k2'], keep='last')

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


### Transforming Data Using a Function or Mapping
* We sometimes need to make transformations based on the values present in an array, Series or column in a DataFrame.
* We can use the map method with a function or dict-like object having the mapping to add or change a column.
* Sometimes the column that we base our mapping on may have varying case from our map. In such a case, we can convert all the values to lowercase. Or just pass a function that does it for us.

In [31]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
                             'Pastrami', 'corned beef', 'Bacon',
                            'pastrami', 'honey ham', 'nova lox'],
                    'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


In [32]:
meat_to_animal = {
    'bacon': 'pig',
    'pulled pork': 'pig',
    'pastrami': 'cow',
    'corned beef': 'cow',
    'honey ham': 'pig',
    'nova lox': 'salmon'
}

#### String methods

* Series and Index are equipped with a set of string processing methods that make it easy to operate on each element of the array. 
* Perhaps most importantly, **these methods exclude missing/NA values automatically.**
* These are accessed via the str attribute and generally have names matching the equivalent (scalar) built-in string methods

In [36]:
data['food'].str

<pandas.core.strings.accessor.StringMethods at 0x265b62edb90>

In [31]:
lowercased = data['food'].str.lower()
lowercased

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

In [32]:
data['animal'] = lowercased.map(meat_to_animal)
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


In [33]:
data['food'].map(lambda x: meat_to_animal[x.lower()])

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

In [34]:
data['animal'] = data['food'].map(lambda x: meat_to_animal[x.lower()])
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


### Replacing Values
* The fillna method is a special case of more general values replacement.
* The map function modifies a subset of values but, 'replace' provides simpler and more flexible way to do so.
* Passing the sentinel (or garbage) value followed by the replcae value will create a new object with the values replaced.
* If we want in-place replacement, use "inplace=True".

In [35]:
data = pd.Series([1., -999., 2., -1000., 3.])
data

0       1.0
1    -999.0
2       2.0
3   -1000.0
4       3.0
dtype: float64

In [36]:
data.replace(-999, np.nan)

0       1.0
1       NaN
2       2.0
3   -1000.0
4       3.0
dtype: float64

* To replace multiple values with a single value, pass a list followed by substitute value.
* To have different replacements for different values, pass list of substitutes.
* You can also pass a dict as argument to replace multiple substitutes.
* NOTE - 'data.replace' is different from 'data.str.replace'. The latter is for element-wise string substitution.

In [37]:
data.replace([-999, -1000], np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    3.0
dtype: float64

In [38]:
data.replace([-999, -1000], [np.nan, 0])

0    1.0
1    NaN
2    2.0
3    0.0
4    3.0
dtype: float64

In [39]:
data.replace({-999: np.nan, -1000: 0})

0    1.0
1    NaN
2    2.0
3    0.0
4    3.0
dtype: float64

### Renaming Axis Indexes
* Just like values, axis labels can also be transformed by a function or mapping to produce differntly labeled objects.
* We can also modify axes in-place without any new data structure.

In [40]:
data = pd.DataFrame(np.arange(12).reshape((3,4)),
                   index = ['Ohio', 'Colorado', 'New York'],
                   columns = ['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


In [41]:
transform = lambda x: x[:4].upper()

data.index.map(transform)

Index(['OHIO', 'COLO', 'NEW '], dtype='object')

In [42]:
data.index = data.index.map(transform)
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


* To get a transformed version of a dataset **without modifying the original**, use 'rename'.
* It can also be used in conjunction with a dict-like object providing new values for subset of the axis labels.
* It saves you from copying DataFrame manuallyand then assigning it index and columns. To modify in-place, use parameter 'inplace=True'.

In [43]:
data.rename(index=str.title, columns=str.upper)
# data.rename(index=str.title, columns=lambda x: x+"_test")

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colo,4,5,6,7
New,8,9,10,11


In [44]:
data.rename(index={'OHIO':"INDIANA"},
           columns = {'three':'peekaboo'})

Unnamed: 0,one,two,peekaboo,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [45]:
data.rename(index={'OHIO': 'INDIANA'}, inplace=True)
data

Unnamed: 0,one,two,three,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


### Discretization and Binning
* Continuous data is always discretized or seperated into 'bins' for analysis.
* To bin a set of continuous data, use the 'cut' method from pandas.
* In below example, we are binnning set of gaes into groups 18 to 25, 26 to 35, 36 to 60 and 61 nd older.

In [46]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]

cuts = pd.cut(ages, bins)
cuts

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

* pandas returns a special Categorical object from cuts function.
* The output describes the bins that each of the element is in. You can treat it like a bin name for each element.
* Internally, the output contains a categories array specifying distinct category names along with a labeling for the 'ages' data in the 'codes' attribute.

In [47]:
cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [48]:
cats.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]], dtype='interval[int64, right]')

In [49]:
pd.value_counts(cats)

(18, 25]     5
(25, 35]     3
(35, 60]     3
(60, 100]    1
dtype: int64

* The interval system for cut is consistent with the mathematical notation. A parenthesis means that the side is open and a square bracket means that it is closed (inclusive).
* We can changes which side is closed by passing 'right=False'.
* We can have our own bin names by passing a list or array to the labels option.

In [50]:
pd.cut(ages, [18, 26, 36, 61, 100], right=False)

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64, left]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

In [51]:
group_names = ['Youth', 'YoungAdult', 'MiddleAges', 'Senior']
pd.cut(ages, bins, labels=group_names)

['Youth', 'Youth', 'Youth', 'YoungAdult', 'Youth', ..., 'YoungAdult', 'Senior', 'MiddleAges', 'MiddleAges', 'YoungAdult']
Length: 12
Categories (4, object): ['Youth' < 'YoungAdult' < 'MiddleAges' < 'Senior']

* Instead of specifying specific intervals, we can just pass an integer to get equal length bins of the same number <b>based on the max and min value of the data</b>.
* The 'precision' parameter limits decimal precision in the values. 'precision=2' limits decimal precision to 2 digits.

In [52]:
data = np.random.rand(20)

pd.cut(data, 4, precision=2)

[(0.52, 0.74], (0.74, 0.96], (0.075, 0.3], (0.3, 0.52], (0.075, 0.3], ..., (0.52, 0.74], (0.075, 0.3], (0.74, 0.96], (0.74, 0.96], (0.52, 0.74]]
Length: 20
Categories (4, interval[float64, right]): [(0.075, 0.3] < (0.3, 0.52] < (0.52, 0.74] < (0.74, 0.96]]

#### cut has a closely related function - 'qcut' that bins data based on sample quantiles.
* Based on distribution, using cut will not usually result in <b>each bin have the same number of data points.</b>
* But as qcut uses sample quantiles, you wil rougjly obtain equal-size bins.
* We can even pass our own quantiles to qcut.

In [53]:
data = np.random.randn(1000)
cats = pd.qcut(data, 4)
cats

[(-2.8209999999999997, -0.691], (0.697, 2.96], (-2.8209999999999997, -0.691], (-0.0268, 0.697], (-0.0268, 0.697], ..., (-2.8209999999999997, -0.691], (-0.691, -0.0268], (0.697, 2.96], (0.697, 2.96], (-2.8209999999999997, -0.691]]
Length: 1000
Categories (4, interval[float64, right]): [(-2.8209999999999997, -0.691] < (-0.691, -0.0268] < (-0.0268, 0.697] < (0.697, 2.96]]

In [54]:
pd.value_counts(cats) #each bin has the same count

(-2.8209999999999997, -0.691]    250
(-0.691, -0.0268]                250
(-0.0268, 0.697]                 250
(0.697, 2.96]                    250
dtype: int64

In [55]:
pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])

[(-2.8209999999999997, -1.28], (1.267, 2.96], (-2.8209999999999997, -1.28], (-0.0268, 1.267], (-0.0268, 1.267], ..., (-2.8209999999999997, -1.28], (-1.28, -0.0268], (-0.0268, 1.267], (1.267, 2.96], (-1.28, -0.0268]]
Length: 1000
Categories (4, interval[float64, right]): [(-2.8209999999999997, -1.28] < (-1.28, -0.0268] < (-0.0268, 1.267] < (1.267, 2.96]]

### Detecting and Filtering Outliers
* Filtering and Transforming outliers is mostly a matter of applying array operations.
* To find values exceeding a threshold, just use boolean indexing with other functions like 'abs()' based on requirement.
* To get all rows having at least one value exceed a threshold, use the 'any(1)' method.
* Values can also be set based on these criteria. So you can cap values based on an interval or threshold.
* You can also use the 'np.sign()' function to get 1 and -1 where the data is positive or negative respectively.

In [15]:
data = pd.DataFrame(np.random.randn(1000, 10))
data.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,0.01948,-0.025731,-0.001563,0.021673,-0.075158,0.004238,-0.039418,0.015128,-0.020907,-0.053543
std,1.061362,0.999677,0.976298,0.984215,0.969858,0.996623,0.959133,1.000497,0.985763,1.028341
min,-3.424189,-2.737034,-2.997512,-3.790986,-2.929871,-3.259187,-3.245552,-2.707198,-3.29577,-3.018881
25%,-0.688316,-0.75266,-0.65413,-0.621561,-0.744031,-0.645981,-0.718353,-0.676601,-0.685393,-0.75807
50%,0.021735,-0.040243,-0.031316,0.030796,-0.071831,0.023172,-0.015025,-0.013618,0.016702,-0.03353
75%,0.742173,0.620024,0.631223,0.713208,0.587569,0.663703,0.608041,0.705518,0.662919,0.626668
max,3.112621,2.672284,2.900661,2.883422,2.914465,3.250909,2.708158,3.071881,2.761528,3.194761


In [16]:
data

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,-0.963219,-1.783863,-1.953851,0.777673,-0.425421,0.350850,1.151583,-1.018415,0.795849,1.144987
1,-1.756073,-1.736723,1.446913,-0.241217,-0.635583,0.501397,-0.017901,0.730742,-0.919426,-0.732066
2,1.529311,1.708207,-0.291036,-1.527785,-0.824747,-1.624641,-0.809000,-0.243516,-0.561359,0.847795
3,-0.439707,0.533734,-2.531073,-0.111463,-1.516860,-3.259187,0.764269,1.341833,-0.894433,-2.213260
4,1.221181,0.516359,-0.817971,-2.285367,-1.813022,-1.467670,-0.660212,0.793322,-1.306187,-0.407413
...,...,...,...,...,...,...,...,...,...,...
995,-0.341332,0.609882,-1.025395,0.241590,-0.737908,-1.425613,0.101672,0.154599,-0.634627,0.566285
996,-1.014023,-0.420599,-0.668563,0.128325,-0.506338,-1.211656,-0.576944,-0.368268,-0.597135,-1.950333
997,1.483011,0.013538,0.159707,-0.642934,-2.210136,0.698386,0.484624,-0.513643,-1.523177,-1.825329
998,-0.435337,0.657692,-0.954776,1.061046,0.552612,0.069969,-0.248030,-0.555209,0.445375,0.338914


In [20]:
# data=data*10
col = data[5]
col

0      0.350850
1      0.501397
2     -1.624641
3     -3.259187
4     -1.467670
         ...   
995   -1.425613
996   -1.211656
997    0.698386
998    0.069969
999   -0.805043
Name: 5, Length: 1000, dtype: float64

In [21]:
col[np.abs(col) > 3]

3     -3.259187
676    3.250909
Name: 5, dtype: float64

In [22]:
(np.abs(data)>3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,True,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
995,False,False,False,False,False,False,False,False,False,False
996,False,False,False,False,False,False,False,False,False,False
997,False,False,False,False,False,False,False,False,False,False
998,False,False,False,False,False,False,False,False,False,False


In [23]:
(np.abs(data)>3).any(axis='columns')

0      False
1      False
2      False
3       True
4      False
       ...  
995    False
996    False
997    False
998    False
999    False
Length: 1000, dtype: bool

In [24]:
data[(np.abs(data) > 3).any(axis='columns')]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
3,-0.439707,0.533734,-2.531073,-0.111463,-1.51686,-3.259187,0.764269,1.341833,-0.894433,-2.21326
13,-0.951366,1.056374,1.45085,1.263329,-0.775593,1.366444,-0.880215,-0.123336,-3.094567,2.473757
93,1.054432,0.325522,-0.268746,1.684932,0.271686,0.617536,-0.539557,0.120378,-3.074202,-1.20038
111,3.069836,-0.510177,-0.201143,-0.226509,-2.307458,-2.122461,2.451263,0.451879,0.316294,-0.647779
133,-3.403942,0.100852,-0.75125,-0.452448,-1.162245,2.057683,1.545381,-0.18546,-0.686959,-1.034519
303,-1.396295,1.471188,-0.288988,1.201568,1.226989,0.516654,0.12659,3.009832,1.119952,0.146831
314,1.349915,0.336991,-0.564487,-3.790986,-2.540229,-0.673561,-0.89919,-1.335036,0.225082,0.20194
362,-1.095624,0.23985,-1.201595,0.163812,0.185223,0.384306,0.809605,3.071881,1.176817,-0.625292
506,0.012799,0.906924,-0.76571,-0.035952,-1.672983,0.531914,-1.193088,0.002971,0.157926,-3.018881
567,-0.468255,0.350055,-2.555257,-0.338477,0.149714,-1.303649,-3.245552,-0.942283,0.958461,0.441696


In [25]:
# Capping outside -3 to 3
data[np.abs(data) > 3] = np.sign(data) * 3 
data.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,0.020126,-0.025731,-0.001563,0.022654,-0.075158,0.004246,-0.039173,0.015046,-0.020443,-0.053718
std,1.058316,0.999677,0.976298,0.980857,0.969858,0.995019,0.958342,1.000251,0.984305,1.027689
min,-3.0,-2.737034,-2.997512,-3.0,-2.929871,-3.0,-3.0,-2.707198,-3.0,-3.0
25%,-0.688316,-0.75266,-0.65413,-0.621561,-0.744031,-0.645981,-0.718353,-0.676601,-0.685393,-0.75807
50%,0.021735,-0.040243,-0.031316,0.030796,-0.071831,0.023172,-0.015025,-0.013618,0.016702,-0.03353
75%,0.742173,0.620024,0.631223,0.713208,0.587569,0.663703,0.608041,0.705518,0.662919,0.626668
max,3.0,2.672284,2.900661,2.883422,2.914465,3.0,2.708158,3.0,2.761528,3.0


In [26]:
data

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,-0.963219,-1.783863,-1.953851,0.777673,-0.425421,0.350850,1.151583,-1.018415,0.795849,1.144987
1,-1.756073,-1.736723,1.446913,-0.241217,-0.635583,0.501397,-0.017901,0.730742,-0.919426,-0.732066
2,1.529311,1.708207,-0.291036,-1.527785,-0.824747,-1.624641,-0.809000,-0.243516,-0.561359,0.847795
3,-0.439707,0.533734,-2.531073,-0.111463,-1.516860,-3.000000,0.764269,1.341833,-0.894433,-2.213260
4,1.221181,0.516359,-0.817971,-2.285367,-1.813022,-1.467670,-0.660212,0.793322,-1.306187,-0.407413
...,...,...,...,...,...,...,...,...,...,...
995,-0.341332,0.609882,-1.025395,0.241590,-0.737908,-1.425613,0.101672,0.154599,-0.634627,0.566285
996,-1.014023,-0.420599,-0.668563,0.128325,-0.506338,-1.211656,-0.576944,-0.368268,-0.597135,-1.950333
997,1.483011,0.013538,0.159707,-0.642934,-2.210136,0.698386,0.484624,-0.513643,-1.523177,-1.825329
998,-0.435337,0.657692,-0.954776,1.061046,0.552612,0.069969,-0.248030,-0.555209,0.445375,0.338914


In [63]:
np.sign(data).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1.0,-1.0,1.0,-1.0,1.0,-1.0,-1.0,-1.0,1.0,1.0
1,1.0,-1.0,1.0,1.0,1.0,-1.0,1.0,1.0,-1.0,-1.0
2,1.0,1.0,1.0,-1.0,-1.0,1.0,1.0,-1.0,1.0,1.0
3,1.0,1.0,1.0,-1.0,1.0,-1.0,1.0,-1.0,1.0,-1.0
4,-1.0,-1.0,-1.0,1.0,-1.0,-1.0,1.0,1.0,-1.0,-1.0


### Permutation and Random Sampling
* We can easily perform Permutation (randomly reordering) on a Series or rows of a DataFrame using 'numpy.random.permutation'.
* Calling it with length of the axis you want to permute cerates an array of integers indicating the new ordering.
* we can then use the same array in an iloc-based indexing or with an equivalent 'take' function.

In [64]:
df = pd.DataFrame(np.arange(5*4).reshape((5,4)))
print(df)
sampler = np.random.permutation(5)
sampler

    0   1   2   3
0   0   1   2   3
1   4   5   6   7
2   8   9  10  11
3  12  13  14  15
4  16  17  18  19


array([2, 0, 3, 1, 4])

In [65]:
df

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


In [66]:
df.take(sampler) #reshuffle by the new index

Unnamed: 0,0,1,2,3
2,8,9,10,11
0,0,1,2,3
3,12,13,14,15
1,4,5,6,7
4,16,17,18,19


In [67]:
df.iloc[sampler] #it does the same thing

Unnamed: 0,0,1,2,3
2,8,9,10,11
0,0,1,2,3
3,12,13,14,15
1,4,5,6,7
4,16,17,18,19


* Use the 'sample' method to select random subset <b>without replacement.</b>
* To generate subset with replacement (i.e. repeat choices) pass 'replace=True'.

In [68]:
df.sample(n=3)

Unnamed: 0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15


In [69]:
choices = pd.Series([5, 7, -1, 6, 4])
draws = choices.sample(n=10, replace=True)
draws

3    6
3    6
4    4
4    4
2   -1
2   -1
4    4
4    4
3    6
1    7
dtype: int64

### Computing Indicator / Dummy Variables
* Another transformation is to <b>convert a categorical variable into a 'dummy' or 'indicator' matrix.</b>
* If a column has k distinct values, we can derive a matrix or a DataFrame with k columns all containing 1s and 0s.
* pandas has the 'get_dummies' function to do this.
* You may want to add a prefix to the columns in the indicator DataFrame. This DataFrame can be merged with other data. get_dummies has the 'prefix' argument to do this.

In [70]:
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                  'data1': range(6)})
print(df)
pd.get_dummies(df['key']) #transform categorical data into numeric data

  key  data1
0   b      0
1   b      1
2   a      2
3   c      3
4   a      4
5   b      5


Unnamed: 0,a,b,c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In [71]:
dummies = pd.get_dummies(df['key'], prefix='key')
print(dummies)
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy

   key_a  key_b  key_c
0      0      1      0
1      0      1      0
2      1      0      0
3      0      0      1
4      1      0      0
5      0      1      0


Unnamed: 0,data1,key_a,key_b,key_c
0,0,0,1,0
1,1,0,1,0
2,2,1,0,0
3,3,0,0,1
4,4,1,0,0
5,5,0,1,0


#### But if a row belongs to multiple categories, then dummying it becomes complicated.
* Adding indicator variable to a column like genre involves a bit of data wrangling.
* First we get all the unique categorical values in the DataFrame.

In [72]:
mnames = ['movie_id', 'title', 'genres']

movies = pd.read_table('./movies.dat', sep='::',
                      header=None, names=mnames, encoding='latin-1')
movies[:10]

  movies = pd.read_table('./movies.dat', sep='::',


Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children's
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


In [73]:
all_genres = []
for x in movies.genres:
    all_genres.extend(x.split('|'))
    
genres = pd.unique(all_genres)
genres

array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
       'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
       'Western'], dtype=object)

* One way to get this would be to start with a DataFrame of all zeros.
* The iterate through each row and set the entry in each row of 'dummies' to 1. For this we use 'dummies.columns' to compute column indices for each category.
* The we use iloc to set values based on those indices. After that we combine this indicator DataFrame with original DataFrame.

In [74]:
zero_matrix = np.zeros((len(movies), len(genres)))
dummies = pd.DataFrame(zero_matrix, columns=genres)
dummies

Unnamed: 0,Animation,Children's,Comedy,Adventure,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Sci-Fi,Documentary,War,Musical,Mystery,Film-Noir,Western
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3878,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3879,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3880,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3881,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [75]:
gen = movies.genres[0]
gen.split('|')

['Animation', "Children's", 'Comedy']

In [76]:
dummies.columns.get_indexer(gen.split('|'))

array([0, 1, 2], dtype=int64)

In [77]:
for i, gen in enumerate(movies.genres):
    indices = dummies.columns.get_indexer(gen.split('|'))
    dummies.iloc[i, indices] = 1 #broadcast 1 to the selected locations

In [78]:
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[1]

movie_id                                        2
title                              Jumanji (1995)
genres               Adventure|Children's|Fantasy
Genre_Animation                               0.0
Genre_Children's                              1.0
Genre_Comedy                                  0.0
Genre_Adventure                               1.0
Genre_Fantasy                                 1.0
Genre_Romance                                 0.0
Genre_Drama                                   0.0
Genre_Action                                  0.0
Genre_Crime                                   0.0
Genre_Thriller                                0.0
Genre_Horror                                  0.0
Genre_Sci-Fi                                  0.0
Genre_Documentary                             0.0
Genre_War                                     0.0
Genre_Musical                                 0.0
Genre_Mystery                                 0.0
Genre_Film-Noir                               0.0


In [79]:
movies_windic.head()

Unnamed: 0,movie_id,title,genres,Genre_Animation,Genre_Children's,Genre_Comedy,Genre_Adventure,Genre_Fantasy,Genre_Romance,Genre_Drama,...,Genre_Crime,Genre_Thriller,Genre_Horror,Genre_Sci-Fi,Genre_Documentary,Genre_War,Genre_Musical,Genre_Mystery,Genre_Film-Noir,Genre_Western
0,1,Toy Story (1995),Animation|Children's|Comedy,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji (1995),Adventure|Children's|Fantasy,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men (1995),Comedy|Romance,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale (1995),Comedy|Drama,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II (1995),Comedy,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


* Another useful method in statistical applications is to combine the dummy function with discretization function like cut. This shows if a value is present in a bin or not.
* For below example we will use random seed to make the example more deterministic.

In [27]:
np.random.seed(12345)
values = np.random.rand(10)
values

array([0.92961609, 0.31637555, 0.18391881, 0.20456028, 0.56772503,
       0.5955447 , 0.96451452, 0.6531771 , 0.74890664, 0.65356987])

In [28]:
bins = [0, 0.2, 0.4, 0.6, 0.8, 1]
pd.cut(values, bins)

[(0.8, 1.0], (0.2, 0.4], (0.0, 0.2], (0.2, 0.4], (0.4, 0.6], (0.4, 0.6], (0.8, 1.0], (0.6, 0.8], (0.6, 0.8], (0.6, 0.8]]
Categories (5, interval[float64, right]): [(0.0, 0.2] < (0.2, 0.4] < (0.4, 0.6] < (0.6, 0.8] < (0.8, 1.0]]

In [81]:
pd.get_dummies(pd.cut(values, bins))

Unnamed: 0,"(0.0, 0.2]","(0.2, 0.4]","(0.4, 0.6]","(0.6, 0.8]","(0.8, 1.0]"
0,0,0,0,0,1
1,0,1,0,0,0
2,1,0,0,0,0
3,0,1,0,0,0
4,0,0,1,0,0
5,0,0,1,0,0
6,0,0,0,0,1
7,0,0,0,1,0
8,0,0,0,1,0
9,0,0,0,1,0


### Homework of today

1. Load "msleep.csv" data into dataframe.
2. Split the dataframe into 3 dataframes - the good (without NaN in each row), the bad (with one NaN in each row), the ugly (with multiple NaNs in each row).
3. Use the "bad" dataframe, fill the NaN with the average value (numeric) or the most frequent value (categorical) in the column.
4. Use the "good" dataframe, convert column "order" into dummies with prefix "order_".
5. Use the "good" dataframe, cut "bodywt" into 10 bins and return the counts in each bin.
6. Use the "good" dataframe, cap the bodywt to 100 max.
7. Use the <b>filled</b> "bad" dataframe, cut "bodywt" into 10 bins and return the counts in each bin.