
---
# Handling Numerical Data

### Introduction
cover numerous strategies for transforming raw numerical data into features purpose-built for machine learning algorithms
### 讨论将原始数值数据转换为专门为机器学习算法构建的特征的策略
### 4.1 Rescaling a feature
Use scikit-learn's `MinMaxScaler` to rescale a feature array

In [1]:
import numpy as np
from sklearn import preprocessing

# create a feature
feature = np.array([
    [-500.5],
    [-100.1],
    [0],
    [100.1],
    [900.9]
])

# create scaler
minmax_scaler = preprocessing.MinMaxScaler(feature_range=(0,1))

# scale feature
scaled_feature = minmax_scaler.fit_transform(feature)

scaled_feature

array([[0.        ],
       [0.28571429],
       [0.35714286],
       [0.42857143],
       [1.        ]])

#### Discussion
重缩放是机器学习中常见的预处理任务。本书后面描述的许多算法都假设所有特性都在同一个尺度上，通常是0到1或-1到1。有许多重缩放技术，但最简单的一种称为最小-最大缩放。最小最大缩放使用特征的最小值和最大值将值重新缩放到某个范围内
$$
x_i^` = \frac{x_i - min(x)}{max(x) - min(x)}
$$



#### See Also
* Feature scaling, wikipedia (https://en.wikipedia.org/wiki/Feature_scaling)

### 4.2 Standardizing a Feature
scikit-learn's `StandardScaler` transforms a feature to have a mean of 0 and a standard deviation of 1.

In [2]:
import numpy as np
from sklearn import preprocessing

# create a feature
feature = np.array([
    [-1000.1],
    [-200.2],
    [500.5],
    [600.6],
    [9000.9]
])

# create scaler
scaler = preprocessing.StandardScaler()

# transform the feature
standardized = scaler.fit_transform(feature)

standardized

array([[-0.76058269],
       [-0.54177196],
       [-0.35009716],
       [-0.32271504],
       [ 1.97516685]])

#### Discussion
最小-最大缩放的一个常见替代方法是将特征重新缩放为近似标准的正态分布。为了实现这一点，使用标准化来转换数据，使其具有平均值x′或0和标准偏差σ1。具体地说，特征中的每个元素都经过如下转换
$$
x_i^` = \frac{x_i - \bar x}{\sigma}
$$

标准化是机器学习预处理的一种常用的缩放方法，它比minmax scaling使用得多。但这取决于学习算法。例如，主成分分析通常使用标准化更好地工作，而最小-最大标度通常推荐用于神经网络。一般来说，默认标准化，它也叫做z-score。

In [4]:
print("Mean {}".format(round(standardized.mean())))
print("Standard Deviation: {}".format(standardized.std()))

Mean 0.0
Standard Deviation: 1.0


如果我们的数据有显著的异常值，它会通过影响特征的平均值和方差对我们的标准化产生负面影响。在这种情况下，使用中间值和四分位数范围来重新缩放特征通常很有帮助。在scikit learn中，我们使用RobustScaler方法来执行此操作

In [3]:
# create scaler
robust_scaler = preprocessing.RobustScaler()

# transform feature
robust_scaler.fit_transform(feature)

array([[-1.87387612],
       [-0.875     ],
       [ 0.        ],
       [ 0.125     ],
       [10.61488511]])

### 4.3 Normalizing Observations
Use scikit-learn's `Normalizer` to rescale the feature values to have unit norm (a total length of 1)

In [5]:
import numpy as np
from sklearn.preprocessing import Normalizer

# create feature matrix
features = np.array([
    [0.5, 0.5],
    [1.1, 3.4]
])

# create normalizer
normalizer = Normalizer(norm="l2")

# transofmr feature matrix
normalizer.transform(features)

array([[0.70710678, 0.70710678],
       [0.30782029, 0.95144452]])

#### Discussion
Many rescaling methods operate of features; however, we can also rescale across individual observations. `Normalizer` rescales the values on individual observations to have unit norm (the sum of their lengths is 1). This type of rescaling is often used when we have many equivalent features (e.g. text-classification when every word is n-word group is a feature).

`Normalizer` provides three norm options with Euclidean norm (often called L2) being the default:
$$
||x||_2 = \sqrt{x_1^2 + x_2^2 + ... + x_n^2}
$$

where x is an individual observation and x_n is that observation's value for the nth feature.

Alternatively, we can specify Manhattan norm (L1):
$$
||x||_1 = \sum_{i=1}^n{x_i}
$$

Intuitively, L2 norm can be thought of as the distance between two poitns in New York for a bird (i.e. a straight line), while L1 can be thought of as the distance for a human wlaking on the street (walk north one block, east one block, north one block, east one block, etc), which is why it is called "Manhattan norm" or "Taxicab norm".

Practically, notice that `norm='l1'` rescales an observation's values so they sum to 1, which can sometimes be a desirable quality

In [7]:
# transform feature matrix
features_l1_norm = Normalizer(norm="l1").transform(features)
print("Sum of the first observation's values: {}".format(features_l1_norm[0,0] + features_l1_norm[0,1]))
features_l1_norm

Sum of the first observation's values: 1.0


array([[0.5       , 0.5       ],
       [0.24444444, 0.75555556]])

### 4.4 Generating Polynomial and Interaction Features

You want to create polynominal and interaction features.
Solution

Even though some choose to create polynomial and interaction features manually, scikit-learn offers a built-in method:

In [9]:
# Load libraries
import numpy as np
from sklearn.preprocessing import PolynomialFeatures

# Create feature matrix
features = np.array([[2, 4],
                     [2, 3],
                     [2, 8]])

# Create PolynomialFeatures object
polynomial_interaction = PolynomialFeatures(degree=2, include_bias=False)

# Create polynomial features
polynomial_interaction.fit_transform(features)


array([[ 2.,  4.,  4.,  8., 16.],
       [ 2.,  3.,  4.,  6.,  9.],
       [ 2.,  8.,  4., 16., 64.]])

The degree parameter determines the maximum degree of the polynomial. For
example, degree=2 will create new features raised to the second power:

$$
x_1,x_2,{x_1}^2,{x_2}^2
$$

We can restrict the features created to only interaction features by setting interaction_only to True:

In [10]:
interaction = PolynomialFeatures(degree=2,
            interaction_only=True, include_bias=False)
interaction.fit_transform(features)

array([[ 2.,  4.,  8.],
       [ 2.,  3.,  6.],
       [ 2.,  8., 16.]])

#### Discussion

Polynomial features are often created when we want to include the notion that there exists a nonlinear relationship between the features and the target. For example, we might suspect that the effect of age on the probability of having a major medical con‐ dition is not constant over time but increases as age increases. We can encode that nonconstant effect in a feature, $x$, by generating that feature’s higher-order forms ($x^2$, $x^3$, etc.).

Additionally, often we run into situations where the effect of one feature is dependent on another feature. 

A simple example would be if we were trying to predict whether or not our coffee was sweet and we had two features:

1) whether or not the coffee was stirred and

2) if we added sugar. Individually, each feature does not predict coffee sweetness, but the combination of their effects does. 

That is, a coffee would only be sweet if the coffee had sugar and was stirred. The effects of each feature on the target (sweetness) are dependent on each other. We can encode that relationship by including an interaction feature that is the product of the individual features.

### 4.5 Transforming Features

You want to make a custom transformation to one or more features.

In scikit-learn, use FunctionTransformer to apply a function to a set of features:

In [11]:
# Load libraries
import numpy as np
from sklearn.preprocessing import FunctionTransformer

# Create feature matrix
features = np.array([[2, 3],
                     [2, 3],
                     [2, 3]])
    
# Define a simple function
def add_ten(x): return x + 10
    
# Create transformer
ten_transformer = FunctionTransformer(add_ten) 

# Transform feature matrix
ten_transformer.transform(features)

array([[12, 13],
       [12, 13],
       [12, 13]])

We can create the same transformation in pandas using apply:

In [12]:
# Load library
import pandas as pd

# Create DataFrame
df = pd.DataFrame(features, columns=["feature_1", "feature_2"])
    
# Apply function
df.apply(add_ten)    

Unnamed: 0,feature_1,feature_2
0,12,13
1,12,13
2,12,13


#### Discussion

It is common to want to make some custom transformations to one or more features. For example, we might want to create a feature that is the natural log of the values of the different feature. We can do this by creating a function and then mapping it to features using either scikit-learn’s FunctionTransformer or pandas’ apply. In the sol‐ ution we created a very simple function, add_ten, which added 10 to each input, but there is no reason we could not define a much more complex function.


### 4.6 Detecting Outliers

You want to identify extreme observations.

Detecting outliers is unfortunately more of an art than a science. However, a common method is to assume the data is normally distributed and based on that assumption “draw” an ellipse around the data, classifying any observation inside the ellipse as an inlier (labeled as 1) and any observation outside the ellipse as an outlier (labeled as -1):

### make_blobs函数是为聚类产生数据集，产生一个数据集和相应的标签

In [19]:
# Load libraries
import numpy as np
from sklearn.covariance import EllipticEnvelope 
from sklearn.datasets import make_blobs

# Create simulated data
features, _ = make_blobs(n_samples = 10,
                             n_features = 2,
                             centers = 1,
                             random_state = 1)


# Replace the first observation's values with extreme values
features[0,0] = 10000
features[0,1] = 10000
features
# Create detector
outlier_detector = EllipticEnvelope(contamination=.1)

# Fit detector
outlier_detector.fit(features) 

# Predict outliers
outlier_detector.predict(features)
type(features)
feature = features[:,0]
feature

  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval


array([ 1.00000000e+04, -2.76017908e+00, -1.61734616e+00, -5.25790464e-01,
        8.52518583e-02, -7.94152277e-01, -1.34052081e+00, -1.98197711e+00,
       -2.18773166e+00, -1.97451969e-01])

这种方法的一个主要局限性是需要指定一个污染参数，即异常值的观察值的比例-一个我们不知道的值。把污染看作是我们对数据清洁度的估计。如果我们期望我们的数据有很少的异常值，我们可以将污染设置为较小的值。但是，如果我们认为数据很可能有异常值，我们可以将其设置为更高的值。              我们可以用四分位值而不是整体来观察这些特征值

In [20]:
# Create one feature
feature = features[:,0]

# Create a function to return index of outliers
def indicies_of_outliers(x):
    q1, q3 = np.percentile(x, [25, 75])
    iqr=q3-q1
    lower_bound = q1 - (iqr * 1.5)
    upper_bound = q3 + (iqr * 1.5)
    return np.where((x > upper_bound) | (x < lower_bound))

# Run function
indicies_of_outliers(feature)

(array([0], dtype=int64),)

IQR是一组数据的第一个和第三个四分位数之间的差。异常值通常定义为小于第一个四分位数的1.5个IQR值或大于第三个四分位数的1.5个IQR值


#### Discussion

There is no single best technique for detecting outliers. Instead, we have a collection of techniques all with their own advantages and disadvantages. Our best strategy is often trying multiple techniques (e.g., both 'EllipticEnvelope' and IQR-based detection) and looking at the results as a whole.

If at all possible, we should take a look at observations we detect as outliers and try to understand them. For example, if we have a dataset of houses and one feature is number of rooms, is an outlier with 100 rooms really a house or is it actually a hotel that has been misclassified? 

### 4.7 Handling Outliers

Typically we have three strategies we can use to handle outliers. 

First, we can drop them:

In [22]:
# Load library
import pandas as pd

# Create DataFrame
houses = pd.DataFrame()
houses['Price'] = [534433, 392333, 293222, 4322032]
houses['Bathrooms'] = [2, 3.5, 2, 116]
houses['Square_Feet'] = [1500, 2500, 1500, 48000]

# Filter observations
houses[houses['Bathrooms'] < 20]
houses

Unnamed: 0,Price,Bathrooms,Square_Feet
0,534433,2.0,1500
1,392333,3.5,2500
2,293222,2.0,1500
3,4322032,116.0,48000



Second, we can mark them as outliers and include it as a feature:


In [23]:
# Load library
import numpy as np

# Create feature based on boolean condition
houses["Outlier"] = np.where(houses["Bathrooms"] < 20, 0, 1)

# Show data
houses

Unnamed: 0,Price,Bathrooms,Square_Feet,Outlier
0,534433,2.0,1500,0
1,392333,3.5,2500,0
2,293222,2.0,1500,0
3,4322032,116.0,48000,1



Finally, we can transform the feature to dampen the effect of the outlier:
    

In [24]:
# Log feature
houses["Log_Of_Square_Feet"] = [np.log(x) for x in houses["Square_Feet"]] 

# Show data
houses

Unnamed: 0,Price,Bathrooms,Square_Feet,Outlier,Log_Of_Square_Feet
0,534433,2.0,1500,0,7.31322
1,392333,3.5,2500,0,7.824046
2,293222,2.0,1500,0,7.31322
3,4322032,116.0,48000,1,10.778956


### Discussion

与检测异常值类似，处理异常值没有硬性规则。从两个方面来处理。首先，应该考虑是什么使它们成为异常值。如果认为它们是数据中的错误，例如来自损坏的传感器或错误编码的值，那么可能会放弃观察值或用NaN替换异常值，因为我们无法相信这些值。但是，如果我们认为异常值是真正的极值（例如，有200个浴室的房子[豪宅]），则将其标记为异常值或转换其值更为合适。             
第二，我们如何处理异常值应该基于我们的机器学习目标。例如，如果我们想根据房子的特点来预测房价，我们可以合理地假设拥有100多个浴室的豪宅的价格是由不同于普通家庭住宅的动态驱动的。此外，如果我们正在培训一个模型来作为在线住房贷款网络应用程序的一部分，我们可能会假设，我们的潜在用户不会包括那些想要购买豪宅的亿万富翁。              
另外一点：如果有异常值，标准化可能不合适，因为平均值和方差可能会受到异常值的高度影响。在这种情况下，使用一种对异常值更健壮的重缩放方法，比如RobustScaler 

### 4.8 Discretizating Features

You have a numerical feature and want to break it up into discrete bins.

Depending on how we want to break up the data, there are two techniques we can use. 

First, we can binarize the feature according to some threshold:


In [25]:
# Load libraries
import numpy as np
from sklearn.preprocessing import Binarizer

# Create feature
age = np.array([[6],
                [12],
                [20],
                [36],
                [65]])

# Create binarizer
binarizer = Binarizer(18)

# Transform feature
binarizer.fit_transform(age)


array([[0],
       [0],
       [1],
       [1],
       [1]])


Second, we can break up numerical features according to multiple thresholds:


In [26]:
# Bin feature
np.digitize(age, bins=[20,30,64])

array([[0],
       [0],
       [1],
       [2],
       [3]], dtype=int64)

注意，bins参数的参数表示每个bin的左边缘。             
例如，20参数不包含值为20的元素，只包含小于20的两个值。我们可以通过将参数设置为True来切换这种行为

In [16]:
# Bin feature
np.digitize(age, bins=[20,30,64], right=True)

array([[0],
       [0],
       [0],
       [2],
       [3]])

### Discussion

Discretization can be a fruitful strategy when we have reason to believe that a numeri‐ cal feature should behave more like a categorical feature. For example, we might believe there is very little difference in the spending habits of 19- and 20-year-olds, but a significant difference between 20- and 21-year-olds (the age in the United States when young adults can consume alcohol). In that example, it could be useful to break up individuals in our data into those who can drink alcohol and those who cannot. Similarly, in other cases it might be useful to discretize our data into three or more bins.

In the solution, we saw two methods of discretization—scikit-learn’s Binarizer for two bins and NumPy’s digitize for three or more bins—however, we can also use digitize to binarize features like Binarizer by only specifying a single threshold:

In [18]:
# Bin feature
np.digitize(age, bins=[18])

array([[0],
       [0],
       [1],
       [1],
       [1]])

### 4.9 Grouping Observations Using Clustering

You want to cluster observations so that similar observations are grouped together.

If you know that you have k groups, you can use k-means clustering to group similar observations and output a new feature containing each observation’s group membership:


In [27]:
import pandas as pd
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
#产生50个数据点，两个属性特征，三个聚类
features, _ = make_blobs(n_samples = 50,
                         n_features = 2,
                         centers = 3,
                         random_state = 1)

df = pd.DataFrame(features, columns=["feature_1", "feature_2"])

# make k-means clusterer
clusterer = KMeans(3, random_state=0)

# fit clusterer
clusterer.fit(features)

# predict values
df['group'] = clusterer.predict(features)

df.head()

Unnamed: 0,feature_1,feature_2,group
0,-9.877554,-3.336145,0
1,-7.28721,-8.353986,2
2,-6.943061,-7.023744,2
3,-7.440167,-8.791959,2
4,-6.641388,-8.075888,2


### Discussion

We are jumping ahead of ourselves a bit and will go much more in depth about clustering algorithms later in the book. However, I wanted to point out that we can use clustering as a preprocessing step. 

Specifically, we use unsupervised learning algorithms like k-means to cluster observations into groups. The end result is a categorical feature with similar observations being members of the same group.

# 4.10 Deleteing Observations with Missing Values

You need to delete observations containing missing values.

Deleting observations with missing values is easy with a clever line of NumPy:

In [28]:
import numpy as np

features = np.array([
    [1.1, 11.1],
    [2.2, 22.2],
    [3.3, 33.3],
    [np.nan, 55]
])

# keep only observations that are not (denoted by ~) missing
features[~np.isnan(features).any(axis=1)]

array([[ 1.1, 11.1],
       [ 2.2, 22.2],
       [ 3.3, 33.3]])

Alternatively, we can drop missing observations using pandas:

In [11]:
import pandas as pd
df = pd.DataFrame(features, columns=["feature_1", "feature_2"])
df.dropna()

Unnamed: 0,feature_1,feature_2
0,1.1,11.1
1,2.2,22.2
2,3.3,33.3


#### Discussion
Most machine learnign algorithms cannot handling any missing values in the target and feature arrays. The simplest solution is the delete every observation that contains one or more missing values

There are three types of missing data:

*Missing Completely At Random (MCAR)*
* The probability that a value is missing is independent of everything.

*Missing At Random (MAR)*
* The probability that a value is missing is not completely random, but depends on information capture in other feature

*Missing Not At Random (MNAR)*
* The probability that a value is missing is not random and depends on information not captured in our features

#### See Also
* Identifying the Three Types of Missing Data (https://measuringu.com/missing-data/)
* Missing-Data Imputation (http://www.stat.columbia.edu/~gelman/arm/missing.pdf)

### 4.11 Imputing Missing Values
### 填补缺失值
You have missing values in your data and want to fill in or predict their values.

If you have a small amount of data, predict the missing values using k-nearest neighbors (KNN):

In [36]:
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs
from sklearn.preprocessing import Imputer

# make fake data
features, _ = make_blobs(n_samples = 1000,
                        n_features = 2,
                        random_state = 1)

# standardize the features
scaler = StandardScaler()
standardized_features = scaler.fit_transform(features)
standardized_features

# replace the first feature's first value with a missing value
true_value = standardized_features[0, 0]
true_value
standardized_features[0,0] = np.nan

# create imputer
mean_imputer = Imputer(strategy="mean", axis=0)

# impute values
feautres_mean_imputed = mean_imputer.fit_transform(features)

# compare true and imputed values
print("True Value: {}".format(true_value))
print("Imputed Value: {}".format(feautres_mean_imputed[0,0]))

True Value: 0.8730186113995938
Imputed Value: -3.058372724614996


Alternatively, we can use scikit-learn’s Imputer module to fill in missing values with the feature’s mean, median, or most frequent value. However, we will typically get worse results than KNN:

In [37]:
# Load library
from sklearn.preprocessing import Imputer

# Create imputer
mean_imputer = Imputer(strategy="mean", axis=0) 

# Impute values
features_mean_imputed = mean_imputer.fit_transform(features)

# Compare true and imputed values
print("True Value:", true_value)
print("Imputed Value:", features_mean_imputed[0,0])

True Value: 0.8730186113995938
Imputed Value: -3.058372724614996


### Discussion

用替代值替换缺失数据有两种主要策略，每种策略各有优缺点。首先，我们可以使用机器学习来预测缺失数据的值。为此，我们将缺失值的特征作为目标向量，并使用剩余的特征子集来预测缺失值。虽然我们可以使用广泛的机器学习算法来插补值，但一个流行的选择是KNN。KNN在后面进行了深入的讨论，但简短的解释是，该算法使用k个最近观测值（根据某种距离度量）来预测缺失值。在我们的解决方案中，我们使用五个最近的观测值来预测漏损值
KNN的缺点是为了知道哪些观测值最接近缺失值，它需要计算缺失值与每个观测值之间的距离。这在较小的数据集中是合理的，但如果一个数据集有数百万个观测值，则很快就会出现问题。
另一种更具伸缩性的策略是用一些平均值填充所有缺失的值。例如，在我们的解决方案中，我们使用scikit learn用特征的平均值填充缺失值。估计值通常不像我们使用KNN时那样接近真实值，但是我们可以很容易地将均值填充到包含数百万个观测值的数据中。
如果我们使用插补，最好创建一个二元特征来指示观测值是否包含插补值
 

#### See Also
* A Study of K-Nearest Neighbor as an Imputation Method (http://conteudo.icmc.usp.br/pessoas/gbatista/files/his2002.pdf)