# 机器学习100天——第1天：数据预处理（Data Preprocessing）

## 第一步：导入需要的库

In [1]:
import numpy as np
import pandas as pd

## 第二步：导入数据集
我们使用Pandas的read_csv方法读取本地csv文件为一个数据帧。然后，从数据帧中制作自变量和因变量的矩阵和向量。

In [2]:
dataset = pd.read_csv('../datasets/Data.csv')
X = dataset.iloc[ : , :-1].values
Y = dataset.iloc[ : , 3].values
print(dataset.head());print()
print('the shape of dataset: {}'.format(dataset.shape))

   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes

the shape of dataset: (10, 4)


## 第三步：处理丢失数据
我们得到的数据很少是完整的。数据可能因为各种原因丢失，为了不降低机器学习模型的性能，需要处理数据。我们可以用整列的平均值或中间值替换丢失的数据。我们用sklearn.preprocessing库中的Imputer类完成这项任务。

In [3]:
from sklearn.preprocessing import Imputer

imputer = Imputer(missing_values="NaN", strategy="mean", axis=0) 
imputer = imputer.fit(X[ : , 1:3])
X[ : , 1:3] = imputer.transform(X[ : , 1:3])
print("X")
print(X.shape)

#Step 3: Handling the missing data Another Way:Use pandas
X = dataset.iloc[ : , :-1].fillna(dataset.iloc[ : , 1:-1].mean()).round(decimals=1).values
print("---------------------")
print("Step 3: Handling the missing data Another Way:Use pandas")
print("X")
print(X.shape)
print(X)

X
(10, 3)
---------------------
Step 3: Handling the missing data Another Way:Use pandas
X
(10, 3)
[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.8]
 ['France' 35.0 58000.0]
 ['Spain' 38.8 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## 第四步：解析分类数据
分类数据指的是含有标签值而不是数字值的变量。取值范围通常是固定的。例如"Yes"和"No"不能用于模型的数学计算，所以需要解析成数字。为实现这一功能，我们从sklearn.preprocessing库导入LabelEncoder类。Day 4 莺尾花的那个模型就处理了'str'类型的变量
### why?
Many machine learning algorithms cannot operate on label data directly. 
<br>
They require all input variables and output variables to be numeric.
How to Convert Categorical Data to Numerical Data?
<br>This involves two steps:
* Integer Encoding
* One-Hot Encoding:For categorical variables where no such ordinal relationship exists, the integer encoding is not enough.

使用one hot encoding 把categorical data转化为二进制。
每个特征用一个二进制数字来表示的方法就是one-hot encoding。
该方法将每个具有n个可能的分类特征转换成n个二元特征，且只有一个特征值有效。
<br>因country有3个可能的分类特征，所以以下OneHotEncoder把categorical data转化为了3个二元特征[[1,00],[0,0,1],[0,1,0]]

In [4]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[ : , 0] = labelencoder_X.fit_transform(X[ : , 0]) #country
#Creating a dummy variable
onehotencoder = OneHotEncoder(categorical_features=[0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_Y = LabelEncoder()
Y =  labelencoder_Y.fit_transform(Y)
print("---------------------")
print("Step 4: Encoding categorical data")
print("X")
print(X)
print(X.shape)
print("Y")
print(Y.shape)

---------------------
Step 4: Encoding categorical data
X
[[1.00000e+00 0.00000e+00 0.00000e+00 4.40000e+01 7.20000e+04]
 [0.00000e+00 0.00000e+00 1.00000e+00 2.70000e+01 4.80000e+04]
 [0.00000e+00 1.00000e+00 0.00000e+00 3.00000e+01 5.40000e+04]
 [0.00000e+00 0.00000e+00 1.00000e+00 3.80000e+01 6.10000e+04]
 [0.00000e+00 1.00000e+00 0.00000e+00 4.00000e+01 6.37778e+04]
 [1.00000e+00 0.00000e+00 0.00000e+00 3.50000e+01 5.80000e+04]
 [0.00000e+00 0.00000e+00 1.00000e+00 3.88000e+01 5.20000e+04]
 [1.00000e+00 0.00000e+00 0.00000e+00 4.80000e+01 7.90000e+04]
 [0.00000e+00 1.00000e+00 0.00000e+00 5.00000e+01 8.30000e+04]
 [1.00000e+00 0.00000e+00 0.00000e+00 3.70000e+01 6.70000e+04]]
(10, 5)
Y
(10,)


## 第五步：拆分数据集为测试集合和训练集合
把数据集拆分成两个：一个是用来训练模型的训练集合，另一个是用来验证模型的测试集合。两者比例一般是80:20。我们导入sklearn.model_selection库中的train_test_split()方法。

In [5]:
# from sklearn.cross_validation import train_test_split 
# sklearn.cross_validation will be removed in 0.20

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split( X , Y , test_size=0.2, random_state=0)
print("---------------------")
print("Step 5: Splitting the datasets into training sets and Test sets")
print("X_train")
print(X_train)
print("X_test")
print(X_test)
print("Y_train")
print(Y_train)
print("Y_test")
print(Y_test)

---------------------
Step 5: Splitting the datasets into training sets and Test sets
X_train
[[0.00000e+00 1.00000e+00 0.00000e+00 4.00000e+01 6.37778e+04]
 [1.00000e+00 0.00000e+00 0.00000e+00 3.70000e+01 6.70000e+04]
 [0.00000e+00 0.00000e+00 1.00000e+00 2.70000e+01 4.80000e+04]
 [0.00000e+00 0.00000e+00 1.00000e+00 3.88000e+01 5.20000e+04]
 [1.00000e+00 0.00000e+00 0.00000e+00 4.80000e+01 7.90000e+04]
 [0.00000e+00 0.00000e+00 1.00000e+00 3.80000e+01 6.10000e+04]
 [1.00000e+00 0.00000e+00 0.00000e+00 4.40000e+01 7.20000e+04]
 [1.00000e+00 0.00000e+00 0.00000e+00 3.50000e+01 5.80000e+04]]
X_test
[[0.0e+00 1.0e+00 0.0e+00 3.0e+01 5.4e+04]
 [0.0e+00 1.0e+00 0.0e+00 5.0e+01 8.3e+04]]
Y_train
[1 1 1 0 1 0 0 1]
Y_test
[0 0]


## 第六步：特征量化
大部分模型算法使用两点间的欧氏距离表示，但此特征在幅度、单位和范围姿态问题上变化很大。在距离计算中，高幅度的特征比低幅度特征权重更大。可用特征标准化或Z值归一化解决。导入sklearn.preprocessing库的StandardScalar类。

In [6]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.fit_transform(X_test)
print("---------------------")
print("Step 6: Feature Scaling")
print("X_train")
print(X_train)
print("X_test")
print(X_test)

---------------------
Step 6: Feature Scaling
X_train
[[-1.          2.64575131 -0.77459667  0.26258245  0.12381682]
 [ 1.         -0.37796447 -0.77459667 -0.25397319  0.46175601]
 [-1.         -0.37796447  1.29099445 -1.97582532 -1.53093364]
 [-1.         -0.37796447  1.29099445  0.05596019 -1.11142003]
 [ 1.         -0.37796447 -0.77459667  1.64006416  1.72029684]
 [-1.         -0.37796447  1.29099445 -0.08178798 -0.16751441]
 [ 1.         -0.37796447 -0.77459667  0.9513233   0.98614802]
 [ 1.         -0.37796447 -0.77459667 -0.59834362 -0.48214962]]
X_test
[[ 0.  0.  0. -1. -1.]
 [ 0.  0.  0.  1.  1.]]


<b>完整的项目请前往Github项目<a href="https://github.com/MachineLearning100/100-Days-Of-ML-Code">100-Days-Of-ML-Code</a>查看。有任何的建议或者意见欢迎在issue中提出~</b>