<a href="https://colab.research.google.com/github/xslittlemaggie/ML-DL-Algorithm-Notes/blob/master/Data%20Preprocessing%20%26%20Feature%20Engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1><center> Five steps of data mining<center><h1>

# **Part 1**: Loading data

From kaggles, local, sklearn built-in datasets, etc.

# **Part 2**: Data Preprocessing & Feature Engineering

The purpose of data processing is to clean the data for model building.

- package **preprocessing**: include almost all data preprocessing methods

- package **feature_selection**: include a lot of feature selection methods

- package **decomposition**: e.g. PCA

In [0]:
import pandas as pd

# 1. 无量纲化

大部分情况下使用标准化

- increase model accuracy

- increase the computation speed

### 1.1 Normalization (归一化)

取最大值和最小值， 对异常值非常敏感

In [63]:
from sklearn.preprocessing import MinMaxScaler

data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
pd.DataFrame(data, columns = ["x_1", "x_2"])

Unnamed: 0,x_1,x_2
0,-1.0,2
1,-0.5,6
2,0.0,10
3,1.0,18


In [64]:
scaler = MinMaxScaler()
result = scaler.fit_transform(data)
result

array([[0.  , 0.  ],
       [0.25, 0.25],
       [0.5 , 0.5 ],
       [1.  , 1.  ]])

In [65]:
scaler.inverse_transform(result)

array([[-1. ,  2. ],
       [-0.5,  6. ],
       [ 0. , 10. ],
       [ 1. , 18. ]])

In [66]:
# fit the data into a particular range

scaler = MinMaxScaler(feature_range = [5, 10])
result = scaler.fit_transform(data)
result

array([[ 5.  ,  5.  ],
       [ 6.25,  6.25],
       [ 7.5 ,  7.5 ],
       [10.  , 10.  ]])

### 1.2. Standardization(标准化)

Z-score

In [67]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
result = scaler.fit_transform(data)
result

array([[-1.18321596, -1.18321596],
       [-0.50709255, -0.50709255],
       [ 0.16903085,  0.16903085],
       [ 1.52127766,  1.52127766]])

导出之前的均值和方差

In [68]:
print("Mean: {}".format(scaler.mean_))
print("Variance: {}".format(scaler.var_))

Mean: [-0.125  9.   ]
Variance: [ 0.546875 35.      ]


导出之后的均值和方差

In [69]:
print("Mean: {}".format(result.mean()))
print("Variance: {}".format(result.std()))

Mean: 0.0
Variance: 1.0


In [70]:
scaler.inverse_transform(result)

array([[-1. ,  2. ],
       [-0.5,  6. ],
       [ 0. , 10. ],
       [ 1. , 18. ]])

# 2. Missing values

load the titanic dataset from kaggle for practice

In [0]:
import os

os.environ['KAGGLE_USERNAME'] = "liulihuang" # username from the json file 
os.environ['KAGGLE_KEY'] = "7adfc6c4e6c5eec087031fbb7397aee5" # key from the json file (This key is incorrect5) 

In [72]:
!pip install -q kaggle
#!kaggle datasets list -s titanic  # It will list the 20 datasets including "titanic" from kaggle
!kaggle datasets download -d kittisaks/testtitanic -p /content/
!unzip -q /content/titanic.zip -d /content/titanic/
!unzip -q /content/testtitanic.zip -d /content/titanic/

testtitanic.zip: Skipping, found more recently modified local copy (use --force to force download)
unzip:  cannot find or open /content/titanic.zip, /content/titanic.zip.zip or /content/titanic.zip.ZIP.
replace /content/titanic/titanic_data.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: A


In [73]:
data = pd.read_csv("/content/titanic/titanic_data.csv")
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [74]:
columns = ["Age", "Sex", "Embarked", "Survived"]
data = data[columns]
data.head()

Unnamed: 0,Age,Sex,Embarked,Survived
0,22.0,male,S,0
1,38.0,female,C,1
2,26.0,female,S,1
3,35.0,female,S,1
4,35.0,male,S,0


In [75]:
data.shape

(891, 4)

### **Method 1**: impute.SimpleImputer
np.nan = missing values
class sklearn.impute.SimpleImputer(missing_values = nan, strategy = "mean", fill_value = None, copy = True)


- missing_values: nan, or other values I want to replace
- strategy: "mean", "median", "most_frequent", "constant"
- fill_value: when the strategy = "constant"
- copy: default = True

In [76]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 4 columns):
Age         714 non-null float64
Sex         891 non-null object
Embarked    889 non-null object
Survived    891 non-null int64
dtypes: float64(1), int64(1), object(2)
memory usage: 28.0+ KB


The feature "Age", "Survived" have missing values

#### 1. Replacing missing values from Age with median

In [77]:
Age = data.loc[:, "Age"].values.reshape(-1, 1)  # reshape to transform 1 dim to 2 dim
Age[:5]

array([[22.],
       [38.],
       [26.],
       [35.],
       [35.]])

In [0]:
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer()  # mean by default
imp_median = SimpleImputer(strategy = "median")
imp_0 = SimpleImputer(strategy = "constant", fill_value = 0)


In [0]:
imp_mean = imp_mean.fit_transform(Age)
imp_median = imp_median.fit_transform(Age)
imp_0 = imp_0.fit_transform(Age)

In [80]:
imp_mean[:20]

array([[22.        ],
       [38.        ],
       [26.        ],
       [35.        ],
       [35.        ],
       [29.69911765],
       [54.        ],
       [ 2.        ],
       [27.        ],
       [14.        ],
       [ 4.        ],
       [58.        ],
       [20.        ],
       [39.        ],
       [14.        ],
       [55.        ],
       [ 2.        ],
       [29.69911765],
       [31.        ],
       [29.69911765]])

In [81]:
imp_median[:20]

array([[22.],
       [38.],
       [26.],
       [35.],
       [35.],
       [28.],
       [54.],
       [ 2.],
       [27.],
       [14.],
       [ 4.],
       [58.],
       [20.],
       [39.],
       [14.],
       [55.],
       [ 2.],
       [28.],
       [31.],
       [28.]])

In [82]:
imp_0[:20]

array([[22.],
       [38.],
       [26.],
       [35.],
       [35.],
       [ 0.],
       [54.],
       [ 2.],
       [27.],
       [14.],
       [ 4.],
       [58.],
       [20.],
       [39.],
       [14.],
       [55.],
       [ 2.],
       [ 0.],
       [31.],
       [ 0.]])

In [0]:
# replace the data with medain
data.loc[:, "Age"] = imp_median

In [84]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 4 columns):
Age         891 non-null float64
Sex         891 non-null object
Embarked    889 non-null object
Survived    891 non-null int64
dtypes: float64(1), int64(1), object(2)
memory usage: 28.0+ KB


#### 2. Repalcing variable Embarked with mode

In [85]:
Embarked = data.loc[:, "Embarked"].values.reshape(-1, 1)
imp_mode = SimpleImputer(strategy = "most_frequent")
data.loc[:, "Embarked"] = imp_mode.fit_transform(Embarked)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 4 columns):
Age         891 non-null float64
Sex         891 non-null object
Embarked    891 non-null object
Survived    891 non-null int64
dtypes: float64(1), int64(1), object(2)
memory usage: 28.0+ KB


### Method 2: replace missing values with pandas, numpy

In [86]:
data = pd.read_csv("/content/titanic/titanic_data.csv")
columns = ["Age", "Sex", "Embarked", "Survived"]
data = data[columns]
data.head()

Unnamed: 0,Age,Sex,Embarked,Survived
0,22.0,male,S,0
1,38.0,female,C,1
2,26.0,female,S,1
3,35.0,female,S,1
4,35.0,male,S,0


In [87]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 4 columns):
Age         714 non-null float64
Sex         891 non-null object
Embarked    889 non-null object
Survived    891 non-null int64
dtypes: float64(1), int64(1), object(2)
memory usage: 28.0+ KB


In [0]:
# replace missing values with fillna 
data.loc[:, "Age"] = data.loc[:, "Age"].fillna(data.loc[:, "Age"].median())

In [0]:
Embarked = data.loc[:, "Embarked"].values.reshape(-1, 1)
imp_mode = SimpleImputer(strategy = "most_frequent")
data.loc[:, "Embarked"] = imp_mode.fit_transform(Embarked)

In [0]:
# data.dropna(axis = 0, inplace = True)
# remove all rows with missing values, (axis = 1) --> remove all columns with missing values

In [91]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 4 columns):
Age         891 non-null float64
Sex         891 non-null object
Embarked    891 non-null object
Survived    891 non-null int64
dtypes: float64(1), int64(1), object(2)
memory usage: 28.0+ KB


# 3. categorical features:编码和哑变量

### 3.1 Label encoder

In [0]:
from sklearn.preprocessing import LabelEncoder
data.iloc[:, -1] = LabelEncoder().fit_transform(data.iloc[:, -1])

In [93]:
data.head()

Unnamed: 0,Age,Sex,Embarked,Survived
0,22.0,male,S,0
1,38.0,female,C,1
2,26.0,female,S,1
3,35.0,female,S,1
4,35.0,male,S,0


In [94]:
y = data.iloc[:, -1]
le = LabelEncoder()
data.iloc[:, -1] = le.fit_transform(y)
le.classes_

array([0, 1])

In [95]:
data.head()

Unnamed: 0,Age,Sex,Embarked,Survived
0,22.0,male,S,0
1,38.0,female,C,1
2,26.0,female,S,1
3,35.0,female,S,1
4,35.0,male,S,0


In [96]:
label = le.inverse_transform(y)
label

array([0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1,
       1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0,
       1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
       1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1,

### 3.2 Feature encoder (only for practice, not appropriate for this data)

In [0]:
from sklearn.preprocessing import OrdinalEncoder
data_ = data.copy()

In [98]:
data_.head()

Unnamed: 0,Age,Sex,Embarked,Survived
0,22.0,male,S,0
1,38.0,female,C,1
2,26.0,female,S,1
3,35.0,female,S,1
4,35.0,male,S,0


In [99]:
OrdinalEncoder().fit(data_.iloc[:, 1:-1]).categories_

[array(['female', 'male'], dtype=object), array(['C', 'Q', 'S'], dtype=object)]

In [100]:
data_.iloc[:, 1:-1] = OrdinalEncoder().fit_transform(data_.iloc[:, 1:-1])
data_.head(20)

Unnamed: 0,Age,Sex,Embarked,Survived
0,22.0,1.0,2.0,0
1,38.0,0.0,0.0,1
2,26.0,0.0,2.0,1
3,35.0,0.0,2.0,1
4,35.0,1.0,2.0,0
5,28.0,1.0,1.0,0
6,54.0,1.0,2.0,0
7,2.0,1.0,2.0,0
8,27.0,0.0,2.0,1
9,14.0,0.0,0.0,1


### 3.3 OneHotEncoder to deal with nominal categorical data

In [101]:
from sklearn.preprocessing import OneHotEncoder
X = data.iloc[:, 1:-1]
X.head()

Unnamed: 0,Sex,Embarked
0,male,S
1,female,C
2,female,S
3,female,S
4,male,S


In [102]:
enc = OneHotEncoder(categories = "auto")
result = enc.fit_transform(X).toarray()

enc.get_feature_names()

array(['x0_female', 'x0_male', 'x1_C', 'x1_Q', 'x1_S'], dtype=object)

In [103]:
newdata = pd.concat([data, pd.DataFrame(result)], axis = 1)
newdata.head()

Unnamed: 0,Age,Sex,Embarked,Survived,0,1,2,3,4
0,22.0,male,S,0,0.0,1.0,0.0,0.0,1.0
1,38.0,female,C,1,1.0,0.0,1.0,0.0,0.0
2,26.0,female,S,1,1.0,0.0,0.0,0.0,1.0
3,35.0,female,S,1,1.0,0.0,0.0,0.0,1.0
4,35.0,male,S,0,0.0,1.0,0.0,0.0,1.0


In [104]:
newdata.drop(["Sex", "Embarked"], axis = 1, inplace = True)
newdata.columns = ["Age", "Survived", "Female", "Male", "Embarked_C", "Embarked_Q", "Embarked_S"]
newdata.head()

Unnamed: 0,Age,Survived,Female,Male,Embarked_C,Embarked_Q,Embarked_S
0,22.0,0,0.0,1.0,0.0,0.0,1.0
1,38.0,1,1.0,0.0,1.0,0.0,0.0
2,26.0,1,1.0,0.0,0.0,0.0,1.0
3,35.0,1,1.0,0.0,0.0,0.0,1.0
4,35.0,0,0.0,1.0,0.0,0.0,1.0


## 4 Continuous feature, e.g. bin

In [0]:
data_2 = data.copy()

In [0]:
from sklearn.preprocessing import Binarizer #特征专用,不让 1 dim 

X = data_2.iloc[:, 0].values.reshape(-1, 1)

In [107]:
transformer = Binarizer(threshold = 30).fit_transform(X)
transformer[:10]

array([[0.],
       [1.],
       [0.],
       [1.],
       [1.],
       [0.],
       [1.],
       [0.],
       [0.],
       [0.]])

## 5. 连续型变量进行分箱

### preprocessing.KBinsDiscretizer

- n_bins, 默认分箱为5， 一次会被运用到所有倒入的特征
- encoder， 默认“onehot”
- strategy, 默认“quantile"， 等位分箱， 每个特征中的每个箱内的数量样本都相同
  - "uniform": 等宽分箱， 即每个特征中的每个箱的最大值之间的差为（max-min）/n_bins
  - "kmeans":按聚类分箱， 每个箱内中的值到最近的一维k均值聚类的簇心的距离都相同

In [0]:
from sklearn.preprocessing import KBinsDiscretizer

X = data.iloc[:, 0].values.reshape(-1, 1)

In [109]:
est = KBinsDiscretizer(n_bins = 3, encode = "ordinal", strategy = "uniform")
est.fit_transform(X).ravel()[:20]

array([0., 1., 0., 1., 1., 1., 2., 0., 1., 0., 0., 2., 0., 1., 0., 2., 0.,
       1., 1., 1.])

In [110]:
set(est.fit_transform(X).ravel())

{0.0, 1.0, 2.0}

In [111]:
est = KBinsDiscretizer(n_bins = 3, encode = "onehot", strategy = "uniform")
est.fit_transform(X).toarray()

array([[1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       ...,
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 1., 0.]])

## Part Four: Model building (ignore)


## Part Five: Model evaluation & application (ignore)

### 分箱方法
URL_0 = "http://www.voidcn.com/article/p-hpqunnqv-bro.html" 

URL_1 = "https://zhuanlan.zhihu.com/p/52312186"

URL_2 = "https://blog.csdn.net/Pylady/article/details/78882220"

URL_3 = "https://www.jianshu.com/p/0805f185ecdf"

URL_4 = "https://zhuanlan.zhihu.com/p/35284849"