### 資料預處理

#### 資料預處理流程:

遺失值處理(missing data) $\rightarrow$ **資料轉換(transform)** $\rightarrow$ **異常值處理(outlier)** $\rightarrow$ **資料縮放(scaling)**

### 資料轉換(transform)

#### 1. 資料轉換(Data transformation)的目的

- 減少異常值(outlier)的影響


- 使資料近似常態分佈


#### 2. Log transformation

(1) Captures relative changes, the magnitude of change, and keeps everything in the positive space.

(2) Log normalization using numpy

a. Checking the variance

    print(df["col"].var())

b. Apply the log normalization function to the column

    df["col"] = np.log(df["col"])

c. Check the variance of the normalized column

    print(df["col"].var())
    
#### 3. Box-Cox transformation

(1) A way to transform non-normal dependent variables into a normal shape


(2) Scipy 套件:

      scipy.stats.boxcox(data, lmbda=p)

- p = -2: reciprocal square


- p = -1: reciprocal


- p = -0.5: reciprocal square root


- p = 0: log


- p = 0.5: square root


- p = 1: no transform


- p = 2: square


Example: 資料轉換(Data transformation)
       
    # Subset data
    A = data['A']

    # Histogram and kernel density estimate
    plt.figure()
    sns.distplot(A)
    plt.show()

    # Box-Cox transformation
    A_log = boxcox(A, lmbda=0)

    # Histogram and kernel density estimate
    plt.figure()
    sns.distplot(A_log)
    plt.show()

#### 4. 若資料轉換後，仍存在異常值(outlier)

$\Rightarrow$ 執行異常值(outlier)處理

### 異常值(Outlier)的偵測與處理方法

#### 1. 異常值(Outlier)圖形視覺化

(1) boxplot conditioned on target variable

    sns.boxplot() 

Example: 使用 boxplots 觀察異常值(outlier)
    
    # Import modules
    import matplotlib.pyplot as plt
    import seaborn as sns

    fig, ax = plt.subplots(1, 2)
    
    # Univariate boxplot
    sns.boxplot(y=data['B'], ax=ax[0])
    
    # Multivariate boxplot
    sns.boxplot(x='A', y='B', data=data, ax=ax[1])
    
    plt.show()

(2) histogram and kernel density estimate (kde)

    sns.distplot()

#### 2. 異常值(Outlier)處理方法

(1) Z-score 處理異常值(outliers)

- Z-score gives a threshold for outliers approximately +/-3 standard deviations away from the mean.


- Z-scores are often used for scaling the data prior to creating a model.


Steps:

i. 計算 Z-score:  

    stats.zscore()

ii. Points above and/or below 1.5 times the IQR should be suspected as possible outliers.


Example: 使用 stats.zscore 去除異常值(outlier)
    
    # Print: before dropping
    print(numeric_cols.mean())
    print(numeric_cols.median())
    print(numeric_cols.max())

    # Create index of rows to keep
    idx = (np.abs(stats.zscore(numeric_cols)) < 3).all(axis=1)
  
    # Concatenate numeric and categoric subsets
    ld_out_drop = pd.concat([numeric_cols.loc[idx], categoric_cols.loc[idx]], axis=1)

    # Print: after dropping
    print(ld_out_drop.mean())
    print(ld_out_drop.median())
    print(ld_out_drop.max())

(2) Winsorizing 方法處理異常值(outliers)

    mstats.winsorize(limits=[0.05, 0.05])

Example: 使用 mstats.winsorize 去除異常值(outlier)

    # Print: before winsorize
    print(df['A'].mean())
    print(df['A'].median())
    print(df['A'].max())

    # Winsorize numeric columns
    df_win = mstats.winsorize(df['A'], limits=[0.05, 0.05])

    # Convert to DataFrame, reassign column name
    df_out = pd.DataFrame(df_win, columns=['A'])

    # Print: after winsorize
    print(df_out.mean())
    print(df_out.median())
    print(df_out.max())

(3) 以統計量處理異常值

    np.where(condition, true, false)

Example: 使用 np.where() 將異常值(outlier)用中位數(median) 取代
    
    # Print: before replace with median
    print(df['A'].mean())
    print(df['A'].median())
    print(df['A'].max())

    # Find median
    median = df.loc[df['A'] < threshold, 'A'].median()
    df['A'] = np.where(df['A'] > threshold, median, df['A'])

    # Print: after replace with median
    print(df['A'].mean())
    print(df['A'].median())
    print(df['A'].max())

### 資料縮放(Scaling)

#### 1. 資料特徵縮放(feature scaling)

- Dataset features are continuous and on different scales


- Dataset features have high variance


- Model with linear characteristics / in linear space


- Transforms to approximately normal distribution


#### 2. 標準化(Standardization): Z-score standardization

(1) Scales to mean = 0 and sd = 1

(2) Scaling functions: 

    scikit-learn.preprocessing.StandardScaler()

Example: 使用 Z-score 標準化資料

    from sklearn.preprocessing import StandardScaler
    
    # Subset features
    numeric_cols = df.select_dtypes(include=[np.number])
    categoric_cols = df.select_dtypes(include=[object])

    # Instantiate
    scaler = StandardScaler()

    # Fit and transform, convert to DF
    numeric_cols_scaled = scaler.fit_transform(numeric_cols)
    numeric_cols_scaledDF = pd.DataFrame(numeric_cols_scaled, columns=numeric_cols.columns)

    # Concatenate categoric columns to scaled numeric columns
    final_DF = pd.concat([categoric_cols, numeric_cols_scaledDF], axis=1)
    print(final_DF.head())
    print(final_DF.var())

#### 3. 歸一化(Normalization): Min/max normalizing

(1) Scales to between (0, 1)

(2) Scaling functions: 

    sklearn.preprocessing.MinMaxScaler()
