## data pipeline in research environment
* we need to deploy the entire pipeline, not only models to both research and production environments
* the predictions from both environments should be consistent
* challenges of traditional software
  + reliability
  + reusablility
  + maintainability
  + flexibility
  + reproducibility (specific to ML)
* open source software/models are preferred. 
  + in case we don't have a specific model from open source, we create in house software
    + version software for good reproducibility
    + tested (reliability)
    + shareable
      + reusability
    + minimise deployment time
    
### Typical machine learning pipeline
* gathering data sources
* data analysis (understand data)
* data pre-processing (feature engineering)
  + filling missing values
  + coding categorical variables and data
* variable selection (feature selection)
  + find most predictable variables and include in the model
* machine learning model building
* model building business uplift evaluation
  + use the statistics metrics related to business value
  
#### Feature Engineering
* feature engineering considers the following 4 processes
  + missing data 
    + missing values within a variable
  + labels in categorical variables  
    + convert strings to numeric representations
    + cardinality: the number of unique labels
    + rare labels: infrequent categories (unbalanced data, overfit tree-based models)
      + some labels may only present in the test set without showing in training set      
  + distribution
    + normal vs skewed
  + outliers
    + unusual or unexpected values
  + feature magnitude-scale
    + machine learning models sensitive to feature scale
      + linear and logistic regression
      + neural networks
      + SVMs
      + KNN
      + K-means clustering
      + linear discriminant analysis (LDA)
      + principal component analysis
      
    + tree-based ML models insensitive to feature scale
      + classification and regression trees
      + random forests
      + gradient boosted trees  
      
* missing value imputation techniques
  + numerical variables
    + mean/median imputation
    + arbitrary value imputation
    + end of tail imputation
  + categorical variables
    + frequent category imputation
    + adding a missing category
  + for both categorical and numerical data
    + complete case analysis
    + adding a "missing" indicator
    + random samle imputation
* categorical encoding
* rare categorical variables can be combined to one group 
* distributions (some models make assumptions on the variable distributions)
  + apply variable transformations to make the distribution more Gaussian like
    + logarithmic
    + exponential
    + reciprocal
    + box-cox
    + yeo-johnson
  + discretisation
    + cut the data to discrete buckets
      + unsupervised
        + equal-width
        + equal-frequency
        + k means
      + supervised
        + decision trees
* outliers
  + discretisation
  + capping / censoring
  + truncation
* feature extraction of datetime variables
  + convert the datetime to day, month, semester, and year
  + extract hour, min, sec 
  + calcuate the elapsed time
    + time between transactions
    + age
* text
  + characters, words , unique words
  + lexical diversity
  + sentences, paragraphs
  + bag of words
  + TFIDF
* transactions and time series (aggregate data)
  + number of payments in last 3, 6, 12 months
  + time since last transaction
  + total spending in last month
* geo data
  + distance
* feature combination
  + ratio: total debt with income convert to debt to income ratio
  + sum: debt in different credit cards convert to total debt
  + subtraction: income without expenses convert to disposable income
 
#### Feature selection
* algorithms of procedures that allow us to find the best subset of features
* process to identify the most predictive features
* why select features?
  + simple models are easier to understand
  + shorter training times
  + enhanced generalization by reducing overfitting  
  + easier to implement by software developers to model production
    + smaller volume of data to transfer via network to feed the model (e.g. jason messages)
    + less code to pre-process the features (less data engineering code)
    + less code to handle potential errors
      + typically, we write error handlers for each variable we send to model
    + less information to log
  + reduced risk of data errors during model use
  + data redundancy (many features contain the similar information)
* variable redundancy
  + constant variables (the variable only has one value)
  + quasi-constant variables (> 99% of observations show same value)
  + duplication (same variable multiple times in the dataset)
  + correlation
    + correlated variables provide the same information
* Feature selection methods
  + filter methods
    + filter features based on simple statistics method such as ANOVA or q squared
    + pros
      + quick feature removal
      + model agnostic
      + fast computation
    + cons
      + does not capture redundancy since each feature is evaluated independently
      + does not cpature feature interaction
      + poor model performance
      
* wrapper methods
  + take the ML algorithm into consideration and do not evaluate feature separately. They evaluate a group of features
  + also known as greedy algorithms and evaluate all possible feature combinations and decide which one is the best
  + pros
    + considers feature interaction
    + best performance
    + best feature subset for a given algorithm
  + cons
    + not model agnostic
    + computation expensive
    + often impracticable
* embedded methods
  + feature selection during training of ML algorithm, such as Laso regression, or feature importance based on tree based algorithms
  + pros
    + good model performance
    + cpature feature interation
    + bettern than filter
    + faster than wrapper
  + cons
    + not model agnostic
* the course didn't select the embedded method, but define a list of features to be used to integrate to the pipeline

#### Machine learning model pipeline - model building
* first build several models
* then evaluate the models by different metrics
  + ROC-AUC
  + accuracy
  + MAE RMSE
* model deployment

### Demostrations of data analysis
* exploare the target variables
  + histogram of the distribution
  + transform the data by logrithmic and show data distribution is more Gaussian
  + find all the categorical and numerical columns and count how many columns are numerical and categorical
  
#### process missing values  
* find all the columns with missing values, and the percentage of missing values of each column
  `vars_with_na = [var for var in data.columns if data[var].isnull().sum() > 0]
  data[vars_with_na].isnull().mean().sort_values(ascending=False)  `
  + plot the columns with missing values and their missing value percentages
* find out how many categorical and numerical columns have missing values  
  + traverse each column with missing values, and plot the mean and std of target variable for rows grouped by whether of not that column's value is missing.
    + if the target variable shows the similar mean and std for rows grouped by missing and valid values of that column, the missing values of that column may not critical for prediction

#### Temoral variables
* There are 4 variables containing year information
* traverse each column to check how many unique values are there for each column (which years are contained in each column)
* groupby data by 'YrSold' and check the median of the saleprice for each year
* groupby data by 'Yearbuild' and check the median of the saleprice for each year
* groupby data by 'yearsold' and check the mdeian of yearsold-yearbuilt, yearsold-yearmodeled vs year sold. We see in more recent years, the elapse between yearmodeled and year sold is longer, meaning that more recently, we sell more older houses
* group by the elapse time for yearsold-yearmodeled, and yearsold-yearremodeld and show the median of saleprice

#### Discrete variables
* find all columns with discrete values
`[var for var in num_vars if len(data[var].unique() < 20 and var not in year_vars]
* for each discrete column, plot the saleprice vs the value

#### continus variables
* traverse all columns that are not categorical and discrete and not year related
* find the distributions of each column variable
* separate columns having skewed and normal distributions
* apply Yeo-Johnson transformation fromn scipy to the skewed columns
  + most columns now have a Gaussian distribution
``` python
    tmp = data.copy()
    for var in cont_vars:
        tmp[var], param = stats.yeojohnson(data[var])
    tmp[cont_vars].hist(bins=30, figsize=(15,15))
    plt.show()
```

* plot the target variable vs each original and transformed variable and check if the transformation brings more corrlations between target and predict variables
* for positive value columns, we apply logrithm transformation and do the same check
  + distribution of the column values
  + correlation between target and predict variables
  
* for highly skewed data (for example, most of the values are zeros), we can transform the data by keeping all zero values, and set the non-zero values to 1
  + group by the zero and one value and check if target variable has a different distribution between the two groups
  
#### Categorical variables
* count unique values for each category
`data[cat_vars].nunique().sort_values(ascending=False).plot.bar(figsize=(12, 5))`
* a set of quality variable columns that describe the quality of house using 
  + values similar to 
    + Ex = Excellent
    + Gd = Good
    + TA = Average/typical
    + Fa = Fair
    + Po = Poor
  + we map these strings to numbers representing quality
    + Po to 1
    + Fa to 2
    + TA to 3
    + Gd to 4
    + Ex to 5
    + Missing and NA to 0
   + after the transformation, plot the saleprice vs the transformed numbers by box plot overlay with the original data points   
 
* for categorical columns with rare labels
  + some values/labels have < 1% in the column values
  ```python
    tmp = df.groupby(var)['SalePrice'].count() / len(df)
    return tmp[tmp < rare_perc]
  ```
  + two problems of these variables
    + overfitting due to unbalanced distribution
    + test set may see some values not presented in training set

### Feature Engineering
#### missing value
* impute missing values
* for categorical columns, if there are high percentages of missing values (> 10%), assign a new value "missing" to the missing values. If there are less than 10% missing values, use the most frequent values to impute the missing values
  + to find the most frequent value, use                      
  `mode = X_train[var].mode()[0]
   X_train[var].fillna(mode, inplace=True)`
* for numeric columns, we replace the missing value by the mean, and create another binary value column indicating if the value is missing (1 if missing otherwise 0)  
  ```python
    # calculate mean
    mean_val = X_train[var].mean()
    
    # add an indicator column
    X_train[var + '_na'] = np.where(X_train[var].isnull(), 1, 0)
    X_test[var + '_na'] = np.where(X_test[var].isnull(), 1, 0)
    
    # replace missing value by the mean
    X_train[var].fillna(mean_val, inplace=True)
    X_test[var].fillna(mean_val, inplace=True)
    
    # confirm we don't have missing values
    X_train[vars_with_na].isnull().sum()

  ```

#### Temporal variables
* use the time difference between the sold time and each temporal column variable
* create columns corresponding to the time difference and drop the original time columns

#### Non-Gaussian distributed variables
* apply np.log transformation to positive value columns to make the data more Gaussian distributed
* apply Yeo-Johnson transformation to other non-Gaussian distributed numeric columns
  + transform the train dataset and obtain the param. Then use the param from train data to transform test dataset
  `X_train['LotArea'], param = stats.yeojohnson(X_train['LotArea']
  X_test['LotArea'] = stats.yeojohnson(X_test['LotArea'], lmbda=param)`
  
* for a few columns very skewed with many values as zeros, we transform those into binary variables
`X_train[var] = np.where(X_train[var]==0, 0, 1)
 X_test[var] = np.where(X_test[var]==0, 0, 1)`

#### Categorical variables
* remove rare labels
* convert strings to numbers (encoding) by monotonic encoding
  + group the categorical column values and get the mean of target for each group, order the groups and return the index
  ``` python
    # generate mean encoding labels
    ordered_labels = tmp.groupby([var])[target].mean().sort_values().index
    order_label = {k: i for i, k in enumerate(ordered_labels, 0)}
    
    # replace the categorical string by the order_labels
    train[var] = train[var].map(order_label)
    test[var] = test[var].map(order_label)
    
  ```
* standardize the values of the variables to the same range

* categorical variables having orders such as different quality, map them to integers
* for unbalanced distributed categorical variables, we keep all values having frequency >= 10%, and for all values < 10%, replace them by "Rare"
  + we do this by extracting a frequent list for each categorical variable using train data, and apply the list to both train and test datasets

#### transformation for all columns
* use MinMaxScaler to scale all columns
    ``` python
    scaler = MinMaxScaler()
    scaler.fit(X_train)

    # skleran returns numpy arrays, so we wrap the array with a pandas dataframe
    X_train = pd.DataFrame(
        scaler.transform(X_train),
        columns=X_train.columns
    )

    X_test = pd.DataFrame(
        scaler.transform(X_train),
        columns=X_test.columns
    )

    ```
    
#### save the tranformed data and scaler 

    ``` python
        X_train.to_csv('xtrain.csv', index=False)
        X_test.to_csv('xtest.csv', index=False)

        y_train.to_csv('xtrain.csv', index=False)
        y_test.to_csv('xtest.csv', index=False)

        joblib.dump(scaler, 'minmax_scaler.joblib')
    ```
    
#### Feature selection
* using Lasso regression and skleran.feature_selection.SelectFromModel

 ``` python
    import pandas as pd
    import numpy as np

    import matplotlib.pyplot as plt

    from sklearn.linear_model import Lasso
    from sklearn.feature_selection import SelectFromModel

    # SelectFromModel will return features with no-zero coefficients 
    # use a small alpha with small penalty to keep sufficient features
    sel_ = SelectFromModel(Lasso(alpha=0.001, random_state=0))
    sel_.fit(X_train, y_train)

    # get_support returns a list with non-zero features as True and otherwise False
    sel_.get_support()
    selected_features = X_train.columns[(sel_.get_support())]

    # number of features is
    sel_.get_support().sum()  
```    

### Model training
* use the Lasso regression model to do the prediction
* draw the line between predicted and `y_test` to see the highly correlated line
* draw the histogram of the residual `y_test - y_predict` for a Gaussian distributed errors
* explore the relative importance of features by `np.abs(lin_model.coef_.ravel())`
