Skip to content

solegalli/feature-engineering-for-machine-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PythonVersion License https://github.com/solegalli/feature-engineering-for-machine-learning/blob/master/LICENSE Sponsorship https://www.trainindata.com/

Feature Engineering for Machine Learning - Code Repository

Code repository for the online course Feature Engineering for Machine Learning

Launched: November, 2017

Actively maintained.

Table of Contents

  1. Introduction: Variable Types

    1. Numerical Variables: Discrete and continuous
    2. Categorical Variables: Nominal and Ordinal
    3. Datetime variables
    4. Mixed variables: strings and numbers
  2. Variable Characteristics

    1. Missing Data
    2. Cardinality
    3. Category Frequency
    4. Distributions
    5. Outliers
    6. Magnitude
  3. Missing Data Imputation

    1. Mean and Median Imputation
    2. Arbitrary value imputation
    3. End of Tail Imputation
    4. Frequent category imputation
    5. Adding string missing
    6. Random Sample Imputation
    7. Adding a missing indicator
    8. Imputation with Scikit-learn
    9. Imputation with Feature-engine
  4. Multivariate Imputation

    1. MICE
    2. KNN imputation
  5. Categorical Variable Encoding

    1. One hot encoding: simple and of frequent categories
    2. Ordinal encoding: arbitrary and ordered
    3. Target mean encoding
    4. Weight of evidence
    5. Rare Label encoding
    6. Encoding with Scikit-learn
    7. Encoding with Feature-engine
    8. Encoding with category encoders
  6. Variable Transformation

    1. Log, power and reciprocal
    2. Box-Cox
    3. yeo-Johnson
    4. Transformation with Scikit-learn
    5. Transformation with Feature-engine
  7. Discretisation

    1. Arbitrary
    2. Equal-frequency discretisation
    3. Equal-width discretisation
    4. K-means discretisation
    5. Discretisation with trees
    6. Discretisation with Scikit-learn
    7. Discretisation with Feature-engine
  8. Outliers

    1. Capping
    2. Trimming
  9. Datetime

    1. Extracting day, month, week, etc
    2. Extracting hr, min, sec, etc
    3. Capturing elapsed time
    4. Working with timezones
  10. Mixed variables

    1. Creating new variables from strings and numbers
  11. Feature creation

    1. Sum, prod, count, mean, std, etc
    2. Div, sub
    3. Polynomial expansion
    4. Splines
  12. Feature Scaling

    1. Standardisation
    2. MinMaxScaling
    3. MaxAbsoluteScaling
    4. RobustScaling
  13. Pipelines

    1. Classification Pipeline
    2. Regression Pipeline
    3. Pipeline with cross-validation

Links