Skip to content

ryanholbrook/data-science-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 

Repository files navigation

Table of Contents

  1. Process
  2. Objectives
  3. Data Collection
    1. Experimental Design
  4. Data Specification
    1. Determine Statistical Type
    2. Determine Roles
    3. Determine Validation Rules
  5. Data Correction
    1. String Correction
    2. Numeric Correction
    3. Missing Correction
  6. Determine Model Validation Scheme
  7. Data Exploration
  8. Establish Baseline Performance and Evaluation Metrics
  9. Feature Engineering and Model Selection
  10. Training and Tuning
  11. Model Evaluation
  12. Model Interpretation
  13. Reporting and Deployment
  14. References

Process

Objectives

Understand utilities. What are the costs of erroneous predictions? What are the benefits of correct predictions? This should (help) determine your scoring function.

Determine Constraints. Do you want predictions? Does the model need to be interpretable? What resources do you have?

Define success. How do you know you're done?

Data Collection

If possible, estimate how much data will be needed to satisfactorily meet the objectives. Alternatively (if more data collection is not possible), determine to what extent the data available will meet those objectives. (Sample size calculations).

Collect the data in raw form, if not provided.

Experimental Design

Statistically optimal ways of collecting data. The mantra is: To get the most from your experiments, reduce the variance. (Good 2005) When data collection is expensive, try to do it in the best way possible. Can be used for computer simulations, too. Consider active learning if you need to label data.

Data Specification

What does clean data look like? These properties should be part of the data validation process. Anything in the data that deviates from this specification needs to be corrected or otherwise addressed before modeling. Garbage In, Garbage Out.

What are you assuming about the data? These properties are assumptions we make about the data by what it is supposed to represent (that is, the type and distribution of its corresponding population) or how it was collected (like whether it is an independent sample, identically distributed, etc.) These properties need to be addressed as part of the modeling process.

If the data is collected as part of an ongoing process (like with stock prices, say), we need to be careful about drift. Distributions tend to change over time with changing conditions (regime change).

Determine Statistical Type

  • Binary: Integer or Logical
  • Nominal: Factor
  • Ordinal: Ordered Factor
  • Continuous: Float
  • Linear Temporal: Date
  • Cyclic Temporal: Factor
  • Text: String

See Statistical Data Type [Wikipedia]

Determine Roles

  • Feature
  • Response
  • Identity
  • Information

Determine Validation Rules

Specify what properties the data must have to be "clean". Domain knowledge is essential here.

Data Correction

Decide on how to correct erroneous data. Understanding why the data is erroneous is important. Visualization tools can help.

Many errors can be corrected through an automatic application of rules. For others, error localization can be used to remove any uncorrectable fields.

Correct observed data before imputing missing data.

String Correction

Numeric Correction

Missing Correction

Determine Model Validation Scheme

Decide on validation procedures (for feature engineering, performance, tuning, benchmarking) and make data splits.

Data Exploration

Consider various automated EDA tools. See "The Landscape of R Packages for Automated Exploratory Data Analysis" by Staniak and Biecek.

Establish Baseline Performance and Evaluation Metrics

  • Null (Featureless) Model: Simple expectation (E[Y]) for RMSE.
  • Best Single Variable Model: (max_{X_i \in X} E[Y|X_i=x_i])
  • Naive Bayes: naive lower bound
  • Current Performance: practical lower bound
  • Bayes Error Estimates: the upper bound of the data set (try estimating by resampling kNN)
  • Other Complexity Estimates:

see:

Feature Engineering and Model Selection

Understand the data. Use domain knowledge and visualization.

Understand how your data will interact with your algorithms. Be aware of:

  • Factor encodings
  • Outliers and robustness assumptions
  • Missing data
  • Statistical assumptions (eg, independence, identical distribution, normality, homoskedasticity)
  • Sensitivity to scale
  • High correlation
  • Rank deficiency (linear dependence)
  • Multicollinearity (ill conditioning)
  • Noninformative features (regularization)
  • Feature interactions and nonlinearity
  • High dimensionality
  • Computational complexity
  • Convergence rates (some algorithms require a lot of data to make accurate estimates)
  • Sparsity

Consider representation learning methods:

  • the PCA family: linear, nonlinear, kernel, probabilistic, IDA, FA, categorical, MCA, HOMALS
  • Autoencoders (which is like nonlinear PCA)
  • Response encodings
  • Missing value imputation

Data Transforms:

  • log transforms
  • Box-Cox family
  • interactions
  • smoothing: splines, kernels
  • factor encoding: dummy, response, thermometer, cyclic
  • time-series embeddings

Feature Selection:

  • Filters
  • Wrappers
  • Embedded

When you are transforming the data it is important to ask: Is the transformation data-dependent? Does it depend on the features? Does it depend on the response? If so, it ought to be part of a validation procedure. This is important to avoid overfitting. Independent transformations can be applied at will, however.

Training and Tuning

Consider model aggregation methods: bagging, model averaging, ensembles, SuperLearning. You want a collection of models giving imperfectly correlated predictions. You may be able to reduce hyperparameter optimization and feature selection this way. (Put a sample of models in and let the superlearner select from them.)

If a particular statistic is of interest, consider Targeted Learning.

Model Evaluation

  • Residuals

  • Permutation Tests: Compare to the same model fit on a randomised response. Can help detect overfitting.

  • Benchmarking

Examining the residuals is very important This can help you to determine whether your model is well-specified. You want the residuals to look like "white noise". Look at QQ-plots or wormplots.

Compare a parametric model to some non-parametric equivalent. If the parametric model is well-specified, it should outperform the non-parametric model. This is because a parametric model should be able to "leverage its assumptions."

see:

Model Interpretation

Reporting and Deployment

References

On process, see:

On infrastructure, see:

On data validation, see:

On experimental design, see:

On models generally, see:

On deep learning, see:

On time series, see:

On feature engineering, see:

On validation and resampling, see:

On model interpretation, see:

On statistics and mathematics, see:

On graphics, see:

About

An outline for a data science project.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published