Process
Objectives
Data Collection
1. Experimental Design
Data Specification
Data Correction
Determine Model Validation Scheme
Data Exploration
Establish Baseline Performance and Evaluation Metrics
Feature Engineering and Model Selection
Training and Tuning
Model Evaluation
Model Interpretation
Reporting and Deployment
References

Process

Objectives

Understand utilities. What are the costs of erroneous predictions? What are the benefits of correct predictions? This should (help) determine your scoring function.

Determine Constraints. Do you want predictions? Does the model need to be interpretable? What resources do you have?

Define success. How do you know you're done?

Data Collection

If possible, estimate how much data will be needed to satisfactorily meet the objectives. Alternatively (if more data collection is not possible), determine to what extent the data available will meet those objectives. (Sample size calculations).

Collect the data in raw form, if not provided.

Experimental Design

Statistically optimal ways of collecting data. The mantra is: To get the most from your experiments, reduce the variance. (Good 2005) When data collection is expensive, try to do it in the best way possible. Can be used for computer simulations, too. Consider active learning if you need to label data.

Data Specification

What does clean data look like? These properties should be part of the data validation process. Anything in the data that deviates from this specification needs to be corrected or otherwise addressed before modeling. Garbage In, Garbage Out.

What are you assuming about the data? These properties are assumptions we make about the data by what it is supposed to represent (that is, the type and distribution of its corresponding population) or how it was collected (like whether it is an independent sample, identically distributed, etc.) These properties need to be addressed as part of the modeling process.

If the data is collected as part of an ongoing process (like with stock prices, say), we need to be careful about drift. Distributions tend to change over time with changing conditions (regime change).

Determine Statistical Type

Binary: Integer or Logical
Nominal: Factor
Ordinal: Ordered Factor
Continuous: Float
Linear Temporal: Date
Cyclic Temporal: Factor
Text: String

See Statistical Data Type [Wikipedia]

Determine Roles

Feature
Response
Identity
Information

Determine Validation Rules

Specify what properties the data must have to be "clean". Domain knowledge is essential here.

Data Correction

Decide on how to correct erroneous data. Understanding why the data is erroneous is important. Visualization tools can help.

Many errors can be corrected through an automatic application of rules. For others, error localization can be used to remove any uncorrectable fields.

Correct observed data before imputing missing data.

String Correction

Numeric Correction

Missing Correction

Determine Model Validation Scheme

Decide on validation procedures (for feature engineering, performance, tuning, benchmarking) and make data splits.

Data Exploration

Consider various automated EDA tools. See "The Landscape of R Packages for Automated Exploratory Data Analysis" by Staniak and Biecek.

Establish Baseline Performance and Evaluation Metrics

Null (Featureless) Model: Simple expectation (E[Y]) for RMSE.
Best Single Variable Model: (max_{X_i \in X} E[Y|X_i=x_i])
Naive Bayes: naive lower bound
Current Performance: practical lower bound
Bayes Error Estimates: the upper bound of the data set (try estimating by resampling kNN)
Other Complexity Estimates:

see:

Setting Expectations [Win-Vector blog]

Feature Engineering and Model Selection

Understand the data. Use domain knowledge and visualization.

Understand how your data will interact with your algorithms. Be aware of:

Factor encodings
Outliers and robustness assumptions
Missing data
Statistical assumptions (eg, independence, identical distribution, normality, homoskedasticity)
Sensitivity to scale
High correlation
Rank deficiency (linear dependence)
Multicollinearity (ill conditioning)
Noninformative features (regularization)
Feature interactions and nonlinearity
High dimensionality
Computational complexity
Convergence rates (some algorithms require a lot of data to make accurate estimates)
Sparsity

Consider representation learning methods:

the PCA family: linear, nonlinear, kernel, probabilistic, IDA, FA, categorical, MCA, HOMALS
Autoencoders (which is like nonlinear PCA)
Response encodings
Missing value imputation

Data Transforms:

log transforms
Box-Cox family
interactions
smoothing: splines, kernels
factor encoding: dummy, response, thermometer, cyclic
time-series embeddings

Feature Selection:

Filters
Wrappers
Embedded

When you are transforming the data it is important to ask: Is the transformation data-dependent? Does it depend on the features? Does it depend on the response? If so, it ought to be part of a validation procedure. This is important to avoid overfitting. Independent transformations can be applied at will, however.

Training and Tuning

Consider model aggregation methods: bagging, model averaging, ensembles, SuperLearning. You want a collection of models giving imperfectly correlated predictions. You may be able to reduce hyperparameter optimization and feature selection this way. (Put a sample of models in and let the superlearner select from them.)

If a particular statistic is of interest, consider Targeted Learning.

Model Evaluation

Residuals
Permutation Tests: Compare to the same model fit on a randomised response. Can help detect overfitting.
Benchmarking

Examining the residuals is very important This can help you to determine whether your model is well-specified. You want the residuals to look like "white noise". Look at QQ-plots or wormplots.

Compare a parametric model to some non-parametric equivalent. If the parametric model is well-specified, it should outperform the non-parametric model. This is because a parametric model should be able to "leverage its assumptions."

see:

Is Your Model Going to Work? [Win-Vector blog]

Model Interpretation

Reporting and Deployment

References

On process, see:

On infrastructure, see:

Production Level Deep Learning [GitHub]
Michelangelo and Scaling Michelangelo and Data Science Workbench [Uber]

On data validation, see:

de Jonge, E., & van der Loo, M. Statistical Data Cleaning with Applications in R. [book] and the `validate` R package [Github]

On experimental design, see:

On models generally, see:

Shalizi, C. R. Data Analysis from an Elementary Point of View. [free book]
Hastie, T., Tibshirani, R., & Friedman, J., Elements of Statistical Learning. [free book]
Efron, B., & Hastie, T. Computer Age Statistical Inference. [free book]
Murphy, K. P. Machine learning: a probabilistic perspective. [book]
Berk, R. Statistical learning from a regression perspective. [book]
Mohri, M., Rostamizadeh, A., & Talwalkar, A., Foundations of machine learning. [free book]

On deep learning, see:

Goodfellow, I., Bengio, Y., & Courville, A. Deep Learning. [free book]

On time series, see:

On feature engineering, see:

Khun, M., Johnson, K. Feature Engineering and Selection [free book]
Koch, I. Analysis of Multivariate and High-Dimensional Data. [book]
Gifi, A., Nonlinear Multivariate Analysis. [book] and an updated (but incomplete) version de Leeuw, J., Mair, P., & Groenen, P., Multivariate Analysis with Optimal Scaling. [book] and the `gifi` [CRAN] R package. PCA style optimal scaling.
Harrell, F. E., Regression modeling strategies. Chapter 16 [book] and the `acepack` [CRAN] R package. Optimal nonparametric data transforms.

On validation and resampling, see:

On model interpretation, see:

Biecek, P. and Burzykowski, T., Predictive Models: Explore, Explain, and Debug. [book] and DrWhy [GitHub] R package

On statistics and mathematics, see:

Jaynes, E. T., Probability theory: the logic of science. [book] A bit sprawling to serve as an introduction, though has much food for though on Bayesian probability.
Casella, G., & Berger, R. L., Statistical Inference. [book]
Deisenroth, M. P., Faisal, A. A., & Ong, C. S., Mathematics for Machine Learning. [free book] Calculus, linear algebra, probability, optimization, with applications to common models. Quite good.
Gallier, J., & Quaintance, J., Algebra, Topology, Diﬀerential Calculus, and Optimization Theory For Computer Science and Machine Learning. More advanced and more comprehensive. The first author has a number of other relevant textbooks on his website. [free book]
Pollard, D., A user's guide to measure theoretic probability. A good introduction to measure theory, if you're into that sort of thing. [book]
Schervish, M. J., Theory of Statistics. Like Casella with measure theory. [book]

On graphics, see:

Murrell, P. R Graphics. [book]
Hadley, W. ggplot2. [free book]

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Table of Contents

Process

Objectives

Data Collection

Experimental Design

Data Specification

Determine Statistical Type

Determine Roles

Determine Validation Rules

Data Correction

String Correction

Numeric Correction

Missing Correction

Determine Model Validation Scheme

Data Exploration

Establish Baseline Performance and Evaluation Metrics

Feature Engineering and Model Selection

Training and Tuning

Model Evaluation

Model Interpretation

Reporting and Deployment

References

About

Releases

Packages

License

ryanholbrook/data-science-project

Folders and files

Latest commit

History

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Table of Contents

Process

Objectives

Data Collection

Experimental Design

Data Specification

Determine Statistical Type

Determine Roles

Determine Validation Rules

Data Correction

String Correction

Numeric Correction

Missing Correction

Determine Model Validation Scheme

Data Exploration

Establish Baseline Performance and Evaluation Metrics

Feature Engineering and Model Selection

Training and Tuning

Model Evaluation

Model Interpretation

Reporting and Deployment

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages