# Data science workflow

Whenever you start a data science project, you should follow a workflow, which will help you:

* Perform all steps in analysis
* Produce reproducible results and track data provenance
* Avoid simple errors
* Produce higher quality work

The [Common industry standard process for data mining](https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining) (CRISP-DM) is a good workflow to use, unless you know better.  

Make sure that you also think about correctness, and integrate *Verification, Validation, & Uncertainty Quantification (VV & UQ)* with your workflow:

*  *Verify* that your code correctly implements your model ('solves the equations right') 
*  *Validate* that your model has high fidelity with reality ('solves the right equations').  

## References

* [Verification and Validation in Scientific Computing](http://www.amazon.com/Verification-Validation-Scientific-Computing-Oberkampf/dp/0521113601/ref=sr_1_1?ie=UTF8&qid=1445036147&sr=8-1&keywords=verification+and+validation) is an excellent reference.
* Benjamin S. Skrainka's talk on ["Correctness in Data Science"](https://youtu.be/kex-UXZTGU4) provides a quick introduction and much practical advice.
* [Fundamentals of Machine Learning for Predictive Data Analytics](https://mitpress.mit.edu/books/fundamentals-machine-learning-predictive-data-analytics) has a good discussion of CRISP-DM plus worked case studies.

## 1. Business understanding

Every thing starts with business understanding.  Speak with your stakeholders:

* What is the business problem you need to answer?
* What are the requirements?
* How do you measure success?

Do not proceed until you have answered these questions.  Often, it is not clear what success looks like or even what you should use as a lable (target) to train you model.  The metric for success will typically be a business quantity like 'decrease churn rate 10%' instead of improving AUC or MAPE.  Consequently, you need to tune your model based on the right business outcome.  Make sure you always state results in business terms like this policy will save $100 MM or decrease fraud by 10%.

Note: These steps are an interative process. E.g., after performing a step, such as **Modeling**, you may discover a mistake which causes you to repeat an earlier process, such as **data cleaning**.

## 2. Data understanding

After you define the business problem, you need to determine what data is available.  Ponder the following:

* What datasets are available?
* How can you combine them to produce a dataset to answer their business questions?
* Do you need to collect additional data?
* Does your data have a label (target) or do you need to generate one, perhaps by using [MechTurk](https://www.mturk.com/mturk/welcome)or equivalent?

## 3. Data preparation

To prepare a dataset for modeling, you should first explore the data and, concurrently, figure out how to clean it.  At the end of this step, you should have a dataset you can use to build a model.

### 3a. Load data and perform minimal cleaning

Start by loading your data so that you can begin exploring it.  Perform only the most minimal cleaning necessary -- overcleaning can remove valuable information (signal).  Pro tip:  if your data is huge, start by making sure everything works on a small subset of your data, like a single shard.  You want to be able to interate quickly and get your pipeline working before attempting full-scale analysis and modeling.

#### Dates

Working with dates is frustrating and tricky.  Start by converting all dates into a `datetime` object using `datetime.datetime.strptime()`.  Next, you may need to add fixed effects (dummy variables) to handle shocks caused by day, week, month, year, or day of week.  For example, "Single's Day" (November 11) in China causes a huge spike Internet purchases.  Also, you may need to normalize data by the number of (working) days per month.  This is common with data on sales or GDP.

### 3b. Exploratory data analysis (EDA)

Get to know the strengths and weaknesses of your data:

* What are the strengths and weaknesses?
* Any weird values?  outliers? missing values? malformed/unstructured fields?
* What is the nature of your missing values?  Are they missing at random?  If not, how are you going to deal with them?
* Compute summary statistics
* Plot features to see if they have predictive power?  If you have a lot of data, draw a subset -- and make sure your results don't depend on the subset you have chosen.
* Plot histograms of label and key features

### 3c. Feature engineering

Finally, assemble your final dataset.  Feature engineering -- how you construct the features for your model -- is often more important than what model you choose.  Some issues:

* Compute a target/label if one doesn't exist -- this will require domain expertise
* Handle missing values -- can you bin or are missing values *missing at random* so you can drop them?
* Handle outliers -- should you bin the data to make it discrete?
* Replace categorical variables with dummy variables, which `scikit-learn` requires, unlike `R`
* Transform data:
  *  Take `log` of data, which is often useful with long-tailed data
  *  Possibly, quantize continuous to handle non-linear behavior
  * Convert text data to features using:
    * *Natural Language Processing* (NLP)
    * *Term frequency-inverse document frequency* (TF-IDF)
    * *[feature hashing](https://en.wikipedia.org/wiki/Feature_hashing)* trick
    * Compute n-grams, etc.
* Rationalize address data into standard USPS format

Be careful to avoid using *endogenous* features, i.e., those which are codetermined with the outcome.  For example, if you regress `quantity` on `price`, `price` is almost always endogenous because it is codetermined from the interaction between the buyer and the seller.  Consequently, in the regression model 


$$q_{it} = \alpha_0 + \alpha_1 * price + \epsilon$$


$\mathbb{E}[price * \epsilon] \neq 0$, which violates the assumptions needed to use linear regression.

Now you should be ready to start modeling...

## 4. Modeling

Start by producing a 'shitty' first model.  If it performs too well, you should be suspicious.  Look for *information leakage*, i.e., a single variable which predicts the outcome (almost) perfectly.  This often happens with time series data when you have a feature which contains information from future periods.  It also occurs, when a feature is derived from the label.  You may want to run *one-in* or *one-out* reports to test for this problem.

Next refine the model and determine which features are important.  You should use *cross-validation* to tune parameter settings and find a candidate model.  Use diagnostic tools to determine which features are most important.  Do the results agree with your intuition?  Also, test that the assumptions needed to use your model are satisfied.

## 5. Evaluation

Before releasing the model to the wild, you should evaluation it.  One of the best ways to do this is to design and run an experiment such as an A/B test.  If that is not possible you can also use an approach like [Bayesian structural time series models](http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41854.pdf) or some other method that let's you determine causally whether your awesome new model has the desired impact.

## 6. Deployment

Finally, you model is ready for deployment!  Then the process will start all over:  you may need to keep retraining and retuning the model as the competitive environment changes.  Also, follow on studies are common, so plan for change!

## Conclusion

Most likely at several points along the workflow, you will need to present your results to non-technical (MBA) stakeholders.  Make sure you have a easily digested explanation and graphic to support your findings.  Modern companies are keen to make *data-driven decisions*.  As a data scientist, you are a key part of this revolution -- doing things according to the 4Ps is no longer good enough, because we are drowning in data and can measure what actually works.