# Data science workflow

Whenever you start a data science project, you should follow a workflow, which will help you:

* Perform all steps in analysis
* Produce reproducible results and track data provenance
* Avoid simple errors
* Produce higher quality work

The [Common industry standard process for data mining](https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining) (CRISP-DM) is a good workflow to use, unless you know better.  

Make sure that you also think about correctness, i.e., verifying that your code correctly implements your model ('solves the equations right') and validating that your model has high fidelity with reality ('solves the right equations').  See [Verification and Validation in Scientific Computing](http://www.amazon.com/Verification-Validation-Scientific-Computing-Oberkampf/dp/0521113601/ref=sr_1_1?ie=UTF8&qid=1445036147&sr=8-1&keywords=verification+and+validation)for more information.

## 1. Business understanding

Every thing starts with business understanding.  Speak with your stakeholders:

* What is the business problem you need to answer?
* What are the requirements?
* How do you measure success?

Do not proceed until you have answered these questions.  Often, it is not clear what success looks like or even what you should use as a lable (target) to train you model.  The metric for success will typically be a business quantity like 'decrease churn rate 10%' instead of improving AUC or MAPE.  Consequently, you need to tune your model based on the right business outcome.  Make sure you always state results in business terms like this policy will save $100 MM or decrease fraud by 10%.

Note: These steps are an interative process. E.g., after performing a step, such as **Modeling**, you may discover a mistake which causes you to repeat an earlier process, such as **data cleaning**.

## 2. Data understanding

After you define the business problem, you need to determine what data is available.  Ponder the following:

* What datasets are available?
* How can you combine them to produce a dataset to answer their business questions?
* Do you need to collect additional data?
* Does your data have a label (target) or do you need to generate one, perhaps by using [MechTurk](https://www.mturk.com/mturk/welcome)or equivalent?

## 3. Data preparation

To prepare a dataset for modeling, you should first explore the data and, concurrently, figure out how to clean it.  At the end of this step, you should have a dataset you can use to build a model.

### 3a. Data cleaning

Start by loading your data so that you can begin exploring it.  Perform only the most minimal cleaning necessary -- overcleaning can remove valuable information (signal).  Pro tip:  if your data is huge, start by making sure everything works on a small subset of your data, like a single shard.  You want to be able to interate quickly and get your pipeline working before attempting full-scale analysis and modeling.

### 3b. Exploratory data analysis (EDA)

Get to know the strengths and weaknesses of your data:

* What are the strengths and weaknesses?
* Any weird values?  outliers? missing values? malformed/unstructured fields?
* What is the nature of your missing values?  Are they missing at random?  If not, how are you going to deal with them?
* Compute summary statistics
* Plot features to see if they have predictive power?  If you have a lot of data, draw a subset -- and make sure your results don't depend on the subset you have chosen.
* Plot histograms of label and key features

### 3c. Feature engineering

Finally, assemble your final dataset.  Feature engineering -- how you construct the features for your model -- is often more important than what model you choose.  Some issues:

* Handling missing values -- can you bin or are missing values *missing at random* so you can drop them?
* Handling outliers -- should you bin the data to make it discrete?
* Replacing categorical variables with dummy variables
* Transform data, e.g., take `log` of data, which is often useful with long-tailed data
* Convert text data to features using *Natural Language Processing* (NLP), *term frequency-inverse document frequency* (TF-IDF), [feature hashing](https://en.wikipedia.org/wiki/Feature_hashing) trick, n-grams, etc.
* Rationalize address data into standard USPS format

Now you should be ready to start modeling

## 4. Modeling

##5. Evaluation

##6. Deployment

## Conclusion