# Overview

Julia Evans' excellent article, [Machine Learning isn't Kaggle Competitions](http://jvns.ca/blog/2014/06/19/machine-learning-isnt-kaggle-competitions/), provides a better overview of this section than I ever could, so I'm just going to include it verbatim:

> If you want to predict flight arrival times, what are you really trying to do? Some possible options:

> * Help the airline understand which flights are likely to be delayed, so they can fix it.
> * Help people buy flights that are less likely to be delayed.
> * Warn people if their flight tomorrow is going to be delayed

> I've spent time on projects where I didn't understand at all how the model was going to fit into business plans. If this is you, it doesn't matter how good your model is. At all.

> Understanding the business problem will also help you decide:

> * How accurate does my model really need to be? What kind of false positive rate is acceptable?
> * What data can I use? If you're predicting flight days tomorrow, you can look at weather data, but if someone is buying a flight a month from now then you'll have no clue.

# Project scoping
This framework for project scoping comes from Max Shron's [talk for the NYC Data Science meetup](https://vimeo.com/98768831).

The world gives us vague requests. We have to make things clear before we start or our models will be rambling and unhelpful. The framework consists of four parts:

1. Context
2. Need
3. Vision
4. Outcome

**Example of a bad scope:** 

We're working with a company that has a subscription business. 

> CEO: It would be great if you could help us build a churn model.

> Us: Alright - we're going to use a logistic regression to predict when someone is about to stop using the product.

Why is this a bad scope? Well, it's not actionable for the company, and includes irrelevant details.

## Context
Who are we working with? What are the big picture, long-term goals?

> "The company has a subscription model. Their goal is to improve profitability."

This is different than a churn model for someone who's just trying to get as many users as possible. It's different than a churn model where someone is just trying to increase revenue as much as possible.

## Need
What is the particular knowledge we are missing?

> "We want to understand who drops off **early enough** that we can intervene."

Finding out two seconds before someone is about to quit is not that helpful.

## Vision
What would it look like to solve the problem?

> "We will build a predictive model **using behavioral data** to predict who will drop off - early enough to be useful."

Data sources matter. The kinds of intervention possible matter (different messaging channels, different content types).

## Outcome
Who will be responsible for next steps? How will we know if we are correct?

> "The tech team will implement the model in a batch process to run daily, automatically sending out email offers. We will calculate success metrics (precision and recall) on held out users, and send a weekly email of stats to stay on top of outcomes."

Cross-validation isn't actually success. We need a control group.

** Example of a better scope: ** 

**Context:** We are working with a hospital system that has had 250k patients in the last 20 years. The CEO is interested in building a tool for reducing medical issues.

**Need:** After talking to some doctors, some belief that there is an overuse of antibiotics - but this is hard to detect.

**Vision:** A pilot investigation. If we find a signal, we will build a repeatable flagging tool.

**Outcome:** The CMO will decide if the pilot is valuable based on our report. The automated tool would be run by the CMO on-demand.

# Client communication
Interviews are obviously the most straightforward way to get this information - but tools from the design industry like roleplaying and storytelling can provide additional context.

Before ever getting started, it helps to sketch out ideas of what the end visualizations could look like. 

Or put it in a sentence - the kind of insight we want to be able to deliver once we've done the work.

# Client best practices
Be able to advise clients on steps they can take to improve their data for future analysis. Sasha Laundry laid out the steps that non-technical clients should take to audit their data at her [PyData NYC 2014 talk](http://blog.sashalaundy.com/talks/data-audit/), which I've summarized below:

## Data completeness
Are business-critical fields okay? Check volume (like comparing sessions from Google Analytics vs internal system logs). 

Engineers working without marketers have a tendency to log errors but not other potentially meaningful measures of user behavior. Simple things - if you have five categories, make sure you’re not just tracking two of them.

## Data correctness
Again, check against existing knowledge. In most cases, the data shouldn’t shock the client. 

Speed up early data analysis by using [csvkit](http://csvkit.readthedocs.org/) - a command line tool for working with CSV files. 

Example:

In [1]:
%%bash

# Start by creating a clean workspace
mkdir csvkit_tutorial
cd csvkit_tutorial

# Fetch the data
curl -L -O https://github.com/onyxfish/csvkit/raw/master/examples/realdata/ne_1033_data.xlsx

# Copy the contents of the Excel file to a new CSV
in2csv ne_1033_data.xlsx > data.csv

# Log headers from data.csv
csvcut -n data.csv

  1: state
  2: county
  3: fips
  4: nsn
  5: item_name
  6: quantity
  7: ui
  8: acquisition_cost
  9: total_cost
 10: ship_date
 11: federal_supply_category
 12: federal_supply_category_name
 13: federal_supply_class
 14: federal_supply_class_name


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100   150  100   150    0     0    151      0 --:--:-- --:--:-- --:--:--   151
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0100   161  100   161    0     0    136      0  0:00:01  0:00:01 --:--:--  4472
  0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--     0100 65331  100 65331    0     0  27691      0  0:00:02  0:00:02 --:--:--  439k


In [2]:
%%bash

cd csvkit_tutorial

# Look at the data in just a few of our columns 
# csvcut -c 2,5,6 data.csv | head

# We can do this by index or by column name
csvcut -c county,item_name,quantity data.csv | head 

county,item_name,quantity
ADAMS,"RIFLE,7.62 MILLIMETER",1
ADAMS,"RIFLE,7.62 MILLIMETER",1
ADAMS,"RIFLE,7.62 MILLIMETER",1
ADAMS,"RIFLE,7.62 MILLIMETER",1
ADAMS,"RIFLE,7.62 MILLIMETER",1
ADAMS,"RIFLE,7.62 MILLIMETER",1
BUFFALO,"RIFLE,5.56 MILLIMETER",1
BUFFALO,"RIFLE,5.56 MILLIMETER",1
BUFFALO,"RIFLE,5.56 MILLIMETER",1


## Data connectability
Can the data be joined easily? Not just to other client datasets - if, for example, the client team has weird internal definitions of territories that don't match up with zip codes or longitude/latitude, you’re going to have a hell of a time getting insights from any other source.

The best way to do this is to actually perform a join, and check it against existing knowledge.

Make sure the client understands the importance of unique IDs that are persistent across all databases - marketing, in-app, transactional.

# Resources
## Video
1. Thinking with Data, Max Shron (NYC Data Science Meetup 2014) - [https://vimeo.com/98768831](https://vimeo.com/98768831)
2. How to Make Your Future Data Scientists Love You, Sasha Laundy (PyData NYC 2014) - [http://blog.sashalaundy.com/talks/data-audit/](http://blog.sashalaundy.com/talks/data-audit/)

## Tools
1. csvkit - [http://csvkit.readthedocs.io/](http://csvkit.readthedocs.io/)
2. data_hacks - [https://github.com/bitly/data_hacks](https://github.com/bitly/data_hacks)

## Reading
### Articles
1. Machine Learning isn't Kaggle Competitions, Julia Evans - [http://jvns.ca/blog/2014/06/19/machine-learning-isnt-kaggle-competitions/](http://jvns.ca/blog/2014/06/19/machine-learning-isnt-kaggle-competitions/)

### Books
1. Thinking with Data, Max Shron - [http://shop.oreilly.com/product/0636920029182.do](http://shop.oreilly.com/product/0636920029182.do)
2. Data Science at the Command Line, Jeroen Janssens - [http://shop.oreilly.com/product/0636920032823.do](http://shop.oreilly.com/product/0636920032823.do)

