# Project 1: Futurama Example

### Project 1: Futurama Example

##### IDENTIFY: Understand the problem:
Using Planet Express customer data from January 3001-3005, determine how likely previous customers are to request a repeat delivery using demographic information (profession, company size, location) and previous delivery data (days since last delivery, number of total deliveries)

- Identify business/product objectives:
	- Are previous customers are to request a repeat delivery?
- Identify and hypothesize goals and criteria for success:
	- What factors are likely to influence a customer's decision to be reuse Planet Express for Delivery?
- Create a set of questions to help you identify the correct data set.


##### ACQUIRE: Obtain the data

**Ideal data vs. data that is available**
Often times we start by identifying the *ideal data* we would want for a project.

During the data acquisition phase, we'll learn about the limitations on the types of data that are available. We have to decide if these limitations will inhibit our ability to answer our question of interest or if we can work with what we have to find a reasonable and reliable answer.

Data for this example:
- demographic information (profession, company size, location)
- previous delivery data (days since last delivery, number of total deliveries)

Questions we may ask include:  

- Identifying the “right” data set(s)
- Is there enough data?
- Does it appropriately align with the question/problem statement?
- Can the dataset be trusted?  How was it collected?
- Is this dataset aggregated? Can we use the aggregation or do we need to get it pre-aggregation?
- Assess resources, requirements, assumptions, and constraints
- Import data from the web (Google Analytics, HTML, XML)
- Import data from a file (CSV, XML, TXT, JSON)
- Import data from a preexisting database (SQL)
- Set up local or remote data structure
- Determine most appropriate tools to work with data
- Tool follows the format, size of the dataset

##### PARSE: Understand the data
Many times we are given *secondary data*, or data that was collected previously. In these cases, we have to learn as much as possible about our data using tools like data dictionaries and source documentation to determine how the data was gathered.

Example data dictionary:

Variable | Description | Type of Variable
---| ---| ---
Profession | Title of the account owner | Categorical
Company Size | 1- small, 2- medium, 3- large| Categorical
Location | Planet of the company | Categorical
Days Since Last Delivery | Float | Continuous
Number of Deliveries | Integer | Continuous

**Common questions include:**  

- Read any documentation provided with the data (e.g. data dictionary above)
- Perform exploratory surface analysis via filtering, sorting, and simple visualizations
- Describe data structure and the information being collected
- Explore variables, data types via select
- Assess preliminary outliers, trends
- Verify the quality of the data (feedback loop -> 1)

##### MINE: Prepare, structure, and clean the data  
Often times, our data will need to be cleaned prior performing an analysis.

Common steps include:

- Sample the data, determine sampling methodology
- Iterate and explore outliers, null values via select
- Intro qualitative vs quantitative data
- Format and clean data in Python (dates, number signs, formatting)
- Define how to appropriately address missing values (cleaning)
- Categorization, manipulation, slicing, format, integrate data
- Format and combining different data points, separate columns, etc.
- Determine most appropriate aggregations, cleaning, etc. methods
- Create necessary derived columns from the data (new data)

##### REFINE: Exploratory data analysis
As an example of basic statistics, you might check the Mean (STD) or specific frequency counts.

Variable | Mean (STD) or Frequency (%)
---| ---
Number of Deliveries | 50.0 (10)
Earth | 50 (10%)
Amphibios 9 | 100 (20%)
Bogad | 100 (20%)
Colgate 8| 100 (20%)
Other| 150 (30%)

These descriptive stats allow us to:

- Identify trends and outliers
- Decide how to deal with outliers - excluding, filtering, and communication
- Apply descriptive and inferential statistics
- Determine initial visualization techniques
- Document and capture knowledge
- Choose visualization techniques for different data types
- Transform data

##### BUILD: Create a data model
We select a model based on the outcome we are interested in or the assumptions of the model we are using. An example of a model statement might look like this:

- We completed a logistic regression using Statsmodels v. XX. We calculated the probability of a customer placing another order with Planet Express.  

Here, we are using a logistic model because we are determine the probability that a customer may place a return order, which at its heart is a *classification problem*.

The steps for model building are:  

- Select appropriate model
- Build model
- Evaluate and refine model
- Predict outcomes, action items

##### PRESENT: Communicate the results of your analysis  
Presentations are a critical part of your analysis. It doesn't matter how brilliant your model is or how illuminating your findings are, if you are not able to effectively communicate your results then they will not be used.

The most basic form of a data science presentation should include a simple sentence that describes your results:

- "Customers from large companies had twice (CI 1.9, 2.1) the odds of of placing another order with Planet Express compared to customers from small companies."

Data science presentations can also be far more complex and exciting, like some of the [research presented by Nate Silver's 538 blog](http://fivethirtyeight.com/burrito/#brackets-view).

When creating a presentation, always consider your audience and make sure to practice your presentation beforehand. Consider the types of questions people might have or - better yet - test your presentation on a few people and pay attention to their response. Clarify and refine your presentation accordingly.

Make sure to consider your needs and goals as well as those of your audience. A presentation created for your fellow data scientists will be vastly different than a presentation intended for some executives who are trying to make a business decision.

Key factors of a good presentation include:  

- Summarize findings with narrative and storytelling techniques
- Refine your visualizations for broader comprehension
- Present both limitations and assumptions
- Determine the integrity of your analysis
- Consider the degree of disclosure for various stakeholders
- Test and evaluate the effectiveness of your presentation beforehand

##### A Note About Iteration
Iteration is an important part of *every step* in the Data Science Workflow. At any given point in the process, you may find yourself repeating or going back and re-doing elements in order to better understand your data, clarify your model, and refine your presentation.

For example, after presenting your findings, you may want to:

- Identify follow-up problems and questions for future analysis
- Create a visually effective summary or report
- Consider the needs of different stakeholders and how your report might be changed for them
- Identify the limitations of your analysis
- Identify relationships between visualizations

***


### Read and evaluate the following problem statement: 

Using Planet Express customer data from January 3001-3005, determine how likely previous customers are to request a repeat delivery using demographic information (profession, company size, location) and previous delivery data (days since last delivery, number of total deliveries).



#### 1. What is the outcome?

Answer: return customer indicator (yes/no)

#### 2. What are the predictors/covariates? 

Answer: year, profession, location, days since last delivery, number of total deliveries

#### 3. What timeframe is this data relevent for?

Answer: Jan 3001-3005

#### 4. What is the hypothesis?

Answer: Demographic and previous delivery info will allow us to predict if a customer will be a repeat customer

### Let's get started with our dataset

In [3]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np

pe_df = pd.read_csv("dataset/planet-express.csv")
pe_df.head()

Unnamed: 0,Company Size,Profession,Days Since Last Delivery,Number of Deliveries,Location,Year
0,2,Pilot,3.61,3,Earth,3001
1,2,Robot,3.67,3,Earth,3001
2,2,Pilot,4.0,1,Earth,3001
3,2,Pilot,3.19,4,Earth,3001
4,2,Pizza Delivery Boy,2.93,4,Earth,3001


#### 1. Create a data dictionary 

Answer: 

Variable | Description | Type of Variable
---| ---| ---
Profession |Title of the account owner  | categorical
Company Size | 1- small, 2- medium, 3- large| categorical
Location | planet of the company | categorical 
Days Since Last Delivery | integer | continuous
Number of Deliveries | integer | continuous
Year | integer | continuous


We would like to explore the association between admission into grad school and the prestige of undergraduate institutions.

#### 1. What is the outcome?

Answer: ????

#### 2. What are the predictors/covariates? 

Answer: ????

#### 3. What timeframe is this data relevent for?

Answer: ????

#### 4. What is the hypothesis?

Answer: ????