# Regression Case Study

## Objectives

* Introduce [CRISP-DM](https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining) (Cross-industry standard process for data mining)
* Examine what CRISP-DM tells us about how to do real DS.
* Tips for using github for collaboration.
* Reinforce workflow.

<img src="./crisp_dm.png" width="400"/>


* What's the first step?
* What adjacent steps are allowed to inform one another?
* What adjacent steps don't inform one another?
* What's the meaning of the outer circle?

# Git workflow

This is a simple git workflow that will probably work for today's purposes (for more complex projects, this probably wouldn't fly).

[Centralized Git Workflow](https://www.atlassian.com/git/tutorials/comparing-workflows)

## The gist

1. One person creates/forks a repository (and adds others as collaborators)
1. Everyone clones the repository
1. User A makes and commits a small change. 
    - Like, really small.
1. User A pushes change.
1. User B makes and commits a small change.
1. User B tries to push, but gets an error.
1. User B pull User A's changes.
1. User B resolves any possible merge conflicts (this is why you keep commits small)
1. User B pushes.
1. Repeat

# Expert advice

## Good ways to improve your case study outcome
1. Make more complete CRISP-DM cycles
1. Collaborate more effectively
1. Transformers, `sklearn.pipeline`
1. Make a ***very*** simple model sooner
    1. Name some simple models.
   

# End Game

Reconvene at the end of the day to share your results with **a short presentation**. "Short" means 5 minutes. 6 slides or so. You have a brief amount of time to communicate the most important things you learned from your data. Don't waste any of your time with the sentence "I loaded the data into a pandas dataframe". That sentence tells us nothing about what you've learned. 

**Start your presentation with a conclusion**. Then explain the modelling / EDA choices you made to justify that conclusion.



### Pro tip zone: categorical variables in cross validation

Imagine you have a feature `color`, and this is all your training data.

| index  | color  |
|---|---|
|  0 | red  |
|  1 | blue  |
|  2 | red  |
|  3 | green  |
|  4 | blue  |

dummifying this column (using, say, `pd.get_dummies`) would give you the following:

| index  | color  | red | blue | green |
|---|---|---|---|---|
|  0 | red  | 1 | 0 | 0 |
|  1 | blue  | 0 | 1 | 0 |
|  2 | red  | 1 | 0 | 0|
|  3 | green  | 0 | 0 | 1 |
|  4 | blue  | 0 | 1 | 0 |

I fit a model using only the columns `red`, `blue`, and `green`, so a data point `x = [0,1,0]` represents a blue thing. 

Now say a test data point `x*` comes in and has color `pink`. How can we pass this to our model? In terms of dummy features we saw in the training set, we can only say "this point is not red, green, or blue", so we would encode it as 

`x* = [0,0,0]`

If you had a dataframe full of test points, and a categorical column contains a value not seen in the training data, calling `pd.get_dummies` on `df_test` will create columns that your model cannot accept (since your model only knows about features defined in your training set, and is also expecting the same columns in the same order). So make sure your code accounts for this data transformation correctly.