# The Data Science 'Pipeline'

- The terms and the pipeline itself will most likely vary slightly, depending on where everything was learned, but at its core, it should be the same.

## Step One: Planning

Pretty self explanatory, but there are certain things that should be defined by this phase.

- The goal

- Any deliverables

- A rough 'How to get there'


### The Goal

- Again, obvious. This is meant to define what your actual goal is, as well as your actual measure(s) of success. Also included are any plans on how to achieve this.

### The Deliverable(s)

- This will be the documentation of the goal, measure of success, and the plan to get there. If you don't define what success looks like to you, you won't know when you're there.

### How to get There

- This will be answered by asking questions *about* the final product and identifying any inital hypothesis to move forward with.

- Common questions will be something like:

    - What will the end product look like?
    
    - What format will it be in?
    
    - Who is it for?
    
    - What is my MVP (Minimal Viable Product)
    
- Formulating hypothesis will look something like this:

    - Is attribute 1 (from the actual data) related to attribute 2?
    
    - How does attribute 3 relate to the target variable?
    
    - Is the mean of the target variable for subset A significantly different from subset B?
    


## Step Two: Acquisition

- The goal (for this step)

- The deliverable

- How to get there

### The Goal

- You create a path from the original data source to the environment in which you're to be working with the data. In my case, I will acquire my data in such a way that I can work with it in Jupyter Labs. 

### The Deliverable

- A file, the acquire.py (or alternatively encapsulated in a wrangle.py), that contans the functions needed to reproduce the acquisition of the data.


### How to get There

- There are any number of ways to get data, however one very common method is pulling the data from a SQL database. If this is used, some amount of refinement of the SQL query is probably necessary before reading the data into the python environment.

- Another method would be to use Pandas to read the information directly from a csv, or json, txt, xlsx file (among others).

- Web scraping using BeautifulSoup or even Selenium may be used to acquire this data.


## Step Three: Preparation

- The goal 

- The deliverable

- How to get there

### The Goal

- By the end of exploration you want to have your data split into 2 or 3 subsets (usually 3, but if cross-validation is being used later, then a train/test is optimal). This is done in order to have one sample of the data to use to test our final model, one that wasn't used in the exploration or development of the model, so that we can understand and see how our model works on 'future' unseen data, and determine generality and usefulness from there.

- Before that, the data must be cleaned in a way that we can easily interpret.

- With acquisition, preparation is absolutely one of the most time consuming parts of this process.

### The Deliverable

- As with the acquisition of the data, the deliverable here is a prepare.py file (or encapsulated within the wrangle.py mentioned above) with all of the functions used to clean the data so that the work is reproducible.  

-  The resulting dataframes from this should be 2 or 3 samples

- If the data is split 3 ways, there will be a train set, made for training the algorithms, a validate to...validate the models developed using the train, and a test set, made to test the data further on completely unseen data, in order to ensure the data can perform on new data and is not overfit. In this case the data splits should be somewhere around 50-60%for the train set, 20-30% for the validate, and around 20% for the test set

- If the data is split in two ways, there will be a train and test set. In this case the train should be roughly 80% and the test 20%. If split in this way there should be another method used to help overcome not having a validate, like cross validation

### How to Get There

- Using various Python libraries to change the data, handle null values, outliers, normalize any text data, changing any data types into something more useful, or any binning required.

- Using matplotlib or seaborn to plot distributions of numeric attributes and target (individually, do NOT compare features to eachother in this way until there is a split of the data to avoid bias)

- Use Scikit-learn to split the data as mentioned above

## Step Four: Exploration (and Pre-processing)

- The goal

- The deliverable

- How to get there

### The Goal

- The goal here is to uncover the features that have the largest impact on the target variable, such as something that will drive the target in one direction or another

### The Deliverable 

- The first will be a explore.py file that contains any functions needed to reproduce the pre-processing and exploration of the data

- The dataframe resulting from this file should be ready to be used for modeling

- This means that;

    - attributes will be reduced to features
    
    - features are in a numeric form
    
    - there are no missing values
    
    - continuous or categorical values are scaled to be unitless