# Kaggle Data Science Project Workflow

## General workflow

https://github.com/ShuaiW/how-to-kaggle

1. Divide Forces
    - Determine jobs that each team member will have
2. Data Setup
    - Get the data
    - Store the data
3. Literature Review
    - See how people solved similar problems
4. Establish an Evaluation Framework
    - Determine metrics to use (accuracy, r2, mean_squared_error, area under the ROC curve, etc.)
    - Determine cross validation strategy (Nested Cross Validation seems like a winner)
5. Exploratory Data Analysis
6. Preprocessing
    1. Data cleaning: (fill in missing values, smooth noisy data, identify or remove outliers, resolve inconsistencies)
    2. Data integration: use multiple data sources and join/merge data
    3. Data transformation: normalize, aggregate, and embed (word)
    4. Data reduction: reduce the volume of data but produce the same or similar analytical results (e.g. PCA)
    5. Data discretization: replace numerical attributes with nominal/discrete ones (bin continuous feature for instance)
7. Feature Engineering
    1. Feature selection
        1. Removing features with low variance
        2. Univariate feature selection
        3. Recursive feature elimination
        4. Selecting from model
    2. Feature creation
        1. This is where we create new & novel features. Here are some examples of potential features to create in any given model:
            1. Add zero_count for each row
            2. Separate date into year, month, day, weekday, or weekend, etc.
            3. Add percentile change from feature to feature (or other interactions among features)
8. Model(s) Tuning
    1. Use gridsearch style algorithms to search the hyperparameter space for the best set of parameters
    2. Use something like hyperopt to optimize parameters
    3. Use something like TPOT
9. Ensemble
    1. Combine models to optimize performance

## Classification Problems
1. Exploratory Data Analysis
    1. Target Variable Analysis
        1. Basic distribution
        2. Are the classes imbalanced?
    2. Null Counts/Fraction
        1. Nulls by column
        2. Nulls by row
        3. Nulls by label (by row and/or column)
        4. Determine if there's an obvious value that the Null's represent (i.e. like in the Iowa housing problem where null's had a specific value in the lookup table)
    3. Count of 0s (if it's unclear what a 0 represents)
    4. Univariate analyses
        1. Distributions or value counts for each of the individual variables (if the number of variables is small enough)
        2. Plots of the histograms of the different variables
            1. These can be plotted on log scales if the range of values is too great.
        3. Outlier analysis
            1. Calculate z-scores or robust zscores
    6. Contingency tables 
2. Model Preprocessing
    1. Null value imputation
    2. Categorical variable encoding
    3. Transforms
    4. Etc.
3. Model building
    1. Pipeline

## Keeping track of models

Two possible paths. One with a lot of overhead (which will be used on actual work projects), which separates the tasks into individual files/scripts.

The second one is to streamline the process a bit for quicker iteration. 

1. First notebook/script gets the data (if anything needs to be downloaded or whatever)
    - Combines external data if needed
    - Takes data from raw data folder & external sources folder, and puts outputs in interim data folder
    
2. Second notebook/series of notebooks (2.0-2.9) does the EDA
    - No file outputs other than charts and visualizations
    - Some feature engineering will occur in this stage, so some of the outputs might be functions to create the features which are applied in the preprocessing/feature engineering step
    
3. (3.0-3.9) does preprocessing/feature engineering. These two tasks are intertwined and should be done simultaneously.
    - Inputs are the files from the interim folder
    - Outputs to the processed data folder
    
4. (4.0-4.9) trains and tunes models
    - Inputs are the processed data files
    - Outputs are:
        - The serialized submissions for the kaggle competition
        - The similarly serialized model pickle files
        - A model summary document (references the training data used, the python source code used, cross-validation score obtained on training data, possibly hyperparameters, etc.)
  
5. (5.0-5.9) combines/ensembles models
    - Inputs are the models
    - Outputs are the same as the previous steps outputs

