# Project 3: Contract Proposal with UCI & Josh's Madelon Dataset


### Stewart Knox, DSI-Plus-2

## Problem Statement

> You're working as a data scientist with a research firm. You're firm is bidding on a big project that will involve working with thousands or possibly tens of thousands of features. You know it will be impossible to use conventional feature selection techniques. You propose that a way to win the contract is to demonstrate a capacity to identify relevant features using machine learning. Your boss says, "Great idea. Write it up." You figure that working with a synthetic dataset such as Madelon is an excellent way to demonstrate your abilities.

Your challenge here is to develop a series of models for two purposes:

1. for the purposes of identifying relevant features. 
2. for the purposes of generating predictions from the model. 

My final product consists of:

1. This prepared report of my findings that detail the accuracy and assumptions of my model
2. A series of Jupyter notebooks to be used to control my pipelines
   
## Executive Summary
- Feature selection was successful in removing both noise and redundant features from both example datasets, eventually yielding upwards 90%+ prediction accuracy on new data.  This method also took advantage of several computational shortcuts and sampling, meaning it can scale to thousands or tens of thousands of features without significant additional investment in resources or manpower
- On the 20-feature UCI Madelon dataset (hereafter UCI), our model scored 92% accuracy using the SVC, far better than the baseline of 50% that could be achieved by guessing, in a dataset that contained ~1 million cells.  On Josh's 20-feature Madelon dataset (hereafter DB), our model scored 86.3% with KNN and 85.9% with RandomForestClassifier, in a dataset that contained ~200 million cells.  
- In both cases, the processes of noise reduction and feature selection performed excellently on small subsets of the data and scaled up to analyzing the entire dataset.  Several lessons learned in this process will enable further scaling to even larger datasets, including those on our upcoming "big project that will involve...tens of thousands of features."  

## Project Roadmap

### Attribute Information:

Listing of attributes: 

- Features: `0,1,2...499` for UCI and `0,1,2...999` for DB: a combination of 5 informative, 15 redundant, and either 480 or 980 random  features 
- Target: `500` in UCI and `1000` in DB

### Jupyter Notebook Step 0 - EDA
1. Gather and store data. Do substantive work on at least six subsets of the data. 
    - 3 sets of 10% of the data from the UCI Madelon set.  
        - Upon loading the data, 3 samples of length 200 were pulled and loaded for processing. Initially, descriptive statistics bore out that this was representative of the overall dataset since this amounted to a total of 30% of the data.  
        - Lessons learned: this small sample size had consequences down the road when running the model benchmarks, which were ultimately solved in notebook 4 by running the whole dataset (see section 4 below).  This sample corresponded to a 7% margin of error at 95% confidence.  Most descriptive statistics at this point, however, provided little insight since the data is opaque and had no real-world meaning.  
        - Column 500 in UCI was NaN - I deleted and filled it with the provided target instead
    - 3 sets of 10% of the data from the Madelon set made available by your instructors  		
        - Loading this data initially proved difficult.  AWS and Josh's server are limited as far as what can be downloaded and stored at any given time.  Ten percent is 20,000 rows, but that's 160mb x3, and Josh's server couldn't successfully download more than 5,000 at a time without crashing AWS.  Ultimately, a 1.5% sample was pulled and randomly split 3 ways for processing.  Initial descriptive statistics bore out that this was representative of the dataset as a whole, putting me in a 3% margin of error range at 95% confidence.
        - Column 0 in DB was a seemingly meaningless `_id` value that I deleted
2. EDA - Removing noise
    - Since there was a known extremely poor signal/noise ratio, I ran Josh's R2 method at this stage as my initial form of EDA, even though it's technically feature selection.  This negated the need for computationally-intensive 500x500 or 1000x1000 correloation and histogram plots, which would not add significant value.  This method was used on both the UCI and DB samples in series, with a function written to run each over KNeighborsRegressor and DecisionTreeRegressor and output the lists.  
    - Each run the models across all 6 datasets yielded nearly identical results that allowed elimination of the 480/980 noisy features, so all the noise was subsequently dropped from the sample datasets (with the `_clean` marker appended to the samples)
3. Skew-Normalize & Standardize Data
    - I did not heavily investigate the skew of the data, since it appeared to be generated in a normal way
4. Investigate outliers
    - I did not investigate outliers because I reasoned that both the size and generating method for the data would limit their presence or effect

### Jupyter Notebook Step 1 - Benchmarking
1. For efficiency, all samples and pipelines were streamlined so they could be looped over with the same function (one that includes training/testing validation) across all notebooks.  In benchmarking, pipelines were built to perform a naive fit for each of the base model classes:
	- logistic regression
	- decision tree
	- k nearest neighbors
	- support vector classifier

<img src="2.png" alt="Smiley face" height="200" width="700">
<img src="3.png" alt="Smiley face" height="200" width="700">
<img src="4.png" alt="Smiley face">

2. The performance metric most valuable here is the test score accuracy, or a measure of how well the prediction picks the correct label of our target variable.  The train scores for the classifiers at this stage seriously overfit on the benchmark models, especially in samples with the noise (all 500 or 1000 features) compared to the ones that have had Josh's first-pass noise-reduction run on them, allowing us to conclude that we successfully removed all the noise.  With baseline parameters, however, many of the models are only a little better than a coin flip at predicting an outcome on the test split.  
	- Of these, KNN and SVC seem to perform the best, getting as high as 70-80 percent in many of the test splits.  Variance is large, however, as SVC often gets worse than a coin flip with some samples.  As a result, more tuning is needed.

<img src="noise.png" alt="Smiley face">
<img src="6.png" alt="Smiley face">


### Jupyter Notebook Step 2 - Identify Features
1. SelectKBest, SelectFromModel, and SelectPercentile were initially selected as my 3 feature selectors to get from 20 to 5 features.  Reusing code and pipelines from the last set allowed for quick and easy gridsearching over each model.  The intersection of the lists and were the identified and graphed as appropriate.

<img src="8.png" alt="Smiley face" height="150" width="450">
<img src="9.png" alt="Smiley face" height="450" width="450">

2. Despite extensive tuning, these methods were unable to reduce down to less than about 9 or 10 features in both the UCI and DB sets.

3. A plot of the correlation matrix was used to test redundancy, and while it showed low redundancy, some still remained.  This necessitated a different approach to feature selection and dimensional reduction.

4. Thus, Principal Component Analysis was next.  Each graph across samples came out as below, showing no explanatory power beyond 5 features.  This insight was incorporated into all following pipelines.

<img src="10.png" alt="Smiley face">

<img src="11.png" alt="Smiley face" height="450" width="450">

<img src="pca.png" alt="Smiley face" height="450" width="450">


### Jupyter Notebook Step 3 - Testing Model Pipelines
- Considering these results, develop a strategy for building a final predictive model.  Recommended approaches:
    - Use feature selection to reduce the dataset to a manageable size then use conventional methods
    - Use dimension reduction to reduce the dataset to a manageable size then use conventional methods
    - Use an iterative model training method to use the entire dataset
- This notebook should be a "playground" where you try various approaches to solving this problem

1. Thus far, Josh's method was used to reduce the noise out of the dataset as part of EDA, and have subsequently run PCA to narrow down to 5 features, which was successful.  Thus, we're doing a version of part 2 above - doing dimension reduction by removing noise, followed by further dimension reduction in PCA (going from 20->5 features).  Conventional methods then follow, including our baseline models and a Bagging, RandomForest, and ExtraTrees model, all of which were gridsearched to optimize features for a larger test.  Initial results were not much better than the benchmark, or were worse in some cases. 

<img src="14.png" alt="Smiley face">
<img src="15.png" alt="Smiley face">

2. Notably, DB scores either gained in accuracy or stayed the same, whereas UCI scores generally dropped.  This is likely due to our initial sampling limitations, where the size of the sample yielded a higher margin of error for UCI than it did for DB.  The UCI sets (at 200 apiece) were likely too small to explain a 5D parameter space.  Luckily, the smaller model is the easiest to enlarge, so Notebook 4 will document my efforts to scale up and increase the score on the full data.

### Jupyter Notebook Step 4 - Build Model

1. After sampling and noise reduction, the process of pulling solely the informative and redundant features was simple and not computationally taxing.  The DB sample was pulled with a SELECT query on our 20 identified critical features (20 cols x 200000 rows = 4 million cells, only marginally more than the original samples). 
2. Additional models were added beyond the baseline models to see if scores improved, including `AdaBoost`, `ExtraTreesClassifier`, `RandomForestClassifier`, and `BaggingClassifier`
3. Running these models and gridsearching over the same parameters on `UCIfull_clean` produced a marked improvement in test score, especially with SVC and KNN.  The score was further improved by stratifying at the train_test_split (`stratified=y`) rather than as part of the gridsearch, saving some computation by doing it upfront rather than multiple times during processing.  The lowest-performing models from testing were also kicked out, and the parameter space was narrowed to allow for quicker searching over large numbers of rows.  SVC ultimately yielded a 92.4% accuracy score on the test data using the parameters identified below.

<img src="17.png" alt="Smiley face">
<img src="18.png" alt="Smiley face" height="700" width="700">

4. And now, the elphant in the room.  Running on Josh's DB required further limiting of models and feature spaces, and several kernel deaths.  Another week could have been spent optimizing this one.  There's still overfitting on the original model, but variance has been reduced significantly, increasing scores from 68% up to 86% with KNN and a score of 85.9% with `RandomForestClassifer`.

<img src="19.png" alt="Smiley face">

<img src="20.png" alt="Smiley face" height="700" width="700">


