# Outline

1. Abstract
2. Intro
3. Datasets & Algorithm Intro
    1. Data
        - Abalone
        - Madelon
    2. Algorithms
        - Decision Tree
        - Boosting
        - ANN
        - KNN
        - SVM
4. Methodology
    * Brief overview of scripts/pipeline
5. Discussion: Algorithms Analysis
    * For each algorithm: 
        * Train/Test Error Rates
            * At the end if there's enough time, might want to add
            precision, recall, f1 score as metrics on best hyper param models
        * Training Time
        * Learning Rate
        * 'Overfitting' Curves (Expressiveness)
        * Hyperparameter Analysis
            * Why did these come out the best? 
            Discuss what each parameter does and reasoning for why that performed best. 
            Look at Grid search results if possible to see if distribution of params change 
            the outcome much (i.e. the effect of param on performance)
        

6. Conclusion
    * Key Takeaways
    

# Abstract


The world is full of problems, which means there's learning to do. If you've got answers, I've got supervised learning techniques. Specifically, if the problem happens to be identifying the age range of abalones or identifying a non-linear classification with vast amounts of noise added then the supervised learning algorithms for Decision Trees, Boosting, Artifical Neural Networks, K Nearest Neighbours and Support Vector Machines may be of help. For each of these datasets, and in turn each algorithm, the accuracy of classification was tested under cross validation over a variety of hyperparameters (learning rate, regularization, etc.) using sci-kit learn's GridSearchCV. The resulting hyperparameters, model performance, learning curve's and 'overfit-ability' are examined. This analysis gives way to furth insight into both the algorithms and problems.


[knn cheby shev distance short explanation](https://www.matec-conferences.org/articles/matecconf/pdf/2017/54/matecconf_iceesi2017_01024.pdf)

# Intro
#### Datasets
Interesting analysis requires insteresting problems, and in order illuminate the strengths, weaknesses and quirks of the examined supervised learning algorithms two well known datasets from the UCI machine learning repository data are examined.

#### Abalone 

*Instances*: 4177

*Attributes*: 10 

*Data Types*: Continuous (9), Categorical (1)

*Classes*: 0, 1, 2 for Abelone with rings in ranges [0, 8], [9,10], [10, 28]

This dataset measures the continuous physical characteristics as well as the categoric gender of shellfish abalones. The purpose of the dataset is to classify the abalone by age, which is a function of the number of rings inside the abalone's shell (age = rings +1.5). This is a time consuming process which could be alleviated by predicting the age as a function of easier to measure characteristics. The original dataset has 28 seperate classes (Rings ranging from 1-28) and in order to reduce this the Rings were segmented into three classes, one each for rings in ranges between 0-8, 9-10, 11+. These ranges roughly cover 1/3 of the distribution each, with the mean number of Rings almost exactly 10. The variance of ~10 and kurtosis of 2.3 indiciate a skewed distribution with wide dispersion. This is further evidenced by the Rings histogram:

<img src='img/rings_hist.png'>

The distribution sharply peaks around 9-10, indicating that most abalones may stop growing around this point or that some confounding variable is present (such as access to food, or survival conditions such as weather and/or harvesting). Binning down to three classes creates a roughly balanced classification problem with ~2000 instances in each class. The algorithms will be learning to discriminate between young abalones (Class 0), average age abalones (Class 1) and older abalones (Class 2). This will be a difficult task, as nearly all data elements are moderately to strongly positively correlated with both each other and the target variable (Rings/Class), as evidenced by this correlation matrix:

<img src='img/abalone_correlation1.PNG'>

The strong correlation with the classes should aid in discriminating between classes 0 and 2, but the multicolinearity of the features may make it difficult to distinguish the difference between the large range of ages (Class 0, Class 2) and the average aged abalones (Class 1). In other words, the  algorithms will need to distinguish which physical traits seperate the young abalones from average age ablones and the old abalones from average age abalones. All from variables that are strongly correlated with one another with similar distributions. The differing weight metrics are all very similar with a large positive skew in their distribution, while length and diameter are negatively skewed. Height has minimal variance indicating it may be uninformative. 

#### Madelon 

*Instances*: 5000

*Attributes*: 440

*Data Types*: Continuous (440)

*Classes*: 0, 1 

The MADELON dataset is an artifical dataset created in 2003 for the NIPs conference as part of a feature selection challenge. The target class comes from a group of 32 clusters on the vertices of a five dimensional hypercube. Those points were randomly assigned a class (either 1 or -1). Additionally the five dimensions were transformed by linear combinations to form fifteen more features. To complicated the problem 480 features of random noise were added to the dataset. 

Of particular interest here is that the Madelon dataset presents a highly non-linear problem where the signal-to-noise ratio is very low. 1% of the features are truly useful (the 5 dimensions) while 15 (3%) are superflouus albeit still informative. This leaves 96% as completely useless to learn from. To alleviate some of the imbalance in signal-to-noise ratio, sklearn's feature selection method SelectFromModel in tandem with a RandomForestClassifier was implemented. The feature selection was repeated four times with a threshold set to 'median', i.e. any feature deemed to be in the lower half of feature importance is dropped. In other words, the more important half of the features were kept with this repeated four times leaving 31 features for the algorithm to learn from. In the best case scenario, this would leave the 20 informative features and 11 noise features. 

In addition the noise issues, the non-linearity of the problem presents an interesting challenge to the learning algorithms. Algorithms without the expressivenes to describe non-linear patterns, e.g. a linear SVM, may struggle on the dataset while others, e.g. an SVM with RBF kernel, may have improved performance. 

## Algorithms & Methodology

All algorithms were implemented via the python machine learning package sci-kit learn. 

Fo each algorithm, the learner was five fold cross validation trained using balanced accuracy as the performance metric across a variety of hyperparameters. The best parameters were stored, with the best performing classifier then trained on varying amounts of the data with its performance and wall clock time recorded to illustrate its learning curve and computation cost. The variance or 'overfit-ablility' of the iterative learners (ANN, Boosting, SVMs) was tested by measuring the train and test accuracy across an increasing number of iterations using hyper parameters with high expressiveness (i.e. regularization parameters set to very low values)

#### Artificial Neural Network
*Hyper Parameters Searched: Activation Function, Learning Rate, Hidden Layer Size*

#### Boosting
*Hyper Parameters Searched: Number of Estimators, Learning Rate (of base estimator Decision Tree)*

#### Decision Tree
*Hyper Parameters Searched: Splitting Criteria, Learning Rate, Node Count*


#### K Nearest Neighbours
*Hyper Parameters Searched: Distance Metric, Number of Neighbours, Weighting of Neighbours Method*

#### SVM
*Hyper Parameters Searched: Kernel (Linear, RBF), Learning Rate, Number of Iterations*

## Algorithm Analysis

5. Discussion: Algorithms Analysis
    * For each algorithm: 
        * Train/Test Error Rates
            * At the end if there's enough time, might want to add
            precision, recall, f1 score as metrics on best hyper param models
        * Training Time
        * Learning Rate
        * 'Overfitting' Curves (Expressiveness)
        * Hyperparameter Analysis
            * Why did these come out the best? 
            Discuss what each parameter does and reasoning for why that performed best. 
            Look at Grid search results if possible to see if distribution of params change 
            the outcome much (i.e. the effect of param on performance)



#### Artificial Neural Network

