# Mushroom Edibility Report

## Overview

This is a project for DSCI 573, a features and model selections class. The goal was to perform a full analysis on a dataset of my choosing. In this case, I chose a dataset pertaining to mushroom edibility. The goal was to create an accurate model that could predict whether or not a mushroom was edible given a set of attributes.

## Data

For this project, I used [this dataset](http://archive.ics.uci.edu/ml/datasets/Mushroom) from the UCI machine learning database. It has 2 possible labels thath show whether the mushroom is poisonous (`p`) or edible (`e`). There are 22 attributes, including the following: cap shape, surface, and colour, bruises, odor, gill attributes, stalk attributes, veil attributes, ring attributes, spore colour, population density, and habitat. The table below highlights an example of the raw data:

![](table.png)

And the next shows the numerical data:

![](fixed.png)

## Methodology

### Data Wrangling

Luckily, there was little cleaning to be done with this dataset. I checked for NULL/NA values (there were none) and the data were in an easily usable format, with all columns being of equal length, etc. I decided to change the string variables (e.g. `p` and `e` into a numerical format. I did this using sklearn's preprocessing `LabelEncoder`, which changed all letter variables in the dataset into numerical features. 

### Model Selection

This is a classification problem (i.e. the mushroom can either be labelled as poisonous or edible). There are a number of supervised machine learning models I could use for such a problem. 

I decided to test out three methods: a linear SVM, a decision tree at depth 5, and a random forest ensemble classifier. I chose these three to test out three different types of classifier, including an ensemble method, so I could compare the accuracy between them. All 3 performed well. Because random forest and the decision tree performed so well, I decided to proceed in my analysis using the linear SVM to see if I could improve the model's accuracy.  

### Tuning and Feature Selection

For the first step of the parameter tuning, I decided to use sklearn's `GridSearchCV` method to determine what values of C and gamma I should use to best fit the model. The grid search iterated over a range of values for both, and determined that the model would fit best with a C value of 10 and a gamma value of $1e-6$. 

Next, I used sklearn's recursive search function `RFE()` to iterate over the features and select those that are most important to the model. The graph below shows the train and test error with different numbers of features selected. 

![graph](features.png)

## Results and Discussion

All 3 of the models that I fit to the data had very good results with their initial runs. 

| Classifier | Training Error | Test Error|
| ---|---|---|
|Linear SVM |0.0377 |0.04259 |
|Decision Tree | 0.0195|0.0219 |
|Random Forest|0 | 0|
|Linear SVM (post-tuning and feature selection)|0.0266|0.0286|

As stated earlier, I decided to pursue the linear SVM purely so that I could try to get its accuracy up through feature selection and parameter tuning. The grid search gave me the optimal values of C and gamma, and the RFE gave me 10 features that were most relevant, which included the following: bruises, gill attachment, gill spacing, gill size, stalk shape, stalk root, stalk surface (above ring), veil colour, ring number, and ring type. This is logical: all are visible features of a mushroom, and perhaps solidify the belief that it is possible to tell whether or not something is edible based on its appearance.

The methodology that I followed managed to reduce both training and test error for the model. Though it did not reach the accuracy of random forest (which seems suspicious to begin with), the linear SVM did perform very well.

## Conclusion

My main conclusion from this analysis is that the relationship between the 22 attributes and whether or not the mushroom is poisonous or edible is so strong that it makes fitting a machine learning model very easy. All of the models performed incredibly well. My exploration of parameter tuning and feature selection did improve the accuracy of my least accurate model (the linear SVM), which highlights the importance of both of these methods in model fitting. In general, hopefully this means that whether or not a mushroom is poisonous can be easily gleaned from looking at a few features!