# Open Coding Hour Machine Learning Workshop
By Matthew Smith and Alexandra Lukasiewicz

Thanks for joining us on Kaggle! This website hosts a wide variety of datasets and examples of machine learning in different programming languages. We've found it very useful in creating this tutorial.

### During this workshop we will generate a binary classifier using the Scikit-Learn python package

### We will be working with two datasets from the [UCI machine learning repository](https://archive.ics.uci.edu/ml/index.php)
* [Mushrooms](https://www.kaggle.com/uciml/mushroom-classification) (a dataset containing key mushroom identification features and their edibility) 
* [Wine](https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009) (A dataset containing features of different red wines such as acidity and sugar content and their quality score)

![](https://media.winefolly.com/Wine-pairing-portobello-mushroom-pinot-winefolly-1.jpg)
Image from "Wine and Grill Food Pairings Made For The Porch" by Phil Keelig, Wine Folly



# Introduction to machine learning 
### What are examples of questions you can ask using ML tools in scikit learn?
ML can be useful in answering biological questions where the exact steps or conditions that generate an outcome are unknown. 
An example can include: 
* Thermodynamic models with series of complex steps
* Metabolic engineering and rule based systems
* Identifying patterns in systems (unsupervised clustering)
* Reducing heterogenous datasets (dimensionality reduction)

### What are challenges in machine learning?
**Poor dataset**
* Training dataset has too few entries 
* Dataset is not representative of new cases it will encounter 
* Irrelevant features

**Poor algorithm**
* Overfitting to training dataset
* Underfitting the training dataset (not enough factors included to be accurate when presented with new information)

### Applying machine learning in your own research
**Generating hypotheses**
* Unsupervised clustering to observe patterns in data 
* Defining a clear goal or question to answer (am I categorizing data? Am I predicting a value (such as binding strength or enzyme production?)

**Evaluating dataset (do I need more information?)** 
* Selecting a set of algorithms for your question
* Evaluating performance 

### To begin, we will import packages and datasets for the rest of the workshop 

In [None]:
#packages for dataframe manipulation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#scikit-learn specific packages:
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.tree import DecisionTreeClassifier as DT
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score

In [None]:
#Import mushroom and wine datasets 
mushrooms = pd.read_csv("../input/mushroom-classification/mushrooms.csv")

wine = pd.read_csv("../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv")

### Quick modification of the Wine dataset
Wine contains the column 'quality' which contains numeric scores. For the purposes of using this dataset for classification we will convert this column to 'poor' or 'excellent' given an arbitrary cutoff score of 7. 

'quality' will be our column of categorical variables, and 'quality_score' will hold our numeric values. If you would like to use this dataset to generate a numeric predictor of quality, you can use several algorithmic approaches including linear regression, L1 or L2 regularization, SVMs, or even Decision trees. 

In [None]:
#convert columns in wine to categorical values 
wine = wine.rename(columns = {"quality":'quality_score'})
cutoff_key = {range(0,7):'poor', range(7,10):'excellent'}
wine['quality'] = wine['quality_score'].apply(lambda x: next((v for k, v in cutoff_key.items() if x in k), 0))

# Lets take a look at our datasets

In [None]:
mushrooms.head()

In [None]:
wine.head()

### Working with non-numeric data
Each of the columns contain different features of a mushroom that will help us classify whether our sample is edible or poisonous 
However, these are all in a non-numeric format and not all ML algorithms support categorical variables. 
We can use the scikit-learn tool Label Encoder to convert our dataset into a numeric format.

In [None]:
le = LabelEncoder()
for col in mushrooms.columns:
    mushrooms[col] = le.fit_transform(mushrooms[col])

mushrooms.head()

We can now see that the "class" column of (e)dible or (p)oisonous has been converted into a binary of 0 or 1. In addition all of the other letter categories have been converted into a number 

### Why do we use dummy variables?

After converting our variables into numbers, we still have a problem. Most of our columns have more than 2 possibilities. For example, look at “cap-surface” in the mushroom data-set - it has three possibilities. If we look at these as numerical values, then we are artificially grouping things together that may have no relation. Most machine learning algorithms will have an easy time separating cap-surface 1 and 2 from cap-surface 3, or separating cap-surface 2 and 3 from caps-surface 1, but these algorithms will have a more difficult time (or even find it impossible), to separate cap-surface 1 and 3 from cap-surface 2. As the number of these classifications goes up for a column, this problem gets worse.

To fix this, we can break our existing columns each into multiple dummy columns. Each column is broken into the number of columns equal to its number of classifications. Then for the cap-surface N column, the value is 1 if the cap-surface is type N, and it’s 0 otherwise. This lets your machine learning algorithm handle arbitrary relationships between the classifications in a particular column of the original data-set.
uld you add info here about the dummy variable section?)

In [None]:
mushrooms = pd.get_dummies(mushrooms,columns=mushrooms.columns,drop_first=True)
mushrooms.head()

Since the wine dataset gives numerical data, rather than categorical data, you don't need to do anything here.

## Splitting dataset into training and testing sets 
Splitting your dataset into training and testing sets is key to evaluating the performance of your algorithm later on (and in cross validating multiple algorithms against one another) 
* Training dataset- used to train the algorithm 
* Testing dataset- used to evaluate accuracy of algorithm

### Is my dataset large enough to split and train? 
The key question here is whether your dataset has enough entries that represent all possibilities that the model may encounter when applied to some unknown set of features. 
If our training dataset represents overwhelmingly edible mushrooms and we encounter a poisonous one can we trust that the algorithm will accurately categorize this outcome?

### What is a general percentage to aim for? 
How much of my dataset should be split into testing and training?

What do I do when my dataset is too small?
K-fold cross validation may be a way to avoid being too optimistic in your fit. We won't cover this in today's code-along, but if you need to know more about this in the future, resources on this are available here (https://machinelearningmastery.com/k-fold-cross-validation/)





In [None]:
#first we split our dataset into the x input (mushroom features) and y response (edibility) variables 

mushrooms_x = mushrooms.drop('class_1', axis = 1)
mushrooms_y = mushrooms['class_1']

#Here we take 20% of the dataset to test with, and train with the leftover 80%
mushrooms_x_train, mushrooms_x_test, mushrooms_y_train, mushrooms_y_test = train_test_split(mushrooms_x, mushrooms_y, test_size=0.2, random_state=42)

Take some time here to view and split the wine dataset

In [None]:
#Your code here
#first we split our dataset into the x input (wine features) and y response (quality) variables



# Feature Extraction using PCA 

*What elements of our dataset account for the most variation?*

One way we can answer these questions is by performing a **Principal Component Analysis (PCA)**

PCA is an unsupervised clustering method that can show you what features account for the greatest variability in your dataset, allowing you to condense your dataset into a set of 2 or 3 features (genes, treatments, traits) to feed into your algorithm. (Granted that your dataset can condense to these plot-able dimensions). PCA is a feature extraction technique that can aid in identifying those that best predict your dependent/response variable.

One thing to be aware of, is that PCA will completely ignore the classifications of your data. It is unsupervised, it simply tries to capture the greatest variability in your dataset that it can with a particular number of dimensions. It will do this even if the feature with the greatest variability tells you nothing about the classifications you want to predict. Usually this is fine, as features with greater variability tend to be good for classification, but it’s not always the case.

PCA, in comparison to methods that will pick out a subset of your features for your classification, brings you an unusual trade-off. To understand this trade-off, take a look at this PCA diagram.


![](http://upload.wikimedia.org/wikipedia/commons/f/f5/GaussianScatterPCA.svg)

The X and Y axes are our original features. The bold black arrows are the dimensions PCA might select on this data. Notice that they are running on diagonals; they are offset from the original dimensions. This is normal for PCA - PCA will tend to mish-mash the features of your original dataset together.

A benefit of this, is that typically all of your features are accounted for to some degree. However, this comes at a cost; when you use PCA, it can make your results difficult to interpret - you won’t generally know which features were most important for your classification, if you run a classifier on the PCA-ed data.

Also important when using PCA, is to remember that it will find the greatest variance. This means units are important, and can affect the result. If you pick a smaller unit for a dimension; maybe you use grams instead of kilograms, then all of your values in that dimension become much farther apart. PCA will then likely try harder to capture that dimension, even if that’s not what it should be doing - so you should be careful that all of your features are in roughly the same range of numerical values.



Here we will break down our mushroom dataset into two principal components 


In [None]:
# to start we need to scale our numeric dataset 
# so as not to overinflate the influence of a single feature in a different unit 
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

mushrooms_x_train_scl = sc.fit_transform(mushrooms_x_train)
mushrooms_x_test_scl = sc.transform(mushrooms_x_test)

In [None]:
# Now do this for the wines too
# YOUR CODE HERE


In [None]:
pd.DataFrame(mushrooms_x_train_scl)

In [None]:
from sklearn.decomposition import PCA
mushroom_pca = PCA(n_components=2) #create PCA class object

mushroom_pca.fit_transform(mushrooms_x_train_scl) 
mushrooms_x_train_pca = mushroom_pca.transform(mushrooms_x_train_scl) #perform PCA on training dataset 
mushrooms_x_test_pca = mushroom_pca.transform(mushrooms_x_test_scl) #apply PCA transformation to scaled test dataset 

In [None]:
print("original shape:   ", mushrooms_x_train.shape)
print("transformed shape:", mushrooms_x_train_pca.shape) #what are the effects of dimensionality redution for this dataset?

# Don't worry too much about the following code - we mostly included it in order to give you a useful visualization for understanding the dataset.
scatter = plt.scatter(mushrooms_x_train_pca[:, 0], mushrooms_x_train_pca[:, 1], alpha=0.8, c = mushrooms_y_train)
plt.axis('equal');
handles, labels = scatter.legend_elements(prop="colors", alpha=0.6)
legend = plt.legend(handles, labels, loc="upper right", title="edibility")

In [None]:
#what if we dont know the number of components that make up variability in our dataset?
mushrooms_pca_n = PCA()
mushrooms_pca_n = mushrooms_pca_n.fit(mushrooms_x_train_scl)
mushrooms_variance = mushrooms_pca_n.explained_variance_ratio_[0:7]
mushrooms_df = pd.DataFrame({'var':mushrooms_variance,
             'PC':['PC1','PC2','PC3','PC4','PC5', 'PC6','PC7']})
scree = plt.bar(mushrooms_df["PC"],mushrooms_df['var'])

# Perform PCA on your dataset
Take some time now to perform PCA on your wine training dataset.

How many components account for the highest variability in your data? Prepare a scree plot that shows this. When you've decided how many dimensions to take, use PCA to extract those dimensions in your wine dataset.

In [None]:
# YOUR CODE HERE
# Here we set up a scree plot

Once you've chosen how many dimensions you want to use, run PCA on your dataset.

In [None]:
# YOUR CODE HERE

# Precision Recall Curves

For a precision-recall curve (sometimes called just a “PR-curve”), we plot recall on the horizontal axis, and precision on the vertical axis. 
To build this plot, we are going to move around a threshold. This threshold tells us when we are certain enough of our result to classify something; if we aren’t certain enough, we just give no classification for the datapoint. Recall indicates what percentage of the total data-set was classified correctly. Precision indicates, of the points we chose to classify, what percentage of them were correct. Typically, there will be some points that you are certain of your classification; then, for low recall, your precision is very high. This is the top left of the chart. As you classify more things, the precision goes down; so the line will move from the top left towards the bottom right as you choose to classify more things.

Here's a diagram of what precision and recall are, shamelessly copied from Wikipieda

![](https://upload.wikimedia.org/wikipedia/commons/2/26/Precisionrecall.svg)

We will have plenty of examples of these Precision Recall curves below.


# Running different classification algorithms 
There is no single classifier that will always perform best on your dataset. Because of this we run multiple algorithms on our training dataset and evaluate their predictive scores against one another. 
In this tutorial we will use:
1. Logistic Regression
1. Support Vector Machine 
1. K- Nearest Neighbor Model
1. Decision Tree
1. Random Forest 

# Logistic Regression

This algorithm fits a “logistic function” to a dataset of true and false values. Here’s an example of a logistic function, shamelessly copied from Wikipedia. The black dots are the training points - they're either 0, for fail, or 1, for pass. Despite this, we can fit a curve that outputs a predicted probability. Notice how it doesn’t go higher than 1 or lower than 0 - the fit done with logistic regression will always give a fit like this.

![](https://upload.wikimedia.org/wikipedia/commons/6/6d/Exam_pass_logistic_curve.jpeg)

* When we fit this function, it doesn’t actually give us a classification - it suggests a probability that the correct class is one class or the other. We have to choose a cut-off.

* One thing to note about Logistic regression, is that, when we do it in higher dimensional spaces, these cut-offs we choose will always be a straight line. We'll generate an example of this below.




In [None]:
# to begin here, scikit learn has several functions to help with evaluating the accuracy of our model
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import plot_precision_recall_curve

In [None]:
# we will also be using the following function to plot our data. It's for demonstration purposes only - I wouldn't worry about how it works.
def plot_predictions(model_name):
    from matplotlib.colors import ListedColormap
    X_set, y_set = mushrooms_x_train_pca, mushrooms_y_train
    
    # plot decision boundary
    X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
    plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.6, cmap = ListedColormap(('firebrick', 'royalblue')))
    plt.xlim(X1.min(), X1.max())
    plt.ylim(X2.min(), X2.max())
    for i, j in enumerate(np.unique(y_set)):
        plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], alpha = 0.4,
                    c = ListedColormap(('firebrick', 'royalblue'))(i), label = j)
    plt.title("%s Training Set" %(model_name))
    plt.xlabel('PC 1')
    plt.ylabel('PC 2')
    plt.legend()    


In [None]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()

classifier.fit(mushrooms_x_train_pca,mushrooms_y_train)
print('Training accuracy Score: {0:.4f}\n'.format(accuracy_score(mushrooms_y_train,classifier.predict(mushrooms_x_train_pca))))
print('Testing accuracy Score: {0:.4f}\n'.format(accuracy_score(mushrooms_y_test,classifier.predict(mushrooms_x_test_pca))))

In [None]:
plot_precision_recall_curve(classifier, mushrooms_x_test_pca, mushrooms_y_test)

In [None]:
#How well did logistic regression predict edibility for our musrhooms?
plot_predictions('Logistic Regression')

Now try and fit a logistic regression classifier to the wine dataset, and analyze the results

In [None]:
# YOUR CODE HERE


# Support Vector Machines (SVM)

* A support vector machine skips the parts with the logistic regression fit, and simply tries to stick a decision boundary into your dataset and the best possible place. It will often give very similar results to logistic regression.
* However, most SVM implementations provide you with something called the “kernel trick.” You let it add more dimensions to your dataset, and those dimensions can be used to find better decision boundaries. 

Here’s what that can look like:

![](https://miro.medium.com/max/2400/1*gXvhD4IomaC9Jb37tzDUVg.png)


In [None]:
from sklearn.svm import SVC
# kernel 'rbf' is the kernel function the classifier is using - feel free to try others, but this one works just fine.
classifier = SVC(kernel='rbf',random_state=42)

classifier.fit(mushrooms_x_train_pca,mushrooms_y_train)
print('Training accuracy Score: {0:.4f}\n'.format(accuracy_score(mushrooms_y_train,classifier.predict(mushrooms_x_train_pca))))
print('Testing accuracy Score: {0:.4f}\n'.format(accuracy_score(mushrooms_y_test,classifier.predict(mushrooms_x_test_pca))))
plot_precision_recall_curve(classifier, mushrooms_x_test_pca, mushrooms_y_test)

In [None]:
plot_predictions("SVM")

Now try and run SVM on the wine dataset, and analyze the results.

In [None]:
# YOUR CODE HERE


# K Nearest Neighbors

The idea behind K Nearest Neighbors is to identify K groupings of datapoints given their "distance" from one another. For each test point, look at the k nearest training points - these are its “neighbors.” We will then have those neighbors “vote” on how to classify the test point.

This method is really good at picking up bizarre decision boundaries that are difficult to capture with other methods.

This method can do poorly in areas where there are points coming from both classifications. For example, if a region of your feature space has a 60% chance of being a positive case, you probably want to mark this as positive, but 40% of your training points in that region will be negative, and it is very possible to end up near a cluster of negative neighbors and misclassify your test point.


In [None]:
from sklearn.neighbors import KNeighborsClassifier as KNN
classifier = KNN()  # by default 5 neighbors are used - feel free to mess around with this and see what happens.

classifier.fit(mushrooms_x_train_pca,mushrooms_y_train)
print('Training accuracy Score: {0:.4f}\n'.format(accuracy_score(mushrooms_y_train,classifier.predict(mushrooms_x_train_pca))))
print('Testing accuracy Score: {0:.4f}\n'.format(accuracy_score(mushrooms_y_test,classifier.predict(mushrooms_x_test_pca))))
plot_precision_recall_curve(classifier, mushrooms_x_test_pca, mushrooms_y_test)

In [None]:
plot_predictions('KNN')

Now try and run K Nearest Neighbors on the wine dataset, and analyze the results.

In [None]:
# YOUR CODE HERE


# Decision Trees

A decision tree makes a tree of “decisions” that give greater and greater quality predictions.

It deals with nonlinear situations much better than the logistic regression fit, but it can also give some really odd results. Take a look at the 2 dimensional decision diagram below to see how this can go.



In [None]:
from sklearn.tree import DecisionTreeClassifier as DT

classifier = DT(criterion='entropy',random_state=42) # default is gini, that's probably fine too - feel free to try it.

classifier.fit(mushrooms_x_train_pca,mushrooms_y_train)
print('Training accuracy Score: {0:.4f}\n'.format(accuracy_score(mushrooms_y_train,classifier.predict(mushrooms_x_train_pca))))
print('Testing accuracy Score: {0:.4f}\n'.format(accuracy_score(mushrooms_y_test,classifier.predict(mushrooms_x_test_pca))))
plot_precision_recall_curve(classifier, mushrooms_x_test_pca, mushrooms_y_test)

In [None]:
plot_predictions('Decision Tree')

Now it's your turn - try using a decision tree on the wines dataset.

In [None]:
# YOUR CODE HERE


# Random Forest

For a random forest, we train a *bunch* of decision trees on different subsets of the data. Then we average their results. 

This gives us a much stronger classifier than any single decision tree can produce, and mitigates many of the negative effects of decision trees.


In [None]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 50, criterion = 'entropy', random_state = 42)

classifier.fit(mushrooms_x_train_pca,mushrooms_y_train)
print('Training accuracy Score: {0:.4f}\n'.format(accuracy_score(mushrooms_y_train,classifier.predict(mushrooms_x_train_pca))))
print('Testing accuracy Score: {0:.4f}\n'.format(accuracy_score(mushrooms_y_test,classifier.predict(mushrooms_x_test_pca))))
plot_precision_recall_curve(classifier, mushrooms_x_test_pca, mushrooms_y_test)

In [None]:
plot_predictions('Random Forest')

Now you try!

In [None]:
# YOUR CODE HERE


# Experiment and ask questions!

Feel free to try other things out that you may not have had a chance for earlier. You can also ask us any questions and we'll do our best to answer.

### References
* https://www.kaggle.com/raghuchaudhary/mushroom-classification
* Aurélien Géron (2019) "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition". *O'Reilly Media, Inc.*  https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/
* https://towardsdatascience.com/tidying-up-with-pca-an-introduction-to-principal-components-analysis-f876599af383

