# Introduction to Machine Learning

In this example, we'll tackle the classic iris plant classification problem using Scikit-learn. Think of this as your 'Hello World!' to machine learning. :D

The Iris dataset was used in Fisher's classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems. The data set consists of 50 samples from each of three species of Iris. Each iris plant listed in the dataset has four different features or attributes:

- Sepal Length
- Sepal Width
- Petal Length
- Petal Width

Our task is to classify the iris plants into 3 species:
- Iris Setosa
- Iris Versicolour
- Iris Virginica 

## 1. Importing libraries
We'll start by importing some import libraries for data analysis and scientific computing:

In [537]:
import pandas as pd
import numpy as np

[Pandas](http://pandas.pydata.org/) provides easy-to-use data structures and data analysis tools for the Python programming language and allows for high-level data manipulation. Meanwhile [NumPy](http://www.numpy.org/), which stands for Numerical Python, is a scientific computing package.

Next, we'll import important tools and models from the Scikit-learn library.

In [538]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC

In these four lines, we import the four machine learning algorithms we'll be using in this example. In future sessions, we'll get to learn how some of these algorithms works, but for now, you can view these models as a 'black box' which accepts some input (features/attributes) and produces some output (predictions), without any knowledge of its internal workings. 

<b>FAQ</b>: How do you know which algorithm to choose and which will work well for your problem?

<b>Answer</b>: Generally, there are a lot of algorithms to choose from. [See this guide for a tour of the different ML algorithms](https://www.quora.com/How-do-you-choose-a-machine-learning-algorithm). 

To choose an appropriate algorithm, you first need to understand and categorize the problem you are trying to solve. [See this post for a detailed explanation](https://www.quora.com/How-do-you-choose-a-machine-learning-algorithm). 
For a short version, what you need to do to choose the right algorithms are:
1. Categorize your problem (supervised, unsupervised, reinforcement, classification, regression, sequential/temporal, etc.)
2. Find available algorithms applicable to your problem (e.g. SVM, ANN work for classification problems; CRF, HMM for sequential data, etc.)
3. Implement them by setting up a machine learning pipeline that compares the performances of the different algorithms
4. Optimize hyperparameters (optional)

In fact, you can can look at this [scikit-learn cheat sheet](https://i.stack.imgur.com/BZJiN.png) for a very rough guide to choosing an algorithm for your problem (just don't limit the models you use to the algorithms in the cheat sheet).

In [539]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

The these two lines are tools for automatically splitting the dataset and computing the accuracy of the resulting predictions, respectively. We'll get to see these two tools in action later.

## 2. Loading and inspecting the Dataset

In [540]:
filename = '../datasets/iris.csv' 
dataframe = pd.read_csv(filename, header=0) 

Alright. So here, we've loaded the dataset. This piece of code locates the `iris.csv` file in the `datasets` folder. We set `header = 0` to specify that the the first row (row 0 in Python) contains the headers, i.e. the names of each column (e.g. Sepal Length, Sepal Width, ... , Species). Some datasets might not contain headers, so be careful with this!

Afterwards, we store the dataset in the variable `dataframe` as a Pandas data structure called the [DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html). You can think of a DataFrame as a two-dimensional array or matrix containing our data. [You can also look at this cheat sheet for a visualization]().

Now, let's inspect our data. It's always a good idea to know what your data looks like. 

In [541]:
print(dataframe.shape)

(150, 5)


Here, we check the dimensions of the data. There are:
- 150 rows (50 samples for each of the the 3 species) 
- 5 columns (Sepal Length, Sepal Width, Petal Length, Petal Width, and Species).

In [542]:
print(dataframe.head(5))

   SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm      Species
0            5.1           3.5            1.4           0.2  Iris-setosa
1            4.9           3.0            1.4           0.2  Iris-setosa
2            4.7           3.2            1.3           0.2  Iris-setosa
3            4.6           3.1            1.5           0.2  Iris-setosa
4            5.0           3.6            1.4           0.2  Iris-setosa


This prints the first 5 rows or instances of your data. 

## 3. Extracting Features and Labels
Next, we need to exract the features/attributes as well as our class labels. 

In [543]:
X = dataframe.iloc[0:150, 0:4] 
y = dataframe.iloc[0:150, 4] 

Remember that we have four features (Sepal Length, Sepal Width, Petal Length, and Petal Width) in the first four columns and the class label ('Species') in the fifth (last) column.

<b>FAQ</b>: What's `iloc`?

<b>Answer</b> We use `iloc` (which stands for '<b>i</b>nteger <b>loc</b>ation') to select certain parts of the data based on their position or index. 

<b>Recap on Python indexing</b>: Remember that in Python, indexing starts at zero (0) and ends at n-1 where n is the maximum number of elements. Thus, for the Iris dataset containing 150 rows, the first row is considered to be at index 0 and the last row is at index 149. Similarly, for the columns, we have 5 columns with indices 0, 1, 2, 3, and 4. 

<b>Recap on Python Slicing</b>: Remember that in Python, when slicing a list or matrix, we need to specify the starting index and the ending index in square brackets separated by a colon( e.g. `[starting_index : ending_index]`). Python however returns the list of elements from the starting index up until the ending index - 1! Yeah, Python'sa little eff'd up like. 

Also, if you don't specify the starting index, it automatically starts at the very beginning which is at index `0` (e.g. `[:5]` is the same as `[0:5]`). Likewise, if you don't specify the ending index, it gets everything from the starting index up to the very end (e.g. if you have 4 elements, then `[1:]` is the same as `[1:5]`). Meaning, if you have something like `[:]`, this basically gets ALL the elements. Kapische? 

[For more information on basic Python indexing and slicing for lists, see this very helpful tutorial!](https://www.tutorialspoint.com/python/python_lists.htm)

<b> Slicing for DataFrames</b>

In the first line, 


In [544]:
X = dataframe.iloc[0:150, 0:4] #or dataframe.iloc[:, :4]

we create a variable `X` containing a matrix of the features of all the rows. That is, we slice the dataframe using two sets of colons: the first set is for the <b>rows</b>, the second for the <b>columns</b>. In other words, `0:150` indicates that we want to get all the rows from index 0 until index 149. That's basically all the rows in our dataset! Then, `0:4` indicates we want to get all the columns from index 0 up to index 3 (excluding index 4). Thus, `dataframe.iloc[0:150, 0:4]` is the same as `dataframe.iloc[:, :4]`.

![](images/slice1.png)

In the second line, 

In [545]:
y = dataframe.iloc[0:150, 4] #or dataframe.iloc[:, 4]

we create a variable `y` containing an array of the class labels of all the rows. Since the labels are located at index 4, we slice the DataFrame such that we get all the rows at the 4th index. 
![](images/slice2.png)

Lastly, note that we can also access the columns values using the header names or labels using `loc` instead of `iloc`.
For more info, [try reading up on it here to get a better understanding](https://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-position).

## 4. Splitting the dataset
Some models may perform better than others on a specific dataset. By 'perform', we are pertaining to how well a model can predict data that it has not yet seen before. But how can we say that one model 'performs' better than another?

A general practice is to split your data into a training and test set. You train/tune your model with your training set and test how well it generalizes to data it has never seen before with your test set. (Source: [Why split into training and test set?](https://www.quora.com/What-is-a-training-data-set-test-data-set-in-machine-learning-What-are-the-rules-for-selecting-them))
In essence, rather than using the entire dataset to train your model to learn your data, you <i>hold back</i> a part of your dataset with the goal of eventually coming up with some quantitative measure, usually a measure of accuracy or error, of how well your model performs. You can then use this quantitative measurement to compare the performances of different models. 

For example, you wouldn't want to use a model with only a 50% accuracy. That's hardly better than random guessing! We all want the best model, right? 

In [546]:
train_size = 0.8
seed = 7 
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=train_size, random_state=seed)