# What is machine learning?

Machine learning is a subfield of artificial intelligence, but is often also referred to as predictive analytics, or predictive modeling. Its goal and usage is to build new and/or leverage existing algorithms to learn from data, in order to build generalizable models that give accurate predictions, or to find patterns, particularly with new and unseen similar data.

# Three Types of Machine Learning Tasks

1. Supervised learning
    - with discrete data, it is classification task
        - Classification is identifying group membership
            - Example: email spam filtering; good emails vs spam emails
    - with continuous data, it is regression task
        - Regression involves estimating or predicting a response
            - Example: relationship between SAT scores and studying time
2. Unsupervised learning
    - discovering structures in unlabeled data
3. Reinforcement learning
    - Example: chess engine

### Example 1: Classifying a fruit as apple or orange based on weight and surface texture

In [1]:
import sklearn

features = [[140, 'smooth'], [130, 'smooth'], [150, 'bumpy'], [170, 'bumpy']]
labels = ['apple','apple','orange','orange']

To use scikit-learn, have to use numeric values in the features and labels data:
    - make "smooth" = 1 and "bumpy" = 0
    - make "apple" = 0 and "orange" = 1

In [2]:
from sklearn import tree

features = [[140, 1], [130, 1], [150, 0], [170, 0]]
labels = [0, 0, 1, 1]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(features, labels)
print(clf.predict([[150,0]]))

[1]


### Example 2: Classifying Iris Data Set

"data" = the features columns<br>
"target" = the label or output column<br>
"setosa" -> 0<br>
"versicolor" -> 1<br>
"virginica" -> 2<br>

<img src="http://www.analyticskhoj.com/wp-content/uploads/2015/04/IRIS-Dataset.jpg">

<img src="http://ersatzassets.s3.amazonaws.com/iris_dataset_output.png">

In [44]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn import tree

iris = load_iris()
test_idx = [0, 50, 100]

# training data
train_target = np.delete(iris.target, test_idx)
train_data = np.delete(iris.data, test_idx, axis=0)

# testing data
test_target = iris.target[test_idx]
test_data = iris.data[test_idx]

clf = tree.DecisionTreeClassifier()
clf.fit(train_data, train_target)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [15]:
print(test_target)
print(clf.predict(test_data))

[0 1 2]
[0 1 2]


In [16]:
# viz code
from sklearn.externals.six import StringIO

with open('/home/pybokeh/temp/iris.dot', 'w') as f:
    f = tree.export_graphviz(clf,
                        out_file=f,
                        feature_names=iris.feature_names,
                        class_names=iris.target_names,
                        filled=True, rounded=True,
                        impurity=False)

In [25]:
!dot -Tpdf /home/pybokeh/temp/iris.dot -o /home/pybokeh/temp/iris.pdf