# A Taste of Supervised Learning

### Scott O’Hara - Metrowest Developers Machine Learning Group - 9/17/2020

## Goals of Talk / Notebook:

### - Touch on Key Aspects of Supervised Learning
### - Provide a Concrete Example
### - Not a Goal: Comprehensive Overview

## References:

Fox, Emily & Carlos Guestrin. *Machine Learning: Regression*. Coursera. https://www.coursera.org/learn/ml-regression.

Collins-Thompson, Kevyn. *Applied Machine Learning in Python*. Coursera. https://www.coursera.org/learn/python-machine-learning

Azevedo, A. and Santos, M. F. (2008); KDD, SEMMA and CRISP-DM: a parallel overview. In Proceedings of the IADIS European Conference on Data Mining 2008, pp 182–185.

Wikipedia contributors, "Cross-industry standard process for data mining," Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/w/index.php?title=Cross-industry_standard_process_for_data_mining&oldid=974543440 (accessed September 16, 2020).

Wikipedia contributors, "Coefficient of determination," Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/w/index.php?title=Coefficient_of_determination&oldid=978151921 (accessed September 17, 2020).

## Three Types of Machine Learning
![](3-types-of-ml.png)

Credit: image borrowed from lecture slides David Silver,  DeepMind, "Introduction to Reinforcement Learning."

## *Supervised Learning* – Learn a function from labeled data that maps input attributes to an output label.

* **Classification**
  - Learn discrete labels e.g., 'cat' or 'dog'.

* **Regression**
  - Learn real-valued labels e.g., a stock price.

## *Unsupervised Learning* – Learn patterns in unlabeled data.

* **Clustering**

* **Dimensionality Reduction**

* **Outlier Detection**

## *Reinforcement Learning* – Train an agent to maximize rewards while acting in an uncertain environment.

## The Data Science Process

![](crisp-dm-process.png)

CRISP-DM: Cross-industry Standard Process for Data Mining

## Business Understanding

This initial phase focuses on understanding the project objectives and requirements from a business perspective, then converting this knowledge into a data \[science\] problem definition and a preliminary plan designed to achieve the objectives.

## Data Understanding

The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data or to detect interesting subsets to form hypotheses for hidden information.

## Data Preparation

The data preparation phase covers all activities to construct the final dataset from the initial raw data.

## Languages and Libraries for Machine Learning and Data Science

**Python** - an interpreted, high-level, general-purpose programming language that emphasizes code readability and supports multiple programming paradigms. One of the main languages used in data science. Other languages include **R**, **Java**, **SQL**, **MATLAB**, **Scala**, and others. (https://www.python.org/)

**NumPy** - supports large, multi-dimensional arrays and matrices along with a large collection of high-level mathematical functions to operate on these arrays. (https://numpy.org/)

**Pandas** - supports data analysis by providing data structures and operations that manipulate numerical tables and time series. (https://pandas.pydata.org/)

**Scikit-Learn** - A machine learning library that features various classification, regression and clustering algorithms and is designed to work with NumPy and Pandas (https://scikit-learn.org/)

## Tools for Machine Learning and Data Science

**Anaconda** - provides free and open-source distributions of Python and R environments for data science and machine learning with the aim of simplifying package management and deployment. (https://www.anaconda.com/)

**Jupyter Notebook** - an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Provided in Anaconda. (https://jupyter.org/)

**Spyder** - a Python IDE for scientific and data analysis. Provided in Anaconda (https://www.spyder-ide.org/).

**Visual Studio Code (VSCode)** - an open source source-code editor from Microsoft that can be used with a variety of programming languages. Provides excellent support for Python, Jupyter, git and other data science-related languages and tasks. Provided in Anaconda. (https://code.visualstudio.com/)

## Data Frames

A *data frame* is a table-like data structure available in languages like R and Python used in data analysis. It is similar to a relational database table.

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Read a file containing housing sales data from King's County, Washington. 
df = pd.read_csv('kc_house_data.csv')

In [3]:
# Examine the first 10 records of the file.
df.head(10)

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503
5,7237550310,20140512T000000,1225000.0,4,4.5,5420,101930,1.0,0,0,...,11,3890,1530,2001,0,98053,47.6561,-122.005,4760,101930
6,1321400060,20140627T000000,257500.0,3,2.25,1715,6819,2.0,0,0,...,7,1715,0,1995,0,98003,47.3097,-122.327,2238,6819
7,2008000270,20150115T000000,291850.0,3,1.5,1060,9711,1.0,0,0,...,7,1060,0,1963,0,98198,47.4095,-122.315,1650,9711
8,2414600126,20150415T000000,229500.0,3,1.0,1780,7470,1.0,0,0,...,7,1050,730,1960,0,98146,47.5123,-122.337,1780,8113
9,3793500160,20150312T000000,323000.0,3,2.5,1890,6560,2.0,0,0,...,7,1890,0,2003,0,98038,47.3684,-122.031,2390,7570


In [4]:
# compute statistics for each column
df.describe()

Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
count,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0
mean,4580302000.0,540088.1,3.370842,2.114757,2079.899736,15106.97,1.494309,0.007542,0.234303,3.40943,7.656873,1788.390691,291.509045,1971.005136,84.402258,98077.939805,47.560053,-122.213896,1986.552492,12768.455652
std,2876566000.0,367127.2,0.930062,0.770163,918.440897,41420.51,0.539989,0.086517,0.766318,0.650743,1.175459,828.090978,442.575043,29.373411,401.67924,53.505026,0.138564,0.140828,685.391304,27304.179631
min,1000102.0,75000.0,0.0,0.0,290.0,520.0,1.0,0.0,0.0,1.0,1.0,290.0,0.0,1900.0,0.0,98001.0,47.1559,-122.519,399.0,651.0
25%,2123049000.0,321950.0,3.0,1.75,1427.0,5040.0,1.0,0.0,0.0,3.0,7.0,1190.0,0.0,1951.0,0.0,98033.0,47.471,-122.328,1490.0,5100.0
50%,3904930000.0,450000.0,3.0,2.25,1910.0,7618.0,1.5,0.0,0.0,3.0,7.0,1560.0,0.0,1975.0,0.0,98065.0,47.5718,-122.23,1840.0,7620.0
75%,7308900000.0,645000.0,4.0,2.5,2550.0,10688.0,2.0,0.0,0.0,4.0,8.0,2210.0,560.0,1997.0,0.0,98118.0,47.678,-122.125,2360.0,10083.0
max,9900000000.0,7700000.0,33.0,8.0,13540.0,1651359.0,3.5,1.0,4.0,5.0,13.0,9410.0,4820.0,2015.0,2015.0,98199.0,47.7776,-121.315,6210.0,871200.0


Pandas is integrated with Python and NumPy.

In [5]:
# Use Python len function to obtain number of records in data frame.
len(df)

21613

In [6]:
# df is a Data Frame
type(df)

pandas.core.frame.DataFrame

In [7]:
# A Pandas Index object
df.columns

Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15'],
      dtype='object')

In [8]:
# create a list of column headers using Python list function
list(df.columns)

['id',
 'date',
 'price',
 'bedrooms',
 'bathrooms',
 'sqft_living',
 'sqft_lot',
 'floors',
 'waterfront',
 'view',
 'condition',
 'grade',
 'sqft_above',
 'sqft_basement',
 'yr_built',
 'yr_renovated',
 'zipcode',
 'lat',
 'long',
 'sqft_living15',
 'sqft_lot15']

In [9]:
# compute median price with NumPy median function.
np.median(df['price'])

450000.0

## Slicing and Dicing Data Frames

In [10]:
# Use loc to create new data frame
X = df.loc[:, ['bedrooms', 'bathrooms']]

In [11]:
X.head()

Unnamed: 0,bedrooms,bathrooms
0,3,1.0
1,3,2.25
2,2,1.0
3,4,3.0
4,3,2.0


### Data Series

A Data Series is a one-dimensional labeled array capable of holding data of any type. 


In [12]:
s = pd.Series([3,'a', 0.0])

In [13]:
s

0    3
1    a
2    0
dtype: object

In [14]:
type(s[1])

str

In [16]:
# Use loc to create a data series
y = df.loc[:, 'price']

In [17]:
# y has 21613 elements
y

0        221900.0
1        538000.0
2        180000.0
3        604000.0
4        510000.0
           ...   
21608    360000.0
21609    400000.0
21610    402101.0
21611    400000.0
21612    325000.0
Name: price, Length: 21613, dtype: float64

In [18]:
# create a data series with 3 elements
y[[0,2,3]]

0    221900.0
2    180000.0
3    604000.0
Name: price, dtype: float64

In [19]:
# Boolean data series where item is true if price is less than $90,000, false otherwise.
y < 90000

0        False
1        False
2        False
3        False
4        False
         ...  
21608    False
21609    False
21610    False
21611    False
21612    False
Name: price, Length: 21613, dtype: bool

In [20]:
# create a data frame containing all houses with a price less than $100,000
df[y < 90000]

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
465,8658300340,20140523T000000,80000.0,1,0.75,430,5050,1.0,0,0,...,4,430,0,1912,0,98014,47.6499,-121.909,1200,7500
1149,3421079032,20150217T000000,75000.0,1,0.0,670,43377,1.0,0,0,...,3,670,0,1966,0,98022,47.2638,-121.906,1160,42882
2141,1623049041,20140508T000000,82500.0,2,1.0,520,22334,1.0,0,0,...,5,520,0,1951,0,98168,47.4799,-122.296,1572,10570
3108,1721801591,20150219T000000,89950.0,1,1.0,570,4080,1.0,0,0,...,5,570,0,1942,0,98146,47.5098,-122.334,890,5100
3767,1523049188,20150430T000000,84000.0,2,1.0,700,20130,1.0,0,0,...,6,700,0,1949,0,98168,47.4752,-122.271,1490,18630
5866,9320900420,20141014T000000,89000.0,3,1.0,900,4750,1.0,0,0,...,6,900,0,1969,0,98023,47.3026,-122.363,900,3404
8274,3883800011,20141105T000000,82000.0,3,1.0,860,10426,1.0,0,0,...,6,860,0,1954,0,98146,47.4987,-122.341,1140,11250
10253,2422049104,20140915T000000,85000.0,2,1.0,830,9000,1.0,0,0,...,6,830,0,1939,0,98032,47.3813,-122.243,1160,7680
13756,1788900230,20140722T000000,86500.0,3,1.0,840,9480,1.0,0,0,...,6,840,0,1960,0,98023,47.3277,-122.341,840,9420
15293,40000362,20140506T000000,78000.0,2,1.0,780,16344,1.0,0,0,...,5,780,0,1942,0,98168,47.4739,-122.28,1700,10387


## Modeling

Select and apply modeling techniques, calibrating parameters for optimal fit.

### Create training and test sets: X_train, X_test, y_train, y_test

![](ml-model-development.jpg)

In [21]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [22]:
X = df.loc[:, 'bedrooms':]
y = df.loc[:, 'price']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
print("Training set percentage: {:.1f}%.".format(100*len(X_train)/len(df)))
print("Test set percentage: {:.1f}%.".format(100*len(X_test)/len(df)))

Training set percentage: 75.0%.
Test set percentage: 25.0%.


![](linear-regression.jpg)

In [23]:
model = LinearRegression().fit(X_train, y_train)

### Evaluate quality of model and compare to other models.

### Regression Quality Metric: The R2 ("r-squared") Regression Score

* Measures how well a prediction model for regression fits the given data.
* The score is between 0 and 1:
  * A value of 0 corresponds to a constant model that predicts the mean value of all training target values.
  * A value of 1 corresponds to perfect prediction
* Also known as "coefficient of determination"

See https://en.wikipedia.org/wiki/Coefficient_of_determination

![](coeff-of-determination.jpg)

In [24]:
model.score(X_train, y_train)

0.7025634191135648

In [25]:
model.score(X_test, y_test)

0.6900932169858087

## Different features matter, some more than others

In [27]:
df.columns

Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15'],
      dtype='object')

In [28]:
# Remove features: 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15'
X = df.loc[:, ['bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement']]
y = df.loc[:, 'price']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
model = LinearRegression().fit(X_train, y_train)
print("Training R^2 score: {}".format(model.score(X_train, y_train)))
print("Testing R^2 score: {}".format(model.score(X_test, y_test)))


Training R^2 score: 0.6094120919619783
Testing R^2 score: 0.5925966295057667


### K-Nearest Neighbors

![](knn.jpg)

In [29]:
# Try a different model class: K-nearest neighbors.
from sklearn.neighbors import KNeighborsRegressor

X = df.loc[:, 'bedrooms':]
y = df.loc[:, 'price']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
model = KNeighborsRegressor(n_neighbors = 5).fit(X_train, y_train)
print("Training R^2 score: {}".format(model.score(X_train, y_train)))
print("Testing R^2 score: {}".format(model.score(X_test, y_test)))

Training R^2 score: 0.6858452516400575
Testing R^2 score: 0.48204813973501615


In [30]:
# K is a hyperparameter. Try to find the best one.

for k in [3, 5, 10, 25, 50, 100]:
    X = df.loc[:, 'bedrooms':]
    y = df.loc[:, 'price']
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
    model = KNeighborsRegressor(n_neighbors = k).fit(X_train, y_train)
    print("K = {}".format(k))
    print("Training R^2 score: {}".format(model.score(X_train, y_train)))
    print("Testing R^2 score: {}".format(model.score(X_test, y_test)))
    print()

K = 3
Training R^2 score: 0.7525551274835218
Testing R^2 score: 0.4700549776957391

K = 5
Training R^2 score: 0.6858452516400575
Testing R^2 score: 0.48204813973501615

K = 10
Training R^2 score: 0.6108727562385599
Testing R^2 score: 0.498699803288462

K = 25
Training R^2 score: 0.5391473087114398
Testing R^2 score: 0.4863195890913989

K = 50
Training R^2 score: 0.49049229359504715
Testing R^2 score: 0.4624989800305602

K = 100
Training R^2 score: 0.4425928431787881
Testing R^2 score: 0.4297135109848631



In [31]:
# Pass 2: K is a hyperparameter. Try to find the best one.

for k in [5, 7, 9, 11, 13, 17]:
    X = df.loc[:, 'bedrooms':]
    y = df.loc[:, 'price']
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
    model = KNeighborsRegressor(n_neighbors = k).fit(X_train, y_train)
    print("K = {}".format(k))
    print("Training R^2 score: {}".format(model.score(X_train, y_train)))
    print("Testing R^2 score: {}".format(model.score(X_test, y_test)))
    print()

K = 5
Training R^2 score: 0.6858452516400575
Testing R^2 score: 0.48204813973501615

K = 7
Training R^2 score: 0.648026320655854
Testing R^2 score: 0.4941473635103566

K = 9
Training R^2 score: 0.6220789098271142
Testing R^2 score: 0.4991159135964429

K = 11
Training R^2 score: 0.6023817458493113
Testing R^2 score: 0.4954264093349724

K = 13
Training R^2 score: 0.5899410502643976
Testing R^2 score: 0.49730196819412714

K = 17
Training R^2 score: 0.5655779981416595
Testing R^2 score: 0.49521317350868166



## Evaluation

At this stage the model (or models) obtained are more thoroughly evaluated and the
steps executed to construct the model are reviewed to be certain it properly achieves the business
objectives.

## Deployment

Creation of the model is generally not the end of the project. Even if the purpose
of the model is to increase knowledge of the data, the knowledge gained will need to be
organized and presented in a way that the customer can use it.