# Machine Learning with Python
Part of the SWEET Workshop series presented by the [IDEA Student Center at UC San Diego](http://www.jacobsschool.ucsd.edu/student/).

### Goals
Learn the basics of machine learning using Python using scikit-learn.

### Requirements
- numpy
- matplotlib
- scikit-learn

In [1]:
# load required packages

# vectorized functions
import numpy as np

# plotting
import matplotlib.pyplot as plt
%matplotlib inline

# load the machine learning methods
from sklearn import linear_model, svm
from sklearn import preprocessing

# make the code compatible with python 2.x and 3.x
from __future__ import print_function, division

## 1) Supervised regression
We'll start with a supervised regression task: predict the sound pressure level of the airfoil blade designs from the NASA data set in the previous iPython Notebook.

**Reminder**: "supervised" means we have training data that includes the "true" output value (so we can adapt our model based on how well it predicts the "true" value).

In [3]:
# we'll load the data using numpy's genfromtxt() function
#
# NOTE: the data columns are separated by commas
#

# load the data
data = np.genfromtxt("airfoil_self_noise.csv", delimiter=",")

# check the data dimensions
print( data.shape )

(1503, 6)


In [4]:
# split the data into a training and testing set
train = data[:1000]
test = data[1000:]

# check the size of the two data sets
print( train.shape )
print( test.shape )

(1000, 6)
(503, 6)


In [8]:
# now let's separate the input variables from the output (aka target)

# inputs (a matrix with m rows x 5 columns):
# - frequency [Hz]
# - angle of attack [degrees]
# - chord length [m]
# - free-stream velocity [m/s]
# - suction side displacement thickness [m]
#
X_train = train[:, :5]
X_test = test[:, :5]

# output (a vector with m rows): scaled sound pressure level [dB]
y_train = train[:, -1]
y_test = test[:, -1]

### 1.1) Ordinary Least-Squares
Now that we have the data organized, we can move onto training a model. Let's start with Ordinary Least-Squares (OLS) regression.

In [13]:
# initialize the model
ols = linear_model.LinearRegression()

# fit the model to the training data
ols.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Simple, right? That's because the Python community has put thousands of hours into make the scikit-learn interface as simple as possible. But underneath, there's a lot of important theory. And training some model is just the start.

But now that we have a trained model, let's check its performance.

In [14]:
# get the coefficient of determination (R^2) of the model
# on the training and testing sets
#
# NOTE:
# - R^1 = 1 => perfect result
# - R^2 = 0 => bad result
#

# model performance on the training set
print( ols.score(X_train, y_train) )


# model performance on the testing set (which the model has
# never seen before)
print( ols.score(X_test, y_test) )

0.609742100092
0.310343768412


**Discussion**: How did the model perform? Was the performance on the testing set better or worst than the training set? And by how much?

### 1.2) Support Vector Regression
Now that we tried a Least-Squares method, let's try a more "pure" Machine Learning method: Support Vector Regression (SVR).

In [18]:
# initialize the SVR model
svr = svm.SVR()

# train the model
svr.fit(X_train, y_train)

# check the model performance
print( "Training set:", svr.score(X_train, y_train) )
print( "Testing set:", svr.score(X_test, y_test) )

Training set: 0.321303886874
Testing set: -0.0248261308626


**Discussion**: How did the SVR method compare to the OLS method? What could have caused the difference in the model performance? Does it mean Machine Learning are always good or always bad?

In [31]:
# let's try to improve the performance of the SVR model

# we'll scale the input variables so that have
# zero mean and unit variance
#
X_scaled = preprocessing.scale(data[:, :5])
X_scaled_train = X_scaled[:1000]
X_scaled_test = X_scaled[1000:]

# initialize the SVR model with a few tweaks
svr = svm.SVR()

# train the model
svr.fit(X_scaled_train, y_train)

# check the model performance
print( "Training set:", svr.score(X_scaled_train, y_train) )
print( "Testing set:", svr.score(X_scaled_test, y_test) )

Training set: 0.788224047272
Testing set: 0.459319656191


**Discussion**: Did scaling the inputs help at all? If so, how does the SVR (with scaled inputs) compare to the original OLS model?

## 2) Supervised classification
Classification is when you need to place some input into different categories. Examples:
- Does this image have a cat? Y/N
- What character (A-Z, 0-9) is shown in this photo?
- Is there a person walking in front of the car (and therefore should the self-driving car stop)?

In comparison, regression is focused on predicting values. However, there is a large overlap in classification and regression tasks (e.g. splitting data into training and testing sets, fitting models, and calculating statistics on the model predictions in order to evaluate the performance).