# Lab 2.1: scikit-learn
## TA: Suraj Yerramilli
## Date: January 18th, 2019

Scikit-learn is a popular machine learning library that has implementations of a large number of common algorithms, including ones for text processing. An advantage of scikit-learn is a consistent syntax across different kinds of models.

Before starting, ensure you have the sklearn library (version >=0.20) installed. You can either use the Anaconda Navigator or the command line for installing/upgrading.

**pip**:

```{bash}
pip install --user sklearn
pip install --user --upgrade sklearn
```

**conda**:

```{bash}
conda install sklearn
conda update sklearn
```

We will start with a script which performs Ridge regression on the california housing dataset

In [None]:
## Typical sklearn script

# importing libraries
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, r2_score

# read the housing data and extract features and labels
housing = np.genfromtxt("../data/california_housing.csv",delimiter=",",names=None,skip_header=1)
X = housing[:,:-1]
y = housing[:,-1]

# Split into train and test sets (2:1 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# preprocessing the data
scaler = StandardScaler()
X_sc = scaler.fit_transform(X_train)

# initialize regressor and fit object
reg = Ridge(alpha=1e-2) # alpha is the regularization parameter
reg.fit(X_sc,y_train)

# Obtain predictions
y_pred_train = reg.predict(X_sc) 
X_sc_test = scaler.transform(X_test)  # need to normalize (with the same mean and sd) before obtaining predictions
y_pred_test = reg.predict(X_sc_test)


# Report training and test performance
print("Training MSE: {}".format(mean_squared_error(y_train,y_pred_train)))
print("Training R^2: {}".format(r2_score(y_train,y_pred_train)))
print()
print("Test MSE: {}".format(mean_squared_error(y_test,y_pred_test)))
print("Test R^2: {}".format(r2_score(y_test,y_pred_test)))

In [None]:
reg.coef_ # regression coefficients

Let us look at the individual components of the script

## Data Input

In most cases, the data input will be in the form of numpy arrays. There may be other data types for text processing.

In [None]:
# shapes of the different arrays
print(X.shape)
print(y.shape)

## Training and Test Sets (Model Validation)

Training and testing the model on the same data is a mistake. You can obtain near-perfect fits to the data while working poorly on unseen data. So, the typical strategy is to hold-out some part of the data as a test set. The `train_test_split` function does just that.

In [None]:
help(train_test_split)

The `train_test_split` function is part of the `sklearn.model_selection` module. It has other validaiton strategies like KFold (which is more robust).

## Estimators

The main API implemented by scikit-learn is the Estimator class. An estimator is any object that learns from data; it may be a classification, regression or clustering algorithm or a transformer that extracts/filters useful features from raw data.

Examples of transforming estimators -StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder. These are availabel in the `sklearn.preprocessing` module. 

Estimator classes for transformers typically have the following methods:

| Method                | Description                                                  |
|-----------------------|--------------------------------------------------------------|
| .fit(X)               | Compute transformer "parameters" from data                   |
| .transform(X)         | Transform new data                                           |
| .fit_transform(X)     | Compute transformer "parameters" and return transformed data |
| .inverse_transform(X) | Return the original representation on new data               |

In the code, we use the StandardScaler to normalize the features in the training set to have zero mean and standard deviation. Note that the test data should be scaled with the same mean and standard deviation

In [None]:
# fitted parameters of scaler object
{'mean':scaler.mean_,'sd': scaler.scale_}

Estimator classes for supervised models typically have the following methods:

| Method             | Description                                                         |
|--------------------|---------------------------------------------------------------------|
| .fit(X,y)          | Fits the model on the data. y is not needed for unsupervised models |
| .predict(X)        | Returns predictions (either values or labels) on the new data       |
| .predict_proba(X)  | Returns class probabilities on the new data (when applicable)       |
| .get_params()      | Get the model's tuning parameters (defined in object call)          |
| .score(X,y)        | Returns the prediction score on the new data                        |

## Metrics

The `sklearn.metrics` module contains functions which return some measure of prediction performance. They are of two types - loss/error and score. The typical syntax is as follows:

```{python}
loss_or_score_function(y_true,y_pred)
```

Note that the syntax is **different** for unsupervised learning metrics like clustering.

Examples:

Regression - mean_squared_error, r2_score, mean_absolute_error
Classification - accuracy_score, precision_score, recall_score, log_loss
Clustering - silhouette_score

## Exercise 1 ##

Fit a radial-basis support vector machine for regression on the california housing dataset. For classification, the estimator class is SVR. Initialize the SVR with arguments `random_state` = 1, `kernel` = 'rbf',`gamma` = 0.125 and `C` = 1. Compare its prediction performance with that of the linear model.

In [None]:
# Train SVM on the california housing dataset
from sklearn.svm import SVR

### YOUR CODE GOES HERE
### You can use the variables defined previously
### Note: Fitting the SVM can take upto several minutes

## Exercise 2 (optional)

Train your favorite classifier (from sklearn of course) on the digits dataset. Return the accuracy score and log(istic) loss (when probabilities are computed). To obtain predicted probabilities use the `.predict_proba` method instead of the regular `.predict` method.

In [None]:
from sklearn.datasets import load_digits

# load digits dataset
X,y = load_digits(return_X_y=True)

### YOUR CODE GOES HERE
# use multiple blocks if you want to do a little more exploration