# Train, Test, Validate Methodology

## Setup
---

### Dataset

https://archive.ics.uci.edu/ml/datasets/Communities+and+Crime+Unnormalized

Download Dataset [CommViolPredUnnormalizedData.txt](https://archive.ics.uci.edu/ml/machine-learning-databases/00211/CommViolPredUnnormalizedData.txt)

Download Header [CommViolPredUnnormalizedDataHeaders.csv](https://github.com/smnieee/ml_workshop/blob/master/session3/CommViolPredUnnormalizedDataHeaders.csv)

### Upload Data

In [None]:
from google.colab import files
upload = files.upload()

### Alternative for using Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Load and View the Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

The data file comes without a header column. We could enter this manually but there 147 columns. Let's just use a header file to bring in the column names.

In [None]:
head = pd.read_csv('/content/CommViolPredUnnormalizedDataHeaders.csv')

Now, we can use the column names of this empty DataFrame as the column names of our data file.

In [None]:
df = pd.read_csv('/content/CommViolPredUnnormalizedData.txt', names=head.columns.values.tolist(),
                 na_values='?')

View the first 5 rows of data with `df.head()`.

In [None]:
df.head()

## Extract the Data for Analysis

For our analysis we will look at the relationship between violent and non-violent crime.

We want to exclude any data that has `NaN` for the columns of `nonViolPerPop` and `violentPerPop`. Then, we will look at the scatter plot to see what a trend might look like.

In [None]:
dfcln = df.dropna(subset=['popDensity','nonViolPerPop', 'violentPerPop'])
dfcln.plot.scatter(x='nonViolPerPop',y='violentPerPop')
dfcln.plot.scatter(x='popDensity',y='violentPerPop')

There appears to be an approximatley linear or quadratic relationship between the two measures.

## Manually Fit Linear Model

We will try to fit the data on our own to see how this works. Then we will use the `scikit-learn` tools for fitting the data. 

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

In [None]:
# Assign the violent crimes data as the Y-data
viol = dfcln['violentPerPop'].values
N = len(viol)
Y_data = viol.copy()

nonviol = dfcln['nonViolPerPop'].values
density = dfcln['popDensity'].values

# Create a 2-D X-data for comparison
X_data = np.ndarray((N,2))
X_data[:,0] = nonviol.copy()
X_data[:,1] = density.copy()

print(f"The number of samples is: {N}")

### Set up number of samples in each set

In [None]:
test_pct = 0.1
valid_pct = 0.2

X_train_valid, X_test, Y_train_valid, Y_test = train_test_split(X_data, Y_data, test_size=test_pct)
X_train, X_valid, Y_train, Y_valid = train_test_split(X_train_valid, Y_train_valid, test_size=valid_pct)

print(f"Train Size: {len(Y_train)}")
print(f"Validation Size: {len(Y_valid)}")
print(f"Test Size: {len(Y_test)}")

### Create Function for Calculating Linear Output

This function will take input data and parameters and return the error score.

In [None]:
def my_simple_fit(X, y, intercept, slopes):
  """
  Calculate mean squared error for manual linear fit.
  """
  ycalc = intercept + slopes @ X.T
  err = mean_squared_error(y, ycalc)
  return err

In [None]:
# Define an intercept and slopes and caluclate the fit
b = 0.0
m = np.array([1.0, 1.0])

err = my_simple_fit(X_train, Y_train, b, m)
print(f"Error is {err}")

In [None]:
# Once the parameters are optimal, compare to the validation set.
val_err = my_simple_fit(X_valid, Y_valid, b, m)
print(f"The validation set error is {val_err}")

### Use Python Tools and Compare

Let's let `scikit-learn` do the calculation and see how we did.

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
# Create the linear regession object
regr = LinearRegression()

# Use the fit method to create the fit
regr.fit(X_train, Y_train)

# Print the coefficients
print(f"The intercept is {regr.intercept_} and the slopes are {regr.coef_}")

Now, check the calculated model with the validation data.

In [None]:
Y_pred = regr.predict(X_valid)
mse_pred = mean_squared_error(Y_valid, Y_pred)
print(f"The MSE is {mse_pred}")

Finally, check the results against the test data.

In [None]:
Y_pred_test = regr.predict(X_test)
mse_pred_test = mean_squared_error(Y_test, Y_pred_test)
print(f"The MSE is {mse_pred_test}")

## Polynomial Regression to the Data

The linear approach was difficult to do manually. The data can be fit with a simple line but is that optimal? Now, we will try to use a polynomial to see how the model fits the data.

In [None]:
from sklearn.preprocessing import PolynomialFeatures

In [None]:
# Set up a quadratic polynomial
quad = PolynomialFeatures(degree=2)

# Create quadratic x-data
Xq_train = quad.fit_transform(X_train)
Xq_valid = quad.fit_transform(X_valid)
Xq_test = quad.fit_transform(X_test)

# Now use the new x-data in the linear regression
qregr = LinearRegression(fit_intercept=False)

qregr.fit(Xq_train, Y_train)

Yq_pred = qregr.predict(Xq_train)
mse_yq_train = mean_squared_error(Y_train, Yq_pred)

# Print the MSE for the training data
print(f"The training data MSE is {mse_yq_train}")

In [None]:
Yq_valid = qregr.predict(Xq_valid)
Yq_test = qregr.predict(Xq_test)

mse_yq_valid = mean_squared_error(Y_valid, Yq_valid)
mse_yq_test = mean_squared_error(Y_test, Yq_test)

# Print the validation and test results
print(f"The validation set MSE is {mse_yq_valid}")
print(f"The test set MSE is {mse_yq_test}")

## K-Fold Validation

The previous example worked to just demonstrate how each step works and if the data showed a quadratic dependence on the data. Now, we want to run a cross-validation on the data set to find the results.

In [None]:
from sklearn.model_selection import KFold

### Investigate KFold Results

Using the `KFold` object will automatically generate the indices for splitting our data up. If we iterate on the object it will give us each permutation of train and validation assignements. We can then see the error as we change the combinations.

In [None]:
nfolds = 5
kf = KFold(n_splits=nfolds)

err_array = np.zeros(nfolds)

n = 0
for train_index, valid_index in kf.split(X_train_valid):
  # Split up the data using KFold
  Xkf_train, Xkf_valid = X_train_valid[train_index], X_train_valid[valid_index]
  Ykf_train, Ykf_valid = Y_train_valid[train_index], Y_train_valid[valid_index]

  # Use linear regression to fit and measure the data
  regr.fit(Xkf_train, Ykf_train)
  err_array[n] = mean_squared_error(Ykf_valid, regr.predict(Xkf_valid))
  n += 1

Now, lets look at the data accross the folds.

In [None]:
perm_num = range(1, len(err_array)+1)

plt.scatter(perm_num, err_array)

So, the error is obviously dependent on the selection of data in the training and in the validation sets. What if we save all the fit parameters and look at those?

In [None]:
kf_intercepts = np.zeros(nfolds)
kf_slopes = np.zeros((nfolds, 2))

n = 0
for train_index, valid_index in kf.split(X_train_valid):
  # Split up the data using KFold
  Xkf_train, Xkf_valid = X_train_valid[train_index], X_train_valid[valid_index]
  Ykf_train, Ykf_valid = Y_train_valid[train_index], Y_train_valid[valid_index]

  # Use linear regression to fit and measure the data
  regr.fit(Xkf_train, Ykf_train)
  kf_intercepts[n] = regr.intercept_
  kf_slopes[n,:] = regr.coef_
  err_array[n] = mean_squared_error(Ykf_valid, regr.predict(Xkf_valid))
  n += 1

fig, ax = plt.subplots()
plt.scatter(range(1,nfolds+1), kf_intercepts)
plt.title('Intercepts')
plt.show()

In [None]:
fig, ax = plt.subplots()
plt.scatter(range(1,nfolds+1), kf_slopes[:,0])
plt.title('Slope for Nonviolent Crimes')
plt.show()

In [None]:
fig, ax = plt.subplots()
plt.scatter(range(1,nfolds+1), kf_slopes[:,1])
plt.title('Slope for Population Density')
plt.show()

### Model Results after KFold

Now, the average of the folding can be used and then tested against the test set.

In [None]:
mean_intercept = kf_intercepts.mean()
mean_slopes = kf_slopes.mean(axis=0)

new_regr = LinearRegression()
new_regr.intercept_ = mean_intercept
new_regr.coef_ = mean_slopes

# Print the MSE for the test set
final_mse = mean_squared_error(Y_test, new_regr.predict(X_test))
print(f"The MSE for the test set after KFolds of {nfolds} is {final_mse}")