# Assignment 2 - Scott Wilkinson (V00887986) 

## Q3: Finding Optimal p and k for KNN Regression using GridSearch

A notebook which uses the GridSearch function to obtain an optimal p and k for a KNN regression model to predict the the concentration of the light profiles of real galaxies from the Canada France Imaging Survey based on their non-parametric morphology data.

First, we must import the necessary Python packages.

In [1]:
# importing packages used in notebook
import numpy as np
import matplotlib.pyplot as plt
import pymysql, os

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV

## Importing Morphology Data

In [2]:
#Query SQL for morphology data
# connects to database
db = pymysql.connect(host = 'lauca.phys.uvic.ca', db = 'sdss', user = 'swilkinson', passwd = '123Sdss!@#')
x = 'SELECT  s.objID, asymmetry, shape_asymmetry, gini, m20, concentration, total_mass_med\
    FROM dr7_cfis_statmorph_newmask s, dr7_uberuber u\
    WHERE u.objID = s.objID\
    AND flag_morph = 0\
    AND asymmetry > -1\
    AND total_sfr_med>-5\
    AND total_mass_med > 7'
c = db.cursor()
c.execute(x)
db_data = c.fetchall()
c.close()
db.close()

# save names as a string
names_morph = np.array(db_data, dtype = str).T[0]

# save rest of data as floats
morph = np.array(db_data, dtype = float).T[1:-1]

sfr = np.array(db_data, dtype = float).T[-1:]

print(morph.shape)
print(sfr.shape)

(5, 147063)
(1, 147063)


## Normalizing that data

In [3]:
#input_tr,input_va,target_tr, target_va = train_test_split(morph.T, sfr.T,test_size=0.25, shuffle = True)
input_tr,input_va,target_tr, target_va = train_test_split(morph[:-1].T, morph[-1].T,test_size=0.25, shuffle = True)

### normalizing inputs

# fit on training set
scaler = StandardScaler().fit(input_tr)  

# normalize training
input_tr_norm= scaler.transform(input_tr)

# normalize validation with same scaler & fit
input_va_norm= scaler.transform(input_va)

## Find optimal p and k using GridSearchCV

GridSearchCV is a function that loops tries all combinations from the list of parameters you provide it and selects the one that will produce the best results for your model and data. 

In [4]:
# list of k's to try; large range and somewhat coarse 
K = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,\
    21,22,23,24,25,30,35,40,45,50,55,60,75,100,200]

# list of types of regularization to test
P = [1,2]

param_grid = [{"n_neighbors": K, "p":P}]

KNN = KNeighborsRegressor()

gs = GridSearchCV(KNN, param_grid, cv = 5)

gs.fit(input_tr_norm, target_tr)

print(gs.best_params_)
print(gs.best_score_)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  fold_sizes = np.full(n_splits, n_samples // n_splits, dtype=np.int)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_mask = np.zeros(_num_samples(X), dtype=np.bool)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_mask = np.zeros(_num_samples(X), dtype=np.bool)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_mask = np.zeros(_num_samples(X), dtype=np.bool)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_mask = np.zeros(_num_samples(X), dtype=np.bool)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-no

{'n_neighbors': 50, 'p': 2}
0.8926060914578967


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype=np.int)


As found using the GridSearchCV function, the optimal k was 50 and optimal p was 2. In the previous assignment, I manually identified k = 25 and p = 2 to be the optimal values but was severely limited by computation time and the method.

Now I test on a more fine tuned list of k-values near the optimal value found previously.

In [5]:
# a finer tuned list around 50
K = [45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65]

# test both, just in case
P = [1,2]

param_grid = [{"n_neighbors": K, "p":P}]

KNN = KNeighborsRegressor()

gs = GridSearchCV(KNN, param_grid, cv = 5)

gs.fit(input_tr_norm, target_tr)

print(gs.best_params_)
print(gs.best_score_)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  fold_sizes = np.full(n_splits, n_samples // n_splits, dtype=np.int)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_mask = np.zeros(_num_samples(X), dtype=np.bool)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_mask = np.zeros(_num_samples(X), dtype=np.bool)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_mask = np.zeros(_num_samples(X), dtype=np.bool)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_mask = np.zeros(_num_samples(X), dtype=np.bool)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-no

{'n_neighbors': 53, 'p': 2}
0.8926494045740414


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype=np.int)


## Conclusion

The optimal k was found to be 53 using l2 regularization. With these values the score is 0.89265, a VERY marginal improvement over the case of k = 50 which produces and score of 0.89261. 

Thus, as an extra conclusion, I find that the specific number of neighbours doesn't change much and in the future, using jumps of 5 or 10 in your gridsearch is probably OK.