# Introduction

<span style="font-size: 1.1em;">
This is a guided walkthrough of how to optimize the support vector machine classifier. We'll be looking at using gridsearchCV to do this. This code usually takes a day to optimize but can be quicker with a smaller data set. You'll want to use this optimization for each type of classifier. When using different training data you'll want to optimize your classifier whether its each step of multistep classification or if you're doing a multi prediction classifier

### Setting Up and Running the Classifier

These first few steps will be drawn straight from the fire_svm notebook. If you want to know more about each step you can refer to that notebook.

In [5]:
cd /Users/RichardP/research/icyfire/py

/Users/RichardP/research/icyfire/py


In [24]:
import numpy as np
import warnings
import collections
warnings.filterwarnings("ignore")

import fire_data as dat
import fire_svm as clf
import fire_model as model
import fire_cv as cv
import fire_org as org

In [9]:
file_read = dat.file_read('/Users/RichardP/research/icyfire/data/sage5.fits')  
data = dat.data_to_pytorch(file_read.data)
name_labels = {}
counter = 1
for i in data.name_unique:
    name_labels[i] = counter
    counter += 1
data.relabelling(name_labels)
(sorted(data.__dict__))

['dflux', 'flux', 'label', 'name', 'name_unique', 'spectra', 'wave']

In [14]:
label_carb = org.multistep(data.label)
spectra_oxy, label_oxy = org.deletion(data.spectra, data.label)

### Optimization

<span style="font-size: 1.1em;">
For the actual optimization step we'll first want to create some arrays. We have our 2 parameters (we are only optimizing RBF kernel because RBF is generally superior) C and gamma. We also have an array for our loss. This is best used in the case when you're classifying more than 2 objects.

In [20]:
c = np.array([])
gamma = np.array([])
svm_loss= np.array([])

<span style="font-size: 1.1em;">
For this next step we'll be optimizing our classifier. One thing thats very important is that you'll want to do this for every classifier. For my case with SAGE-Spec data I did two classifiers, one for carbon and the other for oxygen/RSG stars. The way it works is that you'll want to run the optimization 1000 different times. This is because you want to make sure the training and testing set you select are varied. If you optimize for one training set data then those parameters may not be true for other training sets. This is why if we run the optimization 1000 times we hope that the optimization will narrow down the parameters to eventually one set of numbers. 

In [25]:
for i in range(0,1000):
    training_carb, testing_carb, train_carb, test_carb, = data.randomization(label_carb, data.spectra, 90)
    cross = cv.cross_validator(train_carb['x'],train_carb['y'], test_carb['x'], test_carb['y'])
    c = np.append(c, cross.parameter1)
    gamma = np.append(gamma, cross.parameter2)
    svm_loss = np.append(svm_loss, cross.loss)

[(1264.8552168552958, 1), (2682.6957952797247, 1), (1526.4179671752318, 1)]


<span style="font-size: 1.1em;">
This next step there are many ways to approach it. Our optimizer outputted a distribution of the best parameters it chose for the given data set. However there are two parameters. The C parameter increases on a log scale (nature of the parameter) while the gamma scales by a factor of 10. You could just take the most common gamma parameter from the distribution and also take the median or mean of the C parameter. However the two parameters are somewhat linked. Therefore, what you should do is find the most common gamma parameter and look at only the C values that correspond to that gamma value. From there you select the most common one. 

In [34]:
x = collections.Counter(gamma).most_common(3)
print("Gamma values: " + str(x[0][0]) + ", " + str(x[1][0]) + ", " + str(x[2][0]))
print("Counts for Gamma: " + str(x[0][1]) + ", " + str(x[1][1]) + ", " + str(x[2][1])
mask = np.where(gamma == x[0][0])
y = c[mask]
z = collections.Counter(y).most_common(3)
print("C value: " + str(z[0][0]) + ", " + str(z[1][0]) + ", " + str(z[2][0]))
print("Counts for C: " + str(z[0][1]) + ", " + str(z[1][1]) + ", " + str(z[2][1]

Gamma values: 0.0001, 0.0001, 0.0001
Counts for 0.0001: 3
C value: 1264.85521686
Counts for 1264.85521686: 1


Now you have your optimized parameters that you can input into your classifier

<span style="font-size: 1.1em;">
One possible error is that since we trained with 90% of the data the parameters could slightly change once we input all 100% of the data. Again this is a possible error, but not one that I think will drastically change the results.