# CSE 5243 - Introduction to Data Mining
## Homework 2: Classification
- Semester: Spring 2021
- Instructor: Vedang Patel
- Section: Tuesday/Thursday 9:35AM
- Student Name: Ryan Stuckey
- Student Email: stuckey.87@osu.edu
- Student ID: 500201211

Template Version V2.(Adopted from Prof. Tom Bihari's version)
***

**Instructions and Helpful Hints:**
- Consider putting all of your "discussion" text in markdown cells, not inline with code. That gives you more control over formatting. Markdown cheat sheet: https://www.markdownguide.org/cheat-sheet
- Explain what you are doing, and why.  Explain what you found out or learned.
- *Make sure you run your entire workbook before handing it in, so the output cells are populated.*
- Follow the Section structure as much as possible - put your content where it is requested, so we can find your answers.
- If you have questions on expectations or need clarification or guidance, please ask.  Post to Piazza if it is a general question, so everyone benefits.
- The class label for this exercise is IsitDay.

***
# Section: Overview
- Insert a short description of the scope of this exercise, any supporting information, etc.
***

For this homework, I will be training and testing three different classifiers to determine how we can predict whether it is night or day for SmartBike. The three classifiers I am using are KNN, Decision Trees, and SVM. Finally, at the end, I will analyze and compare each classifier and choose the best one. After that, I will test the final one on the test data to see how it works. 

I put larger functions inside hw2.py to keep this Jupyter notebook clean and clutter free. In addition, hw2.py also contains all the functions and transformations that I used in homework 1. The code that I left here in the Jupyter notebook includes anything that I felt was important to show what computations and actions I was taking.

***
# Section: Setup
- Add any needed imports, helper functions, etc., here.
***

In [1]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score, cross_validate, cross_val_predict
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from statistics import mean

import pandas as pd
import texttable as tt

import hw2

train_data = hw2.data('altered_seoulbikedata_train.csv')

---
***
# Section: 1 - Evaluation Method
- Define measures for evaluating the classification models you develop.  Explain why the measures you choose provide a useful view into the value and usefulness of the model you eventually chose for the company to use.  Define two types:
***

***
## Section: 1.1 - Define measures that do not include the cost information
- (e.g., confusion matrices, accuracy, precision, recall, F-measures, etc.).
***

### Confusion Matrix
- simple representation that shows number of true positives, false positives, true negatives, and false negatives
- true positives and true negatives are the desired outcomes and we want these numbers to be maximized
- used in a wide variety of different measures

### Accuracy
- just a ratio of the correct classifications to the total number of classifications
- can be problematic and misleading in certain instances if a large amount of the data belongs to a single class

### Precision
- ratio over number of true positive classifications over the total number of predicted positive classifications (both true and false positives)
- gives us a better idea of how well our model performed in identifying true positive cases (regardless of how much data belongs to either class) and how our model does at avoiding false positives

### Recall/Sensitivity/True Positive Rate
- similar to precision, but looks at actual positive classification instead
- recall is a ratio of number of true positive classifications to total actual positive classifications

### Specitivity/True Negative Rate
- like recall, but looks at the actual negative classification
- ratio of number of true negatives to total actual negative classifications

### False Positive Rate
- ratio of number of false positives over total number of true negatives and false positives
- also just (1 - specitivity) or &#945;

### False Negative Rate
- ratio of number of false negatives over total number of false negatives and true positives
- also just (1 - sensitivity) or &#946;

### Power
- same thing as recall/sensitivity

### F-measure
- combines precision and recall measure
- calculate with (2\*precision\*recall)/(precision+recall)

### Type I and Type II Errors
- Type I error is an error in which we reject null hypothesis even though it is true (or a false positive)
- &#945; is the probability that we have a Type I error
- Type II error is where we accept a nyull hypothesis even though it is false (or a false negative)
- &#946; is the probability that we have a Type II error

### Scoring Choices
To evaluate my models, I chose **F-measure, accuracy, recall, and precision**. I chose F-measure because I think it will provide a good overview of how well my model did. Accuracy was chosen because it is straight forward and easy to compute; even though it can be misleading, I still wanted to compute it to see what it was. I chose recall and precision because those will better let me know how good my model is doing at identifying positive case. Precision here is important, as that will tell us how good our model is doing at avoiding a positive label when the label is actually negative. If our model predicts a true value of IsitDay when it is actually false, then the bike's headlights would turn off at nighttime.

In [2]:
# Initialize list of evaluation methods to use later
scoring=['accuracy', 'f1', 'precision', 'recall']

***
## Section: 1.2 - Define measures that do include the cost information
- (e.g., using cost matrices).
***

### Cost Matrix
- just gives the cost of each potential outcome (i.e., true postive, false positive, true negative, false negative)
- calculate total cost of model by multiplying each outcome by the cost for that outcome
- can be used to emphasize that some classifications are extremely unfavorable compared to others
- an example of this would be a self-driving care when it is stopped at a stop sign and deciding if a car is coming and it should wait
    - true positive is when the car decides a car is coming and a car is actually coming, true negative is when car decides a car is not coming and a car is not actually coming
    - in this case, a false negative would be very undesirable, as it could lead to a potential collision (e.g., car decides it is safe to move when a car is coming)
    - to address this, we can give a high cost to false negatives while giving a low cost to false postive (i.e., better to wait than to risk a crash)

Below, I create a function to compute the cost of a single prediction:

In [3]:
def compute_cost(conf_matrix):
    tp_cost, fp_cost, tn_cost, fn_cost = 0, 10, 0, 1
    return conf_matrix[0,0] * tp_cost + conf_matrix[0,1] * fn_cost + conf_matrix[1,0] * fp_cost + conf_matrix[1,1] * tn_cost

---
***
# Section: 2 - Pre-Processing of the Dataset
- Use the altered_seoulbikedata_train.  Split it into a Training dataset and a validation dataset.  Keep them separate and use the Training dataset for training/tuning and the validation dataset for hyperparameter tuning. Or you can use cross validation - https://scikit-learn.org/stable/modules/cross_validation.html.
***

For my training and validation sets, I am going to use cross validation (CV) as mentioned above using SciKit Learn's [cross_val_score method](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score). This step is done later after instantiating my classfiers. 

With this method, you can fit the data using the given classifier, training set, and target set. In addition, you specify how many folds to use as well as the evaluation method to use from [this list](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter). The method returns an array with the score for each run of CV.

***
## Section: 2.1 - Revise the dataset
- Review the meanings of the attributes and consider removing redundant or (likely) irrelevant attributes, combining attributes, etc., to reduce the number of attributes.
- (You may choose to use techniques such as those you used in Homework 1 to analyze the impacts of individual attributes on the CLASS attribute, but you need not do a “deep” analysis.)
- Describe what you chose to do (and not do), and why.
***

### Transformations
- For my initial revisions, I did applied the same transformations that I used in homework 1. My transformations involved removing outliers (using 1.5xIQR method) and removing and potentially erroneous data. These revisions included:
    1. Identify and remove temperature outliers. For this, I took outliers in each day, as temperaure will widely vary throughout the year (e.g., temperatures in summer vs. temperatures in winter) but varies less throughout the day.
    2. Identify and remove dew point outliers. Even though I am removing the dew point attribute (due to its low correlation with our class attribute, IsitDay), I want to remove outliers for it because I use it to check for erroneous humidity values in the next step.
    3. Check for erroneous humidity outliers. To do this, I check if the recorded humidity is within a certain range of my calculate humidity. I calculate the humidity using the formula found in the function hw2.humidty in (hw2.py). If the acceptable humidity varies by &#177;5, then I will remove that data entry.
    4. Finally, I remove any extra/irrelevant attributes. These are attributes I found to be uncessary in predicitng IsitDay. To determine if their relevance, I took their correlation with IsitDay and reasoned with logic on whether or not they were important. The attributes I ended up removing were Visibility, Dew point temperature, Rainfall, Snowfall, Seasons, Holiday, Functioning day, and Date.

In [4]:
print('--Transformation Summary--')
target = hw2.hw1_transformation(train_data)
train_data.describe()

--Transformation Summary--
+---------------------+-------+
| Elimination Cause   | Count |
+---------------------+-------+
| Temperature Outlier | 270   |
+---------------------+-------+
| Humidity Error      | 78    |
+---------------------+-------+
| Total Eliminations  | 348   |
+---------------------+-------+


Unnamed: 0,Rented Bike Count,Temperature(C),Humidity(%),Wind speed (m/s),Solar Radiation (MJ/m2)
count,7010.0,7010.0,7010.0,7010.0,7010.0
mean,711.702282,12.894479,58.472896,1.716933,0.567655
std,651.26385,11.904562,20.15385,1.03529,0.866163
min,0.0,-17.8,11.0,0.0,0.0
25%,193.0,3.5,43.0,0.9,0.0
50%,509.0,13.6,57.0,1.5,0.01
75%,1074.0,22.5,74.0,2.3,0.93
max,3556.0,39.4,98.0,7.4,3.52


***
## Section: 2.2 - Transform the attributes
- Consider transforming the remaining attributes (e.g., one hot encoding in case python classification models does not support nomial attribute), normalizing / scaling values, encoding labels (if necessary), etc.
- Describe what you chose to do (and not do), and why.
***

### Additional Transformations
- The only additional transformation I found necessary was to normalize my data via SciKit's MinMaxScaler class.
- With the attributes I had leftover, I did not need to do anything like one hot encoding or encoding of labels because all values I had leftover were numerical.

The results from scaling can be seen below. Notice how all values now range from 0 to 1 in comparison to the data printed in Section 1.2.

In [5]:
scaler = MinMaxScaler()
scaler.fit(train_data)
train_data = pd.DataFrame(scaler.transform(train_data), columns=train_data.columns)
train_data.describe()

Unnamed: 0,Rented Bike Count,Temperature(C),Humidity(%),Wind speed (m/s),Solar Radiation (MJ/m2)
count,7010.0,7010.0,7010.0,7010.0,7010.0
mean,0.200141,0.536617,0.545665,0.232018,0.161266
std,0.183145,0.208122,0.231653,0.139904,0.246069
min,0.0,0.0,0.0,0.0,0.0
25%,0.054274,0.372378,0.367816,0.121622,0.0
50%,0.143138,0.548951,0.528736,0.202703,0.002841
75%,0.302025,0.704545,0.724138,0.310811,0.264205
max,1.0,1.0,1.0,1.0,1.0


---
***
# Section: 3 - Evaluation of the Off-The-Shelf KNN Classifier
- Select the KNN classifier from the SciKit Learn library and run it on the dataset.
***

***
## Section: 3.1 - Configure the off-the-shelf KNN classifier
- Use the KNeighborsClassifier from the SciKit Learn library
- Explain all setup, parameters and execution options you chose to set, and why.
***

The definition for SciKit's KNeighborsClassifier class is:

```
KNeighborsClassifier(n_neighbors=5, *, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None, **kwargs)
```

The parameters that are important or that I'll be changing are described below:
- n_neighbors- number of nearest neighbors to compute; default = 5
- weights- weighting function to use when looking at points in neighborhood; default = 'uniform' but 'distance' or a user defined weighting function can also be passed
- algorithm- algorithm used to compute nearest neighbors; default = 'auto' but other options include 'ball_tree', 'kd_tree', or 'brute'
- metric- method to use for measuring distance; acceptable values can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html#sklearn.neighbors.DistanceMetric) or a user-defined function can be passed in; default = Minkowski, which has the formula sum(|x - y|^p)^(1/p)
- p- power to use for computing Minkowski metric and refers to the *p* value in the formula above; p=1 is Manhattan distance and p=2 is Euclidean distance
- metric_params- additionaly parameters for the metric formula passed in

### KNN Instantiation
For my KNN classifier, I am going to be trying 10 different classifiers, split into two groups. Group 1 will contain 5 classifiers, look for 1 to 5 nearest neighbors, and use Manhattan distance. Group 2 will also contain 5 classfiers, look for 1 to 5 nearest neighbors, and use Euclidean distance. 

I am not quite sure where to start when choosing my value for k, so I wanted to test a considerable range of values. If I make it too small, my classifier will be too sensitive to noise points. However, if I make it too big, my classifier may find data from other classes.

Additionally, I'll be using cross validation with SciKit's cross_val_score and cross_validate methods to train and validate my model. cross_validate is just like cross_val_score, but it allows multiple types of evaluation measures to be used. 

In [6]:
min_neighbors, max_neighbors = 1,5
knn_manhat = [KNeighborsClassifier(n_neighbors=x, metric='manhattan') for x in range(min_neighbors, max_neighbors + 1)]
knn_euclid = [KNeighborsClassifier(n_neighbors=x, metric='euclidean') for x in range(min_neighbors, max_neighbors + 1)]

***
## Section: 3.2 - Run and evaluate the classifier
- Try several values of the K parameter and compare the results.
- Evaluate the performance of the classifier, using the evaluation method you defined above.
***

In [7]:
knn_manhat_scores = [cross_validate(estimator=knn_manhat[i], X=train_data, y=target, cv=5, scoring=scoring) for i in range(0, len(knn_manhat))]
knn_manhat_cost_scores = [compute_cost(confusion_matrix(target, cross_val_predict(knn_manhat[i], train_data, target, cv=5)))  for i in range(0, len(knn_manhat))]

knn_euclid_scores = [cross_validate(estimator=knn_euclid[i], X=train_data, y=target, cv=5, scoring=scoring) for i in range(0, len(knn_euclid))]
knn_euclid_cost_scores = [compute_cost(confusion_matrix(target, cross_val_predict(knn_euclid[i], train_data, target, cv=5)))  for i in range(0, len(knn_euclid))]

table_knn_manhat = tt.Texttable()
table_knn_manhat.add_rows([['Results for Manhattan KNN', '', '', '', '', ''],
                ['Value of K', 'Accuracy', 'F-measure', 'Recall', 'Precision', 'Cost']
])

for i in range(0, len(knn_manhat_scores)):
    table_knn_manhat.add_row( ['k=' + str(i+1), mean(knn_manhat_scores[i]['test_accuracy']), mean(knn_manhat_scores[i]['test_f1']), mean(knn_manhat_scores[i]['test_recall']), mean(knn_manhat_scores[i]['test_precision']), knn_manhat_cost_scores[i] ])
print(table_knn_manhat.draw())
print()

table_knn_euclid = tt.Texttable()
table_knn_euclid.add_rows([['Results for Euclidean KNN', '', '', '', '',''],
                ['Value of K', 'Accuracy', 'F-measure', 'Recall', 'Precision', 'Cost']
])
for i in range(0, len(knn_euclid_scores)):
    table_knn_euclid.add_row( ['k=' + str(i+1), mean(knn_euclid_scores[i]['test_accuracy']), mean(knn_euclid_scores[i]['test_f1']), mean(knn_euclid_scores[i]['test_recall']), mean(knn_euclid_scores[i]['test_precision']), knn_euclid_cost_scores[i] ])
print(table_knn_euclid.draw())

+---------------------------+----------+-----------+--------+-----------+------+
| Results for Manhattan KNN |          |           |        |           |      |
| Value of K                | Accuracy | F-measure | Recall | Precision | Cost |
+---------------------------+----------+-----------+--------+-----------+------+
| k=1                       | 0.902    | 0.889     | 0.859  | 0.922     | 4774 |
+---------------------------+----------+-----------+--------+-----------+------+
| k=2                       | 0.897    | 0.876     | 0.795  | 0.976     | 6642 |
+---------------------------+----------+-----------+--------+-----------+------+
| k=3                       | 0.916    | 0.902     | 0.852  | 0.959     | 4876 |
+---------------------------+----------+-----------+--------+-----------+------+
| k=4                       | 0.904    | 0.886     | 0.812  | 0.975     | 6116 |
+---------------------------+----------+-----------+--------+-----------+------+
| k=5                       

***
## Section: 3.3 - Evaluate the choice of the KNN classifier
- What characteristics of the problem and data made KNN a good or bad choice?
***

Overall, KNN seems to work reasonably well for the data set. However, this is my initial findings as I have not yet run tests with the other two classifiers.

KNN might not be the best fit for this dataset as the numbers are all over the place. Additionally, we have a larger number of dimensions here with 5 different attributes, so this could cause problems for the model when we start trying to use it to predict the class of new data. Our attributes also are across a wide variety of various measurement systems, so the data varies pretty dramatically in magnitude. I fixed this with normalization, but it is still something to keep in mind when choosing the final classifier.

---
***
# Section: 4 - Evaluation of Off-The-Shelf Classifier #2
- As with the KNN classifier above, choose another classifier from the SciKit Learn library (Decision Tree, SVM, Logistic Regression, etc.) and run it on the dataset.
***

***
## Section: 4.1 - Configure the classifier
- Use the appropriate classifier from the SciKit Learn library.
- Explain all setup, parameters and execution options you chose to set, and why.
***

For my second classifier, I am using a decision tree. The definition for SciKit's decision tree constructor is:

```
DecisionTreeClassifier(*, criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, ccp_alpha=0.0)
```

Some of the important features or the features I will be changing include:
- criterion- function to use to measure the quality of a split; default = 'gini' for Gini impurity but can also use 'entropy' for information gain
- splitter- strategy that chooses the split at each node; default = 'best' to choose the best split or use 'random' for the best random split
- max_depth- maximum depth of the tree; default = None so the tree will be expanded until all leaves are pure or all leaves contain less than *min_samples_split* samples (see below)
- min_samples_split- minimum number of samples needed to split a node; default = 2;

### Decision Tree Instantiation

For my decision tree, I will be trying four different classifiers with different hyperparameter settings. Two trees will be use Gini index to measure splits, and the other two will use entropy. Additionally, one from each group will use the 'best' split strategy while the other classifier in each group will use the 'random' strategy.

In [8]:
trees = {
    'gini': {
        'best': DecisionTreeClassifier(criterion='gini', splitter='best'),
        'random': DecisionTreeClassifier(criterion='gini', splitter='random')
    },
    'entropy': {
        'best': DecisionTreeClassifier(criterion='entropy', splitter='best'),
        'random': DecisionTreeClassifier(criterion='entropy', splitter='random')
    }
}

***
## Section: 4.2 - Run and evaluate the classifier
- Try several values of the parameters (if appropriate) and compare the results.
- Evaluate the performance of the classifier, using the evaluation method you defined above.
***

In [9]:
tree_scores = {
    'gini': {
        'best': cross_validate(estimator=trees['gini']['best'], X=train_data, y=target, cv=5, scoring=scoring),
        'random': cross_validate(estimator=trees['gini']['random'], X=train_data, y=target, cv=5, scoring=scoring)
    },
    'entropy': {
        'best': cross_validate(estimator=trees['entropy']['best'], X=train_data, y=target, cv=5, scoring=scoring),
        'random': cross_validate(estimator=trees['entropy']['random'], X=train_data, y=target, cv=5, scoring=scoring)
    }
}
tree_cost_scores = {
    'gini': {
        'best': compute_cost(confusion_matrix(target, cross_val_predict(trees['gini']['best'], train_data, target, cv=5))),
        'random': compute_cost(confusion_matrix(target, cross_val_predict(trees['gini']['random'], train_data, target, cv=5)))
    },
    'entropy': {
        'best': compute_cost(confusion_matrix(target, cross_val_predict(trees['entropy']['random'], train_data, target, cv=5))),
        'random': compute_cost(confusion_matrix(target, cross_val_predict(trees['entropy']['random'], train_data, target, cv=5)))
    }
}

table_trees = tt.Texttable()
table_trees.add_row(['Decision Tree Results', '', '', '', '', '', ''])
table_trees.add_row(['Quality Criterion', 'Splitter', 'Accuracy', 'F-measure', 'Recall', 'Precision', 'Cost'])

table_trees.add_row(['Gini', 'Best', mean(tree_scores['gini']['best']['test_accuracy']), mean(tree_scores['gini']['best']['test_f1']), mean(tree_scores['gini']['best']['test_recall']), mean(tree_scores['gini']['best']['test_precision']), tree_cost_scores['gini']['best']])

table_trees.add_row(['Gini', 'Random', mean(tree_scores['gini']['random']['test_accuracy']), mean(tree_scores['gini']['random']['test_f1']), mean(tree_scores['gini']['random']['test_recall']), mean(tree_scores['gini']['random']['test_precision']), tree_cost_scores['gini']['random']])

table_trees.add_row(['Entropy', 'Best', mean(tree_scores['entropy']['best']['test_accuracy']), mean(tree_scores['entropy']['best']['test_f1']), mean(tree_scores['entropy']['best']['test_recall']), mean(tree_scores['entropy']['best']['test_precision']), tree_cost_scores['entropy']['best']])

table_trees.add_row(['Entropy', 'Random', mean(tree_scores['entropy']['random']['test_accuracy']), mean(tree_scores['entropy']['random']['test_f1']), mean(tree_scores['entropy']['random']['test_recall']), mean(tree_scores['entropy']['random']['test_precision']), tree_cost_scores['entropy']['random']])

print(table_trees.draw())

+----------------+----------+----------+-----------+--------+-----------+------+
| Decision Tree  |          |          |           |        |           |      |
| Results        |          |          |           |        |           |      |
+----------------+----------+----------+-----------+--------+-----------+------+
| Quality        | Splitter | Accuracy | F-measure | Recall | Precision | Cost |
| Criterion      |          |          |           |        |           |      |
+----------------+----------+----------+-----------+--------+-----------+------+
| Gini           | Best     | 0.921    | 0.914     | 0.915  | 0.913     | 3065 |
+----------------+----------+----------+-----------+--------+-----------+------+
| Gini           | Random   | 0.918    | 0.911     | 0.909  | 0.913     | 3071 |
+----------------+----------+----------+-----------+--------+-----------+------+
| Entropy        | Best     | 0.925    | 0.918     | 0.917  | 0.919     | 3081 |
+----------------+----------

***
## Section: 4.3 - Evaluate the choice of the classifier
- What characteristics of the problem and data made the classifier a good or bad choice?
***

I think this classifier is an great choice to use in this situation as it uses basic yes or no questions to come to a conclusion, much in the same way that a human would ask to determine if it is dark outside (e.g., is the sun out? are there a lot of bikes rented out? what is the temperature?).

The results here show that decision tree classification is a great classifier for this problem when comparing it to KNN in **Section 3**. While most metrics remained the same, the cost improved significantly.

---
***
# Section: 5 - Evaluation of Off-The-Shelf Classifier #3
- As with the KNN classifier above, choose another classifier from the SciKit Learn library (Decision Tree, SVM, Logistic Regression, etc.) and run it on the dataset.
***

***
## Section: 5.1 - Configure the classifier
- Use the appropriate classifier from the SciKit Learn library.
- Explain all setup, parameters and execution options you chose to set, and why.
***

For my third classifier, I decided to use SVM. SciKit's definition for SVM is:

```
SVC(*, C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=- 1, decision_function_shape='ovr', break_ties=False, random_state=None)
```

The parameters that I will be using or that are common for SVM/SCV are:
- C- regularization parameter; controls how much SVM focuses on a smooth decision boundary vs classifying every training point correctly; larger values of C mean we focus more on getting every single training point correctly classified; default = 1
- kernel- type of function to use in SVM; default = 'rbf' or radial basis function, but other acceptable values include 'linear', 'poly', 'sigmoid', and 'precomputed'
- degree- degree of the polynomial function specified in *kernel*, but only used if *kernel*='poly'; default = 3
- gamma- kernel coefficient used in 'rbf', 'poly', and 'sigmoid' kernels; default = 'scale'

As I have never used SVM before, I am not quite sure where to begin with the parameter setting. I will start with the default parameters for one classifier instance. Additionally, I will make at least two more instances, some with polynomial kernels (with varying degree) and some with linear kernels.

Finally, for choosing *C*, I will use run multiple iterations using multiples of 10 while leaving all the other parameters constant and then look at the best results from that.

In [10]:
svm_dict = { 'linear': {}, 'poly': {}, 'rbf': {}, 'sigmoid': {} }
for i in range (-2, 3):
    svm_dict['linear'][str(10 ** i)] = SVC(C=10 ** i, kernel='linear', gamma='scale')
for i in range (-2, 3):
    svm_dict['poly'][str(10 ** i)] = SVC(C=10 ** i, kernel='poly', gamma='scale')
for i in range (-2, 3):
    svm_dict['rbf'][str(10 ** i)] = SVC(C=10 ** i, kernel='rbf', gamma='scale')
for i in range (-2, 3):
    svm_dict['sigmoid'][str(10 ** i)] = SVC(C=10 ** i, kernel='sigmoid', gamma='scale')

***
## Section: 5.2 - Run and evaluate the classifier
- Try several values of the parameters (if appropriate) and compare the results.
- Evaluate the performance of the classifier, using the evaluation method you defined above.
***

In [11]:
svm_scores, svm_cost_scores, svm_saved_scores = hw2.init_section5_2_dictionaries()

for kernel in svm_dict.keys():
    for c in svm_dict[kernel].keys():
        svm_scores[kernel][c] = cross_validate(estimator=svm_dict[kernel][c], X=train_data, y=target, cv=5, scoring=scoring)
        svm_cost_scores[kernel][c] = compute_cost(confusion_matrix(target, cross_val_predict(svm_dict[kernel][c], train_data, target, cv=5)))
        hw2.update_min_max(svm_scores, svm_cost_scores, svm_saved_scores, kernel, c)

final_min_max = hw2.find_absolute_min_max(svm_saved_scores)
svm_table = hw2.get_svm_results(final_min_max)
print(svm_table.draw())

+-----------+-----------+---------+-----------+-----------+---------+----------+
| SVM       |           |         |           |           |         |          |
| Results   |           |         |           |           |         |          |
+-----------+-----------+---------+-----------+-----------+---------+----------+
| Test      | Kernel    | C (min) | Value     | Kernel    | C (max) | Value    |
|           | (min)     |         | (min)     | (max)     |         | (max)    |
+-----------+-----------+---------+-----------+-----------+---------+----------+
| Accuracy  | sigmoid   | 100     | 0.634     | rbf       | 100     | 0.938    |
+-----------+-----------+---------+-----------+-----------+---------+----------+
| F-measure | sigmoid   | 0.010   | 0.491     | rbf       | 100     | 0.931    |
+-----------+-----------+---------+-----------+-----------+---------+----------+
| Precision | sigmoid   | 100     | 0.602     | sigmoid   | 0.010   | 0.992    |
+-----------+-----------+---

As seen above, the results were all over the place. However, using a radial basis function kernel with a C of 100 seems to minimize cost and maximize other measures. Beacuse of this, I wanted to look at the data for the RBF kernel with a C of 100, 10, 1, 0.1, and 0.01.

In [12]:
rbf_table = hw2.get_svm_rbf_table(svm_scores, svm_cost_scores)
print(rbf_table.draw())

+--------------------+-------+-------+-------+-------+--------+
| RBF Kernel Results |       |       |       |       |        |
+--------------------+-------+-------+-------+-------+--------+
| Test               | C=100 | C=10  | C=1   | C=0.1 | C=0.01 |
+--------------------+-------+-------+-------+-------+--------+
| Accuracy           | 0.938 | 0.934 | 0.920 | 0.890 | 0.843  |
+--------------------+-------+-------+-------+-------+--------+
| F-measure          | 0.931 | 0.926 | 0.908 | 0.868 | 0.796  |
+--------------------+-------+-------+-------+-------+--------+
| Precision          | 0.955 | 0.954 | 0.955 | 0.963 | 0.985  |
+--------------------+-------+-------+-------+-------+--------+
| Recall             | 0.908 | 0.900 | 0.866 | 0.790 | 0.667  |
+--------------------+-------+-------+-------+-------+--------+
| Cost               | 3087  | 3370  | 4441  | 6848  | 10732  |
+--------------------+-------+-------+-------+-------+--------+


***
## Section: 5.3 - Evaluate the choice of the classifier
- What characteristics of the problem and data made the classifier a good or bad choice?
***

When finding the correct parameters for SVM, it worked considerably good for our data. SVM with an RBF kernel worked extremely well because the decision boundary can be made to fit the data better. This is important for this problem, as the values in our dataset were all over the place and many of them had a considerably large range.

However, a potential problem with using RBF and a C of 100 is that we risk overfitting the data with such a large C. As we increase C, SVM will focus more on correctly classifying every single data point instead of creating a smooth hyperplane. To address this, we could compromise and find a slightly lower C value. A candidate for this would be C=1 or C=10.

Overall, SVM is a great classifier to use because the decision boundary is flexible and can work around the data.

***
# Section: 6 - Comparison of the Three Classifiers
***

***
## Section: 6.1 - Compare the performance of these classifiers to each other
- What are their strong and weak points? Configure the classifier.
***

To start my comparison, I will take two classifier instances from each classifier that I used and tabulate the values:

In [13]:
clfs_cmp = hw2.get_classifier_cmp_table(knn_manhat_scores, knn_manhat_cost_scores, knn_euclid_scores, knn_euclid_cost_scores, tree_scores, tree_cost_scores, svm_scores, svm_cost_scores)
print(clfs_cmp.draw())

+---------------------------+----------+-----------+-----------+--------+------+
| Classifier Comparison     |          |           |           |        |      |
+---------------------------+----------+-----------+-----------+--------+------+
| Classifier                | Accuracy | F-measure | Precision | Recall | Cost |
+---------------------------+----------+-----------+-----------+--------+------+
| KNN (Manhattan, k=3)      | 0.916    | 0.902     | 0.959     | 0.852  | 4876 |
+---------------------------+----------+-----------+-----------+--------+------+
| KNN (Euclidean, k=3)      | 0.915    | 0.903     | 0.953     | 0.858  | 4715 |
+---------------------------+----------+-----------+-----------+--------+------+
| Decision Tree (Gini,      | 0.921    | 0.914     | 0.913     | 0.915  | 3065 |
| Best)                     |          |           |           |        |      |
+---------------------------+----------+-----------+-----------+--------+------+
| Decision Tree (Entropy,   

And now, going over the pros and cons of each:


In [14]:
pros_cons_table = hw2.get_final_classifier_cmp_table()
print(pros_cons_table.draw())

+-------------------+-------------------+-------------------+------------------+
| Classifier        |                   |                   |                  |
| Comparison        |                   |                   |                  |
+-------------------+-------------------+-------------------+------------------+
| Classifier        | Pros              | Cons              | Summary          |
+-------------------+-------------------+-------------------+------------------+
| KNN               | High scores on    | Very large cost   | Overall, not     |
|                   | precision,        | compared to       | bad. But,        |
|                   | accuracy          | others, costly to | compared to our  |
|                   |                   | compute and run   | other            |
|                   |                   |                   | classifiers, we  |
|                   |                   |                   | could definitely |
|                   |       

***
## Section: 6.2 - Choose a Best Classifier
- Choose one of the three classifiers as best and explain why.
***

In [15]:
test_data = hw2.data('altered_seoulbikedata_test.csv')
test_target = test_data['IsitDay']
hw2.remove_cols(test_data)
scaler = MinMaxScaler()
scaler.fit(test_data)
test_data = pd.DataFrame(scaler.transform(test_data), columns=test_data.columns)

final_classifier = svm_dict['rbf']['10']
final_classifier.fit(train_data, target)
test_results = final_classifier.predict(test_data)

final_table = tt.Texttable()
final_table.add_row(['Final Results', 'Accuracy', 'F-Measure', 'Precision', 'Recall', 'Cost'])
final_table.add_row(['SVM (rbf, C=10)', accuracy_score(test_target, test_results), f1_score(test_target, test_results), precision_score(test_target, test_results), recall_score(test_target, test_results), compute_cost(confusion_matrix(test_target, test_results))])

print(final_table.draw())

+-----------------+----------+-----------+-----------+--------+------+
| Final Results   | Accuracy | F-Measure | Precision | Recall | Cost |
+-----------------+----------+-----------+-----------+--------+------+
| SVM (rbf, C=10) | 0.925    | 0.917     | 0.936     | 0.899  | 912  |
+-----------------+----------+-----------+-----------+--------+------+


***
# Section: 7 - Conclusions
- Write a paragraph on what you discovered or learned from this homework.
***

I went into this pretty unsure of what I was doing. I did not feel great about my work on my first homework and was not sure if I was analyzing and transforming the data correctly. However, after running KNN on my data initially, I began to feel better about what I was doing after seeing the results. I was worried about whether or not I was getting the "correct" answer. However, I realized that it was more about experimenting and that there was not really a correct answer. After that realization, I enjoyed experimenting with the hyperparameters more to see how it affected my model.

I learned a lot about the different parameters for each type of classifier and what each parameter actually means. Using the different classifiers made me do some research on each and find articles or videos that helped me better understand the classifiers and how they worked. After getting a better understanding of each parameter and how it affected the classifier, I was better able to fine tune my hyperparameters. Prior to taking this class, I had only take CSE 3521 and, after taking that, I felt like I had not learned a lot of applicable information that I could use in future work. However, after this homework, I feel more confident in my ability to apply what I learned here to my future homework and studies.

---
***
### END-OF-SUBMISSION
***