# CSE 5243 - Introduction to Data Mining
## Homework 2: Classification
- Semester: Spring 2021
- Instructor: Vedang Patel
- Section: Tuesday/Thursday 9:35AM
- Student Name: Ryan Stuckey
- Student Email: stuckey.87@osu.edu
- Student ID: 500201211

Template Version V2.(Adopted from Prof. Tom Bihari's version)
***

**Instructions and Helpful Hints:**
- Consider putting all of your "discussion" text in markdown cells, not inline with code. That gives you more control over formatting. Markdown cheat sheet: https://www.markdownguide.org/cheat-sheet
- Explain what you are doing, and why.  Explain what you found out or learned.
- *Make sure you run your entire workbook before handing it in, so the output cells are populated.*
- Follow the Section structure as much as possible - put your content where it is requested, so we can find your answers.
- If you have questions on expectations or need clarification or guidance, please ask.  Post to Piazza if it is a general question, so everyone benefits.
- The class label for this exercise is IsitDay.

***
# Section: Overview
- Insert a short description of the scope of this exercise, any supporting information, etc.
***

***
# Section: Setup
- Add any needed imports, helper functions, etc., here.
***

In [1]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate
from sklearn.pipeline import Pipeline
from statistics import mean

import pandas as pd
import texttable as tt

import hw2

train_data = hw2.data('altered_seoulbikedata_train.csv')
test_data = hw2.data('altered_seoulbikedata_test.csv')

---
***
# Section: 1 - Evaluation Method
- Define measures for evaluating the classification models you develop.  Explain why the measures you choose provide a useful view into the value and usefulness of the model you eventually chose for the company to use.  Define two types:
***

***
## Section: 1.1 - Define measures that do not include the cost information
- (e.g., confusion matrices, accuracy, precision, recall, F-measures, etc.).
***

### Confusion Matrix
- simple representation that shows number of true positives, false positives, true negatives, and false negatives
- true positives and true negatives are the desired outcomes and we want these numbers to be maximized
- used in a wide variety of different measures

### Accuracy
- just a ratio of the correct classifications to the total number of classifications
- can be problematic and misleading in certain instances if a large amount of the data belongs to a single class

### Precision
- ratio over number of true positive classifications over the total number of predicted positive classifications (both true and false positives)
- gives us a better idea of how well our model performed in identifying true positive cases (regardless of how much data belongs to either class) and how our model does at avoiding false positives

### Recall/Sensitivity/True Positive Rate
- similar to precision, but looks at actual positive classification instead
- recall is a ratio of number of true positive classifications to total actual positive classifications

### Specitivity/True Negative Rate
- like recall, but looks at the actual negative classification
- ratio of number of true negatives to total actual negative classifications

### False Positive Rate
- ratio of number of false positives over total number of true negatives and false positives
- also just (1 - specitivity) or &#945;

### False Negative Rate
- ratio of number of false negatives over total number of false negatives and true positives
- also just (1 - sensitivity) or &#946;

### Power
- same thing as recall/sensitivity

### F-measure
- combines precision and recall measure
- calculate with (2\*precision\*recall)/(precision+recall)

### Type I and Type II Errors
- Type I error is an error in which we reject null hypothesis even though it is true (or a false positive)
- &#945; is the probability that we have a Type I error
- Type II error is where we accept a nyull hypothesis even though it is false (or a false negative)
- &#946; is the probability that we have a Type II error

### Scoring Choices
To evaluate my models, I chose F-measure, accuracy, recall, and precision. I chose F-measure because I think it will provide a good overview of how well my model did. Accuracy was chosen because it is straight forward and easy to compute; even though it can be misleading, I still wanted to compute it to see what it was. I chose recall and precision because those will better let me know how good my model is doing at identifying positive case. Precision here is important, as that will tell us how good our model is doing at avoiding a positive label when the label is actually negative. If our model predicts a true value of IsitDay when it is actually false, then the bike's headlights would turn off at nighttime.

In [2]:
# Initialize list of evaluation methods to use later
scoring=['accuracy', 'f1', 'precision', 'recall']

***
## Section: 1.2 - Define measures that do include the cost information
- (e.g., using cost matrices).
***

### Cost Matrix
- just gives the cost of each potential outcome (i.e., true postive, false positive, true negative, false negative)
- calculate total cost of model by multiplying each outcome by the cost for that outcome
- can be used to emphasize that some classifications are extremely unfavorable compared to others
- an example of this would be a self-driving care when it is stopped at a stop sign and deciding if a car is coming and it should wait
    - true positive is when the car decides a car is coming and a car is actually coming, true negative is when car decides a car is not coming and a car is not actually coming
    - in this case, a false negative would be very undesirable, as it could lead to a potential collision (e.g., car decides it is safe to move when a car is coming)
    - to address this, we can give a high cost to false negatives while giving a low cost to false postive (i.e., better to wait than to risk a crash)

---
***
# Section: 2 - Pre-Processing of the Dataset
- Use the altered_seoulbikedata_train.  Split it into a Training dataset and a validation dataset.  Keep them separate and use the Training dataset for training/tuning and the validation dataset for hyperparameter tuning. Or you can use cross validation - https://scikit-learn.org/stable/modules/cross_validation.html.
***

For my training and validation sets, I am going to use cross validation (CV) as mentioned above using SciKit Learn's [cross_val_score method](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score). This step is done later after instantiating my classfiers. 

With this method, you can fit the data using the given classifier, training set, and target set. In addition, you specify how many folds to use as well as the evaluation method to use from [this list](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter). The method returns an array with the score for each run of CV.

***
## Section: 2.1 - Revise the dataset
- Review the meanings of the attributes and consider removing redundant or (likely) irrelevant attributes, combining attributes, etc., to reduce the number of attributes.
- (You may choose to use techniques such as those you used in Homework 1 to analyze the impacts of individual attributes on the CLASS attribute, but you need not do a “deep” analysis.)
- Describe what you chose to do (and not do), and why.
***

### Transformations
- For my initial revisions, I did applied the same transformations that I used in homework 1. My transformations involved removing outliers (using 1.5xIQR method) and removing and potentially erroneous data. These revisions included:
    1. Identify and remove temperature outliers. For this, I took outliers in each day, as temperaure will widely vary throughout the year (e.g., temperatures in summer vs. temperatures in winter) but varies less throughout the day.
    2. Identify and remove dew point outliers. Even though I am removing the dew point attribute (due to its low correlation with our class attribute, IsitDay), I want to remove outliers for it because I use it to check for erroneous humidity values in the next step.
    3. Check for erroneous humidity outliers. To do this, I check if the recorded humidity is within a certain range of my calculate humidity. I calculate the humidity using the formula found in the function hw2.humidty in (hw2.py). If the acceptable humidity varies by &#177;5, then I will remove that data entry.
    4. Finally, I remove any extra/irrelevant attributes. These are attributes I found to be uncessary in predicitng IsitDay. To determine if their relevance, I took their correlation with IsitDay and reasoned with logic on whether or not they were important. The attributes I ended up removing were Visibility, Dew point temperature, Rainfall, Snowfall, Seasons, Holiday, Functioning day, and Date.

In [3]:
print('--Transformation Summary--')
target = hw2.hw1_transformation(train_data)
train_data.describe()

--Transformation Summary--
+---------------------+-------+
| Elimination Cause   | Count |
+---------------------+-------+
| Temperature Outlier | 270   |
+---------------------+-------+
| Humidity Error      | 78    |
+---------------------+-------+
| Total Eliminations  | 348   |
+---------------------+-------+


Unnamed: 0,Rented Bike Count,Temperature(C),Humidity(%),Wind speed (m/s),Solar Radiation (MJ/m2)
count,7010.0,7010.0,7010.0,7010.0,7010.0
mean,711.702282,12.894479,58.472896,1.716933,0.567655
std,651.26385,11.904562,20.15385,1.03529,0.866163
min,0.0,-17.8,11.0,0.0,0.0
25%,193.0,3.5,43.0,0.9,0.0
50%,509.0,13.6,57.0,1.5,0.01
75%,1074.0,22.5,74.0,2.3,0.93
max,3556.0,39.4,98.0,7.4,3.52


***
## Section: 2.2 - Transform the attributes
- Consider transforming the remaining attributes (e.g., one hot encoding in case python classification models does not support nomial attribute), normalizing / scaling values, encoding labels (if necessary), etc.
- Describe what you chose to do (and not do), and why.
***

### Additional Transformations
- The only additional transformation I found necessary was to normalize my data via SciKit's MinMaxScaler class.
- With the attributes I had leftover, I did not need to do anything like one hot encoding or encoding of labels because all values I had leftover were numerical.

The results from scaling can be seen below. Notice how all values now range from 0 to 1 in comparison to the data printed in Section 1.2.

In [4]:
scaler = MinMaxScaler()
scaler.fit(train_data)
train_data = pd.DataFrame(scaler.transform(train_data), columns=train_data.columns)
train_data.describe()

Unnamed: 0,Rented Bike Count,Temperature(C),Humidity(%),Wind speed (m/s),Solar Radiation (MJ/m2)
count,7010.0,7010.0,7010.0,7010.0,7010.0
mean,0.200141,0.536617,0.545665,0.232018,0.161266
std,0.183145,0.208122,0.231653,0.139904,0.246069
min,0.0,0.0,0.0,0.0,0.0
25%,0.054274,0.372378,0.367816,0.121622,0.0
50%,0.143138,0.548951,0.528736,0.202703,0.002841
75%,0.302025,0.704545,0.724138,0.310811,0.264205
max,1.0,1.0,1.0,1.0,1.0


***
# Section: 3 - Evaluation of the Off-The-Shelf KNN Classifier
- Select the KNN classifier from the SciKit Learn library and run it on the dataset.
***

***
## Section: 3.1 - Configure the off-the-shelf KNN classifier
- Use the KNeighborsClassifier from the SciKit Learn library
- Explain all setup, parameters and execution options you chose to set, and why.
***

The definition for SciKit's KNeighborsClassifier class is:

```
KNeighborsClassifier(n_neighbors=5, *, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None, **kwargs)
```

The parameters that are important or that I'll be changing are described below:
- n_neighbors- number of nearest neighbors to compute; default = 5
- weights- weighting function to use when looking at points in neighborhood; default = 'uniform' but 'distance' or a user defined weighting function can also be passed
- algorithm- algorithm used to compute nearest neighbors; default = 'auto' but other options include 'ball_tree', 'kd_tree', or 'brute'
- metric- method to use for measuring distance; acceptable values can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html#sklearn.neighbors.DistanceMetric) or a user-defined function can be passed in; default = Minkowski, which has the formula sum(|x - y|^p)^(1/p)
- p- power to use for computing Minkowski metric and refers to the *p* value in the formula above; p=1 is Manhattan distance and p=2 is Euclidean distance
- metric_params- additionaly parameters for the metric formula passed in

### KNN Instantiation
For my KNN classifier, I am going to be trying 10 different classifiers, split into two groups. Group 1 will contain 5 classifiers, look for 1 to 5 nearest neighbors, and use Manhattan distance. Group 2 will also contain 5 classfiers, look for 1 to 5 nearest neighbors, and use Euclidean distance. 

I am not quite sure where to start when choosing my value for k, so I wanted to test a considerable range of values. If I make it too small, my classifier will be too sensitive to noise points. However, if I make it too big, my classifier may find data from other classes.

Additionally, I'll be using cross validation with SciKit's cross_val_score and cross_validate methods to train and validate my model. cross_validate is just like cross_val_score, but it allows multiple types of evaluation measures to be used. 

In [16]:
min_neighbors, max_neighbors = 1,5
knn_manhat = [KNeighborsClassifier(n_neighbors=x, metric='manhattan') for x in range(min_neighbors, max_neighbors + 1)]
knn_euclid = [KNeighborsClassifier(n_neighbors=x, metric='euclidean') for x in range(min_neighbors, max_neighbors + 1)]

***
## Section: 3.2 - Run and evaluate the classifier
- Try several values of the K parameter and compare the results.
- Evaluate the performance of the classifier, using the evaluation method you defined above.
***

In [30]:
knn_manhat_scores = [cross_validate(estimator=knn_manhat[i], X=train_data, y=target, cv=5, scoring=scoring) for i in range(0, len(knn_manhat))]
knn_euclid_scores = [cross_validate(estimator=knn_euclid[i], X=train_data, y=target, cv=5, scoring=scoring) for i in range(0, len(knn_euclid))]

table = tt.Texttable()
table.add_rows([['Results for Manhattan KNN', '', '', '', ''],
                ['Value of K', 'Accuracy', 'F-measure', 'Recall', 'Precision']
])

for i in range(0, len(knn_manhat_scores)):
    table.add_row( ['k=' + str(i+1), mean(knn_manhat_scores[i]['test_accuracy']), mean(knn_manhat_scores[i]['test_f1']), mean(knn_manhat_scores[i]['test_recall']), mean(knn_manhat_scores[i]['test_precision']) ])
print(table.draw())

table = tt.Texttable()
table.add_rows([['Results for Euclidean KNN', '', '', '', ''],
                ['Value of K', 'Accuracy', 'F-measure', 'Recall', 'Precision']
])
for i in range(0, len(knn_euclid_scores)):
    table.add_row( ['k=' + str(i+1), mean(knn_euclid_scores[i]['test_accuracy']), mean(knn_euclid_scores[i]['test_f1']), mean(knn_euclid_scores[i]['test_recall']), mean(knn_euclid_scores[i]['test_precision']) ])
print(table.draw())

+---------------------------+----------+-----------+--------+-----------+
| Results for Manhattan KNN |          |           |        |           |
| Value of K                | Accuracy | F-measure | Recall | Precision |
+---------------------------+----------+-----------+--------+-----------+
| k=1                       | 0.902    | 0.889     | 0.859  | 0.922     |
+---------------------------+----------+-----------+--------+-----------+
| k=2                       | 0.897    | 0.876     | 0.795  | 0.976     |
+---------------------------+----------+-----------+--------+-----------+
| k=3                       | 0.916    | 0.902     | 0.852  | 0.959     |
+---------------------------+----------+-----------+--------+-----------+
| k=4                       | 0.904    | 0.886     | 0.812  | 0.975     |
+---------------------------+----------+-----------+--------+-----------+
| k=5                       | 0.915    | 0.901     | 0.842  | 0.970     |
+---------------------------+---------

***
## Section: 3.3 - Evaluate the choice of the KNN classifier
- What characteristics of the problem and data made KNN a good or bad choice?
***

In [None]:
Overall, my KNN classsifier produced excellent results.

***
# Section: 4 - Evaluation of Off-The-Shelf Classifier #2
- As with the KNN classifier above, choose another classifier from the SciKit Learn library (Decision Tree, SVM, Logistic Regression, etc.) and run it on the dataset.
***

***
## Section: 4.1 - Configure the classifier
- Use the appropriate classifier from the SciKit Learn library.
- Explain all setup, parameters and execution options you chose to set, and why.
***

***
## Section: 4.2 - Run and evaluate the classifier
- Try several values of the parameters (if appropriate) and compare the results.
- Evaluate the performance of the classifier, using the evaluation method you defined above.
***

***
## Section: 4.3 - Evaluate the choice of the classifier
- What characteristics of the problem and data made the classifier a good or bad choice?
***

***
# Section: 5 - Evaluation of Off-The-Shelf Classifier #3
- As with the KNN classifier above, choose another classifier from the SciKit Learn library (Decision Tree, SVM, Logistic Regression, etc.) and run it on the dataset.
***

***
## Section: 5.1 - Configure the classifier
- Use the appropriate classifier from the SciKit Learn library.
- Explain all setup, parameters and execution options you chose to set, and why.
***

***
## Section: 5.2 - Run and evaluate the classifier
- Try several values of the parameters (if appropriate) and compare the results.
- Evaluate the performance of the classifier, using the evaluation method you defined above.
***

***
## Section: 5.3 - Evaluate the choice of the classifier
- What characteristics of the problem and data made the classifier a good or bad choice?
***

***
# Section: 6 - Comparison of the Three Classifiers
***

***
## Section: 6.1 - Compare the performance of these classifiers to each other
- What are their strong and weak points?Configure the classifier
***

***
## Section: 6.2 - Choose a Best Classifier
- Choose one of the three classifiers as best and explain why.
***

***
# Section: 7 - Conclusions
- Write a paragraph on what you discovered or learned from this homework.
***

***
### END-OF-SUBMISSION
***