# Homework Assignment 6: Model Evaluation 2
As in the previous assignments, in this homework assignment you will continue your exploration of the [SWAN-SF Dataset](https://doi.org/10.7910/DVN/EBCFKM), described in the paper found [here](https://doi.org/10.1038/s41597-020-0548-x).

This assignment will utilize a copy of the extracted feature dataset we have been working with. The dataset has been processed by performing outlier clipping, z-score and range scaling, and forward feature selection to select 20 features. Like in Homework Assignment 5, we are going to continue to utilize more than one partition worth of data, so for the z-score and range scaling, the mean, standard deviation, minimum, and maximum were calculated using data from both partitions so that a global scaling can be performed on each partition. 

---

## Step 1: Downloading the Data

This assignment will continue to use [Partition 1](https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/EBCFKM/BMXYCB) for a training set and [Partition 2](https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/EBCFKM/TCRPUD) as a testing set. 

---
#### Homework 1 & 2

Recall, that in Homework 1, we started to construct the analytics base table for our [SWAN-SF Dataset](https://doi.org/10.7910/DVN/EBCFKM). In that assignment, we read the data from the two subdirectories, __FL__ and __NF__, of the __partition1__ direcotry. These two subdirectories represented the two classes of our target feature in the solar flare prediction problem we are attempting to solve this semester. We then processed these samples of multivariate time series to construct descriptive features for each sample, and then placed them into our analytics base table.

Then, in Homework 2, you utilized a set of extracted descriptive features much like what you were asked to construct in Homework 1. However, this dataset contained many more extracted features than you were asked to compute for Homework 1 (>800). So, we needed to explore the data to find data quality issues and identify ways to address any we found. Below are the links to the full extracted feature for all of partitions 1 and 2, and a toy representative dataset of partition 1 that was used as input to Homework 2.

- [Full Partition 1 feature dataset](http://dmlab.cs.gsu.edu/solar/data/partition1ExtractedFeatures.csv)
- [Full Partition 2 feature dataset](http://dmlab.cs.gsu.edu/solar/data/partition2ExtractedFeatures.csv)
- [Toy Partition 1 feature dataset](http://dmlab.cs.gsu.edu/solar/data/toy_partition1ExtractedFeatures.csv)

---

#### Homework 3

Then, in Homework 3, you were asked to perform additional data preprocessing on data that would have been produced from Homework 2. These preprocessing steps included finding features with large ranges and features with a large number of outliers. You were asked to clip some of the outliers for the features you found and were also asked to perform a few different types of scaling, such as decimal and z-score. The links to those files are below.  

- [Full Cleaned Partition 1 feature dataset](http://dmlab.cs.gsu.edu/solar/data/cleaned_partition1ExtractedFeatures.csv)
- [Toy Cleaned Partition 1 feature dataset](http://dmlab.cs.gsu.edu/solar/data/toy_cleaned_partition1ExtractedFeatures.csv)
- [Data Quality Table for Partition 1 feature dataset](http://dmlab.cs.gsu.edu/solar/data/data_quality_table.csv)

---

#### Homework 4

I then did much more of this preprocessing for you to produce data for Homework 4, inclding the clipping of outliers, and performing z-score and range normalization. I constructed both a full normalized and a toy normalized data file for use in that assignment found below.

- [Full Normalized Partition 1 feature dataset](http://dmlab.cs.gsu.edu/solar/data/normalized_partition1ExtractedFeatures.csv)
- [Toy Normalized Partition 1 feature dataset](http://dmlab.cs.gsu.edu/solar/data/toy_normalized_partition1ExtractedFeatures.csv)

You were then asked to remove columns that had too many NaN or Inf values in them and replace the remaining NaN and Inf values with the median of the feature the values happend to fall in. Then you were asked to perform various types of feature selection on the features and find a set that we might want to use for classification later.

---

#### Homework 5

For Homework 5, I have again performed clipping of outliers, and performing z-score and range normalization, but this time the calculations were based upon the values of partitions one and two.  I also performed the same Nan and Inf processing you were asked to do in Homework 4. I then performed the last feature selection method you were asked to perform using [scikit-learn Sequential Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html#sequential-feature-selection) with [scikit-learn LassoLarsCV](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoLarsCV.html#sklearn.linear_model.LassoLarsCV) as the estimator. I then constructed a subset of 20 features for both partition 1 and 2 (links below), which was the input data for the assignment.

- [Partition 1 selected feature dataset](http://dmlab.cs.gsu.edu/solar/data/normalized_partition1SelectedFeatures.csv)
- [Partition 2 selected feature dataset](http://dmlab.cs.gsu.edu/solar/data/normalized_partition2SelectedFeatures.csv)

---

# Data for Now

We are going to use the same data as was used in Homework 5. 

- [Partition 1 selected feature dataset](http://dmlab.cs.gsu.edu/solar/data/normalized_partition1SelectedFeatures.csv)
- [Partition 2 selected feature dataset](http://dmlab.cs.gsu.edu/solar/data/normalized_partition2SelectedFeatures.csv)

Now that you have the two files of selected features csv files, you should load each into a Pandas DataFrame using the [pandas.read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) method. 

---

### Evaluation Metric

As was done in Homework 5, for each of the models we evaluate in this assignmnet, you will calculate the True Skill Statistic score using the test data from Partition 2 to determine which model performs the best for classifying the positive flaring class.

    True skill statistic (TSS) = TPR + TNR - 1 = TPR - (1-TNR) = TPR - FPR

Where:

    True positive rate (TPR) = TP/(TP+FN) Also known as recall or sensitivity
    True negative rate (TNR) = TN/(TN+FP) Also known as specificity or selectivity
    False positive rate (FPR) = FP/(FP+TN) = (1-TNR) Also known as fall-out or false alarm ratio


**Recall**

    True positive (TP)
    True negative (TN)
    False positive (FP)
    False negative (FN)
    
See [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix) for more information.

Below is a function implemented to provide your score for each model.

---

In [None]:
%matplotlib inline
import os
import itertools
import pandas as pd
from pandas import DataFrame 
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
def calc_tss(y_true, y_predict):
    '''
    Calculates the true skill score for binary classification based on the output of the confusion
    table function
    
        Parameters:
            y_true   : A vector/list of values that represent the true class label of the data being evaluated.
            y_predict: A vector/list of values that represent the predicted class label for the data being evaluated.
    
        Returns:
            tss_value (float): A floating point value (-1.0,1.0) indicating the TSS of the input data
    '''
    scores = confusion_matrix(y_true, y_predict).ravel()
    TN, FP, FN, TP = scores
    print('TN={0}\tFP={1}\tFN={2}\tTP={3}'.format(TN, FP, FN, TP))
    tp_rate = TP / float(TP + FN) if TP > 0 else 0  
    fp_rate = FP / float(FP + TN) if FP > 0 else 0
    
    return tp_rate - fp_rate

---

### Label Conversion

In Homework 5, you were asked to construct a method that converted the multi-class labels of the analytics base table into a binary label of either 1 or 0. The function was to return a copy of the analytics base table that converted those labeled as M or X to be the positive flaring class (1), and those labled as C, B, or NF to be the negative flaring class (0).  Since we will continue to use the binary classification in this assignment, and since you will be asked to perform some over and under sampling of individual classes prior to this, i have provided that function below. 

---

In [None]:
def copy_and_convert_to_binary(data:DataFrame)->DataFrame:
    '''
    Makes a copy of the input DataFrame and converts the labels in the `lab` column into a binary 0, 1 label.
    
        Parameters:
            data (DataFrame): A DataFrame of samples with a `lab` column
            
        Returns:
            data_copy (DataFrame): A copy of the input DataFrame with the binary conversion applied
    
    '''
    lab_map = {'NF':0, 'A':0, 'B':0, 'C':0, 'M':1, 'X':1}
    data_cpy = data.copy()
    data_cpy['lab'].replace(lab_map, inplace=True)
    return data_cpy

---

### Reading the partitions

In [None]:
data_dir = '/data/FDS'
data_file = "normalized_partition1SelectedFeatures.csv"
data_file2 = "normalized_partition2SelectedFeatures.csv"

---
### Q1 (10 points)

In this assignment, we will compare a few different methods of over/under sampling. If you recall, in Homework 5, you were asked to implement a very naive method for sampling that simply found the smallest class and then undersampled all of the other classes to be that same size. We want to compare other methods against this naive one. So, for this question, reimplement this sampling method, if you got it correct in Homework 5, then copy your code over, and if not, here's your chance to get it right.

---

Below you will find a function stub, complete the function and have it return a copy of the input dataframe where each class (except for the smallest one) have been undersampled to match the size of the smallest class in the dataset. In this function you should assume the `lab` column is the class label.

To do this you may want to use the [groupby](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) function of the DataFrame to get groups of rows from your DataFrame.  You may also wish to use the [sample](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) function to select a number of rows from a group. You can also use the [apply](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html) method to process each group from your grouped rows. These are just hints, you can solve the problem how you see fit.

Once this function is complete, apply it to the original analytics base table for partition 1 (the one with all the NF, C, .., X labels). 

---

Once the function is complete do the following:

<ol>
    <li> 
        Apply your perform_naive_under_sample to the original analytics base table for partition 1 (the one with all the NF, C, .., X labels) to obtain an under sampled copy of the partition
    </li> 
    <li>
        Pass the results of your under sampling function to the copy_and_convert_to_binary function I have provided to convert the multi-class problem to a binary problem, and assign the results to a varaible so we can use this new undersampled data for later questions
    </li>
    <li>
        Pass the analytics base table for partition 1 to the copy_and_convert_to_binary function I have provided to convert the multi-class problem to a binary problem and save the results to a variable so we can use this data for later questions
    </li>
    <li>
         Pass the analytics base table for partition 2 to the copy_and_convert_to_binary function I have provided to convert the multi-class problem to a binary problem, and assign the results to a variable so we can use this data for later questions
    </li>
</ol>

At this point you should have 3 copies of the input analytics base tables, train partition one sampled, train partition one not sampled, and test partition 2 not sampled. 

---

In [None]:
def perform_naive_under_sample(data:DataFrame)->DataFrame:
    data = data.groupby(['lab'], group_keys=False)
    return pd.DataFrame(data.apply(lambda x: x.sample(data.size().min()))).reset_index(drop=True)

In [None]:
S_abt_binary_cpy = copy_and_convert_to_binary(perform_under_sample(abt))
S_abt2_binary_cpy = copy_and_convert_to_binary(perform_under_sample(abt2))

---
### Q2 (10 points)

With your binary classification dataset constructed, now it's time to start training and testing some models. So, same as you did with training and testing in Homework 5, use your unsampled copy of partition 1 that has had its `lab` column converted to a binary label. Then train 8 and test different [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) instances with the following settings. **(see documentation to know what these are)**


|Model Number| kernel | class_weight |
|------------|---------|-------------|
|1|linear|None|
|2|linear|balanced|
|3|poly|None|
|4|poly|balanced|
|5|rbf|None|
|6|rbf|balanced|
|7|sigmoid|None|
|8|sigmoid|balanced|


---
When testing each of your models, you should be using your binary classification copy of partition 2, then calculate (using the calc_tss function provided) and print the TSS score for each result. **NOTE: The model does take a little while to train.**

---

In [None]:
info_dict = {
    1:	['linear',	None],
    2:	['linear', 'balanced'],
    3:	['poly', None],
    4:	['poly', 'balanced'],
    5:	['rbf', None],
    6:	['rbf', 'balanced'],
    7:	['sigmoid', None],
    8:	['sigmoid', 'balanced']
}
for key in info_dict.keys():
    clf = SVC(kernel=info_dict.get(key)[0], class_weight=info_dict.get(key)[1])
    clf.fit(abt_binary_cpy.drop('lab', axis=1),abt_binary_cpy['lab'])
    score = calc_tss(abt2_binary_cpy['lab'], clf.predict(abt2_binary_cpy.drop('lab', axis=1)))
    print(f'Model {key} TTS score: {score}')

---
### Q3 (10 points)

For this question, same as you did with training and testing in Question 2, but use this time using your sampled copy of partition 1 that has had its `lab` column converted to a binary label. Train and test 8 different [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) instances with the following settings. **(see documentation to know what these are)**


|Model Number| kernel | class_weight |
|------------|---------|-------------|
|1|linear|None|
|2|linear|balanced|
|3|poly|None|
|4|poly|balanced|
|5|rbf|None|
|6|rbf|balanced|
|7|sigmoid|None|
|8|sigmoid|balanced|


---
When you test each of your models, you should be using your binary classification copy of partition 2, then calculate (using the calc_tss function provided) and print the TSS score for each result. **NOTE: The model now takes less time to train!**

---

In [None]:
info_dict = {
    1:	['linear',	None],
    2:	['linear', 'balanced'],
    3:	['poly', None],
    4:	['poly', 'balanced'],
    5:	['rbf', None],
    6:	['rbf', 'balanced'],
    7:	['sigmoid', None],
    8:	['sigmoid', 'balanced']
}
for key in info_dict.keys():
    clf = SVC(kernel=info_dict.get(key)[0], class_weight=info_dict.get(key)[1])
    clf.fit(abt.drop('lab', axis=1),abt['lab'])
    score = calc_tss(abt2_binary_cpy['lab'], clf.predict(abt2_binary_cpy.drop('lab', axis=1)))
    print(f'Model {key} TTS score: {score}')

---

Like you saw for the [KNeighborsClassifer](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) in home homework 5, you should see that the sampling method improves both the training time and the results! Why is that the case though? Let's take some time to look into this.

---

---
### Q4 (10 points)

In order to plot the two different classes from two different DataFrames, lets start by breaking them into their own DataFrames. 

In this question, construct the 4 DataFrames listed below and assign them to unique variables, we will be using them in the next question:
<ol>
    <li>Flaring samples from the unsampled analytics base table</li>
    <li>Non-Flaring samples from the unsampled analytics base table</li>
    <li>Flaring samples from the sampled analytics base table</li>
    <li>Non-Flaring samples from the sampled analytics base table</li>
</ol>

For or this question, I want you to utilize the [groupby](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) function of the DataFrame to first group the rows of your DataFrame(s) into flare and no-flare groups. This will give you a [GroupBy](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html) object.  Look at the documentation of that object to see how to get a DataFrame of a specific group.  **Note: if you use groupby to construct your sampled dataset, you may need to [reset_index](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html) of your sampled analytics base table before attempting to group your data again.**

---

In [None]:
flaringUnsampled = abt_binary_cpy[abt_binary_cpy.lab == 1]
flaringSampled = S_abt_binary_cpy[S_abt_binary_cpy.lab == 1]
nonFlaringUnsampled = abt_binary_cpy[abt_binary_cpy.lab == 0]
nonFlaringSampled = S_abt_binary_cpy[S_abt_binary_cpy.lab == 0]

---
### Q5 (15 points)

Now that you have your four different DataFrames from Question 4, you are going to plot each one of these using the seaborn [kdeplot](https://seaborn.pydata.org/generated/seaborn.kdeplot.html) method. 

The plots need to be constructed in the following way:

1. Use Matplotlib [subplots](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplots.html) to request a set of 2 row by 2 column subplots with `figsize=(14,14)`.  This function will return a tuple of 1.) the [Figure](https://matplotlib.org/stable/api/figure_api.html#matplotlib.figure.Figure) object, 2.) an array of [Axes](https://matplotlib.org/stable/api/axes_api.html#matplotlib.axes.Axes) objects. Assign each of these to variables as you will be using at least the array of Axes objects.

2. For each one of the [Axes](https://matplotlib.org/stable/api/axes_api.html#matplotlib.axes.Axes) objects in the array, you should set `ylim=(-0.75, 1.25)` and `xlim=(-1.5, 1.0)` using either the set method or the set_xlim or set_ylim methods. 

3. Each [kdeplot](https://seaborn.pydata.org/generated/seaborn.kdeplot.html) should plot `TOTUSJZ_linear_weighted_average` on the x axis and `TOTFZ_last_value` on the y axis. Use the `x` and `y` input arguments to set the key to those column names. Then assign the `data` input argument to be the DataFrame you are plotting.

4. Use `ax=axes[m,n]` to assign the [kdeplot](https://seaborn.pydata.org/generated/seaborn.kdeplot.html) to a specific subplot. Where m is the row and n is the column in our 2 rwo by 2 column set of plots. 0,0 should be the non-flaring unsampled data, 0,1 should be the flaring unsampled data, 1,0 should be the non-flaring sampled data, and 1,1 should be the flraing sampled data.

5. You should use `cmap='Blues'` for non-flaring and `cmap='Reds'` for flaring as an argument into your  [kdeplot](https://seaborn.pydata.org/generated/seaborn.kdeplot.html) call.  

6. Additional arguments should be `cbar=True`, `fill=True`, and `legend=False` to give you some more information in your plots.



---

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14,14))

for ax in axes.flat:
    ax.set_xlim([-1.5, 1.0])
    ax.set_ylim([-0.75, 1.25])

sns.kdeplot(x=flaringUnsampled['TOTUSJZ_linear_weighted_average'], y=flaringUnsampled['TOTFZ_last_value'], data=flaringUnsampled, ax=axes[0,1], cmap='Reds', cbar=True, fill=True, legend=False)
sns.kdeplot(x=flaringSampled['TOTUSJZ_linear_weighted_average'], y=flaringSampled['TOTFZ_last_value'], data=flaringSampled, ax=axes[1,1], cmap='Reds', cbar=True, fill=True, legend=False)
sns.kdeplot(x=nonFlaringUnsampled['TOTUSJZ_linear_weighted_average'],  y=nonFlaringUnsampled['TOTFZ_last_value'], data=nonFlaringUnsampled, ax=axes[0,0], cmap='Blues', cbar=True, fill=True, legend=False)
sns.kdeplot(x=nonFlaringSampled['TOTUSJZ_linear_weighted_average'], y=nonFlaringSampled['TOTFZ_last_value'], data=nonFlaringSampled, ax=axes[1,0], cmap='Blues', cbar=True, fill=True, legend=False)

---

After you plot the four different kernel density estimate plots you can begin to see why the error based learning method would improve by using the sampled data. The distributions of samples within the feature space are slightly different for the flaring samples, but the non-flaring samples see the most change. In the unsampled data, we have virtually all the weight of the samples in a very narrow band. Because of this, the support vectors of our SVC would prioritize classifying samples that fall in this narrow band to the expense of samples that fall outside of it. 

The high number of instances falling within this narrow band of feature values is caused by the high number of NF samples in our original dataset greatly out numbering any of the other classes (B, C, M, or X).  This shows why one needs to be careful when selecting their data to perform a specific task. If your data is highly imbalanced and you make the wrong design decisions for constructing your training dataset, then your training results may end up being less than desireable.

---

### Q6 (10 points)
Lets try to perform a better sampling strategy to see how this affects our training results. So, for this question we will be reimplementing our sampling with some minor changes and then performing the binary conversion again. Below is a method stub to complete, like before (in Q1), it return a copy of the input DataFrame, but this time it will do the following proceedure:

1. Since X class are so rare, maybe we want to replicate the few instances we have to emphasize their importantce in the classification problem. So, to do this, lets oversample the X class samples by doing random oversampling with replacement until the number of X class samples in the dataset is 2 times the original X class samples in the dataset.

2. Next, determine how many M and X class samples there now are as a combined count.  This will be what we wish to match with our non-flare class in the binary classification problem.

3. With the value calculated from step 2 above, determine the counts needed from the NF, B, and C labeled instances to match that value when counting all three AND keeping the ratio of NF to B to C the same. This will be a proportional undersampling for the binary non-flaring class while keeping the `climatology` of the undely classes the same.  For example, say the ratios of the NF, B, and C labels are 3/6 for NF, 2/6 for B and 1/6 for C, and we wanted to get a total of 200 total insances when we combine them. Then the number of NF instances we would want would be 200 * (3/6), etc.. 

4. After you have determined how many of each NF, B, and C labeled instances you want, then random under sample each of these classes and put them in the output DataFrame with the oversampled X class samples and the unchanged M class samples.



In this function you should assume the `lab` column is the class label.

To do this you may want to use the [groupby](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) function of the DataFrame to get groups of rows from your DataFrame.  You may also wish to use the [sample](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) function to select a number of rows from a group. 

Once this function is complete, apply it to the original analytics base table for partition 1 (the one with all the NF, C, .., X labels). Then pass the results to the copy_and_convert_to_binary function I have provided to convert the multi-class problem to a binary problem, and assign the results to a varaible so we can use this new undersampled data for later questions.

---

In [None]:
def perform_under_over_sample(data:DataFrame)->DataFrame:
    Xdata = data.groupby(label).get_group("X")
    
    for i in range(len(Xdata)):
        rand = random.randrange(0, len(data) - 1)
        if data.iloc[rand][label] == 'X':
            i -= 1
            continue
        else:
            data.iloc[rand] = Xdata.iloc[i]
    
    
    NFcount = data.groupby(label).get_group("NF")
    Bcount = data.groupby(label).get_group("B")
    Ccount = data.groupby(label).get_group("C")

    NFratio = len(NFcount) / (len(NFcount)+len(Bcount)+len(Ccount))
    Bratio = len(Bcount) / (len(NFcount)+len(Bcount)+len(Ccount))
    Cratio = len(Ccount) / (len(NFcount)+len(Bcount)+len(Ccount))

    finalNF = NFcount.sample(n=math.floor(NFratio * (data.groupby(label).get_group("X") + data.groupby(label).get_group("M"))))
    finalB = Bcount.sample(n=math.floor(Bratio * (data.groupby(label).get_group("X") + data.groupby(label).get_group("M"))))
    finalC = Ccount.sample(n=math.floor(Cratio * (data.groupby(label).get_group("X") + data.groupby(label).get_group("M"))))
    
    data = pd.concat([data.groupby(label).get_group("X"), data.groupby(label).get_group("M"), finalNF, finalB, finalC])
    return data

In [None]:
sampled_abt = copy_and_convert_to_binary(perform_under_over_sample(abt))

### Q7 (10 points)

For this question, same as you did with training and testing in Question 3, but use this time using your under and over sampled copy of partition 1 that has had its `lab` column converted to a binary label. Train and test 8 different [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) instances with the following settings. **(see documentation to know what these are)**


|Model Number| kernel | class_weight |
|------------|---------|-------------|
|1|linear|None|
|2|linear|balanced|
|3|poly|None|
|4|poly|balanced|
|5|rbf|None|
|6|rbf|balanced|
|7|sigmoid|None|
|8|sigmoid|balanced|


---
When you test each of your models, you should be using your binary classification copy of partition 2, then calculate (using the calc_tss function provided) and print the TSS score for each result. 

---

In [None]:
info_dict = {
    1:	['linear',	None],
    2:	['linear', 'balanced'],
    3:	['poly', None],
    4:	['poly', 'balanced'],
    5:	['rbf', None],
    6:	['rbf', 'balanced'],
    7:	['sigmoid', None],
    8:	['sigmoid', 'balanced']
}
for key in info_dict.keys():
    clf = SVC(kernel=info_dict.get(key)[0], class_weight=info_dict.get(key)[1])
    clf.fit(sampled_abt.drop('lab', axis=1),sampled_abt['lab'])
    score = calc_tss(abt2_binary_cpy['lab'], clf.predict(abt2_binary_cpy.drop('lab', axis=1)))
    print(f'Model {key} TTS score: {score}')

---
Since we are doing random sampling, your results will vary from one execution of the sampling method to the next, but you should see that this method has returned somewhat better results than the naive sampling method we previously did.  Lets see if our distributions look any different now.

---

---
### Q8 (10 points)

In order to plot the two different classes from now three different DataFrames, lets break the results from Question 6 into flare and non-flare DataFrames like we did for the previously sampled data in Question 4.

In this question, construct the 2 additional DataFrames listed below and assign them to unique variables, we will be using them in the next question:
<ol>
    <li>Flaring samples from the over/under sampled analytics base table</li>
    <li>Non-Flaring samples from the over/under sampled analytics base table</li>
</ol>

For or this question, I want you to utilize the [groupby](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) function of the DataFrame to first group the rows of your DataFrame(s) into flare and no-flare groups. This will give you a [GroupBy](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html) object.  Look at the documentation of that object to see how to get a DataFrame of a specific group.  **Note: if you use groupby to construct your sampled dataset, you may need to [reset_index](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html) of your sampled analytics base table before attempting to group your data again.**

---

In [None]:
flaringSampled2 = sampled_abt[sampled_abt.lab == 1]
nonFlaringSampled2 = sampled_abt[sampled_abt.lab == 0]

---
### Q9 (15 points)
Now that you have the over/under sampled data broken into flaring and non-flaring DataFrames, as well as the data from the original analytics base table and the naive sampling analytics base table. Lets plot all three sets of data to see how they compare, much like was done in Question 5.

The plots need to be constructed in the following way:

1. Use Matplotlib [subplots](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplots.html) to request a set of 3 row by 2 column subplots with `figsize=(14,14)`.  This function will return a tuple of 1.) the [Figure](https://matplotlib.org/stable/api/figure_api.html#matplotlib.figure.Figure) object, 2.) an array of [Axes](https://matplotlib.org/stable/api/axes_api.html#matplotlib.axes.Axes) objects. Assign each of these to variables as you will be using at least the array of Axes objects.

2. For each one of the [Axes](https://matplotlib.org/stable/api/axes_api.html#matplotlib.axes.Axes) objects in the array, you should set `ylim=(-0.75, 1.25)` and `xlim=(-1.5, 1.0)` using either the set method or the set_xlim or set_ylim methods. 

3. Each [kdeplot](https://seaborn.pydata.org/generated/seaborn.kdeplot.html) should plot `TOTUSJZ_linear_weighted_average` on the x axis and `TOTFZ_last_value` on the y axis. Use the `x` and `y` input arguments to set the key to those column names. Then assign the `data` input argument to be the DataFrame you are plotting.

4. Use `ax=axes[m,n]` to assign the [kdeplot](https://seaborn.pydata.org/generated/seaborn.kdeplot.html) to a specific subplot. Where m is the row and n is the column in our 3 rwo by 2 column set of plots. 0,0 should be the non-flaring unsampled data, 0,1 should be the flaring unsampled data, 1,0 should be the non-flaring naive sampled data, and 1,1 should be the flraing naive sampled data, 2,0 should be the non-flaring over/under sampled data, and 2,1 should be the flaring over/under sampled data.

5. You should use `cmap='Blues'` for non-flaring and `cmap='Reds'` for flaring as an argument into your  [kdeplot](https://seaborn.pydata.org/generated/seaborn.kdeplot.html) call.  

6. Additional arguments should be `cbar=True`, `fill=True`, and `legend=False` to give you some more information in your plots.



---

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14,14))

for ax in axes.flat:
    ax.set_xlim([-1.5, 1.0])
    ax.set_ylim([-0.75, 1.25])

sns.kdeplot(x=flaringUnsampled['TOTUSJZ_linear_weighted_average'], y=flaringUnsampled['TOTFZ_last_value'], data=flaringUnsampled, ax=axes[0,1], cmap='Reds', cbar=True, fill=True, legend=False)
sns.kdeplot(x=flaringSampled['TOTUSJZ_linear_weighted_average'], y=flaringSampled['TOTFZ_last_value'], data=flaringSampled, ax=axes[1,1], cmap='Reds', cbar=True, fill=True, legend=False)
sns.kdeplot(x=nonFlaringUnsampled['TOTUSJZ_linear_weighted_average'],  y=nonFlaringUnsampled['TOTFZ_last_value'], data=nonFlaringUnsampled, ax=axes[0,0], cmap='Blues', cbar=True, fill=True, legend=False)
sns.kdeplot(x=nonFlaringSampled['TOTUSJZ_linear_weighted_average'], y=nonFlaringSampled['TOTFZ_last_value'], data=nonFlaringSampled, ax=axes[1,0], cmap='Blues', cbar=True, fill=True, legend=False)
sns.kdeplot(x=nonFlaringSampled2['TOTUSJZ_linear_weighted_average'],  y=nonFlaringSampled2['TOTFZ_last_value'], data=nonFlaringSampled2, ax=axes[0,0], cmap='Blues', cbar=True, fill=True, legend=False)
sns.kdeplot(x=flaringSampled2['TOTUSJZ_linear_weighted_average'], y=flaringSampled2['TOTFZ_last_value'], data=flaringSampled2, ax=axes[1,0], cmap='Blues', cbar=True, fill=True, legend=False)

---
We see a slight variation between the different techniques, but nothing too drastic (which is not that much of a surprise given how similar the scores are).

---

### Bonus Q10 (30 points)

I want you to utilize what you have learned over the course of this semester. Try to out do the performance seen by the [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) inastances of Question 7. This could be accomplished by any number of ways but make sure the methods used are able to run from start to finish by simply doing restart and run all from the kernel menu above. There should be no need for additional intervention by the grader.

Some ideas to improve could be:

1. Perform some sort of different sampling technique. 

2. There are more parameters of the [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) than what we altered, maybe you do a [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) to find the best performing set of parameter values?

3. Maybe we chose a poor set of features to perform classification on? You could use the links at the top of this document to the unprocessed feature tables to construct your own set of features?

4. Perhaps the choice of normalization I performed before performing feature selection was incorrect? Maybe, instead of clipping features that have a large range and a large number of outliers a better option would have been to perform log scaling and range normalization? Similar to 3, you could use the links at the top of this document to get the unprocessed feature tables and work from there.

5. Maybe we go back to one of the other classification methods now that we have sampled the data in a different way?

Really the possibilities are endless. I in no way have done the optimal thing when processing the data for the assignments this semester. I chose to perform quick and fairly understandable processes so that you could be exposed to them for the first time.  It takes more than a few hours a week to become truely familiar with a dataset and understand what design decisions are useful and what are not. This is your time to begin to explore on your own. Be creative, I hope to see some interesting solutions!


In [None]:
### Place your code in this and any number of additional code blocks. 