# Data Mining CMP-7023B

## Lab 5: Supervised Learning - Classification Part 1 - practice

# Keppler

## Context

The Kepler Space Observatory is a NASA-build satellite that was launched in 2009. The telescope is dedicated to searching for exoplanets in star systems besides our own, with the ultimate goal of possibly finding other habitable planets besides our own. The original mission ended in 2013 due to mechanical failures, but the telescope has nevertheless been functional since 2014 on a "K2" extended mission.

Kepler had verified 1284 new exoplanets as of May 2016. As of October 2017 there are over 3000 confirmed exoplanets total (using all detection methods, including ground-based ones). The telescope is still active and continues to collect new data on its extended mission.
## Content

This dataset is a cumulative record of all observed Kepler "objects of interest" — basically, all of the approximately 10,000 exoplanet candidates Kepler has taken observations on.

This dataset has an extensive data dictionary, which can be accessed here. Highlightable columns of note are:

    kepoi_name: A KOI is a target identified by the Kepler Project that displays at least one transit-like sequence within Kepler time-series photometry that appears to be of astrophysical origin and initially consistent with a planetary transit hypothesis
    kepler_name: [These names] are intended to clearly indicate a class of objects that have been confirmed or validated as planets—a step up from the planet candidate designation.
    koi_disposition: The disposition in the literature towards this exoplanet candidate. One of CANDIDATE, FALSE POSITIVE, NOT DISPOSITIONED or CONFIRMED.
    koi_pdisposition: The disposition Kepler data analysis has towards this exoplanet candidate. One of FALSE POSITIVE, NOT DISPOSITIONED, and CANDIDATE.
    koi_score: A value between 0 and 1 that indicates the confidence in the KOI disposition. For CANDIDATEs, a higher value indicates more confidence in its disposition, while for FALSE POSITIVEs, a higher value indicates less confidence in that disposition.

Acknowledgements

This dataset was published as-is by NASA. You can access the original table here. More data from the Kepler mission is available from the same source here.

https://www.kaggle.com/nasa/kepler-exoplanet-search-results

link: https://github.com/OpenExoplanetCatalogue/open_exoplanet_catalogue


### Starting out: loading data and libraries
We begin by loading the necessary libraries for the work we are going to do in this lab.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#designate the path where you saved your OEC data
planet_data_path = "C:\\DM-DATA\\planets.csv"

#Load the data using pandas read_csv function. 
orig_data = pd.read_csv(planet_data_path)

print("Setup complete.")

### what are the names of the columns we want to use.

not using: `["rowid","kepid","kepoi_name","kepler_name","koi_pdisposition",koi_tce_delivname","koi_tce_delivname"]`

empty cols: `["koi_teq_err1","koi_teq_err2"]`

The koi_disposition will be our target

In [None]:
label = "koi_disposition"
label = orig_data[label]

In [None]:
features_to_use = ["koi_score","koi_fpflag_nt","koi_fpflag_ss","koi_fpflag_co","koi_fpflag_ec",
                   "koi_period","koi_period_err1","koi_period_err2","koi_time0bk","koi_time0bk_err1",
                   "koi_time0bk_err2","koi_impact","koi_impact_err1","koi_impact_err2","koi_duration",
                   "koi_duration_err1","koi_duration_err2","koi_depth","koi_depth_err1","koi_depth_err2",
                   "koi_prad","koi_prad_err1","koi_prad_err2","koi_teq","koi_insol","koi_insol_err1",
                   "koi_insol_err2","koi_model_snr","koi_tce_plnt_num","koi_steff","koi_steff_err1",
                   "koi_steff_err2","koi_slogg","koi_slogg_err1","koi_slogg_err2","koi_srad","koi_srad_err1",
                   "koi_srad_err2","ra","dec","koi_kepmag"]

data = orig_data[features_to_use]
data
data = pd.DataFrame(orig_data[features_to_use])

In [None]:
data.head()

### Exploratory data analysis 
Explore the data to gain insights about the data.

View dimensions of dataset

We can see that there are 9564 instances and 41 attributes in the data set.

Describe basic statistics of data

#### View summary of dataset

We can see that there are 40 numerical variables and 1 categorical variable in the dataset.

### Explore problems within variables

#### Missing Value Analysis

Check missing values in variables

#### Impute missing values

The OEC data has various missing values. Pre-process the data to impute some of the missing values, or handle them you can use the SimpleImputer now.
Note: Further investigation may be needed, and different methods could be considered.
Any imputation done on the training stage should be consistent with the test stage.

#### Exploring the target variable

Assuming your target variable is 'koi_disposition' and it is stored in a variable called 'label'

Now check frequency distribution of target variable/class (koi_disposition)

In [None]:
#check how many examples are in each category


We can see that the target variable contains 3 class labels: FALSE POSITIVE, CONFIRMED, CANDIDATE.

Now visualise the target variable distribution using sns.countplot

We've noticed that the target variable is imbalanced, so we'll need to address this issue later on.

### Label Encoding

Transform the set of labels from strings to a suitable encoding such that they can be used with a classifier. 

In [None]:
from sklearn.preprocessing import LabelEncoder


If the dataset has various categorical columns, consider how you can handle categorical values by generating an alternative encoding. 

#### Correlation Analysis

Convert data from numpy array to a Pandas DataFrame

In [None]:
data=pd.DataFrame(data)
data.columns=[features_to_use]

Create a Correlation Heat Map

In [None]:
# Create a heatmap of the correlation matrix


### Split data into separate training and test set 

Split data into separate training and test set using the data and labels you have crafted.

In [None]:
from sklearn.model_selection import train_test_split


### Balancing

Our target variable 'label' is imbalanced, meaning some classes have significantly fewer instances than others. To address this imbalance, we're employing SMOTE (Synthetic Minority Over-sampling Technique), a method that creates synthetic samples for the minority classes to achieve a more balanced dataset.

In [None]:
from imblearn.over_sampling import SMOTE

# Assuming 'X' is your feature matrix and 'y' is your target variable


Now, you have the opportunity to visualize the distribution of the target variable both before and after the balancing process.

In [None]:
import matplotlib.pyplot as plt


## Classifier 1: k-NN

### 1- For balanced data
Train a k Nearest Neighbours classifier on your dataset, when k=3

Hint: from sklearn.neighbors import KNeighborsClassifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier


Make predictions on the test set

In [None]:
# Make predictions on the test data


#### Classifier 1 Model Evaluation:
Once you have trained your classifier on the balanced training set, you can evaluate its performance on a test set using various metrics such as confusion matrix, accuracy, precision, and recall. Now evaluate your trined model with variouse metrics.

Evaluate the accuracy, precision, and recall of the classifier: here, y_test are the true class labels and y_pred are the predicted class labels in the test-set.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score


#### Generate the classification report

### Check for overfitting and underfitting

Now, you can compare the train-set and test-set accuracy to check for overfitting.

In [None]:
# print the scores on training and test set


Interprete the outcome here:

Training set score (0.8865): This indicates the accuracy or performance of the model on the data it was trained on. A score of 0.8345 suggests that the model performs well on the training data, achieving approximately 88.65% accuracy.

Test set score (0.5813): This represents the accuracy or performance of the model on new, unseen data (the test set). A score of 0.5813 suggests that the model's performance drops when applied to data it hasn't seen before, achieving approximately 58.13% accuracy.

A significant difference between the training and test set scores might indicate overfitting, where the model is too tailored to the training data and doesn't generalize well to new data. Further model evaluation and tuning may be needed to improve test set performance.

#### You can use cross-validation to find the best value of k (number of neighbors) for your k-NN classifier.

Hint: Find the mean accuracy for different values of k using 5-fold cross-validation. You can then choose the value of k that gives the highest mean accuracy.

from sklearn.model_selection import cross_val_score

In [None]:
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt


#### Interprete your findings here:

The analysis indicates that, among the odd values of k tested, the model achieved its highest mean accuracy with k set to 1. The mean accuracy of 0.82 suggests that, on average, the model correctly predicted the class labels for approximately 82% of the instances in the cross-validated training data. This finding implies that a lower k value, in this case, 1, led to better performance on the given training data. It's crucial to note that the choice of the optimal k depends on the specific characteristics of the data, and in this instance, a lower k appears to be favorable for the model's accuracy. Further exploration and consideration of the model's generalization to unseen data may provide additional insights.

### 2- For unbalanced data

Now, can use cross-validation to find the best value of k (number of neighbors) for your k-NN classifier for the unbalanced dataset

#### Generate the classification report for the best value of k 

### Check for overfitting and underfitting

Now, you can compare the train-set and test-set accuracy for the unbalanced dataset to check for overfitting. Do you see any difference?

In [None]:
# print the scores on training and test set


## Classifier 2: SVC

Train a SVC classifier on your dataset with default hyperparameters

Default hyperparameter means C=1.0, kernel=rbf and gamma=auto among other parameters.

https://scikit-learn.org/stable/modules/svm.html#svm

Hint: from sklearn.svm import SVC

In [None]:
from sklearn.svm import SVC


Make predictions on the test set

In [None]:
# Make predictions on the test set


#### Classifier 2 Model Evaluation:
Once you have trained your Linear SVC classifier (svc) on the balanced training set, you can evaluate its performance on a test set using various metrics such as confusion matrix, accuracy, precision, and recall. Now evaluate your SVC classifier:

#### Generate the classification report

In [None]:
# Classification Report


### Confusion matrix for all classifiers
Draw confusion matrix for all classifiers

In [None]:
#pip install --upgrade scikit-learn

### Compare performance of classifiers
Which classifier performed best over all? Which classifier had the highest accuracy on each class?

As evident from the results, the performance of both classifiers is suboptimal. Your next task is to explore alternative analyses with varied parameters for each classifier, experiment with different feature selection techniques, and assess the models using the original unbalanced data. Keep in mind that balancing the data doesn't always guarantee improved performance, so it's crucial to thoroughly investigate various configurations to improve the classifiers' effectiveness.

### Optimizing SVM Classifier

   - Use the SVM classifier (SVC) from scikit-learn.
   - Perform a grid search for hyperparameter optimization using GridSearchCV (cv=3).
   - Use the following hyperparameter combinations:
       - Kernel: 'linear', 'rbf'
       - C: 1.0, 10.0, 100.0
       - Gamma: 'scale'
   - Use a balanced training dataset (X_train_balanced, y_train_balanced).
   - Print the best hyperparameters found by the grid search.
   - Print a classification report for the predictions on the test set.
   
**This task may take some time to run

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import classification_report
import time

