## 1. Import necessary packages

For this exercise we need

* pandas
* train_test_split
* LogisticRegression
* pyplot from matplotlib
* KNeighborsClassifier
* LogisticRegressionClassifier
* RandomForestClassifier
* DummyClassifier

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import warnings

warnings.filterwarnings('ignore')

## 2. Load and prepare the dataset
* Load the training data into a dataframe named df_train_data 
* Create binary classification problem - rename some class labels
* Create a dataframe of 9 features named X, drop column 9.
* Create a data frame of labels named y, select only column 9.
* Split the data into a training set and a test set.

In [2]:
train_df = pd.read_csv('data/shuttle_full.csv')

In [3]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43499 entries, 0 to 43498
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   50      43499 non-null  int64
 1   21      43499 non-null  int64
 2   77      43499 non-null  int64
 3   0       43499 non-null  int64
 4   28      43499 non-null  int64
 5   0.1     43499 non-null  int64
 6   27      43499 non-null  int64
 7   48      43499 non-null  int64
 8   22      43499 non-null  int64
 9   2       43499 non-null  int64
dtypes: int64(10)
memory usage: 3.3 MB


In [4]:
train_df.head()

Unnamed: 0,50,21,77,0,28,0.1,27,48,22,2
0,55,0,92,0,0,26,36,92,56,4
1,53,0,82,0,52,-5,29,30,2,1
2,37,0,76,0,28,18,40,48,8,1
3,37,0,79,0,34,-26,43,46,2,1
4,85,0,88,-4,6,1,3,83,80,5


In [5]:
train_df.describe()

Unnamed: 0,50,21,77,0,28,0.1,27,48,22,2
count,43499.0,43499.0,43499.0,43499.0,43499.0,43499.0,43499.0,43499.0,43499.0,43499.0
mean,48.249707,-0.205614,85.341755,0.262742,34.528932,1.298306,37.074783,50.899929,13.964413,1.700522
std,12.252756,78.143602,8.908614,41.004603,21.703636,179.488823,13.135619,21.463492,25.64867,1.354663
min,27.0,-4821.0,21.0,-3939.0,-188.0,-13839.0,-48.0,-353.0,-356.0,1.0
25%,38.0,0.0,79.0,0.0,26.0,-5.0,31.0,37.0,0.0,1.0
50%,45.0,0.0,83.0,0.0,42.0,0.0,39.0,44.0,2.0,1.0
75%,55.0,0.0,89.0,0.0,46.0,5.0,42.0,60.0,14.0,1.0
max,126.0,5075.0,149.0,3830.0,436.0,13148.0,105.0,270.0,266.0,7.0


In [6]:
train_df.isnull()

Unnamed: 0,50,21,77,0,28,0.1,27,48,22,2
0,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
43494,False,False,False,False,False,False,False,False,False,False
43495,False,False,False,False,False,False,False,False,False,False
43496,False,False,False,False,False,False,False,False,False,False
43497,False,False,False,False,False,False,False,False,False,False


In [8]:
train_df.isna().sum()*100/train_df.isna().count()

50     0.0
21     0.0
77     0.0
0      0.0
28     0.0
0.1    0.0
27     0.0
48     0.0
22     0.0
2      0.0
dtype: float64

In [None]:
train_df

## 3. Create the model
* Instantiate a Logistic Regression classifier with a lbfgs solver.
* Fit the classifier to the data.

## 4. Calculate Accuracy
* Calculate and print the accuracy of the model on the test data.

## 5. Dummy Classifier
* Use the dummy classifier to calculate the accuracy of a purely random chance.
* Compare this result to the result of the logistic regression classifier above. What does this result tell you?

## 6. Confusion Matrix
* Print the confusion matrix.

## 7. Plot a nicer confusion matrix (optional)
* Use the plot_confusion_matrix function from above to plot a nicer looking confusion matrix.

## 8. Calculate Metrics
* Print the F1, F beta, precision, recall and accuracy scores.

## 9. Print a classification report

## 10. Plot ROC Curve and AUC
Caculate AUC and plot the curve.


## 11. Plot Precision-Recall Curve
* Plot the precision-recall curve for the model above.
* Find the best value for C in the Logistic Regression Classifier for avoiding overfitting. Plot the training and testing accuracy over a range of C values from 0.05 to 1.5.


## 12. Cross Validation
* Perform 5-fold cross validation for a Logistic Regression Classifier. Print the 5 accuracy scores and the mean validation score.


## 13. Is this really linear?
* Our linear classifier is not giving us accuracy better than the dummy classifier. Suppose that the data was not linearly separable? Instantiate and train a KNN model with k = 7. How does the accuracy of the KNN model compare to the Logistic Regression from above? What does that tell you about the data?


## 14. Random Forest
* Next, instantiate and fit a RandomForestClassifier and calculate the accuracy of that model.