<h3 style = "text-align: center; color:blue"> ISAT 341: Machine Learning and Data Science </h3>

<h3 style = "text-align: center; color:green"> Project: Machine Learning Confidential Sensor Data </h3>

<img src = "images/machine_learning.jpg" width=200; height=250; align= center>

<h3 style = "text-align: center; color:red"> Working with <em>real-world</em> datasets </h3>

### Objectives

To demonstrate the ability to complete an end-to-end data science / machine learning project using real-world data by following and
implementing the main machine learning checklist steps that lead to a solution, namely:
* Frame the problem and look at the big picture.
* Get the data.
* Explore the data to gain insights.
* Prepare the data to expose the underlying data patterns to Machine Learning algorithms.
* Explore many different models and short-list the best ones.
* Fine-tune your models and combine them into a great solution.
* Present your solution.
* Launch, monitor, and maintain your system

<h3 style = "text-align: left; color:purple"> Frame the Problem </h3>

<img src = "images/sensor_array.jpg" width=200; height=250; align= center>

### Sensor Data

The data source as well as the exact nature of the data is confidential. Each data instance contains 12 real-valued input attributes. Each input
attribute represents a sensor designed to detect the presence of one of two groups of substances. As an alternative, the sensor readings may
represent a 'false alarm'.
* Substance 1 is represented by the value 'one' in the class attribute column.
* Substance 2 is represented by the value 'two' in the class attribute column.
* A false alarm is represented by the value 'three' in the class attribute column.

The problem is framed as a **supervised learning** problem: Predict the class of a substance from sensor data using the given measurements in the dataset.

<h3 style = "text-align: Center; color:purple"> Project Analysis Starts Here! </h3>


In [13]:
# import packages
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
np.set_printoptions(precision=3, suppress=True)

<h3 style = "text-align: left; color:green"> Exploratory Data Analysis </h3>

### 1) TO DO: Use Pandas to load your data into a dataframe

In [14]:
df = pd.read_csv('Data/Sensor_Data_Confidential_341Project_DataSet5.csv')
df.head(10)

Unnamed: 0,Input 1,Input 2,Input 3,Input 4,Input 5,Input 6,Input 7,Input 8,Input 9,Input 10,Input 11,Input 12,class
0,1.473,2.311,3.179,2.666,0.2795,0.2771,0.2234,0.1855,0.2539,1.138,1.111,4.712,one
1,1.46,2.377,3.214,2.92,0.2527,0.3064,0.02563,0.1965,0.3027,1.213,1.027,5.463,one
2,1.552,2.164,3.064,2.745,0.282,0.21,0.1721,0.1929,0.21,1.221,1.058,5.332,one
3,1.605,2.228,3.149,2.834,0.2917,0.3613,0.2087,0.1294,0.2734,1.144,1.062,4.829,one
4,1.534,2.114,3.309,2.976,0.21,0.2502,0.2258,0.177,0.2039,1.254,1.112,5.734,one
5,1.796,2.262,3.18,2.888,0.2966,0.1477,0.1624,0.282,-9999.0,1.172,1.03,4.861,one
6,1.566,2.323,3.469,2.711,0.2417,0.05371,0.2112,-0.02686,0.2197,1.158,0.9924,4.78,one
7,1.425,2.152,3.287,2.781,0.2991,0.2075,0.1038,0.1147,0.2698,1.271,1.115,5.662,one
8,1.595,2.271,3.323,2.743,0.1733,0.1965,0.1685,0.05859,0.2051,1.29,1.033,5.145,one
9,1.628,2.211,3.176,2.71,0.08423,0.1892,0.2783,0.1685,0.3491,1.155,1.008,5.613,one


### 2) TO DO: Describe the data

In [15]:
df.describe()

Unnamed: 0,Input 1,Input 2,Input 3,Input 4,Input 5,Input 6,Input 7,Input 8,Input 9,Input 10,Input 11,Input 12
count,2064.0,2064.0,2064.0,2064.0,2064.0,2064.0,2064.0,2064.0,2064.0,2064.0,2064.0,2064.0
mean,-102.944836,-109.711795,-111.319884,-67.994856,-124.949039,-105.066162,-81.658039,-125.587986,-100.835192,-74.740051,-85.312925,-78.599944
std,1027.427211,1050.057089,1072.729741,849.911071,1115.540299,1027.206775,903.995215,1115.467928,1003.772889,877.403031,930.088695,904.281076
min,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0
25%,3.0585,0.8176,4.84825,3.99825,0.4626,0.6494,0.3088,0.188,0.4236,1.379,1.089,0.7898
50%,4.082,1.353,5.336,5.041,0.8063,1.6345,0.5652,0.3027,0.6946,1.996,1.2855,4.915
75%,4.504,2.35425,5.59125,5.64275,1.40075,2.159,0.9598,0.4919,1.23425,4.974,1.86525,5.403
max,5.105,4.675,5.944,6.013,2.754,3.638,2.446,1.199,2.561,5.312,5.64,20.0


### 3) TO DO: Display the shape of the data

In [16]:
df.shape

(2064, 13)

### 4) TO DO: Drop the data

In [17]:
#First replace any value of -9999 to numpy's, nan type. 
df = df.replace(-9999, np.nan)
df.dropna(inplace = True)

In [18]:
#shape of data in dataframe after cleansing
df.shape

(1826, 13)

### 5) TO DO: Use pandas correlation method to find the two features (inputs) with the highest correlation

In [24]:
df.corr

<bound method DataFrame.corr of       Input 1  Input 2  Input 3  Input 4  Input 5  Input 6  Input 7  Input 8  \
0       1.473    2.311    3.179    2.666   0.2795   0.2771  0.22340  0.18550   
1       1.460    2.377    3.214    2.920   0.2527   0.3064  0.02563  0.19650   
2       1.552    2.164    3.064    2.745   0.2820   0.2100  0.17210  0.19290   
3       1.605    2.228    3.149    2.834   0.2917   0.3613  0.20870  0.12940   
4       1.534    2.114    3.309    2.976   0.2100   0.2502  0.22580  0.17700   
...       ...      ...      ...      ...      ...      ...      ...      ...   
2059    3.682    1.301    4.939    4.453   0.4895   0.7922  0.23190  0.05005   
2060    3.412    1.293    4.949    4.199   0.4578   0.9521  0.21360  0.23070   
2061    3.640    1.284    5.111    4.460   0.5786   0.8020  0.26980  0.31740   
2062    3.746    1.261    5.049    4.885   0.5835   1.1470  0.32350  0.23070   
2063    3.959    1.108    5.422    4.835   0.5579   1.3230  0.51510  0.21000   

      I

Inputs 10 and ll are the most highly correlated. 

<h3 style = "text-align: Left; color:green"> Data Visualization </h3>


### 6) TO DO: Plot bar charts using pandas dataframe (plot the mean value of the sensors)

<h3 style = "text-align: Left; color:green"> Data Preprocessing </h3>

### 7) TO DO: Create Feature Matrix and Target Vector

In [40]:
#defining the features
y = df['class']
X = df[['Input 1',	'Input 2','Input 3','Input 4','Input 5','Input 6','Input 7','Input 8','Input 9','Input 10','Input 11', 'Input 12']]

### 8) TO DO: Convert the features dataframe to a numpy array

In [42]:
y.to_numpy()
X.to_numpy()

array([[1.473, 2.311, 3.179, ..., 1.138, 1.111, 4.712],
       [1.46 , 2.377, 3.214, ..., 1.213, 1.027, 5.463],
       [1.552, 2.164, 3.064, ..., 1.221, 1.058, 5.332],
       ...,
       [3.64 , 1.284, 5.111, ..., 1.46 , 1.118, 4.867],
       [3.746, 1.261, 5.049, ..., 1.482, 1.128, 5.627],
       [3.959, 1.108, 5.422, ..., 1.595, 1.244, 5.623]])

### 9) TO DO: Label Encoding

### 10) Split the data into Training and Testing Sets

Scikit learn contains a function called the **train_test_split** function that will randomly shuffle the dataset and then splits it into two datasets: a
**training set** used to build the model and a **test set** to assess and evaluate how well the model works on unseen data (also called outof=sample data).

### *Using 80%/20% split*

In [None]:
TO DO: 

### 11) Look at the shape of the data (rows and columns) after splitting it into training and testing sets

<h3 style = "text-align: left; color:green"> Scale the Data </h3>
<h3 style = "text-align: left; color:red"> IMPORTANT: Standardizing the features:</h3>

Standardization of datasets (feature scaling) is a common requirement for many machine learning and optimization algorithms implemented in
scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data, i.e., Gaussian
with zero mean and unit variance

### 12) Using the StandarScaler from Scikit-learn to transform (scale) our feature

**A comment on what the above code does**

* Using the preceding code, we loaded the StandardScaler class from the preprocessing module and initialized a new StandardScaler objectthat we assigned to the variable sc.
* Using the fit method, StandardScaler estimated the parameters μ (sample mean) and (standard deviation) for each feature dimensionfrom the training data.
* By calling the transform method, we then standardized the training data using those estimated parameters μ and .
* Note that we used the same scaling parameters to standardize the test set so that both the values in the training and test dataset are comparable to each other

<h3 style = "text-align: left; color:green"> Model Building </h3>

**Training multiple Machine Learning models during same session.** 
1. K-Nearest Neighbor (with K=10, K=50, K=200)
2. Logistic Regression
3. Linear Support Vector Classifier

<h3 style = "text-align: left; color:green"> Build a KNN Classification Model for K = 10, 50 and 200 </h3>

### Build and train the actual machine learning model using a loop