# HW5

For this homework, we are going to work with [*Indoor User Movement Prediction from RSS data*](https://archive.ics.uci.edu/ml/datasets/Indoor+User+Movement+Prediction+from+RSS+data) dataset from UCI.  The homework is due Friday, December 21st midnight. 

## Task 1

Download the dataset and unzip it in under a subdirectory of `data` folder named `rss_data`.

The files we are interested is in the subfolder `dataset`.  Each of these files whose names that start with `MovementAAL_RSS_` contain data collected from indivuduals. Each of these files represent a single data point.  There are 314 of these files, and hence, you have 314 data points.  Each file has 4 columns but the number of rows change from file to file.  

There is also a file named `MovementALL_target.csv` in that folder. This file tells us the class each of these measurements are assigned. Some of these measurements are labelled with +1 and some are labelled with -1.

## Task 2

Construct a SVM model that separates +1 labelled data points from -1 data points.  You must first solve the problem that these datapoints do not have the same number of rows even though they all have the same number of columns. 

## Task 3

Using [Keras](https://keras.io/getting-started/sequential-model-guide/) write a neural network model that separates +1 labelled data points from -1 data points.

## Notes

1. You must document each step of your tasks: what are you doing, why are you doing it, what problems you encountered and how you solved it.  All of these must be explained and documented.  Solutions without sufficient documentations will be penalized accordingly. 50% of your points will come from your code, while the other 50% will come from your explanations.

1. You can use MS Excel to inspect the files, but loading them up to python using pandas and inspecting them there under jupyter is easier.

3. Put the data in a separate subfolder of your `data` folder and rename it `rss_data`. I'll take points off if the data is not saved under the correct place.

1. For both of Task 2 and Task 3, you must split your data into a train and test set, and then evaluate the accuracy of your model on the test set.


## Task2 : Construct SVM

In [1]:
import os
import pandas as pd

In [2]:
path = os.getcwd()
#path = "home/nbuser/library"
files = os.listdir(path + '/dataset/')
files = sorted(files)

In [3]:
target_file = pd.read_csv(path + '/dataset/MovementAAL_target.csv', names=('ID', 'class_label'), skiprows=(1))

In [4]:
target_file

Unnamed: 0,ID,class_label
0,1,1
1,2,1
2,3,1
3,4,1
4,5,1
5,6,1
6,7,1
7,8,1
8,9,1
9,10,1


Target data was created as data frame and assign to `target_data`.

In [5]:
target_data = target_file.iloc[:,-1]

Empty dataset was created as `rss_data_frame` which has `target` column.

In [6]:
columns = ["target"]
rss_data_frame = pd.DataFrame(columns=columns)
rss_data_frame

Unnamed: 0,target


We will construct a SVM model with dataset which has RSS anchor values but we have multiple sensor values and one file for each invidiual user. So, we need reduce dimensionality of data to get suitable results from the model. Therefore, it can be used PCA to reduce dimensionality of the dataset. 

### Principal Component Analysis

Principal Component Analysis is a method which is used to reduce dimensionality of data computing eigenvalues and eigenvectors of matrix data. In this problem, we will try to use PCA in an attempt to work on only one vector of dataset for each individual user. So, then we will merge row data of each user file. 

Firstly, let's take an example data through `get_dataframe`. 

In [7]:
def get_dataframe(seq_id):
    return pd.read_csv( path + '/dataset/MovementAAL_RSS_%s.csv' % seq_id, 
                           names=('RSS_anchor1', 'RSS_anchor2','RSS_anchor3', 'RSS_anchor4'), 
                           skiprows=(1))

In [8]:
example_data = get_dataframe(3)

We want to one row and four columns as `RSS_anchor1`, `RSS_anchor2`, `RSS_anchor3`, `RSS_anchor4` at the end of PCA process. So, we need transpose with example_data in order to multiplies between matrices are compatible when computing eigenvectors and eigenvalues. 

In [16]:
example_data = example_data.T

`sigma` is covariant matrix of df_example matrix. In the formula X is df_example matrix, n is number of rows. 

\begin{align}
    S = (1/n) * XX^T
\end{align}


In [9]:
import numpy as np
df_example = np.asmatrix(example_data)
sigma = np.cov(df_example.T)

It is computed eigenvalues and eigenvector with covariance matrix sigma. Eigen values explain variance in data set. Eigenvectors are vectors which tell corresponding direction of variance. 

In [10]:
eigVals, eigVec = np.linalg.eig(sigma)

We interest eigenvalues with maximum variance and eigenvectors which correspond to these eigenvalues. So, we sorted eigenvalues and eigenvector in decreasing order.

In [11]:
sorted_index = eigVals.argsort()[::-1] 
eigVals = eigVals[sorted_index]
eigVec = eigVec[:,sorted_index]

We get eigenvector correspond to eigenvalue with maximum variance.

In [12]:
eigVec = eigVec[:,:1]
eigVec

array([[-0.52132164],
       [-0.33177132],
       [ 0.77720948],
       [ 0.11873062]])

df_example was transformed to new dataset as `transformed` with `eigVec`. 

In [13]:
transformed = df_example.dot(eigVec)

In [14]:
transformed

matrix([[1.08242991],
        [0.795631  ],
        [1.32740408],
        [1.15388771],
        [1.17067167],
        [1.17938975],
        [1.18953369],
        [1.25028276],
        [1.18950713],
        [1.45046582],
        [1.42811748],
        [1.39036666],
        [1.09401566],
        [0.94317096],
        [0.52777706],
        [0.5643332 ],
        [0.66862736],
        [0.53450051],
        [0.37427927],
        [0.55665351],
        [0.21436412],
        [0.08226125],
        [0.25360558]])

At the end, it was created new dataset as needed to construct suitable SVM model for just one example of user sensor data. 

In [15]:
#horizontally stack transformed data set.
final_df = np.hstack((transformed))

#convert the numpy array to data frame
final_df = pd.DataFrame(final_df)

#define the column names
#final_df.columns = ['x','y','label']

final_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,13,14,15,16,17,18,19,20,21,22
0,1.08243,0.795631,1.327404,1.153888,1.170672,1.17939,1.189534,1.250283,1.189507,1.450466,...,0.943171,0.527777,0.564333,0.668627,0.534501,0.374279,0.556654,0.214364,0.082261,0.253606


### Reference

https://code.likeagirl.io/principal-component-analysis-dimensionality-reduction-technique-step-by-step-approach-ffd46623ff67

Now, let's write a function which has PCA algorithm. `.

In [16]:
def apply_PCA(rss_data):
    rss_data = rss_data.T
    df_rss = np.asmatrix(rss_data)
    sigma = np.cov(df_rss.T)
    
    eigVals, eigVec = np.linalg.eig(sigma)
    sorted_index = eigVals.argsort()[::-1] 
    
    eigVals = eigVals[sorted_index]
    eigVec = eigVec[:,sorted_index]
    eigVec = eigVec[:,:1]
    
    transformed_rss = df_rss.dot(eigVec)
    
    final_rss_df = np.hstack((transformed_rss))
    final_rss_df = pd.DataFrame(final_rss_df)
    
    return final_rss_df

For instance, sensor values for 4. user:

In [19]:
apply_PCA(get_dataframe(4))

Unnamed: 0,0,1,2,3
0,(-3.513514124744332+0j),(-2.1221756814564996+0j),(2.867852254375728+0j),(2.2226989599835827+0j)


Finally, we create a dataframe which has all combined sensor data for all users. In this function, `target` data was added to data frame. All sensor data was combined with `pd.concat`.  

In [17]:
for sequential_id in range(1, 315):
    rss_data_point = apply_PCA(get_dataframe(sequential_id))
    rss_data_point["target"] = target_data[sequential_id-1]
    rss_data_frame = pd.concat([rss_data_frame, rss_data_point], axis=0, ignore_index=True)

In [18]:
rss_data_frame.head()

Unnamed: 0,target,0,1,2,3
0,1,(-1.2267807166123779+0j),(-0.574394593231218+0j),(-0.10960860000729511+0j),(1.3553274031314444+0j)
1,1,(2.9259774034536226+0j),(2.404910259366234+0j),(-2.1751787605141732+0j),(-2.3149351660195143+0j)
2,1,(3.337187445713733+0j),(1.686085042450151+0j),(-2.7029061158562153+0j),(-2.2372550680009926+0j)
3,1,(-3.513514124744332+0j),(-2.1221756814564996+0j),(2.867852254375728+0j),(2.2226989599835827+0j)
4,1,(-4.081392660991256+0j),(-2.5329789459185936+0j),(3.2084736446164808+0j),(2.6017834094344408+0j)


In [23]:
rss_data_frame = rss_data_frame.rename(columns={0: 'RSS_anchor1', 1: 'RSS_anchor2', 2: 'RSS_anchor3', 3: 'RSS_anchor4'})

In [24]:
rss_data_frame.head()

Unnamed: 0,target,RSS_anchor1,RSS_anchor2,RSS_anchor3,RSS_anchor4
0,1,(-1.2267807166123779+0j),(-0.574394593231218+0j),(-0.10960860000729511+0j),(1.3553274031314444+0j)
1,1,(2.9259774034536226+0j),(2.404910259366234+0j),(-2.1751787605141732+0j),(-2.3149351660195143+0j)
2,1,(3.337187445713733+0j),(1.686085042450151+0j),(-2.7029061158562153+0j),(-2.2372550680009926+0j)
3,1,(-3.513514124744332+0j),(-2.1221756814564996+0j),(2.867852254375728+0j),(2.2226989599835827+0j)
4,1,(-4.081392660991256+0j),(-2.5329789459185936+0j),(3.2084736446164808+0j),(2.6017834094344408+0j)


In [19]:
rss_data_frame = rss_data_frame.astype(float)

  return arr.astype(dtype)


In [30]:
eigVec

array([[-0.24878716+0.j],
       [-0.17201892+0.j],
       [-0.27120945+0.j],
       [-0.19090073+0.j],
       [-0.26175698+0.j],
       [-0.23916103+0.j],
       [-0.24006912+0.j],
       [-0.26218033+0.j],
       [-0.19573199+0.j],
       [-0.30626061+0.j],
       [-0.28302664+0.j],
       [-0.30300177+0.j],
       [-0.24267795+0.j],
       [-0.22027532+0.j],
       [-0.16719793+0.j],
       [-0.16397693+0.j],
       [-0.15259001+0.j],
       [-0.12468508+0.j],
       [-0.1060249 +0.j],
       [-0.13722138+0.j],
       [-0.07770096+0.j],
       [-0.02441493+0.j],
       [-0.06795331+0.j]])

In [20]:
X = rss_data_frame.iloc[:,1:]
y = rss_data_frame.iloc[:, 0]
y=y.astype('int')

In [21]:
from sklearn.cross_validation import train_test_split
import random

random_value = random.randint(1,1000)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state=random_value)



There is a dataset which has all users movement and class labels belongs to these users. So, we splitted dataset and used SVM classfication method which has parameters as C is equal to 1 and gamma is equal to 0.5. 

In [24]:
from sklearn import svm
#from sklearn.model_selection import GridSearchCV

classifier = svm.SVC(kernel='rbf', C=1, gamma=0.5)
#parameters = {'kernel': ('linear', 'rbf','poly'), 'C':[1.5, 10],'gamma': [1e-7, 1e-4],'epsilon':[0.1,0.2,0.5,0.3]}

#regressor = GridSearchCV(regressor, parameters)
classifier.fit(X_train, y_train)


SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.5, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [55]:
y_pred = classifier.predict(X_test)

Finally, accuracy is

In [25]:
classifier.score(X_test, y_test)

0.8653846153846154