In [None]:
import os
import pandas as pd

In [None]:
path = "../input/human-activity-recognition-with-smartphones/"

In [None]:
df_train = pd.read_csv(os.path.join(path,'train.csv'))
df_test = pd.read_csv(os.path.join(path,'test.csv'))

In [None]:
df_train.head()

In [None]:
df_train.shape

In [None]:
df_test.shape

In [None]:
df_train['Activity'].value_counts()

# Human Activity Recognition

In this notebook, we are trying to predict the Activity of a user. As you can it is a Muliclassification Problem. This notebook is to build a model that can predict whether a person is `Laying`, `Standing` , `Sitting`, `Walking`, `Walking_upstairs`, or `Walking_downstairs`

Initially, the information in this dataset is the measurements from the accelerometer, gyroscope, magnetometer, and GPS of the smartphone. 

#### Data Information 
From the website: 

http://archive.ics.uci.edu/ml/datasets/Smartphone-Based+Recognition+of+Human+Activities+and+Postural+Transitions

The experiments have been carried out with a group of 30 volunteers within an age bracket of 19-48 years. Each person performed six activities (`WALKING`, `WALKING_UPSTAIRS`, `WALKING_DOWNSTAIRS`, `SITTING`, `STANDING`, `LAYING`) wearing a smartphone <b>(Samsung Galaxy S II) </b> on the waist. Using its embedded accelerometer and gyroscope, we captured 3-axial linear acceleration and 3-axial angular velocity at a constant rate of 50Hz. The experiments have been video-recorded to label the data manually. The obtained dataset has been randomly partitioned into two sets, where 70% of the volunteers was selected for generating the training data and 30% the test data.

The sensor signals (accelerometer and gyroscope) were pre-processed by applying noise filters and then sampled in fixed-width sliding windows of 2.56 sec and 50% overlap (128 readings/window). The sensor acceleration signal, which has gravitational and body motion components, was separated using a Butterworth low-pass filter into body acceleration and gravity. The gravitational force is assumed to have only low frequency components, therefore a filter with 0.3 Hz cutoff frequency was used. From each window, a vector of features was obtained by calculating variables from the time and frequency domain.

### Let's talk about the features (columns)

We see, there are `563 individual features(columns)`. 

1. The features selected for this database come from the <b> accelerometer </b> and <b> gyroscope </b> 3-axial raw signals <b> tAcc-XYZ </b>. These time domain signals (prefix <b>'t'</b> to denote time) were captured at a constant rate of 50 Hz. Then they were filtered using a median filter and a 3rd order low pass Butterworth filter with a corner frequency of 20 Hz to remove noise. 


2. Similarly, the acceleration signal was then separated into body and gravity acceleration signals <b> (tBodyAcc-XYZ and tGravityAcc-XYZ) </b> using another low pass Butterworth filter with a corner frequency of 0.3 Hz. 


3. Subsequently, the body linear acceleration and angular velocity were derived in time to obtain <b> Jerk signals (tBodyAccJerk-XYZ </b> and <b> tBodyGyroJerk-XYZ) </b>. Also the magnitude of these three-dimensional signals were calculated using the Euclidean norm `(tBodyAccMag`, `tGravityAccMag`, `tBodyAccJerkMag`, `tBodyGyroMag`, `tBodyGyroJerkMag)`

`jerk is the rate at which an object's acceleration changes with respect to time`


4. Finally a Fast Fourier Transform (FFT) was applied to some of these signals producing

`fBodyAcc-XYZ, fBodyAccJerk-XYZ, fBodyGyro-XYZ, fBodyAccJerkMag, fBodyGyroMag, fBodyGyroJerkMag. `

(Note the 'f' to indicate frequency domain signals). 

These signals were used to estimate variables of the feature vector for each pattern:  


5. <b>'-XYZ' </b> is used to denote 3-axial signals in the X, Y and Z directions.

    - tBodyAcc-XYZ
    - tGravityAcc-XYZ
    - tBodyAccJerk-XYZ
    - tBodyGyro-XYZ
    - tBodyGyroJerk-XYZ
    - tBodyAccMag
    - tGravityAccMag
    - tBodyAccJerkMag
    - tBodyGyroMag
    - tBodyGyroJerkMag
    - fBodyAcc-XYZ
    - fBodyAccJerk-XYZ
    - fBodyGyro-XYZ
    - fBodyAccMag
    - fBodyAccJerkMag
    - fBodyGyroMag
    - fBodyGyroJerkMag`
    
    

6. The set of variables that were estimated from these signals are: 

    - `mean()`: Mean value
    - `std()`: Standard deviation
    - `mad()`: Median absolute deviation 
    - `max()`: Largest value in array
    - `min()`: Smallest value in array
    - `sma()`: Signal magnitude area
    - `energy()`: Energy measure. Sum of the squares divided by the number of values. 
    - `iqr()`: Interquartile range 
    - `entropy()`: Signal entropy
    - `arCoeff()`: Autorregresion coefficients with Burg order equal to 4
    - `correlation()`: correlation coefficient between two signals
    - `maxInds()`: index of the frequency component with largest magnitude
    - `meanFreq()`: Weighted average of the frequency components to obtain a mean frequency
    - `skewness()`: skewness of the frequency domain signal 
    - `kurtosis()`: kurtosis of the frequency domain signal 
    - `bandsEnergy()`: Energy of a frequency interval within the 64 bins of the FFT of each window.
    - `angle()`: Angle between to vectors.
    

7. Additional vectors obtained by averaging the signals in a signal window sample. These are used on the angle() variable:

    `gravityMean
     tBodyAccMean
     tBodyAccJerkMean
     tBodyGyroMean
     tBodyGyroJerkMean`
 
 That's too much information. 

.

## What's our Plan?


### `Outline`

- <b>1. Read Dataset </b>


- <b>2. Datset Cleaning </b>
    - 2.1 Outliers
    - 2.2 Filling null values
    - 2.3 Check for data imbalance
    - 2.4 Correcting some feature names


   
- <b>3. Exploratory Data Analysis </b>


- <b>4. Data Preprocessing </b>
    - 4.1 Encoding categorical variables
    - 4.2 Normalization
    - 4.3 Split Training and testing
    
    
    
- <b>5. Models, Hyperparameter Tuning and Cross Validation</b>
    - 5.1 Logistic Regression 
    - 5.2 Naive Bayes 
    - 5.3 K-Nearest Neighbor
    - 5.4 Decision Tree
    - 5.5 Random Forest
    - 5.5 Support Vector Machine
    
    


Since we have already observed the data and the features. So we will skip the part.

In [None]:
df_train.columns

## 2. Dataset Cleaning

- 2.1 Outliers
- 2.2 Filling null values
- 2.3 Check for data imbalance

### 2.1 Oultiers

In [None]:
df_train.describe()

There is no any possibility of having Outliers. All the values are squeezed between -1 to 1. 

### 2.2 Checking for NaN/null values and Duplicates

In [None]:
## Checking for Duplicates

In [None]:
print("Total Duplicates Train: {} \n".format(sum(df_train.duplicated())))
print("Total Duplicates in Test: {} \n".format(sum(df_test.duplicated())))

In [None]:
## Checking for null values

In [None]:
print("Total Null values in Train: {}\n".format(df_train.isnull().values.sum()))
print("Total Null values in Test: {} \n".format(df_test.isnull().values.sum()))

### 2.3 Check for imbalanced dataset

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

In [None]:
plt.figure(figsize = (15,8))
plt.title('Subjects')
sns.countplot(x = 'subject', data = df_train);

In [None]:
plt.figure(figsize = (16,8))
plt.title("Subject with Each Activity")
sns.countplot(hue = 'Activity', x='subject',data = df_train);
plt.show()

In [None]:
plt.figure(figsize = (12,8))
sns.countplot(x = 'Activity', data = df_train);

We can clearly see, each subjects has almost equal or less amount of data. There is no any huge amount of gap between them.

### 2.4 Correcting some feature names


In [None]:
df_train.head()

We can see, some () 'bracket' between the feature's name. We will remove all these brackets quickly. So it's easier for us to type correctly later.

In [None]:
columns = df_train.columns

## Removing ()

columns = columns.str.replace('[()]','')
columns = columns.str.replace('[-]','')
columns = columns.str.replace('[,]','')


In [None]:
df_train.columns = columns
df_test.columns = columns

In [None]:
df_train.columns

## 3. Exploratory Data Analysis


#### Static and Dynamic Activites

- Static activities are (sit, stand, lie and down) thus there is no any motion of an object. 
- Dynamic activities (Walking, WalkingUpStairs, WalkingDownStairs) motion info will be significant



#### 2. Stationary and Moving activities are completely different

In [None]:
sns.set_palette("Set1", desat=0.80)
facetgrid = sns.FacetGrid(df_train, hue='Activity', size=6,aspect=2)
facetgrid.map(sns.distplot,'tBodyAccMagmean', hist=False)\
    .add_legend()
plt.annotate("Stationary Activities", xy=(-0.956,17), xytext=(-0.9, 23), size=20,\
            va='center', ha='left',\
            arrowprops=dict(arrowstyle="simple",connectionstyle="arc3,rad=0.1"))

plt.annotate("Moving Activities", xy=(0,3), xytext=(0.2, 9), size=20,\
            va='center', ha='left',\
            arrowprops=dict(arrowstyle="simple",connectionstyle="arc3,rad=0.1"))
plt.show()

Let's take a closer look at them

In [None]:
## 

plt.figure(figsize = (12,8))
plt.subplot(1,2,1)
plt.title("Static Activities (closer view)")
sns.distplot(df_train[df_train["Activity"]=="SITTING"]['tBodyAccMagmean'], hist = False, label = 'Sitting');
sns.distplot(df_train[df_train["Activity"]=="STANDING"]['tBodyAccMagmean'], hist = False, label = 'Standing');
sns.distplot(df_train[df_train["Activity"]=="LAYING"]['tBodyAccMagmean'], hist = False, label = 'Laying');
plt.axis([-1.02, -0.5, 0, 35])
plt.subplot(1,2,2)
plt.title("Dynamic Activities (closer view)")
sns.distplot(df_train[df_train["Activity"]=="WALKING"]["tBodyAccMagmean"], hist = False, label ="Sitting");
sns.distplot(df_train[df_train["Activity"]=="WALKING_UPSTAIRS"]['tBodyAccMagmean'], hist = False, label = 'Laying');

We will also, use box plot to visulaize

In [None]:
plt.figure(figsize = (10,7))
sns.boxplot(x = 'Activity', y ='tBodyAccMagmean', data = df_train, showfliers = False);
plt.ylabel('Body Acceleration Magnitude mean')
plt.title('Boxplot of tBodyAccMagmean column across various activities')
plt.axhline(y =- 0.7, xmin = 0.1, xmax = 0.9, dashes = (3,3))
plt.axhline(y = 0.020, xmin = 0.4, dashes = (3,3))
plt.xticks(rotation = 90)
plt.show()

Using boxplot agian, we can come with conditions to seperate static activities from dynamic activities.

`` if(tBodyAccMagmean <= -0.8):
      Activity = "static"
  if(tBodyAccMagmean >= -0.6):
      Activity = "dynamic"
 ``
 

Also, we can easily seperate WALKING_DOWNSTAIRS activity from others using boxplot.

`` 
if (tBodyAccMagmean > 0.02):
    Activity = "WALKING_DOWNSTARIS"
else:
    Activity = "others"
``

But still 25% of WALKING_DOWNSTAIRS observations are below 0.02 which are misclassified as others so this condition makes an error of 25% in classification.

#### 3.2 Analysing Angle between X-axis and gravityMean feature

In [None]:
plt.figure(figsize = (10,7))
sns.boxplot(x = 'Activity', y = 'angleXgravityMean', data = df_train, showfliers = False)
plt.axhline(y = 0.08, xmin = 0.1 , xmax = 0.9, dashes = (3,3))
plt.ylabel("Angle between X-axis and gravityMean")
plt.title("Box plot of angleXgravityMean column across various activities")
plt.xticks(rotation = 90)
plt.show()

<b> Observation: </b>
- If angleXgravityMean > 0.01 then Activity is <b> Laying </b>
- We can classify all datapoints belonging to Laying activity with just a single if else statement



#### 3.3 Analysing Angle between Y-axis and gravityMean feature

In [None]:
plt.figure(figsize = (10,7))
sns.boxplot(x = 'Activity', y = 'angleYgravityMean', data = df_train, showfliers = False)
plt.ylabel("Angle between Y-axis and gravityMean")
plt.title("Box plot of angleYgravitymean column across various activities")
plt.xticks(rotation = 90)
plt.axhline(y = -0.35, xmin = 0.01, dashes = (3,3))
plt.show()

#### 3.4 Visualizing data using t-SNE

Using t-SNE data can be visualized from a extermely high dimensional space to a low dimensional space and still it retains lots of actual information. Given training data has 561 unique featuers, using t-SNE let's visualze it to a 2D space.

In [None]:
from sklearn.manifold import TSNE

In [None]:
X_for_tsne = df_train.drop(['subject','Activity'], axis = 1)

In [None]:
%time
tsne = TSNE(random_state = 42, n_components = 2, verbose = 1, perplexity = 50, n_iter = 1000).fit_transform(X_for_tsne)

In [None]:
plt.figure(figsize = (12,8))
sns.scatterplot(x = tsne[:,0], y = tsne[:,1], hue = df_train["Activity"], palette = "bright")

<b>Observations:</b>
- Laying is totally different position
- Walking, Walking_downstaris, Walking_upstairs are some kind of similar so they are clustered together
- And, Standing and Sitting are also some kind of same position.

## 4. Data Preprocessing







#### 4.1 Splitting training and testing

In [None]:
y_train = df_train.Activity
X_train = df_train.drop(['subject','Activity'], axis = 1)
y_test = df_test.Activity
X_test = df_test.drop(['subject','Activity'], axis = 1)
print('Training data size:', X_train.shape)
print('Test data size:', X_test.shape)

In [None]:
model_score = pd.DataFrame(columns = ("Model","Score"))

## 5. Models, HyperparamterTuning and Cross Validations
- Logistic Regression 
- Linear SVM
- Kernel SVM
- Decision Tree
- Random Forest



#### 5.1 Logistic regression model with Hyperparameter tuning and cross validation

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
import numpy as np

import warnings
warnings.filterwarnings("ignore")

In [None]:
parameters = {'C':np.arange(10,61,10),'penalty':['l2','l1']}
lr_classifier = LogisticRegression()
lr_classifier_rs = RandomizedSearchCV(lr_classifier, param_distributions = parameters, cv = 5, random_state = 42)
lr_classifier_rs.fit(X_train, y_train)
y_pred = lr_classifier_rs.predict(X_test)

In [None]:
lr_accuracy = accuracy_score(y_true = y_test, y_pred = y_pred)
print("Accuracy using Logisitc Regression:", lr_accuracy)

In [None]:
model_score = model_score.append(pd.DataFrame({'Model':["LogisticRegression"],'Score':[lr_accuracy]}))

In [None]:
lr_classifier_rs.best_estimator_

In [None]:
## plotting confusion matrix

def plot_confusion_matrix(cm, lables):
    fig, ax = plt.subplots(figsize = (12,8))
    im = ax.imshow(cm, interpolation = 'nearest', cmap = plt.cm.Blues)
    ax.figure.colorbar(im, ax = ax)
    ax.set(xticks = np.arange(cm.shape[1]))
    yticks = np.arange(cm.shape[0])
    ylabel = 'True label'
    xlabel = 'Predicted label'
    plt.xticks(rotation = 90)
    thresh = cm.max() / 2
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, int(cm[i,j]), ha = "center", va = "center", color = "white" if cm[i,j]> thresh else "black")
            fig.tight_layout()

In [None]:
cm = confusion_matrix(y_test.values, y_pred)
plot_confusion_matrix(cm, np.unique(y_pred))

In [None]:
## function to get best random search attributes

def get_best_randomsearch_results(model):
    print("Best estimator:", model.best_estimator_)
    print("Best set of parameters:", model.best_params_)
    print("Best score:", model.best_score_)

In [None]:
## getting best random search attributes

get_best_randomsearch_results(lr_classifier_rs)

#### 4.2 Linear SVM model with Hyperparameter tuning and cross validation

In [None]:
from sklearn.svm import LinearSVC

In [None]:
parameters = {'C': np.arange(1,12,2)}
lr_svm = LinearSVC(tol = 0.00005)
lr_svm_rs = RandomizedSearchCV(lr_svm, param_distributions = parameters, random_state = 42)
lr_svm_rs.fit(X_train, y_train)
y_pred = lr_svm_rs.predict(X_test)

In [None]:
lr_svm_accuracy = accuracy_score(y_true = y_test, y_pred = y_pred)
print("Accuracy using Linear SVM:", lr_svm_accuracy)

In [None]:
model_score = model_score.append(pd.DataFrame({'Model':["LinearSVM"],'Score':[lr_svm_accuracy]}))

In [None]:
cm = confusion_matrix(y_test.values, y_pred)
plot_confusion_matrix(cm, np.unique(y_pred))

In [None]:
## getting best random search attributes
get_best_randomsearch_results(lr_svm_rs)

#### 5.3 Kernel SVM model with Hyperparameter tuning and cross validation

In [None]:
from sklearn.svm import SVC

In [None]:
np.linspace(2,22, 6)

In [None]:
parameters = {'C':[2,4,8,16], 'gamma':[0.125, 0.250, 0.5, 1]}
kernel_svm = SVC(kernel = 'rbf')
kernel_svm_rs = RandomizedSearchCV(kernel_svm, param_distributions = parameters, random_state = 42)
kernel_svm_rs.fit(X_train, y_train)
y_pred = kernel_svm_rs.predict(X_test)

In [None]:
kernel_svm_accuracy = accuracy_score(y_true = y_test, y_pred = y_pred)
print("Accuracy using Kernel SVM:", kernel_svm_accuracy)

In [None]:
model_score = model_score.append(pd.DataFrame({'Model':["KernelSVM"],'Score':[kernel_svm_accuracy]}))

In [None]:
cm = confusion_matrix(y_test.values, y_pred)
plot_confusion_matrix(cm, np.unique(y_pred))

In [None]:
## getting best random search attributes

get_best_randomsearch_results(kernel_svm_rs)

#### 5.4 Decision tree model with Hyperparameter tuning and cross validation

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
parameters = {'max_depth':np.arange(2,10,2)}
dt_classifier = DecisionTreeClassifier()
dt_classifier_rs = RandomizedSearchCV(dt_classifier,param_distributions=parameters,random_state = 42)
dt_classifier_rs.fit(X_train, y_train)
y_pred = dt_classifier_rs.predict(X_test)

In [None]:
dt_accuracy = accuracy_score(y_true = y_test, y_pred = y_pred)
print("Accuracy using Decision tree:", dt_accuracy)

In [None]:
model_score = model_score.append(pd.DataFrame({'Model':["DecisionTrees"],'Score':[dt_accuracy]}))

In [None]:
cm = confusion_matrix(y_test.values, y_pred)
plot_confusion_matrix(cm, np.unique(y_pred))

In [None]:
## getting best estimators

get_best_randomsearch_results(dt_classifier_rs)

#### 5.5 Random Forest model using Hyperparameter tuning and cross validation

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
params = {'n_estimators': np.arange(20,101,10), 'max_depth':np.arange(2,16,2)}
rf_classifier = RandomForestClassifier()
rf_classifier_rs = RandomizedSearchCV(rf_classifier, param_distributions=params,random_state = 42)
rf_classifier_rs.fit(X_train, y_train)
y_pred = rf_classifier_rs.predict(X_test)

In [None]:
rf_accuracy = accuracy_score(y_test, y_pred)
print("Accuracy using Random Forest:", rf_accuracy)

In [None]:
model_score = model_score.append(pd.DataFrame({'Model':["RandomForest"],'Score':[rf_accuracy]}))

In [None]:
cm = confusion_matrix(y_test.values, y_pred)
plot_confusion_matrix(cm, np.unique(y_pred))

In [None]:
model_score.head()