# Human Activity Recognition (HAR)

In this project we will be using the sensor data recorded using a smartphone. The [HAR dataset](https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones) provides various measurements of the Activities of Daily Living (ADL) of 30 subjects. It consists of various sensor measurements of people, while they are performing activities like standing, walking, sitting, lying down etc. Such a dataset is a mine of information, providing insights into the movement related aspects of individuals. For example, one person's speed of walking is different from another. It stands to reason that a person's physical condition can be correlated to his/her walking speed. This is just one example, we will look into more interesting questions.

We will address the following questions.

1. Can we accurately predict the activity of a person using this dataset? If so, then which is the best model?
2. Which attributes are the most vital ones for predicting the activity of a person?

First, let's import the required library.

In [None]:
import pandas as pd

Now let's see the names of the files we are going to be working with.

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Load the training data.

In [None]:
train_data = pd.read_csv('../input/human-activity-recognition-with-smartphones/train.csv')

In [None]:
test_data = pd.read_csv('../input/human-activity-recognition-with-smartphones/test.csv')

## Exploring the dataset

Let's see the columns in the training set and understand what they mean.

The description of the dataset has the following:
For each record in the dataset the following is provided:
* Triaxial acceleration from the accelerometer (total acceleration) and the estimated body acceleration.
* Triaxial Angular velocity from the gyroscope.
* Its activity label.
* An identifier of the subject who carried out the experiment.

In [None]:
train_data.shape

In [None]:
train_data.head()

In [None]:
train_data.isna().sum().sum()

## Checking for imbalance in the labeled instances

Let's see how many instances of each label there is in the dataset.

In [None]:
train_data.Activity.value_counts()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
fig = plt.figure(figsize = (15, 5))
sns.countplot(x = 'Activity', 
              data = train_data, 
              #palette = "Blues_r",
              palette = 'winter',
              order = train_data['Activity'].value_counts().index
             )

## Distribution of attributes

It is important that we understand the distribution of the values in all the columns of our data, but as we have noticed, there are too many attributes (561 of them!). Surely, we cannot examine all of them. So, what shall we do? 

Well we could just get a glimpse of the distribution by looking at the first 5 columns.

In [None]:
# Plots distribution of 6 columns

def plot_distribution(data, col):
    fig, axes = plt.subplots(ncols = 3, nrows = 2, figsize = (15, 8))
    for i, ax in zip(range(6), axes.flat):
        sns.distplot(data[cols[i]], ax = ax)
    plt.show()

In [None]:
# Select some body acceleration attributes
cols = train_data.columns[:6]
plot_distribution(train_data, cols)

In [None]:
# Select some gravitational acceleration attributes
cols = train_data.columns[40:47]
plot_distribution(train_data, cols)

Some of the attributes are not nearly normally distributed. We observe skew in several gravitational acceleration measurements.

## Visualizing the dataset

In order to get good results, the instances belonging to different labels must be seperable. We will use t-distibuted Stocastic Neighbor Embedding (t-SNE) to visualize the data.

In [None]:
from sklearn.manifold import TSNE

We will make a copy of the train data before applying t-SNE. We will also extract the labels from the data and store it seperately.

In [None]:
tsne_data = train_data.copy()
tsne_data.drop(['Activity', 'subject'], axis = 1, inplace = True)

Get the counts of each activity.

In [None]:
labels = train_data['Activity']
label_counts = labels.value_counts()

Standardization is done on the data prior to applying t-SNE. This is done in order to make the data look almost normally distributed.

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
def scale_data(data):
    scl = StandardScaler()
    return scl, scl.fit_transform(data)

In [None]:
scale_model, scaled_data = scale_data(tsne_data)
scaled_data.shape

Now we will apply t-SNE on the scaled train data.

In [None]:
tsne = TSNE(random_state = 0)
tsne_transformed = tsne.fit_transform(scaled_data)

In [None]:
fig1 = plt.figure(figsize = (25, 10))
colors = ['darkblue', 'mediumturquoise', 'darkgray', 'darkorchid', 'darkred', 'darkgreen']
for i, activity in enumerate(label_counts.index):
    mask = (labels == activity).values
    plt.scatter(x = tsne_transformed[mask][:,0],
                y = tsne_transformed[mask][:,1],
                color = colors[i],
                alpha = 0.4,
                label = activity)
plt.title('Visualisation using t-SNE')
plt.legend()
plt.show()

## 1. Can we accurately predict the activity of a person using this dataset? If so, then which is the best model?
### Creating prediction model

We will try the simple decision tree classifier first. 

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
import numpy as np

In [None]:
X_train = train_data.drop(['Activity', 'subject'], axis = 1)
y_train = train_data['Activity']

X_test = test_data.drop(['Activity', 'subject'], axis = 1)
y_test = test_data['Activity']

In [None]:
from sklearn.tree import DecisionTreeClassifier

clf_dt = DecisionTreeClassifier(max_depth = 30)

In [None]:
clf_dt.fit(X_train, y_train)

In [None]:
y_train_pred_dt = clf_dt.predict(X_train)

In [None]:
y_test_pred_dt = clf_dt.predict(X_test)

In [None]:
def plot_train_test_accuracy(y_train, y_train_pred, y_test, y_test_pred, title):
    acc_train = accuracy_score(y_train, y_train_pred)
    acc_test = accuracy_score(y_test, y_test_pred)
    
    print('Train accuracy = ', acc_train)
    print('Test accuracy = ', acc_test)

    ax = plt.figure()
    plt.bar(x = 'train accuracy', height = acc_train, color='darkblue')
    plt.bar(x = 'test accuracy', height = acc_test, color='lightblue')
    plt.xticks(['train accuracy', 'test accuracy'])
    plt.title(title)

In [None]:
plot_train_test_accuracy(y_train, y_train_pred_dt, y_test, y_test_pred_dt, 'Decision Tree Classifier')

By examining the above figure it can be noticed that there is some amout of overfitting going on. Bagging is well-known method for reducing variance of a model.

### Bagging

Random Forest Classifier is an ensemble learning technique used to reduce the variance of a base learning algorithm (in our case the Decision Tree classifier).

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
clf_rf = RandomForestClassifier(n_estimators = 100)

In [None]:
clf_rf.fit(X_train, y_train)

In [None]:
y_train_pred_rf = clf_rf.predict(X_train)

In [None]:
y_test_pred_rf = clf_rf.predict(X_test)

In [None]:
plot_train_test_accuracy(y_train, y_train_pred_rf, y_test, y_test_pred_rf, 'Random Forest Classifier')

The accuracy on test set has been improved!

### Gradient Boosting
Next we will try gradient boosting based tree classifier. The base learner will be a weak classifier such as a Decision Tree with maximum depth 3.

In [None]:
from lightgbm import LGBMClassifier

lgbm = LGBMClassifier(max_depth = 3, n_estimators = 500, random_state = 0)
lgbm.fit(X_train, y_train)

In [None]:
y_train_pred_lgbm = lgbm.predict(X_train)
y_test_pred_lgbm = lgbm.predict(X_test)

In [None]:
plot_train_test_accuracy(y_train, y_train_pred_lgbm, y_test, y_test_pred_lgbm, 'LGBM Classifier')

Wow! the accuracy has improved significantly. Thus the best model on this dataset is LGBM classifier.

## 2. Which attributes are the most vital ones for predicting the activity of a person?

This kind of question is best answered by looking at the importance scores provided by the XGBMBoost classifier. In fact, all gradient boosting methods return such info.

In [None]:
from lightgbm import plot_importance
plot_importance(lgbm, max_num_features = 10)
plt.show()

**Conclusion:**

Intially I did a t-SNE plot to see the sperability of the classes, and it was found that there were distinct demarcations between the classes. In other words, the data of different classes appear to be be quite easy to seperate. Then, a decision tree classifier was used to predict the actvity of persons in the test set, and its accuracy was only 85%. Next, a random forest classifier was used for prediction, and its accuracy was 93%, which is much better than that of a simple decision tree. Finally, a gradient boosting tree model was found to produce results with accuracy up to 95%. The most important features are Acceleration measurements along the three axes and also certain gyroscope measurements.
