## Activity Recognition using Machine Learning

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import matplotlib.cm as cm
%matplotlib inline

from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score

## Understand the data

In [None]:
training_data = pd.read_csv('../input/human-activity-recognition-with-smartphones/train.csv')
testing_data = pd.read_csv('../input/human-activity-recognition-with-smartphones/test.csv')

In [None]:
training_data.head()

In [None]:
print("Training Data: {}".format(training_data.shape))

In [None]:
print("Null values present in training data: {}".format(training_data.isnull().values.any()))

In [None]:
print("Testing Data: {}".format(testing_data.shape))

In [None]:
print("Null values present in testing data: {}".format(testing_data.isnull().values.any()))

There are total 7352 records in the training dataset. Further, there are no null values in the dataset.The testing dataset has 2947 records for testing our models. This dataset has no null values.

I can see that the dataset consists of accelerometer and gyroscope sensor values for each record. Further, the last two columns are `subject` which refers to subject number and `Activity` which defines the type of activity. The `Activity` column acts as the label `y` and all the rest columns are features `X`. 

In [None]:
# Get X and y for training data
X_train = training_data.drop(columns = ['Activity', 'subject'])
y_train = training_data["Activity"]

# Get X and y for testing data
y_test = testing_data['Activity']
X_test = testing_data.drop(columns = ['Activity', 'subject'])

## Visualize the dataset

I will now visualise the training data to get a better understanding of the available dataset.

In [None]:
count_of_each_activity = np.array(y_train.value_counts())

In [None]:
activities = sorted(y_train.unique())

In [None]:
colors = cm.rainbow(np.linspace(0, 1, 4))
plt.figure(figsize=(10,6))
plt.bar(activities,count_of_each_activity,width=0.3,color=colors)
plt.xticks(rotation=45,fontsize=12)
plt.yticks(rotation=45,fontsize=12)

In [None]:
plt.figure(figsize=(16,8))
plt.pie(count_of_each_activity, labels = activities, autopct = '%0.2f')

The percenage values show that the data size for each activity is comparable. The dataset is equally distributed.

On inspecting the dataset, I can see that there are many features. It's easy to identify that there are Accelerometer, Gyroscope and some other values in the dataset. I can check the share of each by plotting a bar graph of each type. Accelerometer values have Acc in them, Gyroscope values have Gyro and rest can be considered as others

In [None]:
Acc = 0
Gyro = 0
other = 0

for value in X_train.columns:
    if "Acc" in str(value):
        Acc += 1
    elif "Gyro" in str(value):
        Gyro += 1
    else:
        other += 1

In [None]:
plt.figure(figsize=(12,8))
plt.bar(['Accelerometer', 'Gyroscope', 'Others'],[Acc,Gyro,other],color=('r','g','b'))

Accelerometer constitutes the maximum features, followed by Gyroscope. Other features are very less

In [None]:
training_data['subject'].unique()

I will select all rows from the dataset that have the ‘Activity’ label as ‘STANDING’ and store it in standing_activity. 

In [None]:
standing_activity = training_data[training_data['Activity'] == 'STANDING']
# Reset the index for this dataframe
standing_activity = standing_activity.reset_index(drop=True)

In [None]:
standing_activity.shape

### Set time series for each subject

In [None]:
time = 1
index = 0
time_series = np.zeros(standing_activity.shape[0])
print(time_series)

The data collected is in continuous time series for each individual and was recorded at the `same rate`. So, I can simply assign time values to each activity starting from `0` each time the subject changes. For each subject, the `Standing activity` records will start with a time value of 0 and increment by `1` till the previous row’s subject matches the present row’s subject. I store all the time series in a variable `time_series` and convert it into a dataframe using pandas method DataFrame() and store it in a variable `time_series_df`. Lastly, I combine the records and the time series variable together in `standing_activity_d`f using pandas `concatenate()` method. 

In [None]:
for row_number in range(standing_activity.shape[0]):
    if (row_number == 0 
        or standing_activity.iloc[row_number]['subject'] == standing_activity.iloc[row_number - 1]['subject']):
        time_series[index] = time
        time += 1
    else:
        time_series[index] = 1
        time = 2
    index += 1

# Combine the time_series with the standing_activity dataframe
time_series_df = pd.DataFrame({ 'Time': time_series })
standing_activity_df = pd.concat([standing_activity, time_series_df], axis = 1)

In [None]:
standing_activity_df.head()

For each subject, I can now plot the graph of their angles with time. I use the cm subpackage of matplotlib to get a set of colors which shall be used for differentiating subjects.

In [None]:
colors = cm.rainbow(np.linspace(0, 1, len(standing_activity_df['subject'].unique())))

# Create plot for each subject, which will all be displayed overlapping on one plot
id = 0
for subject in standing_activity_df['subject'].unique():
    plt.rcParams.update({'figure.figsize': [40, 30], 'font.size': 24})
    plt.plot(standing_activity_df[standing_activity_df['subject'] == subject]['Time'], 
             standing_activity_df[standing_activity_df['subject'] == subject]['angle(X,gravityMean)'],
             c = colors[id], 
             label = 'Subject ' + str(subject),
             linewidth = 4)
    plt.xlabel('Time',fontsize=28)
    plt.ylabel('Angle',fontsize=28)
    plt.title('Angle between X and mean Gravity v/s Time for various subjects')
    plt.legend(prop = {'size': 24})
    id += 1

If I take a closer look at the graph, we can see that each line on an average, transitions between a maximum range of 0.2–0.3 values. This is indeed the expected behaviour as slight variations can be attributed to minor human errors.

## Classify activities

To begin, I'll use various machine learning algorithms available inside the sklearn package that I have already imported. For each algorithm, I'll calculate the accuracy of prediction and identify the most accurate algorithm.

For now, I will keep the default values of parameters as defined in `sklearn` for each classifier.

In [None]:
accuracy_scores = np.zeros(4)

# Support Vector Classifier
clf = SVC().fit(X_train, y_train)
prediction = clf.predict(X_test)
accuracy_scores[0] = accuracy_score(y_test, prediction)*100
print('Support Vector Classifier accuracy: {}%'.format(accuracy_scores[0]))

# Logistic Regression
clf = LogisticRegression().fit(X_train, y_train)
prediction = clf.predict(X_test)
accuracy_scores[1] = accuracy_score(y_test, prediction)*100
print('Logistic Regression accuracy: {}%'.format(accuracy_scores[1]))

# K Nearest Neighbors
clf = KNeighborsClassifier().fit(X_train, y_train)
prediction = clf.predict(X_test)
accuracy_scores[2] = accuracy_score(y_test, prediction)*100
print('K Nearest Neighbors Classifier accuracy: {}%'.format(accuracy_scores[2]))

# Random Forest
clf = RandomForestClassifier().fit(X_train, y_train)
prediction = clf.predict(X_test)
accuracy_scores[3] = accuracy_score(y_test, prediction)*100
print('Random Forest Classifier accuracy: {}%'.format(accuracy_scores[3]))

plotting a bar graph of the accuracies to compare them visually.


In [None]:
plt.figure(figsize=(12,8))
colors = cm.rainbow(np.linspace(0, 1, 4))
labels = ['Support Vector Classifier', 'Logsitic Regression', 'K Nearest Neighbors', 'Random Forest']
plt.bar(labels,
        accuracy_scores,
        color = colors)
plt.xlabel('Classifiers',fontsize=18)
plt.ylabel('Accuracy',fontsize=18)
plt.title('Accuracy of various algorithms',fontsize=20)
plt.xticks(rotation=45,fontsize=12)
plt.yticks(fontsize=12)

Clearly seen that `Logistic Regression` performed the best with the highest accuracy.