# Machine Learning Models for Human Activity Recognition

The Dataset used has a lot of features which calls for a tonne of preprocessing.  
However the aim of the notebook is to compare ML models for unprocessed data and try to increase score.  
We'll also learn about a feature selection method which can be done to increase score for some or decrease time.  
It's basically a tradeoff between time and score.  

EDA for the same has been well demonstrated in [this](https://www.kaggle.com/abheeshthmishra/eda-of-human-activity-recognition) notebook.

In [None]:
import pandas as pd
from sklearn.ensemble import *
from sklearn.tree import *
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_selection import *
from sklearn.preprocessing import *
from sklearn.model_selection import *
from sklearn.linear_model import *
from sklearn.naive_bayes import *
from sklearn.svm import *
from sklearn.neighbors import *
from sklearn.tree import *
from sklearn.metrics import *
import time
import warnings
warnings.filterwarnings("ignore")

# Importing Train Data

In [None]:
df = pd.read_csv("../input/human-activity-recognition-with-smartphones/train.csv")
df.head()

# % of Different categories
As the percentage is roughly equal, hence we can consider it to a balanced dataset.  
However we'll still use F1-score for comparisons

In [None]:
df['Activity'].groupby(df['Activity']).count()

In [None]:
activity = df['Activity'].groupby(df['Activity']).count().index
activity_data = df['Activity'].groupby(df['Activity']).count().values
colors = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#8c564b","#a4d321"]
plt.pie(activity_data, labels=activity,  colors=colors , autopct='%1.1f%%', shadow=True, startangle=140)
plt.title("% of Different categories")
plt.show()

## Checking the number of null values

In [None]:
print(df.isna().sum())

In [None]:
x = df.drop(['Activity'],axis=1)
y = df['Activity']

# Training Models

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42)

In [None]:
classifiers = [
    KNeighborsClassifier(5),
    SVC(kernel="rbf"),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    GaussianNB(),
    RidgeClassifier(),
    LogisticRegression(max_iter=200)
]

In [None]:
def f_score(X_train, X_test, y_train, y_test):
    for clf in classifiers:
        s = time.time()
        clf.fit(X_train,y_train)
        y_pred = clf.predict(X_test)
        f = f1_score(y_true=y_test,y_pred=y_pred,average="macro")
        e = time.time()
        print(f"Score: {round(f,3)} \t Time(in secs): {round(e-s,3)} \t Classifier: {clf.__class__.__name__}")

# F1-Score

Recall = TruePositives / (TruePositives + FalseNegatives)

Precision = TruePositives / (TruePositives + FalsePositives)

F1 = 2 (precision recall) / (precision + recall)

### Accuracy for train data

In [None]:
f_score(X_train, X_test, y_train, y_test)

**The Above score achieved is after splitting train data and not test data**

### Accuracy for test data

In [None]:
df_test = pd.read_csv("../input/human-activity-recognition-with-smartphones/test.csv")
df_test_x = df_test.drop(['Activity'],axis=1)
df_test_y = df_test['Activity']
f_score(x, df_test_x, y, df_test_y)

# Stacking Classifier
Stacking classifier build a new classifier.  
To learn more refer [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html)

In [None]:
estimators = [
        ('RFC' ,RandomForestClassifier(n_estimators=500, random_state = 42)),
        ('KNC', KNeighborsClassifier(5)),
        ('DTC', DecisionTreeClassifier()),
        ('SVC', SVC(kernel="rbf")),
        ('RC',  RidgeClassifier()),
]

clf = StackingClassifier(
    estimators=estimators, 
    final_estimator=GradientBoostingClassifier()
)

### Accuracy for train data

In [None]:
s = time.time()
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
e = time.time()
print(f"time consumed: {round(e-s,3)}")
f1_score(y_true=y_test,y_pred=y_pred,average="macro")

### Accuracy for test data

In [None]:
s = time.time()
clf.fit(x,y)
y_pred = clf.predict(df_test_x)
e = time.time()
print(f"time consumed: {round(e-s,3)}")
f1_score(y_true=df_test_y,y_pred=y_pred,average="macro")

the stacking classifier does a great work of boosting accuracy to **99+** for train data and **96+** for test data.  
However it consumes a lot of time.  

# Trying reducing number of features
Random forest classifier determines importance of variables.  
This can be used to filter most important features.  
You may also use Logistic regression.  
To understand simply: Logistic regression determines linear coeffecients.  
Coeffecients with higher magnitudes have a greater impact on `Y` than others.

In [None]:
sel = SelectFromModel(RandomForestClassifier())
sel.fit(x,y)

In [None]:
features = x.columns[(sel.get_support())]
print(len(features))
features

Hence Random Forest find these 125 features as important

In [None]:
X1 = x.filter(items=features)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X1, y, test_size = 0.2, random_state = 42)

# Training on reduced dataset

### Accuracy for train data

In [None]:
f_score(X_train, X_test, y_train, y_test)

### Accuracy for test data

In [None]:
f_score(X1, df_test_x.filter(items=features), y, df_test_y)

# Train with reduced Dataset with Stacking Classifier

### Accuracy for train data

In [None]:
s = time.time()
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
e = time.time()
print(f"time consumed: {round(e-s,3)}")
f1_score(y_true=y_test,y_pred=y_pred,average="macro")

### Accuracy for test data

In [None]:
s = time.time()
clf.fit(X1,y)
y_pred = clf.predict(df_test_x.filter(items=features))
e = time.time()
print(f"time consumed: {round(e-s,3)}")
f1_score(y_true=df_test_y,y_pred=y_pred,average="macro")

After Feature selection we get maximum train score of **98+** and test score of **93+**

### The tradeoff between score and time
In most cases the models will be trained prior and deployed with just the weights, however in situations with on device processing like a smartphone we need to decide what we want.  
Stacking almost always boosts your accuracy as explained in case above, it does comes at the cost of extra training time.  
I hope this notebook helped you.  
**Happy Learning**