## Human Activity Recognition Using Smartphones Data Set

# `Introduction`

We will be using the [Human Activity Recognition with Smartphones](https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones) database, which was built from the recordings of study participants performing activities of daily living (ADL) while carrying a smartphone with an embedded inertial sensors. The objective is to classify activities into one of the six activities (walking, walking upstairs, walking downstairs, sitting, standing, and laying) performed.

For each record in the dataset it is provided: 

- Triaxial acceleration from the accelerometer (total acceleration) and the estimated body acceleration. 
- Triaxial Angular velocity from the gyroscope. 
- A 561-feature vector with time and frequency domain variables. 
- Its activity label. 

More information about the features is available on the website above.

# `Plan for data exploration`
1. Exploring data 
    * Examine the data types and value_counts
2. feature engineering 
    * see the data distribution
    * removing unimportant data if found 
    * dealing with missing (NaN) values if found
    * feature scalling for continuous variables if needed
3. encoding
    * encoding for categorical variables if found as to Encode the activity label as an integer
4. Spliting the Data
5. Applying classification models
    * Logistic Regression
    * K-Nearest Neighbors (KNeighbors)
    * Decision Trees
    * Ensemble Methods (Gradient Boosting) 
6. Selecting the best model
7. Next steps

# `Exploring and feature engineering`

In [1]:
# importing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import KFold,GridSearchCV
from sklearn.metrics import r2_score
from sklearn.pipeline import Pipeline

%matplotlib inline

In [1]:
# filepath = 'data/Human_Activity_Recognition_Using_Smartphones_Data.csv'
train = pd.read_csv('../input/human-activity-recognition-with-smartphones/train.csv')
test = pd.read_csv('../input/human-activity-recognition-with-smartphones/test.csv')
train.head()

In [1]:
train.dtypes.value_counts()

The data columns are all floats except for the activity label.

In [1]:
# see the min and max of data excluding the target
print('min = ',train.iloc[:, :-1].min().value_counts())
print('max = ',train.iloc[:, :-1].max().value_counts())

The data are all scaled from -1 (minimum) to 1.0 (maximum).

In [1]:
# Examine the breakdown of activities-- to see if balanced or not
train.Activity.value_counts(normalize=True)

In [1]:
train.isnull().sum().all() and test.isnull().sum().all()

# `encoding`

Scikit learn classifiers won't accept a sparse matrix for the prediction column. Thus, either `LabelEncoder` needs to be used to convert the activity labels to integers, or if `DictVectorizer` is used, the resulting matrix must be converted to a non-sparse array.  

we will use `LabelEncoder` to fit_transform the "Activity" column, and look at 5 random values.

In [1]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
train['Activity'] = le.fit_transform(train.Activity)
test['Activity'] = le.fit_transform(test.Activity)
train['Activity'].sample(5)
## END SOLUTION

# `looking at correlation`

* Calculate the correlations between the dependent variables.
* Create a histogram of the correlation values
* Identify those that are most correlated (either positively or negatively).

In [1]:
# Calculate the correlation values
feature_cols = train.columns[:-1]
corr_values = train[feature_cols].corr()

# Simplify by emptying all the data below the diagonal
tril_index = np.tril_indices_from(corr_values)

# Make the unused values NaNs
for coord in zip(*tril_index):
    corr_values.iloc[coord[0], coord[1]] = np.NaN
    
# Stack the data and convert to a data frame
corr_values = (corr_values
               .stack()
               .to_frame()
               .reset_index()
               .rename(columns={'level_0':'feature1',
                                'level_1':'feature2',
                                0:'correlation'}))

# Get the absolute values for sorting
corr_values['abs_correlation'] = corr_values.correlation.abs()
corr_values

In [1]:
# The most highly correlated values
corr_values.sort_values('correlation', ascending=False).query('abs_correlation>0.8')

# `Data split`

* This can be done using any method, but consider using Scikit-learn's `StratifiedShuffleSplit` to maintain the same ratio of predictor classes.


In [1]:
X_train = train[feature_cols]
X_test = test[feature_cols]
y_train = train['Activity']
y_test  = test['Activity']

In [1]:
y_train.value_counts(normalize=True)

In [1]:
y_test.value_counts(normalize=True)

we maintained the distribution of the target class seccussfuly

# `Applying classification models`

- Logistic Regression
- K-Nearest Neighbors (KNeighbors)
- Decision Trees
- Ensemble Methods (Gradient Boosting)

## `1.Logistic Regression`

In [1]:
from sklearn.linear_model import LogisticRegression
# Standard logistic regression
lr = LogisticRegression(solver='liblinear').fit(X_train, y_train)

In [1]:
from sklearn.linear_model import LogisticRegressionCV
# L1 regularized logistic regression
lr_l1 = LogisticRegressionCV(Cs=10, cv=4, penalty='l1', solver='liblinear').fit(X_train, y_train)
# L2 regularized logistic regression
lr_l2 = LogisticRegressionCV(Cs=10, cv=4, penalty='l2', solver='liblinear').fit(X_train, y_train)

* Predict and store the class for each model.
* Store the probability for the predicted class for each model. 

In [1]:
# Predict the class and the probability for each
y_pred = list()
y_prob = list()

coeff_labels = ['lr', 'l1', 'l2']
coeff_models = [lr, lr_l1, lr_l2]

for lab,mod in zip(coeff_labels, coeff_models):
    y_pred.append(pd.Series(mod.predict(X_test), name=lab))
    y_prob.append(pd.Series(mod.predict_proba(X_test).max(axis=1), name=lab))
    
y_pred = pd.concat(y_pred, axis=1)
y_prob = pd.concat(y_prob, axis=1)
y_pred.head()

In [1]:
y_prob.head()

### `error metrics`

For each model, calculate the following error metrics: 
* Accuracy
* Precision
* Recall
* F-score
* Confusion Matrix

In [1]:
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score
from sklearn.preprocessing import label_binarize

metrics = list()
cm = dict()

for lab in coeff_labels:

    precision, recall, fscore, _ = score(y_test, y_pred[lab], average='weighted')
    
    accuracy = accuracy_score(y_test, y_pred[lab])
    
    # ROC-AUC scores can be calculated by binarizing the data
    auc = roc_auc_score(label_binarize(y_test, classes=[0,1,2,3,4,5]),
              label_binarize(y_pred[lab], classes=[0,1,2,3,4,5]), 
              average='weighted')
    
    cm[lab] = confusion_matrix(y_test, y_pred[lab])
    
    metrics.append(pd.Series({'precision':precision, 'recall':recall, 
                              'fscore':fscore, 'accuracy':accuracy,
                              'auc':auc}, 
                             name=lab))

metrics = pd.concat(metrics, axis=1)

In [1]:
metrics

In [1]:
from sklearn.metrics import f1_score

lr.fit(X_train,y_train)
y_pred = lr.predict(X_test)
f1_lr = f1_score(y_pred, y_test, average='weighted')
f1_lr

In [1]:
## Display or plot the confusion matrix for each model.

fig, axList = plt.subplots(nrows=2, ncols=2)
axList = axList.flatten()
fig.set_size_inches(12, 10)

axList[-1].axis('off')

for ax,lab in zip(axList[:-1], coeff_labels):
    sns.heatmap(cm[lab], ax=ax, annot=True, fmt='d');
    ax.set(title=lab);
    
plt.tight_layout()

## `2.K-Nearest Neighbors (KNeighbors)`

In [1]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import f1_score

knn = KNeighborsClassifier(n_neighbors=3, weights='distance')
knn = knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
f1_knn = f1_score(y_pred, y_test, average='weighted')
f1_knn

## `3.Decision Trees`

In [1]:
from sklearn.metrics import f1_score, make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
import warnings
warnings.filterwarnings("ignore")

tuning_parameters = {'max_depth':[2,4,6,8,10],
                     'min_samples_leaf':[2,4,6,8,10], 
                     'min_samples_split':[2,4,6,8,10]}

scorer = make_scorer(f1_score, average = 'micro')
GR = GridSearchCV(DecisionTreeClassifier(random_state=42), tuning_parameters, scoring=scorer,)
GR = GR.fit(X_train, y_train)

print('GR.best_estimator_: ', GR.best_estimator_)
print('GR.best_score_: ', GR.best_score_)
print('GR.best_params_: ', GR.best_params_)

In [1]:
GR.best_score_, GR.best_params_

In [1]:
y_pred = GR.predict(X_test)
f1_dt = f1_score(y_pred, y_test, average='weighted')
f1_dt

## `4.GradientBoosting`

In [1]:
from sklearn.ensemble import GradientBoostingClassifier
GBC = GradientBoostingClassifier(max_features=5, n_estimators=100, random_state=42)
GBC.fit(X_train.values, y_train.values)
y_pred = GBC.predict(X_test)
f1_GBC = f1_score(y_pred, y_test, average='weighted')
f1_GBC

# `Selecting best model`

In [1]:
pd.DataFrame({'Logistic Regression':f1_lr, 'KNN':f1_knn, 
                              'Decision Trees':f1_dt, 'Gradient Boosting':f1_GBC},index = ['F1_SCORE'])

### so we will chooce `GradientBoostingClassifier` or `logistic regression`

# `key findings`

logistic regression without regularization got us the highest F1_Score so we will choose it and Decision Trees took too long and got the worst score

# `Next steps`

we can use another encoding method like `LabelBinarizer` and we can try another ensemble method like `Random forest`