This assignment is a bit different. One of the key expectations from any published article is reproducibility of results, especially in data science. In the last assignment in this course, you are tasked with attempting to reproduce the study described in a peer-reviewed article, published by the National Institute of Health (nih.gov). This tasked is aimed at gradually building your capacity to tackle complex topics, familiarize yourself with academic discourse, and provide context and practice for the skills you will eventually need when working on your capstone thesis or project.

#### Ensemble-Based Classifier

Familiarize yourself with the ensemble package in Python and its use in a Jupyter notebook by utilizing the “Ensemble Methods for Classification of Physical Activities” to complete this assignment. 

Note the additional digital resources in Supplemental Digital Content at the bottom of the article. Also, note that the dataset used in the article is available for download from the UCI repository and the direct link is included in the article in the Methods section, Data set 1.

Another useful resource is “A Comprehensive Guide to Ensemble Learning (with Python codes).”

Once you have reviewed the required resources, complete the following:

Follow the steps described in the article for acquiring the data and building a classifier (in Python) that implements an ensemble framework. It is expected that you will encounter obstacles along the way and not every step mentioned in the article will be straightforward to implement. Ideally, you will be able to reproduce the project in its entirety. Less than ideal, but still very useful, would be to attempt most steps, adapt some, maybe eliminate one or two classification methods from the ensemble, but still produce a working classifier. Given the breadth and depth of the projects you worked on in this course and given the detailed resources provided that cover both theory and implementation, you are expected to successfully complete this project.

In the event that you are able to fully implement the steps described in the article, it would make an excellent opportunity to write a paper informing the scientific community (and the authors) that you are corroborating the results. If you followed the steps to the letter, it would be even more interesting if you obtained a different result. In such case, the scientific community should hear from you.

Create a technical report (no need to rewrite the article), in which you document your work, all steps, including the code and its output. Compare the results to the ones in the article, even if your ensemble framework is not identical to the one described in the article.

APA style is expected, as well as formal and rigorous scientific writing, using appropriate mathematical notation and references.

This assignment uses a rubric. Review the rubric prior to beginning the assignment to become familiar with the expectations for successful completion.

#### Attribute Information:

##### The 54 columns in the data files are organized as follows:

1 timestamp (s)

2 activityID (see below for the mapping to the activities)

3 heart rate (bpm)

4-20. IMU hand

21-37. IMU chest

38-54. IMU ankle


##### The IMU sensory data contains the following columns:

1 temperature (Â°C)

2-4. 3D-acceleration data (ms-2), scale: Â±16g, resolution: 13-bit

5-7. 3D-acceleration data (ms-2), scale: Â±6g, resolution: 13-bit

8-10. 3D-gyroscope data (rad/s)

11-13. 3D-magnetometer data (Î¼T)

14-17. orientation (invalid in this data collection)


##### List of activityIDs and corresponding activities:

1 lying

2 sitting

3 standing

4 walking

5 running

6 cycling

7 Nordic walking

9 watching TV

10 computer work

11 car driving

12 ascending stairs

13 descending stairs

16 vacuum cleaning

17 ironing

18 folding laundry

19 house cleaning

20 playing soccer

24 rope jumping

0 other (transient activities)


##### Initialize

In [17]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import normalize
import xgboost as xgb
from scipy.stats import uniform, randint
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import auc, accuracy_score, confusion_matrix, mean_squared_error, roc_auc_score, plot_roc_curve, roc_curve
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, VotingClassifier, BaggingClassifier
from sklearn.linear_model import LogisticRegression


In [18]:
df = pd.read_table("protocol/subject101.dat", sep=" ", header=0)
df.head()

Unnamed: 0,8.38,0,104,30,2.37223,8.60074,3.51048,2.43954,8.76165,3.35465,...,0.00830026,0.00925038,-0.0175803,-61.1888,-38.9599,-58.1438,1.2,0.7,0.8,0.9
0,8.39,0,,30.0,2.18837,8.5656,3.66179,2.39494,8.55081,3.64207,...,-0.006577,-0.004638,0.000368,-59.8479,-38.8919,-58.5253,1.0,0.0,0.0,0.0
1,8.4,0,,30.0,2.37357,8.60107,3.54898,2.30514,8.53644,3.7328,...,0.003014,0.000148,0.022495,-60.7361,-39.4138,-58.3999,1.0,0.0,0.0,0.0
2,8.41,0,,30.0,2.07473,8.52853,3.66021,2.33528,8.53622,3.73277,...,0.003175,-0.020301,0.011275,-60.4091,-38.7635,-58.3956,1.0,0.0,0.0,0.0
3,8.42,0,,30.0,2.22936,8.83122,3.7,2.23055,8.59741,3.76295,...,0.012698,-0.014303,-0.002823,-61.5199,-39.3879,-58.2694,1.0,0.0,0.0,0.0
4,8.43,0,,30.0,2.29959,8.82929,3.5471,2.26132,8.65762,3.77788,...,-0.006089,-0.016024,0.00105,-60.2954,-38.8778,-58.3977,1.0,0.0,0.0,0.0


##### Data Preprocess

In [30]:
df_clean = df.iloc[:,[0,1,2,3,20,37]]
df_clean.columns = ['timestamp', 'activityID', 'heart_rate1', 'IMU_hand1', 'IMU_chest1', 'IMU_ankle1']
df_clean.head()
df_clean = df_clean.dropna()

In [35]:
act_dict = {1: 'lying', 2: 'sitting',3: 'standing',4: 'walking',5: 'running',6: 'cycling',7: 'Nordic_walking',9: 'watching_TV',
            10: 'computer_work',11: 'car_driving',12: 'ascending_stairs',13: 'descending_stairs',16: 'vacuum_cleaning',17: 'ironing',
            18: 'folding_laundry',19: 'house_cleaning',20: 'playing_soccer',24: 'rope_jumping',0: 'other'}
df_clean['activityID'].map(act_dict)
df_clean1 = df_clean.copy()
df_clean1['activityID'].map(act_dict)

9         other
20        other
31        other
42        other
53        other
          ...  
376364    other
376375    other
376386    other
376397    other
376408    other
Name: activityID, Length: 34089, dtype: object

In [37]:
train, test = train_test_split(df_clean, random_state=6)

X_train = normalize(train.drop(["heart_rate1"], axis=1))
X_test = normalize(test.drop(["heart_rate1"], axis=1))

y_train = train.heart_rate1
y_test = test.heart_rate1

##### After the preprocessing, start with simple models like random forest and boosting. Run the random forest with the same number of trees as the study - 20

In [43]:
clf = RandomForestClassifier(n_estimators=20)
clf = clf.fit(X_train, y_train)

clf.score(X_test, y_test)

0.01032500293323947

In [44]:
xgb_model = xgb.XGBRegressor(random_state=6)

params = {
  "colsample_bytree": uniform(0.7, 0.3),
  "gamma": uniform(0, 0.5),
  "learning_rate": uniform(0.03, 0.3), # default 0.1 
  "max_depth": randint(2, 6), # default 3
  "n_estimators": randint(100, 150), # default 100
  "subsample": uniform(0.6, 0.4)
}

#use gridsearch to test all values for hyperparams
xgbcv = RandomizedSearchCV(xgb_model, param_distributions=params, random_state=6, n_iter=200, cv=3, verbose=1, 
                           n_jobs=-1, return_train_score=True)
#fit model to training data
xgbcv.fit(X_train, y_train)

predictions_test = xgbcv.predict(X_test)

xgbcv.score(X_test, y_test)

Fitting 3 folds for each of 200 candidates, totalling 600 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   11.6s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   55.8s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 600 out of 600 | elapsed:  2.9min finished


0.9994842293757199

##### After initial investigation compare to an ansemble model made up of different model types in this instance knn, random forest and logistic regression

In [15]:
#create new a knn model
knn = KNeighborsClassifier()
#create a dictionary of all values we want to test for n_neighbors
params_knn = {'n_neighbors': np.arange(1, 25)}
#use gridsearch to test all values for n_neighbors
knn_gs = GridSearchCV(knn, params_knn, cv=5)
#fit model to training data
knn_gs.fit(X_train, y_train)


#save best model
knn_best = knn_gs.best_estimator_
#check best n_neigbors value
print(knn_gs.best_params_)



#create a new random forest classifier
rf = RandomForestClassifier()
#create a dictionary of all values we want to test for n_estimators
params_rf = {'n_estimators': [50, 100, 200]}
#use gridsearch to test all values for n_estimators
rf_gs = GridSearchCV(rf, params_rf, cv=5)
#fit model to training data
rf_gs.fit(X_train, y_train)


#save best model
rf_best = rf_gs.best_estimator_
#check best n_estimators value
print(rf_gs.best_params_)


#create a new logistic regression model
log_reg = LogisticRegression()
#fit the model to the training data
log_reg.fit(X_train, y_train)


#test the three models with the test data and print their accuracy scores
print('knn: {}'.format(knn_best.score(X_test, y_test)))
print('rf: {}'.format(rf_best.score(X_test, y_test)))
print('log_reg: {}'.format(log_reg.score(X_test, y_test)))


#create a dictionary of our models
estimators=[('knn', knn_best), ('rf', rf_best), ('log_reg', log_reg)]
#create our voting classifier, inputting our models
ensemble = VotingClassifier(estimators, voting='hard')

#fit model to training data
ensemble.fit(X_train, y_train)
#test our model on the test data
ensemble.score(X_test, y_test)

{'n_neighbors': 1}
{'n_estimators': 100}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


knn: 0.9529508389064884
rf: 0.9504869177519653
log_reg: 0.06863780359028511


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.944385779655051