# Random Forests Classification of Body Postures Data Set Using Python

Human activity monitoring is a growing field within data science. It has practical use within the healthcare industry, particular with tracking the elderly to make sure they don't end up doing things which might cause them to hurt themselves. Governments are also very interested in it do that they can detect unusual crowd activities, perimeter breaches, or the identification of specific activities, such as loitering, littering, or fighting. Fitness apps also make use of activity monitoring to better estimate the amount of calories used by the body during a period of time.

In this lab, you will be training a random forest against a public domain Human Activity Dataset titled Wearable Computing: Accelerometers' Data Classification of Body Postures and Movements, containing 165,633, one of which is invalid. Within the dataset, there are five target activities:

Sitting
Sitting Down
Standing
Standing Up
Walking

These activities were captured from four people wearing accelerometers mounted on their waist, left thigh, right arm, and right ankle. 

In [1]:
import pandas as pd
import time

### How to Get The Dataset

Grab the DLA HAR dataset from:

- http://groupware.les.inf.puc-rio.br/har
- http://groupware.les.inf.puc-rio.br/static/har/dataset-har-PUC-Rio-ugulino.zip
- A cached copy of the dataset is included in the course repository.

After extracting it out, load up the dataset into dataframe named `X` and do your regular dataframe examination:

In [2]:
X = pd.read_csv('Datasets/dataset-har-PUC-Rio-ugulino.csv', delimiter=';', decimal=',')

  interactivity=interactivity, compiler=compiler, result=result)


Encode the gender column such that: `0` is male, and `1` as female:

In [3]:
X.gender = X.gender.map({'Man': 0, 'Woman': 1})

Clean up any columns with commas in them so that they're properly represented as decimals:

In [4]:
# Include the parameter (decimal=',') in loading CSV file

Let's take a peek at your data types:

In [5]:
X.dtypes

user                   object
gender                  int64
age                     int64
how_tall_in_meters    float64
weight                  int64
body_mass_index       float64
x1                      int64
y1                      int64
z1                      int64
x2                      int64
y2                      int64
z2                      int64
x3                      int64
y3                      int64
z3                      int64
x4                      int64
y4                      int64
z4                     object
class                  object
dtype: object

Convert any column that needs to be converted into numeric use `errors='raise'`. This will alert you if something ends up being problematic.

In [6]:
X.z4 = pd.to_numeric(X.z4, errors='coerce')
X = X.dropna(axis=0)

Okay, now encode your `y` value as a Pandas dummies version of your dataset's `class` column:

In [7]:
y = X['class'].copy()
y = pd.get_dummies(y)

In fact, get rid of the `user` and `class` columns:

In [8]:
X = X.drop(labels=['user', 'class'], axis=1)

Let's take a look at your handy-work:

In [9]:
X.describe()

Unnamed: 0,gender,age,how_tall_in_meters,weight,body_mass_index,x1,y1,z1,x2,y2,z2,x3,y3,z3,x4,y4,z4
count,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0
mean,0.612044,38.264925,1.639712,70.819431,26.188535,-6.649319,88.293591,-93.164449,-87.827956,-52.065911,-175.055647,17.423517,104.517056,-93.881641,-167.641211,-92.625235,-159.650985
std,0.487286,13.183821,0.05282,11.296557,2.995781,11.616273,23.895881,39.409487,169.435606,205.160081,192.817111,52.635546,54.155987,45.38977,38.311336,19.968653,13.22102
min,0.0,28.0,1.58,55.0,22.0,-306.0,-271.0,-603.0,-494.0,-517.0,-617.0,-499.0,-506.0,-613.0,-702.0,-526.0,-537.0
25%,0.0,28.0,1.58,55.0,22.0,-12.0,78.0,-120.0,-35.0,-29.0,-141.0,9.0,95.0,-103.0,-190.0,-103.0,-167.0
50%,1.0,31.0,1.62,75.0,28.4,-6.0,94.0,-98.0,-9.0,27.0,-118.0,22.0,107.0,-90.0,-168.0,-91.0,-160.0
75%,1.0,46.0,1.71,83.0,28.6,0.0,101.0,-64.0,4.0,86.0,-29.0,34.0,120.0,-80.0,-153.0,-80.0,-153.0
max,1.0,75.0,1.71,83.0,28.6,509.0,533.0,411.0,473.0,295.0,122.0,507.0,517.0,410.0,-13.0,86.0,-43.0


You can also easily display which rows have nans in them, if any:

In [10]:
X[pd.isnull(X).any(axis=1)]

Unnamed: 0,gender,age,how_tall_in_meters,weight,body_mass_index,x1,y1,z1,x2,y2,z2,x3,y3,z3,x4,y4,z4


Create an RForest classifier named `model` and set `n_estimators=30`, the `max_depth` to 10, `oob_score=True`, and `random_state=0`:

In [11]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=30, max_depth=10, oob_score=True, random_state=0)

Split your data into `test` / `train` sets. Your `test` size can be 30%, with `random_state` 7. Use variable names: `X_train`, `X_test`, `y_train`, and `y_test`:

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=7)

### Now the Fun Stuff

In [13]:
print("Fitting...")
s = time.time()

# TODO: train your model on your training set

model.fit(X_train, y_train)


print("Fitting completed in: ", time.time() - s)

Fitting...
Fitting completed in:  14.176083326339722


Display the OOB Score of your data:

In [14]:
score = model.oob_score_
print("OOB Score: ", round(score*100, 3))

OOB Score:  98.744


In [15]:
print("Scoring...")
s = time.time()

# TODO: score your model on your test set

print("Score: ", round(score*100, 3))
print("Scoring completed in: ", time.time() - s)

Scoring...
Score:  98.744
Scoring completed in:  0.0
