# Project : Human Activity Recognition System

### Notebook 1:

### This notebook includes loading, analyzing and processing the data.

### Description of Data chosen:
Data is chosen from the below site:

https://archive.ics.uci.edu/dataset/240/human+activity+recognition+using+smartphones

This database of human activities was created using recordings of 30 participants going about their normal lives while wearing smartphones attached on their waists with inertial sensors. The tAcc-XYZ and tGyro-XYZ 3-axial raw signals from the accelerometer and gyroscope provide the features chosen for this database. These time domain signals (prefixed with 't' to represent time) were recorded at a constant frequency of 50 Hz. After that, they were noise-removal filtered with a median filter and a Butterworth filter of third order with a corner frequency of 20 Hz. Using a second low pass Butterworth filter with a corner frequency of 0.3 Hz, the acceleration signal was then divided into body and gravity acceleration signals (tBodyAcc-XYZ and tGravityAcc-XYZ). 

The body's linear acceleration and angular velocity were then calculated in time to produce the tBodyAccJerk-XYZ and tBodyGyroJerk-XYZ Jerk signals. Additionally, the magnitude of these three-dimensional signals (tBodyAccMag, tGravityAccMag, tBodyAccJerkMag, tBodyGyroMag, and tBodyGyroJerkMag) was determined using the Euclidean norm.

Finally, some of these signals underwent a Fast Fourier Transform (FFT) to produce the signals fBodyAcc-XYZ, fBodyAccJerk-XYZ, fBodyGyro-XYZ, fBodyAccJerkMag, fBodyGyroJerkMag.

These signals were used to estimate variables of the feature vector for each pattern:  
'-XYZ' is used to denote 3-axial signals in the X, Y and Z directions.

The set of variables that were estimated from these signals are: 

mean(): Mean value;
std(): Standard deviation;
mad(): Median absolute deviation;
max(): Largest value in array;
min(): Smallest value in array;
sma(): Signal magnitude area;
energy(): Energy measure. Sum of the squares divided by the number of values;
iqr(): Interquartile range;
entropy(): Signal entropy;
arCoeff(): Autorregresion coefficients with Burg order equal to 4;
correlation(): correlation coefficient between two signals;
maxInds(): index of the frequency component with largest magnitude;
meanFreq(): Weighted average of the frequency components to obtain a mean frequency;
skewness(): skewness of the frequency domain signal;
kurtosis(): kurtosis of the frequency domain signal;
bandsEnergy(): Energy of a frequency interval within the 64 bins of the FFT of each window;
angle(): Angle between two vectors

Additional vectors obtained by averaging the signals in a signal window sample. These are used on the angle() variable:

gravityMean,
tBodyAccMean,
tBodyAccJerkMean,
tBodyGyroMean,
tBodyGyroJerkMean

#### Note: 
The data downloaded from the site provided is already split into train and test sets. I have combined the train and test sets into one single dataset and loaded it here since I believed I needed to demonstrate one of the knowledges I absorbed from the course material, how to split the data.

#### Objective of this analysis:
The primary goal is to create a predictive model by fitting a logistic regression, an SVM, and a decision tree model, to accurately predict the activity performed from the sensor data provided. I will then measure the performance of my models and identify the best performing model.

In [1]:
# Importing necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing 

#### Loading and displaying the data

In [2]:
df = pd.read_csv('/Users/sunilinus/Downloads/HumanActivity_recg_Data.csv') 
df.head()

Unnamed: 0,tBodyAcc-mean()-X,tBodyAcc-mean()-Y,tBodyAcc-mean()-Z,tBodyAcc-std()-X,tBodyAcc-std()-Y,tBodyAcc-std()-Z,tBodyAcc-mad()-X,tBodyAcc-mad()-Y,tBodyAcc-mad()-Z,tBodyAcc-max()-X,...,fBodyBodyGyroJerkMag-skewness(),fBodyBodyGyroJerkMag-kurtosis(),"angle(tBodyAccMean,gravity)","angle(tBodyAccJerkMean),gravityMean)","angle(tBodyGyroMean,gravityMean)","angle(tBodyGyroJerkMean,gravityMean)","angle(X,gravityMean)","angle(Y,gravityMean)","angle(Z,gravityMean)",activity Label
0,0.288585,-0.020294,-0.132905,-0.995279,-0.983111,-0.913526,-0.995112,-0.983185,-0.923527,-0.934724,...,-0.298676,-0.710304,-0.112754,0.0304,-0.464761,-0.018446,-0.841247,0.179941,-0.058627,5
1,0.278419,-0.016411,-0.12352,-0.998245,-0.9753,-0.960322,-0.998807,-0.974914,-0.957686,-0.943068,...,-0.595051,-0.861499,0.053477,-0.007435,-0.732626,0.703511,-0.844788,0.180289,-0.054317,5
2,0.279653,-0.019467,-0.113462,-0.99538,-0.967187,-0.978944,-0.99652,-0.963668,-0.977469,-0.938692,...,-0.390748,-0.760104,-0.118559,0.177899,0.100699,0.808529,-0.848933,0.180637,-0.049118,5
3,0.279174,-0.026201,-0.123283,-0.996091,-0.983403,-0.990675,-0.997099,-0.98275,-0.989302,-0.938692,...,-0.11729,-0.482845,-0.036788,-0.012892,0.640011,-0.485366,-0.848649,0.181935,-0.047663,5
4,0.276629,-0.01657,-0.115362,-0.998139,-0.980817,-0.990482,-0.998321,-0.979672,-0.990441,-0.942469,...,-0.351471,-0.699205,0.12332,0.122542,0.693578,-0.615971,-0.847865,0.185151,-0.043892,5


In [3]:
# determining the number of rows and columns

rows = df.shape[0]
cols = df.shape[1]
print(f"Rows={rows} and Cols={cols}")

Rows=10299 and Cols=562


#### Checking for missing values

In [4]:
df.isnull().sum()

tBodyAcc-mean()-X                       0
tBodyAcc-mean()-Y                       0
tBodyAcc-mean()-Z                       0
tBodyAcc-std()-X                        0
tBodyAcc-std()-Y                        0
                                       ..
angle(tBodyGyroJerkMean,gravityMean)    0
angle(X,gravityMean)                    0
angle(Y,gravityMean)                    0
angle(Z,gravityMean)                    0
activity Label                          0
Length: 562, dtype: int64

From the above result, we can see that there are no NaN values. Now lets see the data types.

In [5]:
df.dtypes

tBodyAcc-mean()-X                       float64
tBodyAcc-mean()-Y                       float64
tBodyAcc-mean()-Z                       float64
tBodyAcc-std()-X                        float64
tBodyAcc-std()-Y                        float64
                                         ...   
angle(tBodyGyroJerkMean,gravityMean)    float64
angle(X,gravityMean)                    float64
angle(Y,gravityMean)                    float64
angle(Z,gravityMean)                    float64
activity Label                            int64
Length: 562, dtype: object

We may infer from the findings above that we do not need to check for unique values or typos because there are no object data types.

#### Identifying the target variable:
The goal of creating this model is to predict the human activity basing on the sensor data provided. The target feature is labelled as 'activity Label'. For better labelling, updating the target feature name to 'activity_label'.

Also, different activities are labelled as follows to maintain the binary nature of the data:

1: Walking

2: Walking Upstairs

3: Walking Downstairs

4: Sitting

5: Standing

6: Laying

In [6]:
df.rename(columns={'activity Label': 'activity_label'}, inplace=True)

Now separating the features and target feature:

In [7]:
features = df.drop(columns=['activity_label'])
target = df['activity_label']

### Splitting data to train and test

I am splitting the data to 70, 30. 30% of the data will be used for testing, while 70% will be used for training the model given the number of records in the dataset.

In [8]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, random_state=1)

Standardizing the training and test data to ensure that different features are on a similar scale.

In [9]:
scaler = preprocessing.StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train) 
X_test = scaler.transform(X_test)

Now that the data is preprocessed and ready to be fit, saving the processed data to different csv files.

In [10]:
# Converting NumPy arrays to DataFrames
X_train_df = pd.DataFrame(X_train)
y_train_df = pd.DataFrame(y_train)
X_test_df = pd.DataFrame(X_test)
y_test_df = pd.DataFrame(y_test)

# Saving DataFrames to CSV files
X_train_df.to_csv('X_train.csv', index=False)
y_train_df.to_csv('y_train.csv', index=False)
X_test_df.to_csv('X_test.csv', index=False)
y_test_df.to_csv('y_test.csv', index=False)

### Now we have the data pre-processed and ready to be fit.