The data to serve as source for this project has been downloaded from the following URL: https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip
From the ZIP archive, the following files were used in this project:
activity_labels.txttest/subject_test.txttest/X_test.txttest/y_test.txttrain/subject_train.txttrain/X_train.txttrain/y_train.txt
Aside from activity_labels.txt (with two variables), each of the remaining files have only one variable per file; the files under /test and /train are equal in terms of variables and number of observations.
activity_labels.txt- contains one numeric variable, that acts as the unique identifier for the string variable.test/subject_test.txt- contains one numeric variable, which represents the unique identifier of the study subjects.test/X_test.txt- contains one numeric variable, which represents multiple observations generated by an accelerometer.test/y_test.txt- contains one numeric variable, which represents multiple observations of the numeric variable ofactivity_labels.txt.train/subject_train.txt- same astest/subject_test.txttrain/X_train.txt- same astest/X_test.txttrain/y_train.txt- same astest/y_test.txt
Given the following project directives:
- Merges the training and the test sets to create one data set.
- Extracts only the measurements on the mean and standard deviation for each measurement.
- Uses descriptive activity names to name the activities in the data set
- Appropriately labels the data set with descriptive variable names.
- Creates a second, independent tidy data set with the average of each variable for each activity and each subject.
The approach was a little different, since the test and train datasets were handled separately and only merged at the very end, meaning that the directives #2 and #4 were done first, then #1 and finally #3 and #5. Here's the code used to handle each dataset and achieve directives #2 and #4:
## Test Data
# Set path to files
maindir <- getwd()
path <- paste(maindir, "UCI_HAR_Dataset/test", sep="/")
# Read files
subjectTest <- read.table(file= paste(path,"subject_test.txt", sep="/"), header=FALSE)
xTest <- read.table(file= paste(path,"X_test.txt", sep="/"), header=FALSE)
yTest <- read.table(file= paste(path,"y_test.txt", sep="/"), header=FALSE)
# Extract the mean and standard deviation from xTest, and create new data frame
meanTest <- rowMeans(xTest, na.rm = TRUE)
stdTest <- apply(xTest,1,sd)
calcsTest <- data.frame(meanTest,stdTest)
# Concatenate the files
concatTest <- cbind(subjectTest,yTest,calcsTest)
# Rename columns to apply rbind() later and give descriptive names
colnames(concatTest) <- c("Subject", "Activity", "Mean","Std")
## Train Data
# Set path to files
maindir <- getwd()
path <- paste(maindir, "UCI_HAR_Dataset/train", sep="/")
# Read files
subjectTrain <- read.table(file= paste(path,"subject_train.txt", sep="/"), header=FALSE)
xTrain <- read.table(file= paste(path,"X_train.txt", sep="/"), header=FALSE)
yTrain <- read.table(file= paste(path,"y_train.txt", sep="/"), header=FALSE)
# Extract the mean and standard deviation from xTest, and create new data frame
meanTrain <- rowMeans(xTrain, na.rm = TRUE)
stdTrain <- apply(xTrain,1,sd)
calcsTrain <- data.frame(meanTrain,stdTrain)
# Concatenate the files
concatTrain <- cbind(subjectTrain,yTrain,calcsTrain)
# Rename columns to apply rbind() later and give descriptive names
colnames(concatTrain) <- c("Subject", "Activity", "Mean","Std")
To achieve directive #1:
## Concatenate train and test data into single dataset
data <- rbind(concatTest,concatTrain)
To achieve directive #3:
## Replace activity code by its description from activity_labels.txt
maindir <- getwd()
path <- paste(maindir, "UCI_HAR_Dataset", sep="/")
# Read file
activityList <- read.table(file= paste(path,"activity_labels.txt", sep="/"), header=FALSE)
# Match activity id with description
data$Activity <- activityList$V2[match(data$Activity, activityList$V1)]
To achieve directive #5:
## Generating the tidy dataset: group by Subject and Activity, outputing the average of Mean and Std
# Using package data.table
library(data.table)
dataDT <- data.table(data)
dataTidy <- dataDT[, list(Mean=mean(Mean), Std=mean(Std)), by=list(Subject, Activity)]
# write to file
write.table(x=dataTidy,file=paste(maindir, "tidyDataset.txt", sep="/"), sep= "\t", col.names= TRUE, quote=FALSE, row.names=FALSE)