Course Project submission repository for Getting and Cleaning Data courese.
R script for the Course Project of "Getting and Cleaning data" by Sungwook Moon
Merges the training and the test sets to create one data set.
-
Download project data
temp <- tempfile() fileUrl <- "https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip" download.file(fileUrl, temp, method="curl") -
Read train data files which are under the folder train and bind subject id, activity level and train data
subject_train <- read.table(unz(temp, "UCI HAR Dataset/train/subject_train.txt"), stringsAsFactors=FALSE) ylabel_train <- read.table(unz(temp, "UCI HAR Dataset/train/y_train.txt")) ds_train <- read.table(unz(temp, "UCI HAR Dataset/train/X_train.txt")) # read train dataset 7,352 obs. ds_train <- cbind(subject_train, ylabel_train, ds_train) -
Read test data files which are under the folder test and bind subject id, activity level and test data
subject_test <- read.table("./data/test/subject_test.txt", stringsAsFactors=FALSE) ylabel_test <- read.table("./data/test/y_test.txt") ds_test <- read.table("./data/test/X_test.txt") ds_test <- cbind(subject_test, ylabel_test, ds_test) -
Merge both datasets and set column names
ds_merged <- rbind(ds_train, ds_test) hdr <- read.table(unz(temp, "UCI HAR Dataset/features.txt"), stringsAsFactors=FALSE) # read column header names(ds_merged) <- c("subjectID", "activity", hdr$V2)
Extracts only the measurements on the mean and standard deviation for each measurement.
-
Make column names unique before select
names(ds_merged) <- make.names(names=names(ds_merged), unique=TRUE) -
Select columns which contains "mean" and "std"
ds_extracted <- select(ds_merged, subjectID, activity, contains(".mean"), contains(".std"), -contains(".meanFreq"))
Uses descriptive activity names to name the activities in the data set
-
Read activity labels from file
activity_labels <- read.table(unz(temp, "UCI HAR Dataset/activity_labels.txt"), stringsAsFactors=FALSE) -
Change activity levels to factor type with descriptive labels
ds_extracted$activity <- as.factor(ds_extracted$activity) levels(ds_extracted$activity) <- activity_labels$V2
Appropriately labels the data set with descriptive variable names. Using gsub function, change the column names of dataset
names(ds_extracted) <- gsub(".mean","Mean", names(ds_extracted))
names(ds_extracted) <- gsub(".std","Std", names(ds_extracted))
names(ds_extracted) <- gsub("\\.","", names(ds_extracted))
names(ds_extracted) <- gsub("BodyBody","Body", names(ds_extracted))
From the data set in step 4, creates a second, independent tidy data set with the average of each variable for each activity and each subject.
-
Create tidy dataset
ds_tidy <- ddply(ds_extracted, .(subjectID, activity), colwise(mean)) -
Write dataset to a txt file
write.table(ds_tidy, file="./tidy_dataset.txt", row.names=FALSE, col.names=TRUE, sep="\t", quote=FALSE) -
Disconnect and remove tempfile
unlink(temp) rm(temp,fileUrl)
That is it.