Skip to content

zguy23/GCD_Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 

Repository files navigation

Getting and Cleaning Data class project

This README is for the Getting and Cleaning Data class project at Johns Hopkins University. The data used for this course can be downloaded at the following website.
http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones

This repo contains the following three files.

  • README.md
  • run_analysis.R
  • CodeBook.md

This README file contains links to the data used and specific steps taken to process the raw data taken from the above mentioned website. The run_analysis.R contains the actual code used to process the data from the downloaded raw dataset to the tidy data set that it produces. The CodeBook.md contains information about the variables and information about how I have chose to summarize the data.

Project Objective

  1. Merges the training and the test sets to create one data set.
  2. Extracts only the measurements on the mean and standard deviation for each measurement.
  3. Uses descriptive activity names to name the activities in the data set
  4. Appropriately labels the data set with descriptive variable names.
  5. From the data set in step 4, creates a second, independent tidy data set with the average of each variable for each activity and each subject.

run_analysis.R Script

*Notes prior to execution

  1. Download the data from the URL above and unzip it in the directory of choice.
  2. Download the run_analysis.R script into the directory above. This directory should contain the "UCI HAR Dataset" directory along with the test and train subdirectories.

Steps Taken to Process Raw Data

  1. Using read.table, read in subject_train.txt, X_train.txt, y_train.txt, subject_test.txt, X_test.txt, y_test.txt, features.txt and activity_labels.txt files and store them accordingly.
  2. Extract feature names from the features.txt file by subsetting. This files contain the names of the variables recored by the experiment.
  3. Update the default column names in x_train and x_test data frames using the names() function with the data from step 2.
  4. Update the default column name in the sub_train, sub_test, y_train and y_test data frames. These data frames only contain one column, Subject and Activity information. The names() function was used here again.
  5. Replace the activity number with its character representation in y_train and y_test by looking up the meaning in the activity_labels.txt file. This was done using gsub().
  6. Merge x_test, y_test and sub_test data frames together using the cbind().
  7. Repeat step 6 with x_train, y_train and sub_train data frames.
  8. Merge x_train and x_test data frames with rbind(). Store this into a new data frame called x_train_test.
  9. Using grep, find the variable names which contain mean() or std() in their name. The features_info.txt file states the variables with the "()" after mean or std indicate mean and standard deviation respectively. There are 33 variables each for mean and standard deviation. Store the results in mean_cols and std_cols.
  10. Subset the data frame created from step 8, x_train_test, using variables required for project. I selected Subject, Activity, and the mean and standard deviation variables found from the prior step. This is subsetted data frame is stored in tidy_data.
  11. Using sub() and gsub(), cleanup variable names. E.g. Remove duplicate words and characters such as ".", "(", ")" and ",".
  12. Sort the tidy_data data frame using arrange() by Subject and Activity variables.
  13. Since we are asked to provide averages of the variables I used the group_by() function to do so. The groups selected were Subject and Activity.
  14. Finally, aggregate or summarize data using the summarize_each function on the data frame created in step 13.
  15. The tidy_data frame is now complete per project requirements. The data frame contains 10299 rows and 68 columns.

CodeBook.md

This file contain variable names and descriptions of each.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages