Data Science Fundamentals: R |
[Table of Contents](../index.ipynb)
- - - 
<!--NAVIGATION-->
Module 14. [Introduction](./00.ipynb) | [Basic Syntax](./01.ipynb)  | [Data Types](./02.ipynb) | [Variables](./03.ipynb) | [Operators](./04.ipynb) | [Decision Making](./05.ipynb)  | [Functions](./06.ipynb) | [Strings](./07.ipynb) | [Vectors](./08.ipynb) | [Lists](./09.ipynb) | [Matrices](./10.ipynb) | [Arrays](./11.ipynb) | [Factors](./12.ipynb) | [Data Frames](./13.ipynb) | [Data Reshaping](./14.ipynb) | [Exercises](./15.ipynb)

Data Reshaping
---


Data Reshaping in R is about changing the way data is organized into rows and columns. Most of the time data processing in R is done by taking the input data as a data frame. It is easy to extract data from the rows and columns of a data frame but there are situations when we need the data frame in a format that is different from format in which we received it. R has many functions to split, merge and change the rows to columns and vice-versa in a data frame.

<b>Joining Columns and Rows in a Data Frame</b>

We can join multiple vectors to create a data frame using the cbind()function. Also we can merge two data frames using rbind() function.

In [1]:
# Create vector objects.

city <- c("Tampa","Seattle","Hartford","Denver")
state <- c("FL","WA","CT","CO")
zipcode <- c(33602,98104,06161,80294)

# Combine above three vectors into one data frame.
addresses <- cbind(city,state,zipcode)

# Print a header.
cat("# # # # The First data frame\n") 

# Print the data frame.
print(addresses)

# Create another data frame with similar columns

new.address <- data.frame(
   city = c("Lowry","Charlotte"),
   state = c("CO","FL"),
   zipcode = c("80230","33949"),
   stringsAsFactors = FALSE
)

# Print a header.
cat("# # # The Second data frame\n") 

# Print the data frame.
print(new.address)

# Combine rows form both the data frames.
all.addresses <- rbind(addresses,new.address)

# Print a header.
cat("# # # The combined data frame\n") 

# Print the result.
print(all.addresses)

# # # # The First data frame
     city       state zipcode
[1,] "Tampa"    "FL"  "33602"
[2,] "Seattle"  "WA"  "98104"
[3,] "Hartford" "CT"  "6161" 
[4,] "Denver"   "CO"  "80294"
# # # The Second data frame
       city state zipcode
1     Lowry    CO   80230
2 Charlotte    FL   33949
# # # The combined data frame
       city state zipcode
1     Tampa    FL   33602
2   Seattle    WA   98104
3  Hartford    CT    6161
4    Denver    CO   80294
5     Lowry    CO   80230
6 Charlotte    FL   33949


<b>Merging Data Frames</b>

We can merge two data frames by using the merge() function. The data frames must have same column names on which the merging happens.

In the example below, we consider the data sets about Diabetes in Pima Indian Women available in the library names "MASS". we merge the two data sets based on the values of blood pressure("bp") and body mass index("bmi"). On choosing these two columns for merging, the records where values of these two variables match in both data sets are combined together to form a single data frame.

In [2]:
library(MASS)

merged.Pima <- merge(x = Pima.te, y = Pima.tr,
   by.x = c("bp", "bmi"),
   by.y = c("bp", "bmi")
)

print(merged.Pima)
nrow(merged.Pima)


"package 'MASS' was built under R version 3.6.3"

   bp  bmi npreg.x glu.x skin.x ped.x age.x type.x npreg.y glu.y skin.y ped.y
1  60 33.8       1   117     23 0.466    27     No       2   125     20 0.088
2  64 29.7       2    75     24 0.370    33     No       2   100     23 0.368
3  64 31.2       5   189     33 0.583    29    Yes       3   158     13 0.295
4  64 33.2       4   117     27 0.230    24     No       1    96     27 0.289
5  66 38.1       3   115     39 0.150    28     No       1   114     36 0.289
6  68 38.5       2   100     25 0.324    26     No       7   129     49 0.439
7  70 27.4       1   116     28 0.204    21     No       0   124     20 0.254
8  70 33.1       4    91     32 0.446    22     No       9   123     44 0.374
9  70 35.4       9   124     33 0.282    34     No       6   134     23 0.542
10 72 25.6       1   157     21 0.123    24     No       4    99     17 0.294
11 72 37.7       5    95     33 0.370    27     No       6   103     32 0.324
12 74 25.9       9   134     33 0.460    81     No       8   126

<b>Melting and Casting</b>

One of the most interesting aspects of R programming is about changing the shape of the data in multiple steps to get a desired shape. The functions used to do this are called melt() and cast().

We consider the dataset called ships present in the library called "MASS".

In [2]:
library(MASS)
print(ships)

   type year period service incidents
1     A   60     60     127         0
2     A   60     75      63         0
3     A   65     60    1095         3
4     A   65     75    1095         4
5     A   70     60    1512         6
6     A   70     75    3353        18
7     A   75     60       0         0
8     A   75     75    2244        11
9     B   60     60   44882        39
10    B   60     75   17176        29
11    B   65     60   28609        58
12    B   65     75   20370        53
13    B   70     60    7064        12
14    B   70     75   13099        44
15    B   75     60       0         0
16    B   75     75    7117        18
17    C   60     60    1179         1
18    C   60     75     552         1
19    C   65     60     781         0
20    C   65     75     676         1
21    C   70     60     783         6
22    C   70     75    1948         2
23    C   75     60       0         0
24    C   75     75     274         1
25    D   60     60     251         0
26    D   60

### [Reshape]( http://had.co.nz/reshape.1.html) Package

<b>Melt the Data</b>

Now we melt the data to organize it, converting all columns other than type and year into multiple rows.  Basically, you "melt" data so that each row is a unique id-variable combination. 

In [3]:
install.packages("reshape")


The downloaded binary packages are in
	/var/folders/39/rw094bh97s1fm7lfmg_nwcxm0000gn/T//RtmpuqAGW4/downloaded_packages


In [5]:
# Create vector objects.

id <- c(1,1,2,2)
time <- c(1,2,1,2)
x1 <- c(5,3,6,2)
x2 <- c(6,5,1,4)

# Combine above three vectors into one data frame.
mydata <- cbind(id,time,x1,x2)

In [6]:
mydata

id,time,x1,x2
1,1,5,6
1,2,3,5
2,1,6,1
2,2,2,4


In [7]:
# example of melt function
library(reshape)
mdata <- melt(mydata, id=c("id","time"))

"package 'reshape' was built under R version 3.6.3"

In [8]:
mdata

X1,X2,value
1,id,1
2,id,1
3,id,2
4,id,2
1,time,1
2,time,2
3,time,1
4,time,2
1,x1,5
2,x1,3


![caption](files/r_package.png)

<b>Cast the Melted Data</b>

We can cast the molten data into a new form where the aggregate of each type of ship for each year is created. It is done using the cast() function. Then you "cast" the melted data into any shape you would like. 

In [9]:
# cast the melted data
# cast(data, formula, function)
subjmeans <- cast(mdata, id, mean)
timemeans <- cast(mdata, time, mean)

ERROR: Error: Casting formula contains variables not found in molten data: 1122


In [None]:
# Create vector objects.

id <- c(1,1,2,2,1,1,2,2)
time <- c(1,2,1,2,1,2)
variable <- c(x1,x1,x1,x1,x2,x2,x2,x2)
value <- c(5,3,6,2,6,5,1,4)

# Combine above three vectors into one data frame.
newdata <- cbind(id,time,variable,value)

In [None]:
newdata

- - - 
<!--NAVIGATION-->
Module 14. [Introduction](./00.ipynb) | [Basic Syntax](./01.ipynb)  | [Data Types](./02.ipynb) | [Variables](./03.ipynb) | [Operators](./04.ipynb) | [Decision Making](./05.ipynb)  | [Functions](./06.ipynb) | [Strings](./07.ipynb) | [Vectors](./08.ipynb) | [Lists](./09.ipynb) | [Matrices](./10.ipynb) | [Arrays](./11.ipynb) | [Factors](./12.ipynb) | [Data Frames](./13.ipynb) | [Data Reshaping](./14.ipynb) | [Exercises](./15.ipynb)