# Wrangling

## Sample Data

In [4]:
set.seed(1234)
my.df = data.frame(
  com  = paste('C',sample(1:3, 100, replace = T),sep=''),
  dept = paste('D',sample(1:7, 100, replace = T),sep=''),
  team = paste('T',sample(1:15, 100, replace = T),sep=''),
  x1 = rnorm(1:100, mean = 50, sd = 5),
  x2 = rnorm(1:100, mean = 20, sd = 3),
  x3 = rnorm(1:100, mean =  5, sd = 1),
  stringsAsFactors = F
)
head(my.df)

com,dept,team,x1,x2,x3
C1,D1,T10,48.11381,21.31079,2.683964
C2,D4,T8,50.4881,23.18037,5.562472
C2,D2,T5,58.19372,21.35657,4.216225
C2,D2,T12,45.62204,21.9896,4.773946
C3,D1,T8,50.6088,16.59088,3.412897
C2,D3,T11,56.81065,18.88851,5.547524


## stringr
### Why use stringr ?
- Consistent syntax, all functions start with 'str_'
- Compatible with %>%

In [2]:
library(stringr)
words = c("why", "video", "cross", "extra", "deal", "authority")
fruits = c('banana','apple','banana','durian','apple')

### Number of Chars

In [10]:
print( str_length( words ) )   # stringr
print( nchar(words)  )         # similar to base R

[1] 3 5 5 5 4 9
[1] 3 5 5 5 4 9


### Combinning
This operation take single vector and combine all its elements to form a combined string

In [16]:
print( str_c(fruits, collapse = " , ") )

[1] "banana , apple , banana , durian , apple"


### Subset

In [36]:
str_sub(fruits, 1, 3)  # substring position 1 to 3

### Replace

In [38]:
str_replace(words, "[aeiou]", "?")

### String Split
**stringr::str_split** has similar result as **base::strplit**

In [18]:
x =c('banana, apple  , banana , durian  , apple', 
     'avocado, water melon, pineapple')
print( str_split(x, ', ')  )

[[1]]
[1] "banana"   "apple  "  "banana "  "durian  " "apple"   

[[2]]
[1] "avocado"     "water melon" "pineapple"  



In [19]:
print( strsplit(x,',')  )

[[1]]
[1] "banana"    " apple  "  " banana "  " durian  " " apple"   

[[2]]
[1] "avocado"      " water melon" " pineapple"  



### Pattern Match

In [43]:
print( words )
print( str_detect(x, "[aeiou]")  )

[1] "why"       "video"     "cross"     "extra"     "deal"      "authority"
[1] TRUE TRUE


## dplyr

### Filter

### Sorting