<center><h1>Cleaning String Data in R</h1></center>

# 1. What is Data Cleaning?
  - No uniform definition "data cleaning"
  - Roughly speaking, refers to exploring the idiosyncrasies of a data set, and then addressing them in a principled manner so as to allow for data analysis

## 1.1 Examples of Data Cleaning

  - Recoding `"NULL"`, `" "`, `""`, to be `NA`
  - Eliminating duplicate entries
  - Ensure numeric data is being treated as numerics (e.g., `"2" + 2 != 4`)
  - Treating dates or timestamps as `Date` or `POSIXct` data type 

# 2. Cleaning Strings

  - Parsing/cleaning/extracting info from strings is extremely common
  - Parsing timestamp strings is a great example

## 2.1 Errors in our `officer_cnt`

In [1]:
# Load necessary packages and arrests data
library(stringr)
library(dplyr)

arrests_df <- read.csv("./data/pvd_arrests_2020-10-03.csv")


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




In [2]:
count_names <- function(names_str) {
    # This function should return the number of names in 
    # the string `names_str` that we pass to the function. 
    
    name_vec <- unlist(str_split(names_str, ", "))
    k <- length(name_vec)
    
    return(k)
}

### 2.1.1 Inconsistencies in `arresting_officers` Column

In [3]:
head(arrests_df$arresting_officers, 10)

In [4]:
tail(arrests_df$arresting_officers, 10)

## 2.2 Addressing the Inconsistency
  - Use different criteria for counting names with full-name format
    + Define function to identify full-name vs. first-initial format
    + Note: first-inital format always starts with two capital letters

In [5]:
LETTERS               # This is a built-in object in R

In [6]:
"B" %in% LETTERS

## 2.3 Identifying Full-Name Format 
  - If the first two characters are uppercase, it's full-name format

In [7]:
is_uppercase <- function(chr) {
    res <- chr %in% LETTERS
    return(res)
}

has_full_names <- function(names_str) {
    char1 <- substr(names_str, 1, 1)
    char2 <- substr(names_str, 2, 2)
    
    res <- !(is_uppercase(char1) && is_uppercase(char2))
    return(res)
}

### 2.3.1 Testing our Functions

In [8]:
is_uppercase("a")                            # false
is_uppercase("b")                            # false
has_full_names("NManfredi")                  # Not full name
has_full_names("MPlace, JPerez, ASantos")    # Not full name

is_uppercase("A")
is_uppercase("B")
has_full_names("Newton, Frank")
has_full_names("Newton, Frank/ Chin, Rosemarie")

## 2.4 Fixing our `count_names()` Function

In [9]:
old_count_names <- function(names_str) {
    name_vec <- unlist(str_split(names_str, ", "))
    k <- length(name_vec)
    
    return(k)
}

In [10]:
count_names <- function(names_str) {
    names_str_trm <- str_trim(names_str)     # remove whitespace
    
    if (has_full_names(names_str_trm)) {
        split_char <- "/ "
    } else {
        split_char <- ", "
    }
    
    name_vec <- unlist(str_split(names_str_trm, split_char))
    k <- length(name_vec)
    
    return(k)
}

### 2.4.1 Testing New `count_names()`

In [11]:
old_count_names("YGonzalez, LTaveras") == 2
old_count_names("Newton, Frank/ Chin, Rosemarie") == 2     # function is wrong
count_names("YGonzalez, LTaveras") == 2
count_names("Newton, Frank/ Chin, Rosemarie") == 2

## 2.5 Re-Counting Officers
  - Let's compare how the "old" (i.e., incorrect) method did relative to our new `count_names()`

In [12]:
count_officers <- function(col, old = FALSE) {

    n <- length(col)   # get the length of our input column
    cnts <- rep(0, n)  # allocate vector of zeros to populate with counts

    for (i in 1:n) {
        if (old) {
            cnts[i] <- old_count_names(col[i])
        } else {
            cnts[i] <- count_names(col[i])
        }
    }
    return(cnts) 
}

In [13]:
arrests_df$old_officer_cnt <- count_officers(arrests_df$arresting_officers, old = TRUE)

arrests_df$officer_cnt <- count_officers(arrests_df$arresting_officers)

In [14]:
head(arrests_df)

Unnamed: 0_level_0,arrest_date,year,month,gender,race,ethnicity,year_of_birth,age,from_address,from_city,from_state,statute_type,statute_code,statute_desc,counts,case_number,arresting_officers,arrestee_id,old_officer_cnt,officer_cnt
Unnamed: 0_level_1,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>
1,2019-08-24T02:23:00.0,2019,8,Male,White,NonHispanic,1981,37,No Permanent Address,providence,Rhode Island,,,,,2019-00084142,"YGonzalez, LTaveras",pvd2218242150382148273,2,2
2,2019-08-24T02:02:00.0,2019,8,,,,1994,25,SUMMER AVE,Cranston,Rhode Island,RI Statute Violation,31-11-18,"Driving after Denial, Suspension or Revocation of License",1.0,2019-00084127,NManfredi,pvd15166785558364246202,1,1
3,2019-08-24T02:02:00.0,2019,8,Female,Black,NonHispanic,1984,34,DOUGLAS AVE,Providence,Rhode Island,RI Statute Violation,12-7-10,RESISTING LEGAL OR ILLEGAL ARREST,1.0,2019-00084126,"MPlace, JPerez, ASantos",pvd3142917706201385905,3,3
4,2019-08-24T02:02:00.0,2019,8,Female,Black,NonHispanic,1984,34,DOUGLAS AVE,Providence,Rhode Island,RI Statute Violation,11-45-1,DISORDERLY CONDUCT,1.0,2019-00084126,"MPlace, JPerez, ASantos",pvd3142917706201385905,3,3
5,2019-08-24T02:02:00.0,2019,8,Female,Black,Unknown,2001,18,TRASH ST,,,RI Statute Violation,12-7-10,RESISTING LEGAL OR ILLEGAL ARREST,1.0,2019-00084126,"MPlace, JPerez, ASantos",pvd460449304532374599,3,3
6,2019-08-24T02:02:00.0,2019,8,Female,Black,Unknown,2001,18,TRASH ST,,,RI Statute Violation,11-45-1,DISORDERLY CONDUCT,1.0,2019-00084126,"MPlace, JPerez, ASantos",pvd460449304532374599,3,3


In [15]:
tail(arrests_df, 12)

Unnamed: 0_level_0,arrest_date,year,month,gender,race,ethnicity,year_of_birth,age,from_address,from_city,from_state,statute_type,statute_code,statute_desc,counts,case_number,arresting_officers,arrestee_id,old_officer_cnt,officer_cnt
Unnamed: 0_level_1,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>
8744,2020-09-25T15:14:00.0,2020,9,Male,White,NonHispanic,1947,72,WASHBURN ST,Providence,,RI Statute Violation,11-45-1,DISORDERLY CONDUCT,1.0,2020-00079845,"Lugo, Jeann",pvd6155618112432892995,2,1
8745,2020-09-25T14:36:00.0,2020,9,Male,White,Hispanic,1999,21,NIAGARA ST,Providence,Rhode Island,RI Statute Violation,11-5-2,DOMESTIC-FELONY ASSAULT,1.0,2020-00079837,"Lopez, Vincent/ Schneider, Alex/ Vargas, Guillermo",pvd8046681516085598213,4,3
8746,2020-09-25T14:36:00.0,2020,9,Male,White,Hispanic,1999,21,NIAGARA ST,Providence,Rhode Island,RI Statute Violation,11-47-42,WEAPONS OTHER THAN FIREARMS PROHIBITED,1.0,2020-00079837,"Lopez, Vincent/ Schneider, Alex/ Vargas, Guillermo",pvd8046681516085598213,4,3
8747,2020-09-25T14:36:00.0,2020,9,Male,White,Hispanic,1999,21,NIAGARA ST,Providence,Rhode Island,RI Statute Violation,11-45-1,DOMESTIC-DISORDERLY CONDUCT,1.0,2020-00079837,"Lopez, Vincent/ Schneider, Alex/ Vargas, Guillermo",pvd8046681516085598213,4,3
8748,2020-09-25T14:36:00.0,2020,9,Male,White,Hispanic,1999,21,NIAGARA ST,Providence,Rhode Island,RI Statute Violation,11-44-1,DOMESTIC-VANDALISM/MALICIOUS INJURY TO PROP,1.0,2020-00079837,"Lopez, Vincent/ Schneider, Alex/ Vargas, Guillermo",pvd8046681516085598213,4,3
8749,2020-09-25T14:36:00.0,2020,9,Male,White,Hispanic,1999,21,NIAGARA ST,Providence,Rhode Island,RI Statute Violation,11-5-3,DOMESTIC-SIMPLE ASSAULT/BATTERY,1.0,2020-00077230,"Lopez, Vincent/ Schneider, Alex/ Vargas, Guillermo",pvd8046681516085598213,4,3
8750,2020-09-25T09:45:00.0,2020,9,Male,White,NonHispanic,1970,50,PINE ST,Providence,Rhode Island,RI Statute Violation,11-5-2,FELONY ASSAULT/ DANG. WEAPON OR SUBSTANCE,1.0,2020-00079750,"Maycock, Michael",pvd13478136167689373662,2,1
8751,2020-09-25T09:11:00.0,2020,9,Male,Black,NonHispanic,1990,29,WARWICK AVE,Cranston,,RI Statute Violation,31-11-18,"Driving after Denial, Suspension or Revocation of License",1.0,2020-00079744,"San Lucas, Luis",pvd15638788339600544418,2,1
8752,2020-09-25T00:00:00.0,2020,9,Female,Black,Hispanic,1986,33,ATWELLS AVE,Providence,,,,,,2020-00079901,"Lopes, Joseph",pvd7076185679870331431,2,1
8753,2020-09-25T00:00:00.0,2020,9,Male,Black,Hispanic,1981,39,IVES ST,Providence,,,,,,2020-00079921,"Lopes, Joseph",pvd17011494258935977890,2,1


## 2.6 How Many Errors?

In [16]:

sum(arrests_df$old_officer_cnt != arrests_df$officer_cnt)

In [17]:
nrow(arrests_df)