# Cleansing : Address

### Check Outlier
in addition to empty content, addresses may not consist of all letters (plus spaces) and numbers (plus spaces).

In [2]:
sql <- "SELECT kode_pelanggan, alamat, pola_alamat 
FROM dqlab_messy_data 
WHERE pola_alamat REGEXP '^[aAw]+$' or pola_alamat REGEXP '^[9w]+$'"

If we use regex for the alamat column then we use:
<pre>
^[A-Za-z ]+$
^[0-9 ]+$
</pre>
If we use regex for the pola_alamat then we use:
<pre>
^[aAw]+$
^[9w]+$
</pre>

Note: The cap sign (^) at the beginning of the regex pattern and $ (dollar) at the end of the regex pattern is a sign that the pattern is valid from the beginning to the end of the text.

### Standardize Alamat Column :  Jalan


In [21]:
library(openxlsx)
data_alamat <- read.xlsx("C:/Users/aftermath/Documents/CS-101/R/asset/data_alamat.xlsx")

<pre>
    Jl. Pulo Bambu No. 15, Kota Tenggara Lama
    Jln. Tegal Sari Indah, No. D87 -- Kota H
    Jalan Hang Tuah, No. 11, Kota DM
</pre>

we will replace all variations of the abbreviations above with "Jalan".
The regex pattern is as follows:
<pre>
    jalan[]*\\.
    jl[]*\\.
    \\bjln\\b
    \\bjl\\b
    jalan\\.
</pre>
Where
<pre>
    \\b is the word boundary marker.
    \\. is point.
    []*\\. indicates repeated spaces that may follow before a period.
</pre>
Note: This pattern is just an example for our case, in practice you need to collect these patterns to standardize.

In [23]:
data_alamat$alamat <- gsub("jln[ ]*\\.", "Jalan", data_alamat$alamat, ignore.case=TRUE)
data_alamat$alamat <- gsub("\\bjln\\b", "Jalan", data_alamat$alamat, ignore.case=TRUE)
data_alamat$alamat <- gsub("jl[ ]*\\.", "Jalan", data_alamat$alamat, ignore.case=TRUE)
data_alamat$alamat <- gsub("\\bjl\\b", "Jalan", data_alamat$alamat, ignore.case=TRUE)
data_alamat$alamat <- gsub("jalan\\.", "Jalan", data_alamat$alamat, ignore.case=TRUE)
data_alamat

kode_pelanggan,alamat
KD-00032,"Vila Sempilan, No. 67 - Kota B"
KD-00053,"Vila Sempilan, No. 11 - Kota B"
KD-00133,"Vila Sempilan, No. 1 - Kota B"
KD-00056,"Vila Permata Intan Berkilau, Blok C5-7"
KD-00111,"Vila Permata Intan Berkilau, Blok A1/2"
KD-00036,"Vila Gunung Seribu, Blok O1 - No. 1"
KD-00126,"Vila Gunung Seribu, Blok F4 - No. 8"
KD-00137,"Vila Bukit Sagitarius, Gang. Sawit No. 3"
KD-00046,"Vila Bukit Sagitarius, Gang Kelapa No. 6"
KD-00027,"Vila Bukit Sagitarius, Blok A1 No. 1"


### Save Data as Xlsx

In [24]:
write.xlsx(file = "C:\\Users\\aftermath\\Documents\\CS-101\\R\\asset\\staging.alamat.xlsx", data_alamat)

Note: zip::zip() is deprecated, please use zip::zipr() instead
