# Cleansing : Duplicate Data
Data duplication is a condition where in a dataset there is more than one data that actually represents one entity but is not successfully grouped into one.

From the business side, this can lead to opportunity lost. For example, in a business group suppose we can integrate all customers, then we can know the shopping behavior of each person and can make better offers.

### Text Distance
In R, there is a package called "stringdist" which we will use to calculate the distance of the text with the function we use is stringdist. Here is a direct example of its use.
<pre>
    stringdist ("Agus Cahyono", "Cahyono, Agus", method = "cosine")
</pre>

Where:
<br>stringdist: function to calculate the distance between texts.
<br>"Agus Cahyono": the first text to be compared.
<br>"Cahyono, Agus": the second text to be compared.
<br>method = "cosine": the method of calculating text spacing, in this case "cosine". This method is used because it uses text broken into vectors of a number of character pairs (2 characters, 3 characters, and so on) and does not see the position of the characters.

Note: The stringdist function is case sensitive, meaning that upper and lowercase letters of the same alphabet are considered different.

other methods:
<pre>
lv: Levenstein distance. The distance calculation is based on how many characters are deleted, added, and changed so that the two texts are the same. The distance value is an integer number from 0 to a certain integer value.

dl: Damerau-Levenstein distance. Development of a Levenstein distance which allows character transposition (moving places). The distance value is an integer number from 0 to a certain integer value.

hamming: Hamming distance - the number of different characters between the two texts - and the length of the two texts must be the same. Otherwise, it will return Inf. The distance value is an integer number from 0 to a certain integer                value.

osa: Optimal string alignment - similar to dl but each text can only be edited once. The distance value is an integer number from 0 to a certain integer value. This is the default method for stringdist.

lcs: longest common substring - how many characters must be removed from both texts so that they become the same text. The distance value is an integer number from 0 to a certain integer value.

qgram: How many different n-gram pairs (i.e. chunks of n characters from the text). The distance value is an integer number from 0 to a certain integer value.

jaccard: Is the distance calculated based on how many different n-gram pairs divided by the total number of n-gram pairs. The distance value is a decimal value between 0 and 1.

jw: the Jaro Wrinkler method calculates the minimum required character displacement so that one text is transformed into another text. The distance value is a decimal value between 0 and 1.

soundex: a method of distance between texts based on differences in pronunciation in English.

In [4]:
library('stringdist')

stringdist("Agus Cahyono" ,"Agus Cahyono", method="cosine")
stringdist("Agus Cahyono", "agus cahyono", method="cosine")
stringdist("Agus Cahyono", "Agus Tjahyono", method="cosine")
stringdist("Agus Cahyono", "Cahyono Agus", method="cosine")
stringdist("Agus Cahyono",  "Cahyono, Agus", method="cosine")
stringdist("Agus Cahyono", "Justin Bieber", method="cosine")

From these results, the following can be summarized:

The first result is a distance of 0 or no distance at all, the two texts are the same.
The second result is 0.131401 because the two texts contain the exact same alphabetical order, but have upper and lower case letters.

The third result of the distance is 0.1029148 because the two texts "Agus Cahyono" and "Agus Tjahyono" turned out to be different in the "Cahyono" and "Tjahyono" sections. The calculated figure shows that the distance is still very close.

The fourth result of the distance is 0. This means that the exact distance is the same even though the text of the word is reversed: "Agus Cahyono" and "Cahyono Agus".

The fifth result of the distance is 0.03390822. It means that the distance is very close even though the text is upside down: "Agus Cahyono" and "Cahyono, Agus". This small difference is due to a comma.

The sixth distance result is 0.7407185. It means that it has a long distance between "Agus Cahyono" and "Justin Bieber" and is very different from the results above.

### Duplicate in Vector

In [5]:
referensi <- "Agus Cahyono"
nama.pelanggan <- c("Agus Cahyono", "Justin Bieber", "Agus Tjahyono", "Cahyono Agus")

jarak.teks <- stringdist(referensi, nama.pelanggan, method="cosine")

nama.pelanggan[jarak.teks<=0.15]

### Duplicate Grouping

The same grouping number states that the data is considered to be the same (duplicate). In the example below grouping 1 has three data, while grouping 2 has only 1 data (no duplicates).

<pre>
  grouping          nama
1        1  Agus Cahyono
2        1 Agus Tjahyono
3        1  Cahyono Agus
4        2 Justin Bieber 
</pre>


To do this, there are many ways. In the code editor an algorithm has been created by the DQLab team with the following logic:

<br>Customer variables are filled with initial vector data.
<br>Initialize variable number grouping (grouping_no) to value 1.
<br>The process of finding duplicates will begin by taking references from the first vector item.
<br>Customer variables will be omitted items per duplicate discovery so that eventually all or the vector length will disappear to zero.
<br>Calculate the distance between the reference text and all items of the customer's name.
<br>Filter customer name that has a text range according to the threshold, and saved to the outcome variable.
<br>Make a temp variable in the form of a data frame that contains the current grouping number and duplicate results.
<br>Combine var.temp with the previous results into the final variable.
<br>Remove items that have been obtained from the duplicate name. Customer, by filtering the item with a text distance above the threshold.
<br>Increase grouping_no value by 1.
<br>If the item still exists, the process is repeated from step no 2

In [6]:
#Membuat variable vector nama
nama.pelanggan <- c("Agus Cahyono", "Justin Bieber", "Agus Tjahyono", "Cahyono Agus")

#Inisialisai variable untuk hasil.akhir
hasil.akhir <- NULL

#Inisialiasi variable grouping_no dengan nilai 1
grouping_no <- 1

#Melakukan perulangan proses pencarian dengan perintah while, sampai akhirnya isi vector menjadi kosong (panjang = 0)
while(length(nama.pelanggan)>0)
{
  #Variable referensi diisi dengan item pertama variable nama.pelanggan
  referensi <- nama.pelanggan[1]

  #Menghitung jarak antara referensi dengan item-item nama.pelanggan
  jarak.teks <- stringdist(referensi, nama.pelanggan, method="cosine")

  #Hasil filter jarak dengan threshold 0.15 disimpan ke variable nama.hasil
  nama.hasil <- nama.pelanggan[jarak.teks <= 0.15]

  #Hasil filter jarak dengan threshold 0.15 disimpan ke variable nama.hasil
  var.temp = data.frame(grouping=grouping_no, nama=nama.hasil)

  #Menggabungkan hasil sebelumnya 
  hasil.akhir <- rbind(hasil.akhir, var.temp)  

  #Mengambil porsi data yang bukan di dalam threshold dengan menggunakan simbol ! yang mewakili operator not (bukan)
  nama.pelanggan <- nama.pelanggan[!(jarak.teks <= 0.15)]

  #Menambahkan nilai grouping untuk diambil pada iterasi selanjutnya
  grouping_no <- grouping_no + 1
}
#Menampilkan hasil akhir
hasil.akhir


grouping,nama
1,Agus Cahyono
1,Agus Tjahyono
1,Cahyono Agus
2,Justin Bieber


In [16]:
library(openxlsx)

data.pelanggan <- read.xlsx("C:\\Users\\aftermath\\Documents\\CS-101\\R\\asset\\merged.data.xlsx")
data.pelanggan

kode_pelanggan,nama,alamat,no_telepon,anomali_no_telepon,kode_pos,tanggal_lahir
KD-00001,Agus Cahyonos,"Jalan Pulo Bambu No. 15, Kota Tenggara Lama",+628298911112222,TRUE,,08-02-1967
KD-00002,Khairul Nissa,"Taman Vivo Indah, Blok AA No. 7",+6287132221371404,TRUE,,23-10-1991
KD-00003,Slamet Wiyanto,"Meta Residences, No. 32C",+6285725955303368,TRUE,,23-11-1962
KD-00004,DRS. Maria Simangunsong,"Gang Bulan Desember III, No. 9",+6283376770990635,TRUE,,17-02-2097
KD-00005,Prihatin Setyonugroho,"Jalan Tegal Sari Indah, No. D87 -- Kota H",+6286843623971825,TRUE,,19-08-1986
KD-00006,DR. Candra Wijaya,"Perum Pluto, Blok C No. 1",+6284063423953696,TRUE,,05-09-1990
KD-00007,"Indra Kurniawan, ST","Apartemen Kecapi Indah, Lt. 16 No. 1610",+6283840529196797,TRUE,,23-10-1979
KD-00008,Willy Sanjaya,"Kali Mars Cluster, No. 24C",+6285312577710538,TRUE,,22-07-1973
KD-00009,Antonius Winarta,"Jalan Kebon Jahe, No. F16 - Kota E",+6282722234294686,TRUE,,
KD-00010,"Sri Wahyuni, Ir","Perum Venus, Gg. Harimau No. 1A",+6284079659289143,TRUE,,23-10-1991


In [17]:
#Inisialisai variable untuk hasil.akhir
hasil.akhir <- NULL

#Inisialiasi variable grouping_no dengan nilai 1
grouping_no <- 1

while(length(data.pelanggan$nama)>0)
{
  #Variable referensi nama dan alamat diambil dari item pertama
  referensi.nama <- data.pelanggan$nama[1]
  referensi.alamat <- data.pelanggan$alamat[1]
  
  #Menghitung jarak antara referensi dengan item-item nama dan alamat
  #gunakan method "cosine" untuk nama, dan method "lv" untuk alamat
  jarak.teks.nama <- stringdist(referensi.nama, data.pelanggan$nama, method="cosine")
  jarak.teks.alamat <- stringdist(referensi.alamat, data.pelanggan$alamat, method="lv")

  #Hasil filter jarak dengan threshold 
  # - lebih kecil sama dengan angka 0.15 untuk nama
  # - lebih kecil dari angka 15 untuk alamat
  #disimpan ke variable filter.jarak
  filter.jarak <- (jarak.teks.nama <= 0.15 & jarak.teks.alamat < 15)

  #Melakukan filtering pada variable data.pelanggan, dan mengambil tiga kolom 
  #untuk disimpan ke tiga variable 
  kode_pelanggan.temp <- data.pelanggan[filter.jarak,]$kode_pelanggan
  nama.temp <- data.pelanggan[filter.jarak,]$nama
  alamat.temp <- data.pelanggan[filter.jarak,]$alamat
  
  #Konstruksi temporary variable
  var.temp <- data.frame(grouping=grouping_no, kode_pelanggan=kode_pelanggan.temp, nama=nama.temp, alamat=alamat.temp, jumlah_record=length(kode_pelanggan.temp))

  #Menggabungkan temporary variable dengan hasil sebelumnya
  hasil.akhir <- rbind(hasil.akhir, var.temp)
  
  #Menggabungkan hasil sebelumnya
  data.pelanggan <- data.pelanggan[!filter.jarak,]

  #Menambahkan nilai grouping untuk diambil pada iterasi selanjutnya
  grouping_no <- grouping_no + 1
}

In [18]:
hasil.akhir

grouping,kode_pelanggan,nama,alamat,jumlah_record
1,KD-00001,Agus Cahyonos,"Jalan Pulo Bambu No. 15, Kota Tenggara Lama",3
1,KD-00012,"Cahyono, Agus","Pulo Bambu No. 15, Kota Tenggara Lama",3
1,KD-00778,Cahyono Agus H.,Jalan Pulau Bambu No. 15 - Kota Tenggara Lama,3
2,KD-00002,Khairul Nissa,"Taman Vivo Indah, Blok AA No. 7",1
3,KD-00003,Slamet Wiyanto,"Meta Residences, No. 32C",1
4,KD-00004,DRS. Maria Simangunsong,"Gang Bulan Desember III, No. 9",1
5,KD-00005,Prihatin Setyonugroho,"Jalan Tegal Sari Indah, No. D87 -- Kota H",1
6,KD-00006,DR. Candra Wijaya,"Perum Pluto, Blok C No. 1",1
7,KD-00007,"Indra Kurniawan, ST","Apartemen Kecapi Indah, Lt. 16 No. 1610",1
8,KD-00008,Willy Sanjaya,"Kali Mars Cluster, No. 24C",1


### Save Data to Xlsx

In [19]:
write.xlsx(file = "C:\\Users\\aftermath\\Documents\\CS-101\\R\\asset\\nonduplicate.data.xlsx", data.pelanggan)

Note: zip::zip() is deprecated, please use zip::zipr() instead
