The data that has been wrangled is data from https://mapzen.com/data/metro-extracts/metro/jakarta_indonesia/

I originate from Indonesia so I thought it would be nice if I can explore the data that I should be familiar with. Thus, I can find the most common errors in data particularly in my area.

# Problems Encountered in the Map

There are several issues found in the data, mostly came from the `tag` tag.

1. Inconsistent address format, some only including address, residencial cluster,
```
\xc3\xa2\xc2\x80\xc2\x8eJl. Ir. H. Djuanda No. 95'
JI.Margonda Raya No. 428, Beji, Depok , Indonesia'
'M.I. RIDWAN RAIS NO. 37, Beji Timur. Depok'
'Sentra Niaga Puri Indah'
'Pamulang Permai blok D III no. 1-2',
'22'
```

2. Wrong formatted postal codes
```
b'16127.'
b'14450.'
b'11550.'
b'\xc3\xa2\xc2\x80\xc2\x8e15414'
b'Lippo Karawaci 1600 Tangerang 15811'
b'151416'
```

3. Inconsistent language used on city
```
'Jakarta Selatan',
'South Jakarta',
```

4. Incosistent Phone number
```
'+62 21 799 0888',
'+62 21 5263137',
'14045',
'622178834966',
'+62 8983 2943',
'(0251) 831 6348',
```

Thus, before the data exported into the database, these errors should be cleaned. Below are the logic behind the data cleaning respect to each error.

1. To clean the in consistent address format, in this case the street (Jalan) naming, following algorithm conducted:
    - Replace all abbreviated "Jalan" such as jl, jln, Jln., and so on.
    - Delete all double spaces such as "Jalan  A"
    - Remove all non ASCII character in the street name such as "\xe2\x80\x8eJalan Ir. H. Djuanda No. 95" (mostly we do not use any latin/utf characters. However, we need more research on this)
    - Remove all non address and its number from the text, such as city name and country

    Thus, below is the code function:

    ```
    ADDRESS_ABBRV = re.compile(r'(j|J)(l|ln|I|L|LN)(\s|\.)|jalan')
    
    def fix_address(data):
        data = ADDRESS_ABBRV.sub('Jalan ', data)
        data = data.replace("  ", " ") #delete double spaces
        data = data.encode('ascii', 'ignore').decode() #remove unicode characters
        if 'Jalan' not in data:
            data = "Jalan "+ data
        for value in re.split(',\s|,', data):
            if "Jalan" in value:
                data = value
        return data
    ```
    
2. To clean the postal code is quite simple, it only need a regex as follows:
    ```
    POSTAL_CODE = re.compile(r'[0-9]{5}')
    
    def fix_postal(data):
    if len(POSTAL_CODE.findall(data)) > 0:
        return POSTAL_CODE.findall(data)[0]
    else:
        return '00000'
    ```

3. It interesting in Indonesia sometimes we confused when to use english or bahasa. Thus, the inconsistent language may occurs in city name such as South (Selatan), to "simple" fix the issue, I have provided a simple dictionary as follows to tranlate all to bahasa: 
    ```
    CITY_TRANSLATION = {
        "south jakarta": "Jakarta Selatan",
        "north jakarta": "Jakarta Utara",
        "west jakarta": "Jakarta Barat",
        "east jakarta": "Jakarta Timur",
    }
    ```
    However, it is just a simple fix, further development such as per-word/context translation may help this better.

4. Generally every people have their own format for their phone number, as as "simple" fix, I remove all the special character in phone number to make it more consistent.
    ```
    PROBLEMCHARS = re.compile(r'[()\-\=\+/&<>;\'\"\?%#$@\,\.\ \t\r\n]')
    ```


# Overview of the data
As a general overview, there are 16071868 lines that were processed resulting following files.
    
    ```
    jakarta_indonesia.osm: 3.04 GB
    nodes.csv: 1.45 GB
    nodes_tags.csv: 17.9 MB
    ways.csv: 253 MB
    ways_nodes.csv: 507.4 MB 
    ways_tags.csv: 703.5 MB
    ```


# Other ideas about the dataset
