## **Preliminary analysis** | 30 September 2023
Breakdown of the problem into simpler chunks

### **File format**
The XML files are all in the same format. They all have the following tags:

1. putusan
2. kepala_putusan
3. identitas
4. riwayat_penahanan
5. riwayat_perkara
6. riwayat_tuntutan
7. riwayat_dakwaan
8. fakta
9. fakta_hukum
10. pertimbangan_hukum
11. amar_putusan
12. penutup
    
where <putusan/> is the parent tag to all the children tags within

> Correction: Some files contain additional properties \<pertimbangan_hukum\> and \<fakta_hukum\>

### **General Workflow**
The assignment is massive, composed of multiple different functionalities that must be implemented at the same time.
From what I understand so far, we need to:

1. Fetch all **files from a folder**
2. Process **command line arguments**
    1. Use keywords and boolean operators
    2. Translate natural language to computer language
3. **Filter** results based on command line arguments
    1. Search for certain sections within the document
    2. Implementing conditional operators `(AND/OR/ANDNOT)`
4. Output **formatted strings** based on fetched results


## **Problem Breakdown** | 1 October 2023

### **Accessing files**

Accessing files can be done using the open() and read() method, but only after finding the path to the specified XML file.
Looping through files in a folder is to be done separately.

In [1]:
my_file = open("C:/Users/tobya/OneDrive/Documents/University Documents/Akademik/Dasar-Dasar Pemrograman 1/TP 2/indo-law-main/dataset/000d41015da43a8f1060facc4a67113f.xml", "r")
print(my_file.read())

<putusan amar="pidana" amar_lainnya="pidana-penjara-waktu-tentu" id="000d41015da43a8f1060facc4a67113f" klasifikasi="pidana-khusus" lama_hukuman="1440" lembaga_peradilan="pn-pekalongan" provinsi="jateng" status="berkekuatan-hukum-tetap" sub_klasifikasi="narkotika-dan-psikotropika" url="https://putusan3.mahkamahagung.go.id/direktori/putusan/000d41015da43a8f1060facc4a67113f.html">
<kepala_putusan>
putusan
no 180 pid sus 2018 pn pk
demi keadilan berdasarkan ketuhanan yang maha esa
pengadilan negeri pekalongan yang mengadili perkara perkara pidana pada peradilan tingkat pertama dengan acara pemeriksaan biasa telah menjatuhkan putusan sebagai berikut
</kepala_putusan>
<identitas>
nama lengkap fery irawan als kenceng bin suprapto
tempat lahir pekalongan
umur tgl lahir
24 tahun 3 pebruari 1994
jenis kelamin laki laki
kewarganegaraan indonesia
tempat tinggal dk sepete d jetak kidul rt 016 rw 003 kecamatan wonopringgo kabupaten pekalongan
agama islam pekerjaan buruh harian lepas
pendidikan smp k

### **Looping through files**

We can retrieve a list of all the file names in a directory using the os.listdir() function.

In [2]:
import os

folder_contents = os.listdir("indo-law-main/dataset")
for f in folder_contents:
    print(f)

00035681c8d944203f25d2e8215ae2bf.xml
000399ce26773e18695ce14f519cb9e6.xml
0006582ad67cd9bd1ddf4261a09bf382.xml
00092bbac1a705aa44f2e10a0511cc0c.xml
0009b7fa2e45129b1755ddbdf35c7eec.xml
000d41015da43a8f1060facc4a67113f.xml
0011e2eb493179fd588719d8f5ce5524.xml
00122b1be15a10ad474bb3b7ec0dea73.xml
00136d1554e18c63256deac42aad0c58.xml
0013e8cdeaab97f04b4601d46a008546.xml
001477db6a8d6599f8ac40908d983a0b.xml
00182198c1634ee0e7532bb8ec7b6158.xml
001c437281a5e012307f85894cc15fef.xml
001c8a8f1d5c891146e090fe31452218.xml
001f721fe31cee23eab7ddd68db7fea3.xml
00213770c9651cc862623c409a689f8f.xml
0023eff6dfcda08f49cf9b055cb2782f.xml
002440ead64d7422fd79ab097b3d1cf8.xml
00289784b1c883ed25b0dd2d5ec20e2f.xml
0029289f42555a023476b7ed7a382a8a.xml
00299fca74c4dddd9f1a58424e09a260.xml
002cf65f6bc63c190a51d4713ca19f04.xml
002d3f2456292f874d710695561f670d.xml
002fc593bc1e983a9328ce38a7d28de7.xml
003235753e4492bff39cf2e12d5a5511.xml
0034080b5e150e1dcec9017c61038c7a.xml
003605d9ecf0888b82813bf2973c79c3.xml
0

> Through some testing, it turns out that storing the list in a variable and then looping through the list items is around 0.6s faster than looping through the list defined within the for loop. This will only be useful in outputting text.

In [3]:
import os
import time

start_time = time.time()
count = 0
for f in os.listdir("indo-law-main/dataset"):
    count += 1
end_time = time.time()
print(f"Looped through {count} files in {end_time - start_time:.5}s")

time.sleep(1)

start_time = time.time()
count = 0
folder = os.listdir("indo-law-main/dataset")
for f in folder:
    count += 1
end_time = time.time()
print(f"Looped through {count} files in {end_time - start_time:.5}s")

Looped through 22630 files in 0.060027s


Looped through 22630 files in 0.026342s


### **Command Line Argument Processing**

Using `sys.argv[index]` allows us to access the command line arguments.

We know that the file name will always be argument index 0 of our command line arguments.

We also know that the section being referred to will always be argument index 1 of our command line arguments.

There are two possible versions of the "keywords" argument, which are:

1. `search.py <section> [keyword]`
2. `search.py <section> [keyword1] AND/OR/ANDNOT [keyword2]`

We can handle this by using an if statement to check whether the user input contains a keyword operator.

# **Attempt 1** | 2 October 2023

The goal I wish to achieve in this attempt is to simply let it run to spec. As long as the logic works out, I'm happy with it.

### **Roadbump 1** | type(search_all) -> Literal[False]


```import sys

search_all = True
if sys.argv[1] == "all":
    search_all = True
elif sys.argv[1] in SECTION_NAMES:
    search_all = False
    section = sys.argv[1]
else:
    print(section)
    exit()

For some reason, search_all is being set to type Literal, even when I explicitly try to change its data type to bool.

I'm thinking it could be something to do with the use of the sys module, but I can't seem to figure out why.

**Fix is currently in progress**

**Resolved:**
It turns out that the problem I was facing was completely separate from the reported issue.

### **Roadbump 2** | get_section(file, content) -> None

In [1]:
def get_section(file: str, section_name: str):
    try:
        section_start = file.rindex(f"<{section_name}>")
        section_end = file.index(f"</{section_name}>")
        return file[section_start : section_end + 1]
    except ValueError:
        pass

Passing a file without the specified tag in it would return nothing - an object of type NoneType. 

This causes an error when the return value of the function is processed as a string.

> **Resolved:** This problem was resolved by adding a return value of "", which can still be processed as a string.

In [None]:
def get_section(file: str, section_name: str):
    if section_name == "all":
        return file
    else:
        try:
            section_start = file.rindex(f"<{section_name}>")
            section_end = file.index(f"</{section_name}>")
            return file[section_start : section_end + 1]
        except ValueError:
            return ""

### **To Do:** 
1. Input validation and exception handling
2. Retrieval of file properties
3. Runtime calculation
4. String formatting
5. Code optimization


# **Attempt 2** | 3 October 2023

I plan to achieve input validation, exception handling, and string formatting in this iteration of the program.

I would also like to use some of the optimization ideas I got after writing the first version of the program.

### **Progress**

1. Input validation and exception handling has been trivially put into place
2. Optimizing program by minimizing conditional statements has proven near useless for efficiency
3. The choice between `os.scandir()` or `os.listdir()` makes absolutely no difference in this case; rapid access to files is the bottleneck

> Correction: minimizing conditional statements does in fact make a difference in efficiency; times went up from an average of 9.8s to 14-15s with larger variance from the average.

# **Attempt 3** | 7 October 2023

Most likely the final iteration of the program, using components of previous attempts.

### **Goals**

1. Implement the general structure of attempt 1.
2. Use the functions and input validation defined in attempt 2.
3. More thorough documentation of code

# **Code Optimization**

The largest bottleneck I've found that affects this code is that the very first time running the code, the data within each file is taken from cold storage. 

As of October 7th, I haven't found any workarounds around this hardware bottleneck.

#### **Optimization Possibilities for Attempt 1** | 3 October 2023

I found that using `with` automatically closes an opened file, which could shorten code and potentially make it more efficient.

I just realized how stupid of an idea it is to loop through the argument length conditionals every single file iteration. No wonder it is taking forever to iterate ;-;

It turns out that `os.scandir()` is a more lightweight method in comparison to `os.listdir()`, but returns a `DirEntry` object instead of a `list`