# Project 3 
This notebook is the final hand-in for the Project 3, as part of the NLP class taken with Georges-André Silber at Mines Paris in January 2025.

## Project Guidelines 

To view the project guidelines, check out this pdf 

[Project 3 Guidelines](project3/enonce/project_3.pdf)

## Data 

The data is a total of 133 regional prefectoral orders. 





## First Examples
Let's classify three different prefectoral orders in order to better understand the types of **labels**, **targets** and **text** to extract from the HTML files

### Example 1: May 12 2022 Order on Subterrean pump tests
Here is an image of the of prefectoral order

# ![Prefectoral Order on Pump Tests](./figs/2022_05_12_pumping_img.png)

This prefectural order aims to authorize and regulate temporary pumping tests conducted by ROXANE company in the municipalities of Saint-Céneri-le-Gérei and La Ferrière-Bochard. It follows ROXANE's declaration of its project to perform this operation (pursuant to Article L. 214-3 of the Environmental Code).

**Authorized action:** 
Conducting complementary pumping tests on four boreholes (F1, F2, F3, and F4) with specific flow rates for each borehole, for a limited period (from 05/16/2022 to 12/16/2022).

**Purpose of the tests:**
These tests aim to better understand the functioning of the groundwater table and to assess the potential impacts of permanent withdrawals on the environment, before any commercial exploitation of groundwater.

**Main conditions imposed:**
1. Minimum duration of tests: 7 months
2. Obligation to implement continuous piezometric monitoring
3. Implementation of daily rainfall monitoring
4. Monitoring of wetlands around the boreholes
5. Compliance with maximum specific flow rates for each borehole

For this prefectoral order a correct classification could be 
```json
"arrete prefectoral n° 2350-22-00082 du 12 mai 2022": {
    "label": "authorize action",
    "target": "S.A. ROXANE",
    "text": "...",
}
```

We could also suggest the following classification, by considering this prefectoral order to be a simple appendum to the article L. 214-3 of the Environmental Code, adding addition conditions regarding the specific case of ROXANE's pumping test operation:
```json
"arrete prefectoral n° 2350-22-00082 du 12 mai 2022": {
 ***   "label": "add conditions",
    "target": "article L. 214-3 du code de l’environnement",
    "text": "prescriptions techniques...",
}
```
It is not easy to extract all the technical requirements from the article: there are many.  
We could summarize them using a language model, but that is not the spirit of this key value "text," which serves as justification for the key values "label" and "target."


### Example 2: Aug 3rd 1994 Order on Liquid Storage
Here is an image of the decree 
# ![Prefectoral Order on Liquid Storage](./figs/2022_05_12_pumping_img.png)


This decree, dated August 3, 1994, administratively authorizes the COGESAL S.A. factory in Argentan to operate its industrial activities, subject to compliance with numerous technical requirements. It takes into account a regularization request, a favorable public inquiry, and various advisory opinions.

The decree regulates the following activities:

*   Storage of slightly and extremely flammable liquids.
*   Refrigeration installations.
*   Effluent spreading.

It sets general rules concerning:

*   Water protection (discharge standards, prevention of accidental pollution, management of rainwater and wastewater).
*   Prevention of air pollution (limitation of smoke, gas emissions, etc.).
*   Waste disposal.
*   Noise nuisances (compliance with standards).
*   Electrical installations (safety and control).
*   Traffic rules on the site.
*   Prevention of the risks of fire and explosion (detection, emergency resources, smoking ban, fire permit).
*   Dairy and dairy product manufacturing activities.
*   Refrigeration installations operating with ammonia.
*   Combustion installations.

The decree also specifies deadlines for certain studies or compliance, control methods, and obligations to publicize the decision. It repeals several previous decrees and declaration receipts.


For this prefectoral order a correct classification could be 
```json
"arrete prefectoral du 3 août 1994": {
    "label": "authorize action",
    "target": "COGESAL S.A.",
    "text": "...",
}
```
### Thoughts on the manual classification
We can see that the classification task is only the classification of the **label** such as **replace all** or **authorize action**. 
The rest of the task consists in scraping the target company and the name/date of the prefectoral order, and finally, correctly formatting said information. 

We can now formulate a general structure that these documents have, which will help us better filter out the unnecessary information and extract the correct elements for classification. 

### General structure of the prefectoral orders

**1. Header: Document and Authority Identification**

**2. Vus (Legal and Procedural Grounds)**

This section demonstrates that the prefecture followed all legal procedures and consulted all stakeholders before making its decision.

**3. Central Parts of the Document, Classification of the Establishment's Activities:**

This section is important because it describes the activities of the establishment that are subject to authorization.

**4. Main Articles (Prescriptions and Obligations)**

This section contains the articles that define the operator's obligations. It can be divided into several parts:

**5. Final Provisions (Notification, Publicity, Copies)**

**6. Signature**

*   The document is signed by the Prefect (or by delegation), attesting to the validity of the order.


### Preprocessing
To reduce the size of documents, we can filter out parts that don't serve classification purposes.

It seems like only the sections 1 (Header) and 3 (Central parts of the document and classification of the establishment's activities) are relevant to our task. We can filter out the rest.

### What types of labels should we account for in the classification process? 

A large number of labels may make the classification process less useful, as it is overly specific. A too small number of labels could lead to some inaccuration classifications, where some prefectoral orders are given labels that poorly fit. 

Here are the main labels that will be sought out in the prefectoral order texts: 
- **authorize action**. Most prefectoral orders fall into this category, as they authorize a company to do an activity such as an industrial project in a given location. Here, the target of the prefectoral order is a company, and the establishment affected by the prefectoral order is also this company. 
- **replace of modify existing order** Some prefectoral orders fall into this category, for they perform a modification to existing law, orders or prescriptions that apply to a given establishment's activities. In this case, the target of the prefectoral order is the lawm order of prescription text being modified or replaced, whilst the establishment being affected is the company. 


In [1]:
from pathlib import Path

# select first file
file = list(DATA_DIR.iterdir())[0]

# read file
with open(file, "r") as f:
    text = f.read()
print(text)

NameError: name 'DATA_DIR' is not defined

### Preprocessing

Pour réduire la taille des documents, nous pouvons filtrer les parties qui ne servent pas à la classification. 
La partie des Visas administratifs ("Vu...") consiste en : 
- des lois et décrets de référence (loi 76-663, décret 77-1133, loi 92-3 sur l'eau)
- des documents de procédure (enquête publique, plans)
- l'avis des différentes parties prenantes :
    - Conseils Municipaux
    - Services départementaux (DDASS, DDE, DDAF, SDIS)
    - Conseil Départemental d'Hygiène
Cette partie ne sert pas à la classification. Nous pouvons l'enlever à l'aide de commandes Régex. De la même manière, 