# **Data augmentation: semantic-driven method**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pickle
import numpy as np
import random
from random import randrange
from collections import Counter

from lxml import html
import requests

# Install WordNet:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.corpus import wordnet

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


## **Load the data**

Classification task:

In [None]:
task = '3'

Load train set:

In [None]:
f_in = open("drive/MyDrive/train_set_"+task+"_orig.pkl","rb")

data_train = pickle.load(f_in)
 
f_in.close()

Extract positive examples:

In [None]:
data_train_positive = [data_train[i][0] for i in range(len(data_train)) if data_train[i][2]]

In [None]:
data_train_positive[0:5]

['Article 1 : Occupations ou utilisations du sol interdites\n \n1) Dans l’ensemble de la zone sont interdits :\n \nLes terrains de camping ou de caravanage permanents visés à l’article L.443-1 et L.444-1 du \ncode de l’urbanisme.',
 'Article 1 : Occupations ou utilisations du sol interdites\n \n1) Dans l’ensemble de la zone sont interdits :\n \nLes habitations légères de loisirs.',
 'Article 1 : Occupations ou utilisations du sol interdites\n \n1) Dans l’ensemble de la zone sont interdits :\n \nLes constructions destinées à l’habitation ne dépendant pas d’une exploitation agricole autres \nque celles visées à l’article 2 paragraphe 1).',
 'Article 1 : Occupations ou utilisations du sol interdites\n \n1) Dans l’ensemble de la zone sont interdits :\n \nLes constructions destinées à l’hébergement hôtelier autres que celles visées à l’article 2 \nparagraphe 1).',
 'Article 1 : Occupations ou utilisations du sol interdites\n \n1) Dans l’ensemble de la zone sont interdits :\n \nLes construct

## **Perform augmentation**

### **Semantic-driven (WordNet): replace random word by random concept from nomenclature enriched by WordNet**

Experiment (augmentation) name:

In [None]:
experiment = '4'

How many times repeat augmentation:

In [None]:
k = 1

Load basic nomenclature concepts:

In [None]:
f = open("drive/MyDrive/nomenclature", "r")

hierarchy = {} # nomenclature hierarchy
nomenclature_basic = [] # list of all concepts

for line in f:
    try:
        textLine = line.strip()
        if textLine != '':
            data = textLine.split(':')
            parent_node = data[0].strip()
            if parent_node.lower() not in nomenclature_basic and parent_node != 'objet':
                nomenclature_basic.append(parent_node.lower())
            child_nodes = [name.strip() for name in data[1].split(',')]
            for name in child_nodes:
                if name.lower() not in nomenclature_basic:
                    nomenclature_basic.append(name.lower())
            hierarchy[parent_node] = child_nodes
    except ValueError:
        print('Invalid input:',line)

f.close()

Control sum:

In [None]:
len(nomenclature_basic)

60

Enrich nomenclature by using synonyms from WordNet:

In [None]:
def getWordnetSynonyms(word,k):
    synonyms = []
    for syn in wordnet.synsets(word, lang='fra'):
        for l in syn.lemma_names('fra'):
            if (l.lower() != word) and (l not in synonyms):
                synonyms.append(l.lower())
    
    return synonyms[0:k]

In [None]:
s = 5 # top synonyms for each concept

new_nomenclature = []

for name in nomenclature_basic:
    list_synonyms = getWordnetSynonyms(name.replace(' ','_').lower(),s)
    if list_synonyms != []:
        for concept in [item.replace('_',' ') for item in list_synonyms]:
            if concept.lower() not in nomenclature_basic and concept.lower() not in new_nomenclature:
                new_nomenclature.append(concept.lower())

nomenclature_extended = nomenclature_basic + new_nomenclature

Control sum:

In [None]:
len(nomenclature_extended)

134

Define function for augmentation:

In [None]:
def getNewSegments(input_data,nomenclature_concepts,k,l):
  # k - how many times to repeat augmentation with each phrase
  # l - how many words to replace in each phrase
  generated_segments = []
  for j in range(k):
    for i in range(len(input_data)):
      phrase = input_data[i]
      # Select a word in a phrase by random:
      phrase_words = phrase.split(' ')
      new_phrase = phrase
      for m in range(l):
        selected_word = phrase_words[random.randrange(len(phrase_words))]
        # Replace it with a random word from (enriched) nomenclature: 
        nomenclature_word = nomenclature_concepts[random.randrange(len(nomenclature_concepts))]
        new_phrase = new_phrase.replace(selected_word,nomenclature_word,1)
      if (new_phrase not in generated_segments) and (new_phrase not in input_data):
        generated_segments.append(new_phrase)
      if i % 10 == 0:
        print("Process",i,"segment")
  
  return generated_segments

Perform augmentation:

In [None]:
# Fix random seed to make results reproducible:
random.seed(179)

# Generate new segments:
new_segments = getNewSegments(data_train_positive,nomenclature_extended,k,1)

Process 0 segment
Process 10 segment
Process 20 segment
Process 30 segment
Process 40 segment
Process 50 segment
Process 60 segment
Process 70 segment
Process 80 segment
Process 90 segment
Process 100 segment
Process 110 segment


Generated phrases:

In [None]:
new_segments[0:10]

['Article 1 : Occupations ou utilisations du sol interdites\n \n1) Dans l’ensemble ilot de circulation la zone sont interdits :\n \nLes terrains de camping ou de caravanage permanents visés à l’article L.443-1 et L.444-1 du \ncode de l’urbanisme.',
 'Article 1 tu discontinu individuel Occupations ou utilisations du sol interdites\n \n1) Dans l’ensemble de la zone sont interdits :\n \nLes habitations légères de loisirs.',
 'Article 1 : Occupations ou utilisations du sol interdites\n \n1) Dans l’ensemble de la zone sont interdits :\n \nLes constructions destinées moyen l’habitation ne dépendant pas d’une exploitation agricole autres \nque celles visées à l’article 2 paragraphe 1).',
 'Article 1 : Occupations ou utilisations du sol interdites\n \n1) Dans l’ensemble de la zone sont interdits :\n \nLes constructions destinées tu discontinu l’hébergement hôtelier autres que celles visées à l’article 2 \nparagraphe 1).',
 'Article 1 : Occupations ou utilisations du sol interdites\n \n1) Dans 

Create new segments:

In [None]:
data_new = [(i,-1,True) for i in new_segments]

data_augmented = data_train + data_new

Some stats:

In [None]:
len(data_augmented)

588

In [None]:
print("Positive examples:", len([i for i in range(len(data_augmented)) if data_augmented[i][2]]))
print("Negative examples:", len([i for i in range(len(data_augmented)) if not data_augmented[i][2]]))

Positive examples: 236
Negative examples: 352


Save results:

In [None]:
f_out = open("drive/MyDrive/train_set_"+task+"_augm-"+experiment+".pkl","wb")

pickle.dump(data_augmented,f_out)

f_out.close()

### **Semantic-driven (DES): replace random word by random concept from nomenclature enriched by DES**

Experiment (augmentation) name:

In [None]:
experiment = '5'

How many times repeat augmentation:

In [None]:
k = 2

Enrich nomenclature by using synonyms from the DES dictionary:



In [None]:
def getSynonymsDes(word,k):
    synonyms = []
    
    request_str = 'https://crisco4.unicaen.fr/des/synonymes/'+word.lower().replace(" ","+")
    page = requests.get(request_str)
    tree = html.fromstring(page.content)
    rows = tree.xpath('//table/tr')
    
    for row in rows:
        text = row.xpath('./td/a/text()')[0].strip()
        synonyms.append(text)
        
    return synonyms[0:k]

In [None]:
s = 5 # top synonyms for each concept

new_nomenclature = []

for name in nomenclature_basic:
    list_synonyms = getSynonymsDes(name,s)
    if list_synonyms != []:
        for concept in [item.replace('_',' ') for item in list_synonyms]:
            if concept.lower() not in nomenclature_basic and concept.lower() not in new_nomenclature:
                new_nomenclature.append(concept.lower())

nomenclature_extended = nomenclature_basic + new_nomenclature

Control sum:

In [None]:
len(nomenclature_extended)

153

Perform augmentation:

In [None]:
# Fix random seed to make results reproducible:
random.seed(179)

# Generate new segments:
new_segments = getNewSegments(data_train_positive,nomenclature_extended,k,1)

Process 0 segment
Process 10 segment
Process 20 segment
Process 30 segment
Process 40 segment
Process 50 segment
Process 60 segment
Process 70 segment
Process 80 segment
Process 90 segment
Process 100 segment
Process 110 segment
Process 0 segment
Process 10 segment
Process 20 segment
Process 30 segment
Process 40 segment
Process 50 segment
Process 60 segment
Process 70 segment
Process 80 segment
Process 90 segment
Process 100 segment
Process 110 segment


Generated phrases:

In [None]:
new_segments[0:10]

['Article 1 : Occupations ou utilisations du sol interdites\n \n1) Dans l’ensemble ilot de circulation la zone sont interdits :\n \nLes terrains de camping ou de caravanage permanents visés à l’article L.443-1 et L.444-1 du \ncode de l’urbanisme.',
 'Article 1 tu discontinu individuel Occupations ou utilisations du sol interdites\n \n1) Dans l’ensemble de la zone sont interdits :\n \nLes habitations légères de loisirs.',
 'Article 1 : Occupations ou utilisations du sol interdites\n \n1) Dans l’ensemble de la zone sont interdits :\n \nLes constructions destinées part l’habitation ne dépendant pas d’une exploitation agricole autres \nque celles visées à l’article 2 paragraphe 1).',
 'Article 1 : Occupations ou utilisations du sol interdites\n \n1) Dans l’ensemble de la zone sont interdits :\n \nLes constructions destinées tu discontinu l’hébergement hôtelier autres que celles visées à l’article 2 \nparagraphe 1).',
 'Article 1 : Occupations ou utilisations du sol interdites\n \n1) Dans l

Create new segments:

In [None]:
data_new = [(i,-1,True) for i in new_segments]

data_augmented = data_train + data_new

Some stats:

In [None]:
len(data_augmented)

706

In [None]:
print("Positive examples:", len([i for i in range(len(data_augmented)) if data_augmented[i][2]]))
print("Negative examples:", len([i for i in range(len(data_augmented)) if not data_augmented[i][2]]))

Positive examples: 354
Negative examples: 352


Save results:

In [None]:
f_out = open("drive/MyDrive/train_set_"+task+"_augm-"+experiment+".pkl","wb")

pickle.dump(data_augmented,f_out)

f_out.close()

### **Semantic-driven (Agrovoc): replace random word by random concept from nomenclature enriched by Agrovoc**

Experiment (augmentation) name:

In [None]:
experiment = '6'

How many times repeat augmentation:

In [None]:
k = 3

Enrich nomenclature by using synonyms from the Agrovoc dictionary:

In [None]:
def getSynonymsAgrovoc(word,k):
    synonyms = []
        
    request_str = 'https://agrovoc.fao.org/browse/agrovoc/en/search?clang=fr&q='+word.lower().replace(" ","+")
    page = requests.get(request_str)
    tree = html.fromstring(page.content)
    synonyms = tree.xpath('//span[@class="versal value"]/text()')

    return synonyms[0:k]

In [None]:
s = 5 # top synonyms for each concept

new_nomenclature = []

for name in nomenclature_basic:
    list_synonyms = getSynonymsAgrovoc(name,s)
    if list_synonyms != []:
        for concept in list_synonyms:
            if concept.lower() not in nomenclature_basic and concept.lower() not in new_nomenclature:
                new_nomenclature.append(concept.lower())

nomenclature_extended = nomenclature_basic + new_nomenclature

Control sum:

In [None]:
len(nomenclature_extended)

120

Perform augmentation:

In [None]:
# Fix random seed to make results reproducible:
random.seed(179)

# Generate new segments:
new_segments = getNewSegments(data_train_positive,nomenclature_extended,k,1)

Process 0 segment
Process 10 segment
Process 20 segment
Process 30 segment
Process 40 segment
Process 50 segment
Process 60 segment
Process 70 segment
Process 80 segment
Process 90 segment
Process 100 segment
Process 110 segment
Process 0 segment
Process 10 segment
Process 20 segment
Process 30 segment
Process 40 segment
Process 50 segment
Process 60 segment
Process 70 segment
Process 80 segment
Process 90 segment
Process 100 segment
Process 110 segment
Process 0 segment
Process 10 segment
Process 20 segment
Process 30 segment
Process 40 segment
Process 50 segment
Process 60 segment
Process 70 segment
Process 80 segment
Process 90 segment
Process 100 segment
Process 110 segment


Generated phrases:

In [None]:
new_segments[0:10]

['Article 1 : Occupations ou utilisations du sol interdites\n \n1) Dans l’ensemble tu discontinu collectif la zone sont interdits :\n \nLes terrains de camping ou de caravanage permanents visés à l’article L.443-1 et L.444-1 du \ncode de l’urbanisme.',
 'Article 1 tronçon Occupations ou utilisations du sol interdites\n \n1) Dans l’ensemble de la zone sont interdits :\n \nLes habitations légères de loisirs.',
 'Article 1 : Occupations ou utilisations du sol interdites\n \n1) Dans l’ensemble de la zone sont interdits :\n \nLes constructions destinées chemin de fer l’habitation ne dépendant pas d’une exploitation agricole autres \nque celles visées à l’article 2 paragraphe 1).',
 "Article 1 : Occupations ou utilisations du sol interdites\n \n1) Dans l’ensemble de la zone sont interdits :\n \nLes constructions destinées canal d'irrigation l’hébergement hôtelier autres que celles visées à l’article 2 \nparagraphe 1).",
 'Article 1 : travail du bois ou utilisations du sol interdites\n \n1) D

Create new segments:

In [None]:
data_new = [(i,-1,True) for i in new_segments]

data_augmented = data_train + data_new

Some stats:

In [None]:
len(data_augmented)

824

In [None]:
print("Positive examples:", len([i for i in range(len(data_augmented)) if data_augmented[i][2]]))
print("Negative examples:", len([i for i in range(len(data_augmented)) if not data_augmented[i][2]]))

Positive examples: 472
Negative examples: 352


Save results:

In [None]:
f_out = open("drive/MyDrive/train_set_"+task+"_augm-"+experiment+".pkl","wb")

pickle.dump(data_augmented,f_out)

f_out.close()