<a href="https://colab.research.google.com/github/rickiepark/MLQandAI/blob/main/supplementary/q15-text-augment/noise-injection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 데이터 증식을 위한 잡음 추가

In [1]:
!pip install watermark

%load_ext watermark
%watermark -a 'Sebastian Raschka' -v

Collecting watermark
  Downloading watermark-2.4.3-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting jedi>=0.16 (from ipython>=6.0->watermark)
  Using cached jedi-0.19.1-py2.py3-none-any.whl.metadata (22 kB)
Downloading watermark-2.4.3-py2.py3-none-any.whl (7.6 kB)
Using cached jedi-0.19.1-py2.py3-none-any.whl (1.6 MB)
Installing collected packages: jedi, watermark
Successfully installed jedi-0.19.1 watermark-2.4.3
Author: Sebastian Raschka

Python implementation: CPython
Python version       : 3.10.12
IPython version      : 7.34.0



### 랜덤 문자 삽입

In [2]:
import random
import string


def random_character_insertion(text, insertion_rate=0.1):
    num_insertions = int(len(text) * insertion_rate)

    for _ in range(num_insertions):
        position = random.randint(0, len(text))
        character = random.choice(string.ascii_letters)
        text = text[:position] + character + text[position:]

    return text

In [3]:
random.seed(1)


text = "The cat jumped over the dog."
augmented_text = random_character_insertion(text)
print("랜덤 문자 삽입:", augmented_text)

랜덤 문자 삽입: The Kcat jumped over the doZg.


In [4]:
import difflib


d = difflib.Differ()
diff = d.compare(text,
                 augmented_text)

print('\n'.join(diff))

  T
  h
  e
   
+ K
  c
  a
  t
   
  j
  u
  m
  p
  e
  d
   
  o
  v
  e
  r
   
  t
  h
  e
   
  d
  o
+ Z
  g
  .


### 랜덤 문자 삭제

In [5]:
import random

def random_character_deletion(text, deletion_rate=0.1):

    num_deletions = int(len(text) * deletion_rate)

    for _ in range(num_deletions):
        if len(text) == 0:
            break
        position = random.randint(0, len(text) - 1)
        text = text[:position] + text[position + 1:]

    return text

In [6]:
random.seed(1)


text = "The cat jumped over the dog."
augmented_text = random_character_deletion(text)
print("랜덤 문자 삭제:", augmented_text)

랜덤 문자 삭제: The at jumped overthe dog.


In [7]:
import difflib


d = difflib.Differ()
diff = d.compare(text,
                 augmented_text)

print('\n'.join(diff))

  T
  h
  e
   
- c
  a
  t
   
  j
  u
  m
  p
  e
  d
   
  o
  v
  e
  r
-  
  t
  h
  e
   
  d
  o
  g
  .


### 오타 생성

In [8]:
import random

def typo_introduction(text, introduction_rate=0.1):
    num_typos = int(len(text) * introduction_rate)

    for _ in range(num_typos):
        # Ensure there are at least two characters to swap
        if len(text) < 2:
            break
        position = random.randint(0, len(text) - 2)
        text = text[:position] + text[position + 1] + text[position] + text[position + 2:]

    return text

In [9]:
random.seed(1)


text = "The cat jumped over the dog."
augmented_text = typo_introduction(text)
print("오타 생성:", augmented_text)

오타 생성: The act jumped ove rthe dog.


In [10]:
import difflib


d = difflib.Differ()
diff = d.compare(text,
                 augmented_text)

print('\n'.join(diff))

  T
  h
  e
   
+ a
  c
- a
  t
   
  j
  u
  m
  p
  e
  d
   
  o
  v
  e
+  
  r
-  
  t
  h
  e
   
  d
  o
  g
  .
