#1. DATA ACQUISITION

a. DATA AVAILABLE SCENARIOS:

i. Data on your desk

ii. Data in database

iii. Less data


1.   Synonym Replacement
2.   Bigram Flip : Alter word sequence
3.   Back Translation : Translate text to another language & back to original language
4. Adding noise





#1. Synonym Replacement

In [None]:
import random
import nltk
from nltk.corpus import wordnet
nltk.download('wordnet')
nltk.download('omw-1.4')  
nltk.download('punkt')   

[nltk_data] Error loading wordnet: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>
[nltk_data] Error loading omw-1.4: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>
[nltk_data] Error loading punkt: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


False

In [2]:
text = "The movie was absolutely fantastic and enjoyable"

In [3]:
def get_synonyms(word):
    synonyms = set()
    for syn in wordnet.synsets(word):
        for lemma in syn.lemmas():
            synonym = lemma.name().replace('_', ' ')
            if synonym.lower() != word.lower():
                synonyms.add(synonym)
    return list(synonyms)

def synonym_replacement(text, n=2):
    words = nltk.word_tokenize(text)
    new_words = words.copy()
    random_word_list = list(set([word for word in words if word.isalpha()]))
    random.shuffle(random_word_list)

    num_replaced = 0
    for word in random_word_list:
        synonyms = get_synonyms(word)
        if synonyms:
            synonym = random.choice(synonyms)
            new_words = [synonym if w == word else w for w in new_words]
            num_replaced += 1
        if num_replaced >= n:
            break

    return ' '.join(new_words)

In [None]:
original = "The movie was absolutely fantastic and enjoyable"
augmented = synonym_replacement(original, n=2)

print("Original:", original)
print("Augmented:", augmented)  

Original: The movie was absolutely fantastic and enjoyable
Augmented: The film exist absolutely fantastic and enjoyable


#2. Bigram Flip: Alter word sequence

In [None]:
import random
import nltk
nltk.download('punkt')

def bigram_flip(text):
    words = nltk.word_tokenize(text)
    new_words = words.copy()

    indices = list(range(len(words) - 1))
    if not indices:
        return text 
    flip_index = random.choice(indices)

    new_words[flip_index], new_words[flip_index + 1] = new_words[flip_index + 1], new_words[flip_index]

    return ' '.join(new_words)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
text2 = "The movie was absolutely fantastic and enjoyable"
augmented2 = bigram_flip(text2)
print("Original:", text2)
print("Augmented:", augmented2)

Original: The movie was absolutely fantastic and enjoyable
Augmented: movie The was absolutely fantastic and enjoyable


#3. Back Translation: Translate text to another language and back to original language

In [None]:
!pip install deep-translator

In [None]:
from deep_translator import GoogleTranslator

def back_translate_verbose(text, intermediate_lang='fr'):
    try:
        translated = GoogleTranslator(source='auto', target=intermediate_lang).translate(text)

        back_translated = GoogleTranslator(source='auto', target='en').translate(translated)

        print(f"Original: {text}")
        print(f"Translated ({intermediate_lang}): {translated}")
        print(f"Back Translated (English): {back_translated}")

        return back_translated
    except Exception as e:
        print("Translation error:", e)
        return text

In [None]:
text3 = "The movie was absolutely fantastic and enjoyable"
augmented3 = back_translate_verbose(text3, intermediate_lang='fr')

Original: The movie was absolutely fantastic and enjoyable
Translated (fr): Le film était absolument fantastique et agréable
Back Translated (English): The film was absolutely fantastic and pleasant


#4. Adding Noise

There are several ways to introduce noise:

1. Random character swaps

2. Random deletions

3. Keyboard typos

1. Random character swaps

In [None]:
#Random character swaps
import random

def add_noise(text, noise_level=0.1):
    text_chars = list(text)
    num_noisy = int(len(text_chars) * noise_level)

    for _ in range(num_noisy):
        idx = random.randint(0, len(text_chars) - 2)
        text_chars[idx], text_chars[idx + 1] = text_chars[idx + 1], text_chars[idx]

    return ''.join(text_chars)

In [None]:
text4 = "The movie was absolutely fantastic and enjoyable"
augmented4 = add_noise(text4, noise_level=0.1) 
print("Original:", text4)
print("Augmented (Noisy):", augmented4)

Original: The movie was absolutely fantastic and enjoyable
Augmented (Noisy): hTe movei was absolutely anftastic and enjoyable


2. Random Deletion

In [None]:
#Random deletion
import random

def random_deletion(text, deletion_prob=0.2):
    words = text.split()
    if len(words) == 1:
        return text  

    new_words = []
    for word in words:
        r = random.random()
        if r > deletion_prob:
            new_words.append(word)

    if not new_words:
        new_words.append(random.choice(words))
    return ' '.join(new_words)

In [None]:
text5 = "The movie was absolutely fantastic and enjoyable"
augmented5 = random_deletion(text5, deletion_prob=0.2)
print("Original:", text5)
print("Augmented (Random Deletion):", augmented5)

Original: The movie was absolutely fantastic and enjoyable
Augmented (Random Deletion): The movie fantastic and enjoyable


3. Keyboard typos

In [None]:
#keyboard typos
import random

qwerty_neighbors = {
    'a': ['s', 'q', 'z'],
    'b': ['v', 'g', 'h', 'n'],
    'c': ['x', 'd', 'f', 'v'],
    'd': ['s', 'e', 'r', 'f', 'c', 'x'],
    'e': ['w', 's', 'd', 'r'],
    'f': ['d', 'r', 't', 'g', 'v', 'c'],
    'g': ['f', 't', 'y', 'h', 'b', 'v'],
    'h': ['g', 'y', 'u', 'j', 'n', 'b'],
    'i': ['u', 'j', 'k', 'o'],
    'j': ['h', 'u', 'i', 'k', 'n', 'm'],
    'k': ['j', 'i', 'o', 'l', 'm'],
    'l': ['k', 'o', 'p'],
    'm': ['n', 'j', 'k'],
    'n': ['b', 'h', 'j', 'm'],
    'o': ['i', 'k', 'l', 'p'],
    'p': ['o', 'l'],
    'q': ['a', 's', 'w'],
    'r': ['e', 'd', 'f', 't'],
    's': ['a', 'w', 'e', 'd', 'x', 'z'],
    't': ['r', 'f', 'g', 'y'],
    'u': ['y', 'h', 'j', 'i'],
    'v': ['c', 'f', 'g', 'b'],
    'w': ['q', 'a', 's', 'e'],
    'x': ['z', 's', 'd', 'c'],
    'y': ['t', 'g', 'h', 'u'],
    'z': ['a', 's', 'x']
}

def keyboard_typo(text, typo_prob=0.1):
    new_text = []
    for char in text:
        if char.lower() in qwerty_neighbors and random.random() < typo_prob:
            replacement = random.choice(qwerty_neighbors[char.lower()])
            new_char = replacement.upper() if char.isupper() else replacement
            new_text.append(new_char)
        else:
            new_text.append(char)
    return ''.join(new_text)

In [74]:
# Example
text6 = "The movie was absolutely fantastic and enjoyable"
augmented6 = keyboard_typo(text6, typo_prob=0.1)
print("Original:", text6)
print("Augmented (Keyboard Typos):", augmented6)

Original: The movie was absolutely fantastic and enjoyable
Augmented (Keyboard Typos): The movie was absolutely fantastic and enjoyanle


#Continue...

b. DATA FROM OTHER RESOURCES

i. Public datasets

ii. Web scrapping


1.   Beautiful Soup
2.   Selenium

iii. API's: RAPID API for the list of all the API's

iv. PDF's


1. Beautiful Soup

In [None]:
!pip install requests beautifulsoup4




In [92]:
import requests
from bs4 import BeautifulSoup

url = "http://quotes.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

quotes_data = soup.find_all("div", class_="quote")

print("Quotes with HTML Tags:\n")
for i, quote_block in enumerate(quotes_data[:5], 1):
    quote_html = str(quote_block.find("span", class_="text"))
    author_html = str(quote_block.find("small", class_="author"))
    tag_elements = quote_block.find_all("a", class_="tag")
    tags_html = [str(tag) for tag in tag_elements]

    print(f"{i}. Quote: {quote_html}")
    print(f"   Author: {author_html}")
    print(f"   Tags: {' | '.join(tags_html)}\n")


Quotes with HTML Tags:

1. Quote: <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
   Author: <small class="author" itemprop="author">Albert Einstein</small>
   Tags: <a class="tag" href="/tag/change/page/1/">change</a> | <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a> | <a class="tag" href="/tag/thinking/page/1/">thinking</a> | <a class="tag" href="/tag/world/page/1/">world</a>

2. Quote: <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>
   Author: <small class="author" itemprop="author">J.K. Rowling</small>
   Tags: <a class="tag" href="/tag/abilities/page/1/">abilities</a> | <a class="tag" href="/tag/choices/page/1/">choices</a>

3. Quote: <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though ev

#Continue...

c. NOBODY HAS THE DATA

i. Engaging trusted client

ii. Data Generation