# Data Acquisition

How to get the data?
1. Use a public dataset
2. Scrape data
3. Product intervention: AI team work with the product team to collect more data by developing better instrumentation.
4. Data augmentation
    - Synonym replacement
    - Back translation
    - TF-IDF-based word replacement
    - Bigram flipping
    - Replacing entities
    - Adding noises to data
    - Advanced techniques: Snorkel, Easy Data Augmentation (EDA), Active learning

# Text Extraction and Cleanup

## HTML parsing and cleanup

In [1]:
from bs4 import BeautifulSoup
from urllib.request import urlopen

In [2]:
myurl = "https://stackoverflow.com/questions/415511/how-to-get-the-current-time-in-python"

html = urlopen(myurl).read()
soupified = BeautifulSoup(html, "html.parser")

question = soupified.find("div", {"class": "question"})
questiontext = question.find("div", {"class": "s-prose js-post-body"})
print("Question: \n", questiontext.get_text().strip())

answer = soupified.find("div", {"class": "answer"})
answertext = answer.find("div", {"class": "s-prose js-post-body"})
print("Best answer: \n", answertext.get_text().strip())

Question: 
 What is the module/method used to get the current time?
Best answer: 
 Use:
>>> import datetime
>>> datetime.datetime.now()
datetime.datetime(2009, 1, 6, 15, 8, 24, 78915)

>>> print(datetime.datetime.now())
2009-01-06 15:08:24.789150

And just the time:
>>> datetime.datetime.now().time()
datetime.time(15, 8, 24, 78915)

>>> print(datetime.datetime.now().time())
15:08:24.789150

See the documentation for more information.
To save typing, you can import the datetime object from the datetime module:
>>> from datetime import datetime

Then remove the leading datetime. from all of the above.


## Unicode normalization

In [3]:
text = 'I love üçï!  Shall we book a üöó to gizza?'
Text = text.encode("utf-8")
print(Text)

b'I love \xf0\x9f\x8d\x95!  Shall we book a \xf0\x9f\x9a\x97 to gizza?'


## Spelling correction

In [4]:
# Need azure upgraded account
# Tutorial: https://docs.microsoft.com/en-us/azure/cognitive-services/bing-spell-check/quickstarts/python

import requests
import json

api_key = "<ENTER-KEY-HERE>"
example_text = "Hollo, wrld"
endpoint = "https://api.cognitive.microsoft.com/bing/v7.0/SpellCheck"

data = {'text': example_text}
params = {
    'mkt':'en-us',
    'mode':'proof'
    }
headers = {
    'Content-Type': 'application/x-www-form-urlencoded',
    'Ocp-Apim-Subscription-Key': api_key,
    }
response = requests.post(endpoint, headers=headers, params=params, data=data)
json_response = response.json()
print(json.dumps(json_response, indent=4))

{
    "error": {
        "code": "401",
        "message": "Access denied due to invalid subscription key or wrong API endpoint. Make sure to provide a valid key for an active subscription and use a correct regional API endpoint for your resource."
    }
}


## System-specific error correction

In [7]:
from PIL import Image
import pytesseract
from pytesseract import image_to_string

pytesseract.pytesseract.tesseract_cmd = r'D:\Program Files\Tesseract-OCR\tesseract.exe'

filename = "scanned_document.png"
text = image_to_string(Image.open(filename))
print(text)

in the nineteenth century the only Kind of linguistics considered
seriously was this comparative and historical study of words in languages
known or believed to be cognate‚Äîsay the Semitic languages, or the Indo-
European languages. It is significant that the Germans who really made
the subject what it was, used the term Indo-germanisch. Those who know
the popular works of Otto Jespersen will remember how fitmly he
declares that linguistic science is historical. And those who have noticed

