# Data Retrieval and Generation

For an actually good dataset, we use this.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Set-Up-and-Imports" data-toc-modified-id="Set-Up-and-Imports-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Set-Up and Imports</a></span><ul class="toc-item"><li><span><a href="#Listing-out-all-functions" data-toc-modified-id="Listing-out-all-functions-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Listing out all functions</a></span></li></ul></li><li><span><a href="#Last-Minute-Retrieval-of-more-Articles" data-toc-modified-id="Last-Minute-Retrieval-of-more-Articles-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Last-Minute Retrieval of more Articles</a></span><ul class="toc-item"><li><span><a href="#The-New-York-Times" data-toc-modified-id="The-New-York-Times-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>The New York Times</a></span></li></ul></li><li><span><a href="#Unique-URLs" data-toc-modified-id="Unique-URLs-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Unique URLs</a></span></li><li><span><a href="#Retrieval" data-toc-modified-id="Retrieval-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Retrieval</a></span></li></ul></div>

## Set-Up and Imports

We use BeautifulSoup and Requests for Data Acquisition, with `re` for Regex matching and `glob` for file-checking. Pandas, Numpy and Matplotlib are also utilised.

In [1]:
from lib.datagen import *
from bs4 import BeautifulSoup as souper
import requests, re
import np, pandas as pd, matplotlib.pyplot as plt
from glob import glob

### Listing out all functions

In [2]:
functions

[<function lib.datagen.nytimes(url)>,
 <function lib.datagen.nature(url)>,
 <function lib.datagen.forbes(url)>]

## Last-Minute Retrieval of more Articles

### The New York Times

In [3]:
with open("datagen/nytimes.txt", "a") as file:
    print(*[tag["href"] for tag in souper(requests.get("https://www.nytimes.com/").content).findAll("a") if re.match(r"https://www\.nytimes\.com/\d{4}/\d{2}/\d{2}.*", tag["href"])], sep="\n", file=file)

## Unique URLs

To avoid overlap of URLs in the text files.

In [4]:
for retrieve in functions:
    print(retrieve.__name__)
    filename = f"datagen/{retrieve.__name__}.txt"
    np.savetxt(filename, np.unique(np.loadtxt(filename, dtype=str)), fmt="%s")

nytimes
nature
forbes


## Retrieval

In [5]:
for retrieve in functions:
    urls = pd.Series(np.unique(np.loadtxt(f"datagen/{retrieve.__name__}.txt", dtype=str)))
    articles = urls.apply(retrieve)
    
    filenames = urls.str.split(r"(/\?|\?)").str.get(0).str.split("/").str.get(-1).str.split(r"\.|\?").str.get(0)
    for text, filename in zip(articles, filenames):
        file = "datagen/"+retrieve.__name__+"/"+filename+".txt"
        print(file)
        with open(file, "w+", encoding="utf-8") as outfile:
            outfile.write(text)

datagen/nytimes/at-yosemite-a-waterfall-turns-into-a-firefall.txt
datagen/nytimes/what-to-do-when-you-dont-want-to-run.txt
datagen/nytimes/seasonal-depression-covid.txt
datagen/nytimes/women-stem-pandemic.txt
datagen/nytimes/ai-education-neural-networks.txt
datagen/nytimes/motivation-energy-advice.txt
datagen/nytimes/smartphones-iphone-android.txt
datagen/nytimes/depression-anxiety-physical-health.txt
datagen/nytimes/hearing-aids-fda.txt
datagen/nytimes/yosemite-falls.txt
datagen/nytimes/google-facebook-advertising.txt
datagen/nytimes/metaverse-politics-disinformation-society.txt
datagen/nytimes/apple-face-computers.txt
datagen/nytimes/tech-won-now-what.txt
datagen/nytimes/tech-predictions.txt
datagen/nytimes/microsoft-activision-metaverse.txt
datagen/nytimes/metaverse-gaming-definition.txt
datagen/nytimes/how-excited-are-you-about-the-metaverse.txt
datagen/nytimes/facebook-experiments.txt
datagen/nytimes/olympics-beijing-xi-putin.txt
datagen/nytimes/fact-check-joe-rogan-robert-malone.

datagen/forbes/work-experience-resume-overqualified-job-forbes-woman-leadership-career.txt
datagen/forbes/why-is-work-experience-so-undervalued.txt
