### This notebook shows the usage for the first part of the ingestion pipeline - loading in data from Kaggle.

Code located in Loader/KaggleDatasetLoader.py

Kaggle account permisions have been included in Loader/kaggle.json. Initializing the loader will move these permissions to '~/.kaggle' if you don't have a kaggle.json there. This is needed to use the kaggle api. 

In [1]:
from Loader.KaggleDatasetLoader import KaggleDatasetLoader

loader = KaggleDatasetLoader()

kaggle.json file is already in ~/.kaggle
Permissions of /Users/williamshabecoff/.kaggle/kaggle.json are currently private


### Using the loader

Calling loader.get_dataset() will make a new directory in 'Data/Text'. get_dataset() just needs the user and dataset name.  url = 'jeet2016/us-financial-news-articles' or url = 'https://www.kaggle.com/datasets/jeet2016/us-financial-news-articles' will work for example.

Apologies that kaggle api does not support a progress bar. 

In [2]:
loader.get_dataset('https://www.kaggle.com/datasets/jeet2016/us-financial-news-articles')

Loading dataset into ./Data/Text/us-financial-news-articles. This may take a few minutes
Dataset URL: https://www.kaggle.com/datasets/jeet2016/us-financial-news-articles


In [3]:
loader.get_dataset('hammadjavaid/football-news-articles')

Loading dataset into ./Data/Text/football-news-articles. This may take a few minutes
Dataset URL: https://www.kaggle.com/datasets/hammadjavaid/football-news-articles


We now have a csv dataset at 'Data/Text/football-news-articles' and a json dataset at 'Data/Text/us-financial-news-articles'. We will now use the loader to reformat these datasets.  We use a config to describe a simple mapping into a standard form.  Here is football.yaml, found in Loader/schema


```yaml
Format: csv 
Mapping:
  title: title
  text: content
  url: link
  author: author
```

In the mapping, we have output columns of ['title', 'text', 'url', 'author'] which are found at ['title', 'content','url', 'author'].  Make sure not to change outputs columns of 'title' and 'text' because our TextDataset class expects these columns.

Formats supported are ['csv', 'single_json', 'multi_json'].  Single vs multi json describes if each json file contains one document or a number of documents. If the dataset includes some metadata that we don't want in our final dataset (and this metadata uses the same extension as the rest of the data) - you will need to remove it before running reformat.  This is not necessary for our kaggle datasets.  Using delete = True will remove the data pre-transform. 

In [4]:
loader.format_dataset('football-news-articles', 'football.yaml', delete = True)

Transforming files at ./Data/Text/football-news-articles into standardized format: 100%|█| 7/7 [00:02<00:00,  3.


Deleted original directory: ./Data/Text/football-news-articles


In [3]:
loader.format_dataset('us-financial-news-articles', 'financial.yaml', delete = True)

Removed .zip file: ./Data/Text/us-financial-news-articles/3811_112b52537b67659ad3609a234388c50a/2018_01_112b52537b67659ad3609a234388c50a.zip
Removed .zip file: ./Data/Text/us-financial-news-articles/3811_112b52537b67659ad3609a234388c50a/2018_04_112b52537b67659ad3609a234388c50a.zip
Removed .zip file: ./Data/Text/us-financial-news-articles/3811_112b52537b67659ad3609a234388c50a/2018_03_112b52537b67659ad3609a234388c50a.zip
Removed .zip file: ./Data/Text/us-financial-news-articles/3811_112b52537b67659ad3609a234388c50a/2018_05_112b52537b67659ad3609a234388c50a.zip
Removed .zip file: ./Data/Text/us-financial-news-articles/3811_112b52537b67659ad3609a234388c50a/2018_02_112b52537b67659ad3609a234388c50a.zip
Removed empty directory: ./Data/Text/us-financial-news-articles/3811_112b52537b67659ad3609a234388c50a


Transforming files into standardized format: 0it [00:00, ?it/s]
Transforming files into standardized format: 100%|██████████████████████| 57456/57456 [00:12<00:00, 4643.56it/s]
Transforming files into standardized format: 100%|██████████████████████| 63245/63245 [00:13<00:00, 4522.07it/s]
Transforming files into standardized format: 100%|██████████████████████| 64592/64592 [00:16<00:00, 3948.34it/s]
Transforming files into standardized format: 100%|██████████████████████| 57802/57802 [00:13<00:00, 4266.67it/s]
Transforming files into standardized format: 100%|██████████████████████| 63147/63147 [00:16<00:00, 3878.77it/s]


Deleted original directory: ./Data/Text/us-financial-news-articles


In [5]:
!ls Data/Text

[34mfootball-articles-sample[m[m            [34mus-financial-news-articles-standard[m[m
[34mfootball-news-articles-standard[m[m


We now have our two standard datasets.  You may also see football-articles-sample, which comes with the repo as a smaller sample of data to demo the other ingestion functionalities.