<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Haggle" data-toc-modified-id="Haggle-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Haggle</a></span></li><li><span><a href="#Simple-example" data-toc-modified-id="Simple-example-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Simple example</a></span><ul class="toc-item"><li><span><a href="#By-the-way..." data-toc-modified-id="By-the-way...-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>By the way...</a></span></li></ul></li><li><span><a href="#Search-results-and-dataset-metadata" data-toc-modified-id="Search-results-and-dataset-metadata-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Search results and dataset metadata</a></span><ul class="toc-item"><li><span><a href="#.meta" data-toc-modified-id=".meta-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>.meta</a></span></li><li><span><a href="#Cached-search-info" data-toc-modified-id="Cached-search-info-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Cached search info</a></span></li></ul></li><li><span><a href="#The-boring-stuff" data-toc-modified-id="The-boring-stuff-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>The boring stuff</a></span><ul class="toc-item"><li><span><a href="#Install" data-toc-modified-id="Install-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Install</a></span></li><li><span><a href="#API-credentials" data-toc-modified-id="API-credentials-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>API credentials</a></span></li></ul></li><li><span><a href="#F.A.Q." data-toc-modified-id="F.A.Q.-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>F.A.Q.</a></span><ul class="toc-item"><li><span><a href="#What-if-I-don't-want-a-zip-file-anymore?" data-toc-modified-id="What-if-I-don't-want-a-zip-file-anymore?-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>What if I don't want a zip file anymore?</a></span></li><li><span><a href="#Do-you-have-any-jupyter-notebooks-demoing-this." data-toc-modified-id="Do-you-have-any-jupyter-notebooks-demoing-this.-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Do you have any jupyter notebooks demoing this.</a></span></li></ul></li><li><span><a href="#Scrap" data-toc-modified-id="Scrap-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Scrap</a></span></li></ul></div>

# Haggle

A simple facade to [Kaggle](https://www.kaggle.com/) data.

Essentially, instantiate a `KaggleDatasets` object, and from it...
- search for datasets from the python console (so much better than having pictures the [kaggle website](https://www.kaggle.com/) right?)
- download what you want and start using...
- ... oh, and it automatically caches the data zip and search results to local files
- ... oh, and all the while it pretends to be a humble dict with `owner/dataset` keys, and that's the coolest bit.

**Haggle:** /ˈhaɡəl/
- an instance of intense argument (as in bargaining) 
- wrangle (over a price, terms of an agreement, etc.) 
- rhymes with Kaggle and is not taken on pypi (well, now it is)

# Simple example

In [1]:
from haggle import KaggleDatasets

rootdir = '/D/Dropbox/_odata/kaggle'  # define where you want the data to be cached/downloaded

s = KaggleDatasets(rootdir)  # make an instance

if 'rtatman/english-word-frequency' in s:
    del s['rtatman/english-word-frequency']  # just to prepare for the demo


In [2]:
list(s)  # see what you have locally

['uciml/human-activity-recognition-with-smartphones',
 'sitsawek/phonetics-articles-on-plos']

Let's search something (you can also search on [kaggle](https://www.kaggle.com/), I was kidding about it being lame!)

In [3]:
results = s.search('word frequency')
print(f"{len(results)=}")
list(results)[:10]

len(results)=180


['rtatman/english-word-frequency',
 'yekenot/fasttext-crawl-300d-2m',
 'rtatman/japanese-lemma-frequency',
 'rtatman/glove-global-vectors-for-word-representation',
 'averkij/lingtrain-hungarian-word-frequency',
 'lukevanhaezebrouck/subtlex-word-frequency',
 'facebook/fatsttext-common-crawl',
 'facebook/fasttext-wikinews',
 'facebook/fasttext-english-word-vectors-including-subwords',
 'kushtej/kannada-word-frequency']

Chose what you want? Good, now do this:

In [4]:
v = s['rtatman/english-word-frequency']
type(v)

py2store.slib.s_zipfile.ZipReader

Okay, let's slow down a moment. What happened? What's this `ZipReader` thingy?

Well, what happened is that this downloaded the zip file of the data for you and saved it in `ROOTDIR/rtatman/english-word-frequency.zip`. Don't believe me? Go have a look. 

But then it also returns this object called `ZipReader` that points to it. 

If you don't like it, you don't have to use it. But I think you should like it.

Look at what it can do!

List the contents of file (that's in the zip... okay there's just one here, it's a bit boring)

In [5]:
list(v)

['unigram_freq.csv']

Retrieve the data for any given file of the zip without ever having to unzip it!

Oh, and still pretending to be a dict. 

In [6]:
b = v['unigram_freq.csv']
print(f"b is a {type(b)} and has {len(b)} bytes")

b is a <class 'bytes'> and has 4956252 bytes


Now the data is given in bytes by default, since that's the basis of everything. 

From there you can go everywhere. Here for example, say we'd like to go to `pandas.DataFrame`...

In [7]:
import pandas as pd
from io import BytesIO

df = pd.read_csv(BytesIO(b))
df.shape

(333333, 2)

In [8]:
print(df.head(7).to_string())

  word        count
0  the  23135851162
1   of  13151942776
2  and  12997637966
3   to  12136980858
4    a   9081174698
5   in   8469404971
6  for   5933321709


And as mentioned, it caches the data to your local drive. You know, download, so that the next time you ask for `s['rtatman/english-word-frequency']`, it'll be faster to get those bytes.

See, let's list the contents of `s` again and see that we now have that `'rtatman/english-word-frequency'` key we didn't have before.

In [9]:
list(s)

['uciml/human-activity-recognition-with-smartphones',
 'rtatman/english-word-frequency',
 'sitsawek/phonetics-articles-on-plos']

## By the way...

So a `KaggleDatasets` is a store with a dict-like interface. 

Listing happens locally. Remote listing is done through `.search(...)`.

Getting happens locally first, and if not, will get remotely (and cache locally).


Where are the zips stored? Ask `.zips_dir`:

In [13]:
s.zips_dir

'/D/Dropbox/_odata/kaggle/zips'

# Search results and dataset metadata

Let's have a closer look at those search results. All we did is a `len(results)` and a `list(results)`. What else can you do with that object?

Well, as is so happens, you can do whatever (read-only) operation you can do on a -- take a wild guess -- a dict. 

Namely, you can get a value for the keys we've listed

In [10]:
from pprint import pprint
pprint(results['rtatman/english-word-frequency'])

{'creatorName': 'Rachael Tatman',
 'creatorUrl': 'rtatman',
 'currentVersionNumber': 1,
 'description': None,
 'downloadCount': 3079,
 'files': [],
 'id': 2367,
 'isFeatured': False,
 'isPrivate': False,
 'isReviewed': True,
 'kernelCount': 12,
 'lastUpdated': '2017-09-06T18:21:27.18Z',
 'licenseName': 'Other (specified in description)',
 'ownerName': 'Rachael Tatman',
 'ownerRef': 'rtatman',
 'ref': 'rtatman/english-word-frequency',
 'subtitle': '⅓ Million Most Frequent English Words on the Web',
 'tags': [{'competitionCount': 3,
           'datasetCount': 231,
           'description': 'Language is a method of communication that consists '
                          'of using words arranged into meaningful patterns. '
                          'This is a good place to find natural language '
                          'processing datasets and kernels to study languages '
                          'and train your chat bots.',
           'fullPath': 'topic > culture and humanities > lang

You get description, size, tags, download count... Useful stuff to make your choice. 

Personally, I like transform those results in a `DataFrame` that I can subsequently interrogate:

In [11]:
import pandas as pd
df = pd.DataFrame(results.values())[['ref', 'title', 'subtitle', 'downloadCount', 'totalBytes']]
df = df.set_index('ref').sort_values('downloadCount', ascending=False)
df.head(10)

Unnamed: 0_level_0,title,subtitle,downloadCount,totalBytes
ref,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
jealousleopard/goodreadsbooks,Goodreads-books,comprehensive list of all books listed in good...,23640,637338.0
uciml/zoo-animal-classification,Zoo Animal Classification,Use Machine Learning Methods to Correctly Clas...,16597,1898.0
yekenot/fasttext-crawl-300d-2m,FastText crawl 300d 2M,2 million word vectors trained on Common Crawl...,8275,1545552000.0
rtatman/sentiment-lexicons-for-81-languages,Sentiment Lexicons for 81 Languages,Sentiment Polarity Lexicons (Positive vs. Nega...,7960,1621755.0
rtatman/glove-global-vectors-for-word-representation,GloVe: Global Vectors for Word Representation,Pre-trained word vectors from Wikipedia 2014 +...,7432,480172600.0
mozillaorg/common-voice,Common Voice,"500 hours of speech recordings, with speaker d...",6075,12931470000.0
arathee2/demonetization-in-india-twitter-data,Demonetization in India Twitter Data,Data extracted from Twitter regarding the rece...,5761,919578.0
eibriel/rdany-conversations,rDany Chat,157 chats & 6300+ messages with a (fake) virtu...,3983,916724.0
mrisdal/2016-us-presidential-debates,2016 US Presidential Debates,Full transcripts of the face-off between Clint...,3920,123161.0
nobelfoundation/nobel-laureates,"Nobel Laureates, 1901-Present",Which country has won the most prizes in each ...,3192,67763.0


In [28]:
# print(df.head(10).to_markdown())

## .meta

`.meta` is your access to metadata about datasets. 

It works the same way things work with the zips of datasets: It will:
- list: will list locally store dataset meta information (in location specified by `s.meta_dir`)
- get: when a value (metadata dict) is requested, (1) the key is searched locally first, and if not found, (2) will request it remotely (through the kaggle api), and (3) the value will be cached (stored) locally


## Cached search info

Wait, it's not all: `KaggleDatasets` will (by default) also cache these results locally in individual json files.

Where? Ask `meta_dir`:

In [16]:
s.meta_dir

'/D/Dropbox/_odata/kaggle/meta'

You can access these files with your favorite dict-like interface, through the `.meta` attribute

In [14]:
len(s.meta)

358

In [17]:
list(s.meta)[:7]

['emmabel/word-occurrences-in-mr-robot',
 'bitsnpieces/covid19-country-data',
 'johnwdata/coronavirus-covid19-cases-by-us-state',
 'johnwdata/coronavirus-covid19-cases-by-us-county',
 'andradaolteanu/bing-nrc-afinn-lexicons',
 'rahulloha/covid19',
 'nltkdata/word2vec-sample']

In [18]:
pprint(s.meta['emmabel/word-occurrences-in-mr-robot'])

{'creatorName': 'Emma',
 'creatorUrl': 'emmabel',
 'currentVersionNumber': 1,
 'description': None,
 'downloadCount': 116,
 'files': [],
 'id': 4288,
 'isFeatured': False,
 'isPrivate': False,
 'isReviewed': False,
 'kernelCount': 1,
 'lastUpdated': '2017-11-09T18:30:15.733Z',
 'licenseName': 'CC0: Public Domain',
 'ownerName': 'Emma',
 'ownerRef': 'emmabel',
 'ref': 'emmabel/word-occurrences-in-mr-robot',
 'subtitle': "Find out F-Society's favorite lingo",
 'tags': [{'competitionCount': 0,
           'datasetCount': 7525,
           'description': 'Activities that holds the attention and interest of '
                          'an audience, or gives pleasure and delight. It can '
                          'be an idea or a task, but is more likely to be one '
                          'of the activities or events that have developed '
                          'over thousands of years specifically for the '
                          "purpose of keeping an audience's attention.",
      

So if you want to search locally for information (again, information about your searches, not your data zips!), you can get them in a `DataFrame` like so:

In [20]:
df = pd.DataFrame(s.meta.values())
df

Unnamed: 0,id,ref,subtitle,tags,creatorName,creatorUrl,totalBytes,url,lastUpdated,downloadCount,...,datasetId,datasetSlug,ownerUser,totalViews,totalVotes,totalDownloads,licenses,keywords,collaborators,data
0,4288.0,emmabel/word-occurrences-in-mr-robot,Find out F-Society's favorite lingo,"[{'ref': 'arts and entertainment', 'competitio...",Emma,emmabel,1.194660e+05,https://www.kaggle.com/emmabel/word-occurrence...,2017-11-09T18:30:15.733Z,116.0,...,,,,,,,,,,
1,576036.0,bitsnpieces/covid19-country-data,Country level metadata that includes temperatu...,"[{'ref': 'global', 'competitionCount': 0, 'dat...",Patrick,bitsnpieces,1.908210e+05,https://www.kaggle.com/bitsnpieces/covid19-cou...,2020-05-03T23:51:55.5Z,939.0,...,,,,,,,,,,
2,575937.0,johnwdata/coronavirus-covid19-cases-by-us-state,NYTimes Coronavirus Dataset,"[{'ref': 'earth and nature', 'competitionCount...",John Wackerow,johnwdata,8.258200e+04,https://www.kaggle.com/johnwdata/coronavirus-c...,2020-09-23T12:43:05.76Z,59.0,...,,,,,,,,,,
3,575883.0,johnwdata/coronavirus-covid19-cases-by-us-county,NYTimes Coronavirus Dataset,"[{'ref': 'earth and nature', 'competitionCount...",John Wackerow,johnwdata,3.508189e+06,https://www.kaggle.com/johnwdata/coronavirus-c...,2020-07-23T18:47:16.543Z,37.0,...,,,,,,,,,,
4,507452.0,andradaolteanu/bing-nrc-afinn-lexicons,the lexicons are in CSV format,"[{'ref': 'earth and nature', 'competitionCount...",Andrada Olteanu,andradaolteanu,8.396500e+04,https://www.kaggle.com/andradaolteanu/bing-nrc...,2020-02-09T18:39:13.343Z,135.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
353,839.0,nobelfoundation/nobel-laureates,Which country has won the most prizes in each ...,[],Abigail Larion,abigaillarion,6.776300e+04,https://www.kaggle.com/nobelfoundation/nobel-l...,2017-02-16T00:31:00.993Z,3192.0,...,,,,,,,,,,
354,110364.0,fourtonfish/hello-salut,"Dataset of translations of the word ""hello"" to...","[{'ref': 'languages', 'competitionCount': 3, '...",Stefan Bohacek,fourtonfish,6.600000e+03,https://www.kaggle.com/fourtonfish/hello-salut,2019-03-10T22:32:44.603Z,120.0,...,,,,,,,,,,
355,540160.0,guenthermi/facete,Dataset for Domain-Specific Word Embedding Eva...,"[{'ref': 'internet', 'competitionCount': 18, '...",Michael Günther,guenthermi,1.300565e+07,https://www.kaggle.com/guenthermi/facete,2020-03-04T15:03:24.507Z,11.0,...,,,,,,,,,,
356,688051.0,bcgvaccine/hackathon,"Improve BCG Data and Provide Insights to ""BCG ...","[{'ref': 'business', 'competitionCount': 2, 'd...",Radoslav Kirkov,rkirkov,4.695259e+09,https://www.kaggle.com/bcgvaccine/hackathon,2020-09-22T17:16:38.747Z,283.0,...,,,,,,,,,,


In [27]:
# t = df.head(10).dropna(axis=1)
# del t['tags']
# print(t.to_markdown())

**Note: If you don't want all your search results to be cached you can just specify it.**

```python
s = KaggleDatasets(rootdir, cache_metas_on_search=False)  # make an instance
```

# The boring stuff

## Install

pip install haggle

**You'll need a kaggle api token to use this**

If you do, you probably can just start using. 

If you don't got get one! Go see [this](https://github.com/Kaggle/kaggle-api) for detailed instructions, it essentially says:




## API credentials

To use the Kaggle API, sign up for a Kaggle account at https://www.kaggle.com. 
Then go to the 'Account' tab of your user profile (`https://www.kaggle.com/<username>/account`) and select 'Create API Token'. 
This will trigger the download of `kaggle.json`, a file containing your API credentials. 
Place this file in the location `~/.kaggle/kaggle.json` (on Windows in the location `C:\Users\<Windows-username>\.kaggle\kaggle.json` - you can check the exact location, sans drive, with `echo %HOMEPATH%`). 
You can define a shell environment variable `KAGGLE_CONFIG_DIR` to change this location to `$KAGGLE_CONFIG_DIR/kaggle.json` (on Windows it will be `%KAGGLE_CONFIG_DIR%\kaggle.json`).

For your security, ensure that other users of your computer do not have read access to your credentials. On Unix-based systems you can do this with the following command: 

`chmod 600 ~/.kaggle/kaggle.json`

You can also choose to export your Kaggle username and token to the environment:

```bash
export KAGGLE_USERNAME=datadinosaur
export KAGGLE_KEY=xxxxxxxxxxxxxx
```
In addition, you can export any other configuration value that normally would be in
the `$HOME/.kaggle/kaggle.json` in the format 'KAGGLE_<VARIABLE>' (note uppercase).  
For example, if the file had the variable "proxy" you would export `KAGGLE_PROXY`
and it would be discovered by the client.


# F.A.Q.

## What if I don't want a zip file anymore?

Just delete it, like you do with any file you don't want anymore. You know the one.

Or... you can be cool and do `del s['owner/dataset']` for that key (note a key doesn't include the rootdir or the `.zip` extension), just like you would with a... `dict`, once again.

## Do you have any jupyter notebooks demoing this.

Sure, you can find some [here on github](https://github.com/otosense/haggle/tree/master/docs).

# Scrap