## Preparing the 20 Newsgroups Dataset

We will need some data to work with. For the purposes of this demo we will make use of the 20 newsgroups dataset. Even though 20 newsgroups is a toy dataset, it offers enough complications to show how we can piece together embeddings using ``vectorizers``.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# from src import paths
# from src.data import Dataset

In [3]:
import numpy as np
import matplotlib.colors
import seaborn as sns
import pandas as pd

In [4]:
import csv

In [56]:
def read_format_citeseer():
    content = pd.read_csv('../data/citeseer-doc-classification/citeseer.content', sep='\t', header=None)
    cites = pd.read_csv('../data/citeseer-doc-classification/citeseer.cites', sep='\t', header=None)
    n = content.shape[1]
    
    labels = {str(content.loc[i, 0]): str(content.loc[i, n-1]) for i in range(content.shape[0])}
    doc_word_matrix = content[[i+1 for i in range(n-2)]].to_numpy()
    citations = cites.groupby(0).aggregate(lambda x: list(x)).reset_index().rename(columns={0:'paper', 1:'citation'})
    
    citation_labels = [labels[x] if x in labels.keys() else 'No_label' for x in citations.paper]
    doc_labels = list(content[n-1])
    
    color_key =  {'AI': '#dbe9f6',
     'Agents': '#b6b6d8',
     'DB': '#fee187',
     'HCI': '#9bcdcd',
     'IR': '#c4a0b1',
     'ML': '#a2d88a',
     'No_label': '#777777bb'}
    
    return(citations, citation_labels, doc_word_matrix, doc_labels, color_key)

In [38]:
content = pd.read_csv('../data/citeseer-doc-classification/citeseer.content', sep='\t', header=None)
cites = pd.read_csv('../data/citeseer-doc-classification/citeseer.cites', sep='\t', header=None)
n = content.shape[1]

labels = {content.loc[i, 0]: content.loc[i, n-1] for i in range(content.shape[0])}
doc_word_matrix = content[[i+1 for i in range(n-2)]].to_numpy()
citations = cites.groupby(0).aggregate(lambda x: list(x)).reset_index().rename(columns={0:'paper', 1:'citation'})

  content = pd.read_csv('../data/citeseer-doc-classification/citeseer.content', sep='\t', header=None)


In [47]:
list(content[n-1])

['Agents',
 'IR',
 'Agents',
 'DB',
 'AI',
 'AI',
 'Agents',
 'IR',
 'AI',
 'HCI',
 'IR',
 'Agents',
 'DB',
 'IR',
 'DB',
 'IR',
 'DB',
 'ML',
 'Agents',
 'Agents',
 'Agents',
 'IR',
 'IR',
 'DB',
 'IR',
 'ML',
 'DB',
 'Agents',
 'IR',
 'ML',
 'ML',
 'ML',
 'ML',
 'Agents',
 'IR',
 'Agents',
 'AI',
 'IR',
 'Agents',
 'Agents',
 'Agents',
 'Agents',
 'Agents',
 'Agents',
 'Agents',
 'Agents',
 'Agents',
 'Agents',
 'IR',
 'Agents',
 'IR',
 'IR',
 'Agents',
 'HCI',
 'IR',
 'DB',
 'ML',
 'DB',
 'IR',
 'Agents',
 'DB',
 'HCI',
 'IR',
 'AI',
 'Agents',
 'IR',
 'IR',
 'IR',
 'DB',
 'IR',
 'Agents',
 'Agents',
 'Agents',
 'DB',
 'ML',
 'Agents',
 'Agents',
 'ML',
 'ML',
 'IR',
 'DB',
 'Agents',
 'IR',
 'ML',
 'Agents',
 'IR',
 'DB',
 'AI',
 'IR',
 'Agents',
 'ML',
 'HCI',
 'Agents',
 'IR',
 'IR',
 'IR',
 'ML',
 'AI',
 'IR',
 'Agents',
 'IR',
 'Agents',
 'ML',
 'IR',
 'Agents',
 'IR',
 'IR',
 'IR',
 'DB',
 'Agents',
 'Agents',
 'Agents',
 'DB',
 'AI',
 'Agents',
 'Agents',
 'Agents',
 'DB',
 '

With a dataset and a carefully designed color palette we are in good shape to do some analysis of different embedding methods using UMAP to obtain visualizations of the embeddings. 

## Save this Dataset
Let's save this as a dataset for easy re-use in our other notebooks, and add the color palette to the metadata of the dataset. 

Note: This Dataset has already been added to the catalog and the following cells do not need to be run again. They are included here as a reference.

In [9]:
# from src.helpers import notebook_as_transformer

In [10]:
# new_dataset_name = f'{ds_in.name}_pruned'
# new_data = news_data
# new_target = targets
# new_metadata = ds_in.metadata.copy()
# new_metadata['color_key'] = color_key
# added_descr_txt = f"""\n This dataset is a subselection of the {ds_in.name} Dataset where we have pruned out any post less than {prune_limit} \
# characters ({prune_limit} is chosen somewhat arbitrarily). A custom `color_key` can be found in the metadata."""
# new_metadata['descr'] += added_descr_txt

# new_ds = Dataset(dataset_name=new_dataset_name, data=new_data, target=new_target,
#                  metadata=new_metadata)


In [11]:
# # Due to various design choiced in Jupyter, we need to specify this name manually.
# nbname = '00-20-newsgroups-setup.ipynb'
# dsdict = notebook_as_transformer(notebook_name=nbname,
#                                  input_datasets=[ds_in],
#                                  output_datasets=[new_ds],
#                                  overwrite_catalog=True)

