# Interactive Data Visualization
##### (C) 2023-2025 Timothy James Becker: [revision 1.0](),  [GPLv3 license](https://www.gnu.org/licenses/gpl-3.0.html) 


## <u>Finding and Loading Data</u>

#### <u>Public Datasets</u>

There is a nice current listing in the github repository called [awesome-public-datasets](https://github.com/awesomedata/awesome-public-datasets) and will detail some of our favorites from the list. The basic premise of this repository is that vairous domains can contribute a URL and we can have a nice and compact listing of public (IEfree to use) data for our visualization practice. Some other nice places to look are [kaggle](https://www.kaggle.com/datasets) and some of the builtin data sets that are included in [scikit-learn](https://scikit-learn.org/stable/datasets.html) and [keras](https://keras.io/api/datasets/)

[1000 Genomes](https://www.internationalgenome.org/data) A US/NIH funded whole genome research project that studied the genomes in 26 human populations. Currently there are over 3000 human genomes avaible (and analyzed).

[ENCODE](https://www.encodeproject.org/) The Encyclopedia of DNA Elements (ENCODE) Consortium is an ongoing international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI). The goal of ENCODE is to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active.

[GEO](https://www.ncbi.nlm.nih.gov/gds/) Gene expression omnibus is a NIH funded sitre for gene expression data such as bulk RNA-seq and single cell RNA-seq.

[GO](https://geneontology.org/docs/download-go-annotations/) The Gene Ontology (GO) is a structured, standardized representation of biological knowledge. GO describes concepts (also known as terms, or formally, classes) that are connected to each other via formally defined relations. The GO is designed to be species-agnostic to enable the annotation of gene products across the entire tree of life. The computational framework of the GO enables consistent gene annotation, comparison of functions across organisms, and integration of knowledge across diverse biological databases.

[USCS Genome](https://hgdownload.soe.ucsc.edu/downloads.html)  This page contains links to sequence and annotation downloads for the genome assemblies featured in the UCSC Genome Browser. Downloads are also available via our JSON API, MySQL server, or FTP server. Data filtering is available in the Table Browser or via the command-line utilities.

[NOAA](http://www.ncdc.noaa.gov/data-access/quick-links) NCEI provides environmental data, products, and services covering the depths of the ocean to the surface of the sun to drive resilience, prosperity, and equity for current and future generations.

[NIST](https://math.nist.gov/~RPozo/complex_datasets.html) In analyzing large-scale complex networks, it is important to establish a standard dataset from which algorithms and claims be compared and verified. Currently, it is often difficult to track down the original data used for computational experiments. Much of it is floating around in various formats throughout the net, imbedded in papers, and often difficult to get from the authors. Moreover, the datasets are often modified (filtered) by research groups interested in different attributes, so that even when the name and descriptions match a citation in a paper, there is no guarantee that the data is identical. 

[International Economics Data](https://cid.ucdavis.edu/) The Center for International Data was established in 1999 and is directed by Robert Feenstra. The center is housed at the Department of Economics at the University of California, Davis. The purpose of this center is to collect, enhance, create, and disseminate international economic data, including online and offline distribution. This web site describes all the data available to the public from the Center, with details for downloading or ordering.

[Yahoo Finance](http://finance.yahoo.com/) Yahoo finace provide some free stock downloads and other paid services.

[CT Census Data](https://github.com/HandsOnDataViz/ct-boundaries/blob/main/ct-towns-2024-census.geojson) converted census data for easy inclusion in visualization projects.

[NHANES](https://www.cdc.gov/nchs/nhanes/index.html) The National Health and Nutrition Examination Survey (NHANES) collects data about the health of adults and children in the United States. We also collect data about what participants eat, drink, and take as supplements to determine how many nutrients are in their diet. These dietary interviews and blood tests help us measure the nutritional status of U.S. adults and children.

[TCGA](https://portal.gdc.cancer.gov/) The Cancer Genome Atlas project is a repository and computational platform for cancer researchers who need to understand cancer, its clinical progression, and response to therapy.

[10K faces database](https://wilmabainbridge.com/datasets.html) Faces imgaes for machine learning and other uses.

[CERN](https://opendata.cern.ch/search?q=&f=type%3ADataset&l=list&order=desc&p=1&s=10&sort=mostrecent) Physics datasets

[Twitter Reputation Management](https://nlp.uned.es/replab2013/) RepLab is a competitive evaluation exercise for Online Reputation Management systems organized as an activity of CLEF. RepLab 2013 focused on the task of monitoring the reputation of entities (companies, organizations, celebrities, etc.) on Twitter. 

[baseball data](https://www.retrosheet.org/game.htm) Old basball scores

#### <u>Loading Data asynchronously with D3</u>

Loading a file into the d3.js session is done asychronously meaning that function call will not block (stop the running javascript program).  Most modern web browser have up to 30 threads avaible for downloading resources such as HTML and media files using this method. Most modern CPUs can then process those requests and render them into the page in parallel theoretically speeding up the response time from when a user navtigates to a page and when that page is fully loaded.

First start by downloading the [d3_template_webapp](https://github.com/timothyjamesbecker/Interactive_Data_Visualization/d3_template_webapp).  And then the example comma seperated value (CSV) file called [cars.csv](https://github.com/timothyjamesbecker/Interactive_Data_Visualization/blob/main/data/cars.csv). You should put the cars.csv file into your webfolder next to the other index.html, main.css and main.js files as shown below:

<img src="figures/webstorm_data_file.png" alt="webstorm_data_file" width="700px">

Now we will use the d3.js library fetch function (which is a wrapper for the [JavaScript fetch API](https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API/Using_Fetch)). More specifically d3 combines a CSV/TSV parser with the fetch API so that you get both the data and the abilty to parse it. You will modify the main.js file to have the following code:

d3.dsv() is a general purpose delimited parser and fetch API in one function. To have it chop up tab-seperate files you would simply modify the "," to be "\t". The next argument is the file you want to load which is "cars.csv". This type of file path is known as a relative path and only works if you put the cars.csv file in the same filder as main.js. Next the (d) => indicates an anonymous function that will yeild a data selection. Notice how a data selection is different from a DOM selection that uses d3.select.  

Here we are defining how we want the d3.dsv parser to proccess our data.  We use the return statement to pick which columns we want (projection) and then can give them a new name (alias) by defining key value pair.  After we have made our data selection and projection this is followed by the .then() function which is a shorthand for the data loader callback.  That is to say here we define what will happen once the data has been loaded.  

First you will see that we have an <b>err</b> variable which will contain any data errors.  Next you will see the <b>res</b> variable which will contain the server response which in this case should be the array of objects data structure once the data selection/projection code has done its magic.  Use the console.log function to print the results into the browser and then always check those results before starting any visualization of the data.

<img src="figures/webstorm_data_d3_dsv.png" alt="webstorm_data_d3_dsv" width="700px">

You can see when you then launch your app and use the inspector (in firefox) that the contents of the <b>res</b> variable has been spilled to the console (showed as Array(32) \[{...},{...}\])

<img src="figures/webstorm_data_console.png" alt="webstorm_data_console" width="700px">

When you then click the right arrow, the array contents will be showed, which as we see are nicely formatted objects that have the mpg, hp and name key. Notice that the mpg and the hp have been converted to Number data types and that the name has been kept as the default string type.

<img src="figures/webstorm_data_cars.png" alt="webstorm_data_cars" width="700px">

If we draw a table to represent the cars.csv table we can take a closer look at the projection here:

<img src="figures/data_d3_projection.png" alt="data_d3_projection" width="700px">

Projection allows us to reorder the columns and alias them allowing us to use better or shorter names which can simplyfy our JavaScript code. The other thing that we will do is to select some of our data rows instead of all of them (IE filter) by using limits of the values:

<img src="figures/webstorm_data_d3_select_project.png" alt="webstorm_data_d3_select_project" width="700px">

This will have the effect of selecting the smaller engine cars that have less than 110 hp, which means we will have only 11 elements in our cars array instead of 32.

<img src="figures/data_d3_selection.png" alt="data_d3_selection" width="700px">

#### Webstorm Table Viewer
Webstorm has a built in (integratied) table viewer which should be used on any new tabular data format prior to trying to write a d3.js data loader web app since there could be issues with the file that the table viewer can help the designer deal with prior to the code which could contain errors.


<img src="figures/webstorm_data_table_view.png" alt="webstorm_data_table_view" width="700px">

#### Exercises

#### [1] select only four-cylinder cars and keep all columns. How many are there?
#### [2] select cars with greater than 20 mpg with wt less than 3.2 and keep all columns. How many are there?
#### [3] do the above but get only the name of the car
#### [4*] load another dataset from the repository (like NHANES.csv or ct-census...) and repeat this process by viewing in the table viewer, then selecting and projecting with the d3 data loader patterns shown here
