# Libraries for Data Science

### Scientific Computing Libraries in Python
- Pandas: Provides data structures and tools for data cleaning, manipulation, and analysis.
- NumPy: Based on arrays and matrices, allowing mathematical functions to be applied to arrays.

### Visualization Libraries in Python
- Matplotlib: A popular library for creating graphs and plots, with customizable options.
- Seaborn: Based on Matplotlib, generating heat maps, time series, and violin plots.

### High-Level Machine Learning and Deep Learning Libraries in Python
- Scikit-learn: Contains tools for statistical modeling, including regression, classification, clustering, etc.
- Keras: Allows building standard deep learning models with a high-level interface.
- TensorFlow: A low-level framework for large-scale production of deep learning models.
- PyTorch: Used for experimentation, making it simple for researchers to test ideas.

### Libraries Used in Other Languages
- Apache Spark: A general-purpose cluster-computing framework for processing data in parallel.
- Scala Libraries:
- Vegas: A Scala library for statistical data visualizations.
- Big DL: For deep learning.
- R Libraries:
- ggplot2: A popular library for data visualization.
- Keras and TensorFlow interfaces: Allow interaction with Python libraries.

# Application Program Interfaces (API)

### Overview
- An API (Application Programming Interface) allows communication between two pieces of software.
- It is the part of the library that is visible to the user, while the library contains all the program components.
    - Example: Pandas Library
    - Pandas is a set of software components that can be used to process data.
    - The Pandas API allows communication with the other software components without knowing what happens at the backend.
    - The backend can be written in different languages, such as C++.

### REST API
- REST (Representational State Transfer) APIs allow communication through the internet and access to resources like storage, data, and artificially intelligent algorithms.
- The client (your program) sends requests to the resource (web service) via an endpoint.
- The client sends requests using HTTP methods, and the resource returns a response using HTTP messages.
- The request contains a JSON file with instructions for the operation to be performed.
- The response contains the result of the operation in a JSON file.
    - Examples of REST APIs
    - Watson Text to Speech API: Converts speech to text.
    - Watson Language-Translator API: Translates text from one language to another.


# Data Sets – Powering Data Science

### Definition of a Data Set
- A data set is a structured collection of data.
- It can include various types of data such as text, numbers, images, audio, or video files.
- Tabular data sets are organized in rows and columns, with each row representing an observation and each column containing information about that observation.

### Types of Data Ownership
- Private Data: Typically contains proprietary or confidential information and is not shared publicly.
- Open Data: Made available to the public, often by governments, organizations, or companies, and can be used for various purposes.

### Sources of Data
- Government Data: Many governments worldwide publish datasets on their websites, covering various topics such as economy, society, healthcare, transportation, and more.
- Intergovernmental Organizations: Organizations like the United Nations and the European Union maintain data repositories providing access to a wide range of information.
- Online Communities: Platforms like Kaggle provide access to a variety of datasets, and users can contribute their own datasets.

### Community Data License Agreement (CDLA)
- Created by the Linux Foundation to address the issue of open data distribution and use.
- Two licenses were initially created: CDLA-Sharing and CDLA-Permissive.
- CDLA-Sharing: Grants permission to use and modify the data, with the requirement to share modified versions under the same license terms.
- CDLA-Permissive: Grants permission to use and modify the data, but does not require sharing changes to the data.


## Additional Sources of Datasets

### Open datasets and sources
In this data-driven world, some datasets are freely available for anyone to access, use, modify, and share. These are called open datasets.
Open datasets include a public license and are very useful for your journey as a Data Scientist. Some of the most informative open dataset sources are listed below.

#### Government Data:

- https://www.data.gov/
- https://www.census.gov/data.html
- https://data.gov.uk/
- https://www.opendatanetwork.com/
- https://data.un.org/

#### Financial Data Sources:

- https://data.worldbank.org/
- https://www.globalfinancialdata.com/
- https://comtrade.un.org/
- https://www.nber.org/
- https://fred.stlouisfed.org/

#### Crime Data:

- https://www.fbi.gov/services/cjis/ucr
- https://www.icpsr.umich.edu/icpsrweb/content/NACJD/index.html
- https://www.drugabuse.gov/related-topics/trends-statistics
- https://www.unodc.org/unodc/en/data-and-analysis/

#### Health Data:

- https://www.who.int/gho/database/en/
- https://www.fda.gov/Food/default.htm
- https://seer.cancer.gov/faststats/selections.php?series=cancer
- https://www.opensciencedatacloud.org/
- https://pds.nasa.gov/
- https://earthdata.nasa.gov/
- https://www.sgim.org/communities/research/dataset-compendium/public-datasets-topic-grid

#### Academic and Business Data:

- https://scholar.google.com/
- https://nces.ed.gov/
- https://www.glassdoor.com/research/
- https://www.yelp.com/dataset

#### Other General Data:

- https://www.kaggle.com/datasets
- https://www.reddit.com/r/datasets/

### Propriety datasets and sources
Proprietary datasets contain data primarily owned and controlled by specific individuals or organizations. This data is limited in distribution because it is sold with a licensing agreement.
Some data from private sources cannot be easily disclosed, like public data.

National security data, geological, geophysical, and biological data are examples of propriety data. Copyright laws or patents usually bind this type of data. Proprietary datasets that mainly contain sensitive information are less widely available than open datasets.

Some standard propriety dataset sources are listed below.

#### Health Care:

https://www.sgim.org/communities/research/dataset-compendium/proprietary-datasets

#### Financial Market data:

https://datarade.ai/data-categories/proprietary-market-data

#### Google Cloud based datasets:

https://cloud.google.com/datasets

### Dataset licenses
When you select a dataset, it is necessary to look into the license. A license explains whether you can use that dataset or not; or explains if you have to accept certain guidelines to use that dataset. The different license types are listed below.

1. **PUBLIC DOMAIN MARK - PUBLIC DOMAIN:**

    When a dataset has a Public Domain license, all the rights to use, access, modify and share the dataset are open to everyone. Here there is technically no license.


2. **OPEN DATA COMMONS PUBLIC DOMAIN DEDICATION AND LICENSE – PDDL:**

    Open Data Commons license has the same features as the Public Domain license, but the difference is the PDDL license uses a licensing mechanism to give the rights to the dataset.


3. **CREATIVE COMMONS ATTRIBUTION 4.0 INTERNATIONAL CC-BY**
    
    This license allows users to share and modify a dataset, but only if they give credit to the creator(s) of the dataset.


4. **COMMUNITY DATA LICENSE AGREEMENT – CDLA PERMISSIVE-2.0**
    
    Like most open-source licenses, this license allows users to use, modify, adapt, and share the dataset, but only if a disclaimer of warranties and liability is also included.


5. **OPEN DATA COMMONS ATTRIBUTION LICENSE - ODC-BY**
    
    This license allows users to share and adapt a dataset, but only if they give credit to the creator(s) of the dataset.


6. **CREATIVE COMMONS ATTRIBUTION-SHAREALIKE 4.0 INTERNATIONAL - CC-BY-SA**
    
    This license allows users to use, share, and adapt a dataset, but only if they give credit to the dataset and show any changes or transformations, they made to the dataset. Users might not want to use this license because they have to share the work they did on the dataset.


7. **COMMUNITY DATA LICENSE AGREEMENT – CDLA-SHARING-1.0**
    
    This license uses the principle of ‘copyleft’: users can use, modify, and adapt a dataset, but only if they don’t add license restrictions on the new work(s) they create with the dataset.


8. **OPEN DATA COMMONS OPEN DATABASE LICENSE - ODC-ODBL**
    
    This license allows users to use, share, and adapt a dataset but only if they give credit to the dataset and show any changes or transformations they make to the dataset. Users might not want to use this license because they have to share the work they did on the dataset.


9. **CREATIVE COMMONS ATTRIBUTION-NONCOMMERCIAL 4.0 INTERNATIONAL - CC BY-NC**
    
    This license is a restrictive license. Users can share and adapt a dataset, provided they give credit to its creator(s) and ensure that the dataset is not used for any commercial purpose.


10. **CREATIVE COMMONS ATTRIBUTION-NO DERIVATIVES 4.0 INTERNATIONAL - CC BY-ND**
    
    This license is also a restrictive license. Users can share a dataset if they give credit to its creator(s). This license does not allow additions, transformations, or changes to the dataset.


11. **CREATIVE COMMONS ATTRIBUTION-NONCOMMERCIAL-SHAREALIKE 4.0 INTERNATIONAL - CC BY-NC-SA**
    
    This license allows users to share a dataset only if they give credit to its creator(s). Users can share additions, transformations, or changes to the dataset, but they cannot use the dataset for commercial purposes.


12. **CREATIVE COMMONS ATTRIBUTION-NONCOMMERCIAL-NODERIVATIVES 4.0 INTERNATIONAL - CC BY-NC-ND**
    
    This license allows users to share a dataset only if they give credit to its creator(s). Users are not allowed to modify the dataset and are not allowed to use it for commercial purposes.

Note: Additional license types exist. Any dataset you use will include details about its license.


# Sharing Enterprise Data - [Data Asset eXchange](https://developer.ibm.com/exchanges/data/)
DAX provides a curated collection of high-quality open data sets for enterprise applications.
DAX includes tutorial notebooks for basic and advanced data analysis tasks.
DAX and MAX are available on the IBM Developer website.
DAX notebooks can be opened in Watson Studio for advanced data analysis and machine learning workflows.

## Features of DAX
- Curated Data Sets: High-quality data sets from IBM Research and trusted third-party sources.
- Tutorial Notebooks: Basic and advanced notebooks for data cleaning, pre-processing, exploratory analysis, and more.
- Advanced Notebooks: Notebooks for complex tasks like creating charts, training machine-learning models, and running statistical analysis.
- Data Files: Data files available for download and use in projects.

## Accessing DAX
- IBM Developer Website: Access DAX and the Model Asset eXchange (MAX) on the IBM Developer website.
- Data Asset eXchange: Navigate to the Data Asset eXchange page and explore available data sets.

## Using DAX
- Getting Started: Download data sets and run notebooks in Watson Studio to perform data cleaning, pre-processing, and exploratory analysis.
- Notebooks: Execute notebooks in Watson Studio to perform advanced tasks like creating charts and training machine-learning models.
- Data Files: Load data files into projects for use in data analysis and machine learning workflows.