# Categories of Data Science Tools

- Data science tasks: 
    - Data Management -  storage, management and retrieval of data
    - Data Integration and Transformation - streamline data pipelines and automate data processing tasks
    - Data Visualization - provide graphical representation of data and assist with communicating insights
    - Modelling - enable Building, Deployment, Monitoring and Assessment of Data and Machine Learning models


- Data Science Tasks support the following:

    - Code Asset Management - store & manage code, track changes and allow collaborative development

    - Data Asset Management - organize and manage data, provide access control, and backup assets

    - Development Environments - develop, test and deploy code

    - xecution Environments - provide computational resources and run the code
    
The data science ecosystem consists of many open source and commercial options, and include both traditional desktop applications and server-based tools, as well as cloud-based services that can be accessed using web-browsers and mobile interfaces.



# Open Source Tools for Data Science


- Data management: MySQL, PostgreSQL, MongoDB, Apache CouchDB, Apache Cassandra, Hadoop File System, Ceph, Elasticsearch
- Data integration and transformation: Apache AirFlow, KubeFlow, Apache Kafka, Apache Nifi, Apache SparkSQL, NodeRED
- Data visualization: Pixie Dust, Hue, Kibana, Apache Superset
- Model deployment: Apache PredictionIO, Seldon, Kubernetes, Redhat OpenShift, MLeap, TensorFlow service, TensorFlow lite, TensorFlow.js
- Model monitoring: ModelDB, Prometheus, IBM AI Fairness 360, IBM Adversarial Robustness 360 Toolbox, IBM AI Explainability 360
- Code asset management: Git, GitHub, GitLab, Bitbucket
- Data asset management: Apache Atlas, ODPi Egeria, Kylo


## Model Development

IBM Watson Studio: Engineered as an integrated environment, Watson Studio simplifies developing, training, and deploying models. It boasts support for multiple languages and frameworks, such as Python, R, and TensorFlow, alongside collaboration features, data preparation tools, and versatile deployment options.

IBM AutoAI: A notable feature embedded within Watson Studio, IBM AutoAI streamlines the machine learning model construction process. By dynamically exploring various algorithms and hyperparameters, it aims to identify the optimal model for a given dataset.

IBM Watson OpenScale: As a platform for overseeing and managing AI models in production, Watson OpenScale plays a pivotal role in ensuring model fairness, explainability, and bias mitigation. It furnishes insights into model performance and drift over time, facilitating informed decision-making.

IBM Watson Machine Learning: Watson Machine Learning, available as a service on the IBM Cloud platform, enables users to scale their training and deployment of machine learning models. It seamlessly supports popular frameworks like TensorFlow, PyTorch, and scikit-learn, and offers APIs for seamless integration with other applications.

### Development Envioremnet

- Jupyter Notebooks: a popular tool for interactive programming, which supports many languages and allows users to combine code, output, and visualizations in one document.
- Jupyter Lab: The successor to Jupyter Notebooks, featuring a more modern interface and the ability to handle different file types.
- Apache Zeppelin: Inspired by Jupyter Notebooks but with integrated plotting capabilities (no external libraries needed). Can be extended with additional libraries.
- RStudio: a development environment designed for R but can also be used for Python. It offers functionalities like debugging, data exploration, and visualization.
- Spyder: a Python IDE similar to RStudio but with less functionality.

### Execution environments:

- Apache Spark:  Widely used tool for large-scale batch data processing, meaning it efficiently handles vast amounts of data in chunks.
- Apache Flink: Another large-scale data processing tool, but focused on real-time data streams.
- Ray: A recent development designed specifically for training large-scale deep learning models.

### Visual Tools (No Coding Required):

- KNIME: Offers a drag-and-drop visual interface for data integration, transformation, visualization, and basic model building. It can be extended with R and Python programming and connects to Apache Spark.
- Orange: A simpler and less flexible tool compared to KNIME, but easier to use for basic data science tasks.

#### [Upgrade Material](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0130EN-SkillsNetwork/storyline/Open%20Source%20Tools/story.html?origin=www.coursera.org)  

# Commercial Tools for Data Science

### Data Management

- Oracle Database, Microsoft SQL Server, and IBM Db2 are industry-standard data management tools.
- Commercial support is crucial for these tools, which are delivered directly from software vendors and partners.

### Data Integration
- Commercial data integration tools include extract, transform, and load (ETL) tools like Informatica PowerCenter and IBM InfoSphere DataStage.
- These tools support the design and deployment of ETL data processing pipelines through graphical interfaces and have connectors to various commercial and open-source systems.

### Data Visualization
- Commercial data visualization tools include business intelligence (BI) tools like Tableau, Microsoft Power BI, and IBM Cognos Analytics.
- These tools focus on creating visual reports and live dashboards for end users.
- Watson Studio Desktop includes a component called Data Refinery for data integration and visualization.

### Model Building
- Commercial model building tools include data mining products like SPSS Modeler and SAS Enterprise Miner.
- These tools are tightly integrated into the model-building process and can export models in open formats like predictive model markup language (PMML).

### Model Deployment and Monitoring
- Model deployment is integrated into the model-building process in commercial software.
- Model monitoring is a new discipline and currently, relevant commercial tools are not available, so open-source tools are used instead.

### Code and Data Asset Management
- Open-source tools like Git and GitHub are the de facto standard for code asset management.
- Commercial tools like Informatica Enterprise Data Governance and IBM provide data asset management tools, including data governance and data lineage.

### Development Environment
- Watson Studio is a fully integrated development environment for data scientists, available both in the cloud and as a desktop version.
- It combines Jupyter Notebooks with graphical tools to maximize performance.
- Watson Studio, together with Watson Open Scale, covers the complete data science life cycle and can be deployed in local data centers or on top of Kubernetes/RedHat OpenShift.

### Other Tools
- H2O Driverless AI is another fully integrated commercial tool that covers the complete data science life cycle.

# Cloud Based Tools for Data Science

### Fully Integrated Visual Tools
- Watson Studio and Watson OpenScale cover the complete development life cycle for all data science, machine learning, and AI tasks.
- Microsoft Azure Machine Learning is another example of a full cloud-hosted offering supporting the complete development life cycle.
- H2O Driverless AI is a product that can be downloaded and installed, but also has a one-click deployment for standard cloud service providers.

### Data Management
- Many commercial tools are available as software-as-a-service (SaaS) offerings, where the cloud provider operates the tool in the cloud.
- Examples include Amazon Web Services DynamoDB, Cloudant, and IBM's Db2 as a service.

### Data Integration
- Commercial data integration tools include extract, transform, and load (ETL) tools like Informatica Cloud Data Integration and IBM's Data Refinery.
- Data Refinery is part of IBM Watson Studio and allows transforming large amounts of raw data into consumable information in a spreadsheet-like interface.

### Data Visualization
- Cloud-based data visualization tools are numerous, with examples including Datameer, IBM's Cognos Business Intelligence suite, and IBM Data Refinery's data exploration and visualization functionality in Watson Studio.
- Various visualizations are available in Watson Studio, such as 3D bar charts, hierarchical edge bundling, 2D scatter plots with heat maps, tree maps, pie charts, and word clouds.

### Model Building
- Model building can be done using services like Watson Machine Learning, which can train and build models using various open-source libraries.
- Google's AI Platform Training is another example of a service for model building.

### Model Deployment
- Model deployment in commercial software is usually tightly integrated into the model-building process.
- Examples include SPSS Collaboration and Deployment Services and Watson Machine Learning's deployment of models using a REST interface.

### Model Monitoring
- Cloud tools like Amazon SageMaker Model Monitor and Watson OpenScale are used to monitor deployed machine learning and deep learning models continuously.
