whiterabbit #35

wanghaisheng · 2018-08-13T04:55:13Z

No description provided.

wanghaisheng · 2018-08-13T05:03:56Z

定位在帮助完成后续ETL工作的这么一个工具。

1.源数据可以是txt文件，也可以是各种数据库(MySQL, Oracle, SQL Server, and PostgreSQL)。
2.扫描整个原始数据，针对其中每张表、每个字段、每个数据值提供具体的信息。扫描完成后会生成一份报表，可以用作ETL的参考。
3.data profiling tools
4.操作流程

设置输出扫描报告结果的文件夹

连接源数据测试连接是否成功

选择待扫描的表和字段，完成扫描
输出扫描报表


5.完成ETL
选择软件预置的目标标准数据库模型

加载目标标准数据库模型

表与表之间映射

字段与字段映射

生成ETL文档

wanghaisheng · 2018-08-13T05:09:25Z

https://datacleaner.github.io/
What is DataCleaner?
Data profiling

The heart of DataCleaner is a strong data profiling engine for discovering and analyzing the quality of your data. Find the patterns, missing values, character sets and other characteristics of your data values.

Interrogating and profiling your data is an essential activity of any Data Quality, Master Data Management or Data Governance program. If you don’t know what you’re up against, you have poor chances of fixing it.

1.连接数据源

选 "New job from scratch" or in "Manage datastores"

也可以是云服务

2.针对一次扫描任务，选择若干组件
Analyzers——必须要选择一个
Transformers——对原始数据字段进行切割或者两个字段值整合成一个有意义的，或者是利用这个值得到的查询结果生成一些新的列
Filters——比如有些是空值，分析时我们要忽略这些行

一个完整的扫描任务

3.完成扫描

4.扫描任务可以保存成模板复用

wanghaisheng · 2018-08-13T05:52:02Z

What is Data Profiling? Process, Best Practices, Top 6 Tools

Written by An Bui|June 19, 2018

Data processing and analysis can’t happen without data profiling—reviewing source data for content and quality. As data gets bigger and infrastructure moves to the cloud, data profiling is increasingly important.

Need to achieve big data profiling with limited time and resources? Learn about:

An efficient data profiling process from the father of data warehousing.
Best practices for learning about your data and obtaining data quality metrics, even without meta data or documentation.
Top six open source and commercial tools to automate parts of the process.
The future of autonomous data profiling in a cloud-based environment.

What is data profiling?

Data profiling is the process of reviewing source data, understanding structure, content and interrelationships, and identifying potential for data projects.

Data profiling is a crucial part of:

Data warehouse and business intelligence (DW/BI) projects—data profiling can uncover data quality issues in data sources, and what needs to be corrected in ETL.
Data conversion and migration projects—data profiling can identify data quality issues, which you can handle in scripts and data integration tools copying data from source to target. It can also uncover new requirements for the target system.
Source system data quality projects—data profiling can highlight data which suffers from serious or numerous quality issues, and the source of the issues (e.g. user inputs, errors in interfaces, data corruption).

Data profiling involves:

Collecting descriptive statistics like min, max, count and sum.
Collecting data types, length and recurring patterns.
Tagging data with keywords, descriptions or categories.
Performing data quality assessment, risk of performing joins on the data.
Discovering metadata and assessing its accuracy.
Identifying distributions, key candidates, foreign-key candidates, functional dependencies, embedded value dependencies, and performing inter-table analysis.

Types of data profiling

There are three main types of data profiling:

Structure discovery

Validating that data is consistent and formatted correctly, and performing mathematical checks on the data (e.g. sum, minimum or maximum). Structure discovery helps understand how well data is structured—for example, what percentage of phone numbers do not have the correct number of digits.

Content discovery

Looking into individual data records to discover errors. Content discovery identifies which specific rows in a table contain problems, and which systemic issues occur in the data (for example, phone numbers with no area code).

Relationship discovery

Discovering how parts of the data are interrelated. For example, key relationships between database tables, references between cells or tables in a spreadsheet. Understanding relationships is crucial to reusing data; related data sources should be united into one or imported in a way that preserves important relationships.

Data profiling steps—an efficient process for data profiling

Ralph Kimball, a father of data warehouse architecture, suggests a four-step process for data profiling:

Use data profiling at project start to discover if data is suitable for analysis—and make a “go / no go” decision on the project.
Identify and correct data quality issues in source data, even before starting to move it into target database.
Identify data quality issues that can be corrected by Extract-Transform-Load (ETL), while data is moved from source to target. Data profiling can uncover if additional manual processing is needed.
Identify unanticipated business rules, hierarchical structures and foreign key / private key relationships, use them to fine-tune the ETL process.

Data profiling and data quality analysis best practices

Basic data profiling techniques:

Distinct count and percent—identifies natural keys, distinct values in each column that can help process inserts and updates. Handy for tables without headers.
Percent of zero / blank / null values—identifies missing or unknown data. Helps ETL architects setup appropriate default values.
Minimum / maximum / average string length—helps select appropriate data types and sizes in target database. Enables setting column widths just wide enough for the data, to improve performance.

Advanced data profiling techniques:

Key integrity—ensures keys are always present in the data, using zero/blank/null analysis. Also, helps identify orphan keys, which are problematic for ETL and future analysis.
Cardinality—checks relationships like one-to-one, one-to-many, many-to-many, between related data sets. This helps BI tools perform inner or outer joins correctly.
Pattern and frequency distributions—checks if data fields are formatted correctly, for example if emails are in a valid format. Extremely important for data fields used for outbound communications (emails, phone numbers, addresses).

6 data profiling tools—open source and commercial

Data profiling, a tedious and labor intensive activity, can be automated with tools, to make huge data projects more feasible. These are essential to your data analytics stack.

Open source data profiling tools

Quadient DataCleaner—key features include:

Data quality, data profiling and data wrangling
Detect and merge duplicates
Boolean analysis
Completeness analysis
Character set distribution
Date gap analysis
Reference data matching

2. Aggregate Profiler (Open Source Data Quality and Profiling)—key features include:

Data profiling, filtering, and governance
Similarity checks
Data enrichment
Real time alerting for data issues or changes
Basket analysis with bubble chart validation
Single customer view
Dummy data creation
Metadata discovery
Anomaly discovery and data cleansing tool
Hadoop integration

3. Talend Open Studio—a suite of open source tools, data quality features include:

Customizable data assessment
A pattern library
Analytics with graphical charts
Fraud pattern detection
Column set analysis
Advanced matching
Time column correlation

Commercial data profiling tools

4. Data Profiling in Informatica—key features include:

Data stewardship console which mimics data management workflow
Exception handling interface for business users
Enterprise data governance
Map data quality rules once and deploy on any platform
Data standardization, enrichment, de-duplication and consolidation
Metadata management

5. Oracle Enterprise Data Quality—key features include:

Data profiling, auditing and dashboards
Parsing and standardization including constructed fields, misfiled data, poorly structured data and notes fields
Automated match and merge
Case management by human operators
Address verification
Product data verification
Integration with Oracle Master Data Management

6. SAS DataFlux—key features include:

Extracts, cleanses, transforms, conforms, aggregates, loads and manages data
Supports batch-oriented and real-time Master Data Management
Creates real-time, reusable data integration services
User-friendly semantic reference data layer
Visibility into where data originated and how it was transformed
Optional enrichment components

Data profiling in a cloud-based data pipeline: the need for speed

Traditional data profiling, as described in this post, is a complex activity performed by data engineers prior to, and during, ingestion of data to a data warehouse. Data is meticulously analyzed and processed (with partial automation) before it is ready to enter the pipeline.

Today more organizations are moving data infrastructure to the cloud, and discovering that data ingestion can happen at the click of a button. Cloud data warehouses, data management tools and ETL services come pre-integrated with hundreds of data sources. But if you can click a button and move data instantly into your target system, what about data profiling?

Data profiling is more crucial than ever, with huge volumes flowing through the big data pipeline and the prevalence of unstructured data. In an cloud-based data pipeline architecture, you need an automated data warehouse that can take care of data profiling and preparation on its own. Instead of analyzing and treating the data using a data profiling tool, just pour it into the automated data warehouse, and it will automatically be cleaned, optimized and prepared for analysis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

whiterabbit #35

whiterabbit #35

wanghaisheng commented Aug 13, 2018

wanghaisheng commented Aug 13, 2018 •

edited

wanghaisheng commented Aug 13, 2018 •

edited

wanghaisheng commented Aug 13, 2018 •

edited

whiterabbit #35

whiterabbit #35

Comments

wanghaisheng commented Aug 13, 2018

wanghaisheng commented Aug 13, 2018 • edited

wanghaisheng commented Aug 13, 2018 • edited

wanghaisheng commented Aug 13, 2018 • edited

What is Data Profiling? Process, Best Practices, Top 6 Tools

What is data profiling?

Types of data profiling

Structure discovery

Content discovery

Data profiling steps—an efficient process for data profiling

Data profiling and data quality analysis best practices

6 data profiling tools—open source and commercial

Open source data profiling tools

Commercial data profiling tools

Data profiling in a cloud-based data pipeline: the need for speed

wanghaisheng commented Aug 13, 2018 •

edited

wanghaisheng commented Aug 13, 2018 •

edited

wanghaisheng commented Aug 13, 2018 •

edited