Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

whiterabbit #35

Open
wanghaisheng opened this issue Aug 13, 2018 · 3 comments
Open

whiterabbit #35

wanghaisheng opened this issue Aug 13, 2018 · 3 comments

Comments

@wanghaisheng
Copy link
Owner

No description provided.

@wanghaisheng
Copy link
Owner Author

wanghaisheng commented Aug 13, 2018

定位在帮助完成后续ETL工作的这么一个工具。

1.源数据可以是txt文件,也可以是各种数据库(MySQL, Oracle, SQL Server, and PostgreSQL)。
2.扫描整个原始数据,针对其中每张表、每个字段、每个数据值提供具体的信息。扫描完成后会生成一份报表,可以用作ETL的参考。
3.data profiling tools
4.操作流程

设置输出扫描报告结果的文件夹
default

连接源数据 测试连接是否成功
default

选择待扫描的表和字段,完成扫描
输出扫描报表
default
default
5.完成ETL
选择软件预置的目标标准数据库模型
default
加载目标标准数据库模型
default

表与表之间映射
default

字段与字段映射
default

生成ETL文档

@wanghaisheng
Copy link
Owner Author

wanghaisheng commented Aug 13, 2018

https://datacleaner.github.io/
What is DataCleaner?
Data profiling

The heart of DataCleaner is a strong data profiling engine for discovering and analyzing the quality of your data. Find the patterns, missing values, character sets and other characteristics of your data values.

Interrogating and profiling your data is an essential activity of any Data Quality, Master Data Management or Data Governance program. If you don’t know what you’re up against, you have poor chances of fixing it.

1.连接数据源
default
选 "New job from scratch" or in "Manage datastores"
default
也可以是云服务
default
2.针对一次扫描任务,选择若干组件
Analyzers——必须要选择一个
Transformers——对原始数据字段进行切割或者两个字段值整合成一个有意义的,或者是利用这个值得到的查询结果生成一些新的列
Filters——比如有些是空值,分析时我们要忽略这些行

default
default
default
default
default
一个完整的扫描任务
default

3.完成扫描
default
4.扫描任务可以保存成模板 复用

default

@wanghaisheng
Copy link
Owner Author

wanghaisheng commented Aug 13, 2018

What is Data Profiling? Process, Best Practices, Top 6 Tools

Written by An Bui|June 19, 2018

Data processing and analysis can’t happen without data profiling—reviewing source data for content and quality. As data gets bigger and infrastructure moves to the cloud, data profiling is increasingly important.

Need to achieve big data profiling with limited time and resources? Learn about:

What is data profiling?

Data profiling is the process of reviewing source data, understanding structure, content and interrelationships, and identifying potential for data projects.

Data profiling is a crucial part of:

  • Data warehouse and business intelligence (DW/BI) projects—data profiling can uncover data quality issues in data sources, and what needs to be corrected in ETL.
  • Data conversion and migration projects—data profiling can identify data quality issues, which you can handle in scripts and data integration tools copying data from source to target. It can also uncover new requirements for the target system.
  • Source system data quality projects—data profiling can highlight data which suffers from serious or numerous quality issues, and the source of the issues (e.g. user inputs, errors in interfaces, data corruption).

Data profiling involves:

  • Collecting descriptive statistics like min, max, count and sum.
  • Collecting data types, length and recurring patterns.
  • Tagging data with keywords, descriptions or categories.
  • Performing data quality assessment, risk of performing joins on the data.
  • Discovering metadata and assessing its accuracy.
  • Identifying distributions, key candidates, foreign-key candidates, functional dependencies, embedded value dependencies, and performing inter-table analysis.

Types of data profiling

There are three main types of data profiling:

Structure discovery

Validating that data is consistent and formatted correctly, and performing mathematical checks on the data (e.g. sum, minimum or maximum). Structure discovery helps understand how well data is structured—for example, what percentage of phone numbers do not have the correct number of digits.

Content discovery

Looking into individual data records to discover errors. Content discovery identifies which specific rows in a table contain problems, and which systemic issues occur in the data (for example, phone numbers with no area code).

Relationship discovery

Discovering how parts of the data are interrelated. For example, key relationships between database tables, references between cells or tables in a spreadsheet. Understanding relationships is crucial to reusing data; related data sources should be united into one or imported in a way that preserves important relationships.

Data profiling steps—an efficient process for data profiling

Ralph Kimball, a father of data warehouse architecture, suggests a four-step process for data profiling:

  1. Use data profiling at project start to discover if data is suitable for analysis—and make a “go / no go” decision on the project.
  2. Identify and correct data quality issues in source data, even before starting to move it into target database.
  3. Identify data quality issues that can be corrected by Extract-Transform-Load (ETL), while data is moved from source to target. Data profiling can uncover if additional manual processing is needed.
  4. Identify unanticipated business rules, hierarchical structures and foreign key / private key relationships, use them to fine-tune the ETL process.

Data profiling and data quality analysis best practices

Basic data profiling techniques:

  • Distinct count and percent—identifies natural keys, distinct values in each column that can help process inserts and updates. Handy for tables without headers.
  • Percent of zero / blank / null values—identifies missing or unknown data. Helps ETL architects setup appropriate default values.
  • Minimum / maximum / average string length—helps select appropriate data types and sizes in target database. Enables setting column widths just wide enough for the data, to improve performance.

Advanced data profiling techniques:

  • Key integrity—ensures keys are always present in the data, using zero/blank/null analysis. Also, helps identify orphan keys, which are problematic for ETL and future analysis.
  • Cardinality—checks relationships like one-to-one, one-to-many, many-to-many, between related data sets. This helps BI tools perform inner or outer joins correctly.
  • Pattern and frequency distributions—checks if data fields are formatted correctly, for example if emails are in a valid format. Extremely important for data fields used for outbound communications (emails, phone numbers, addresses).

6 data profiling tools—open source and commercial

Data profiling, a tedious and labor intensive activity, can be automated with tools, to make huge data projects more feasible. These are essential to your data analytics stack.

Open source data profiling tools

  1. Quadient DataCleaner—key features include:
  • Data quality, data profiling and data wrangling
  • Detect and merge duplicates
  • Boolean analysis
  • Completeness analysis
  • Character set distribution
  • Date gap analysis
  • Reference data matching

DataCleanerMaxRows

2. Aggregate Profiler (Open Source Data Quality and Profiling)—key features include:

  • Data profiling, filtering, and governance
  • Similarity checks
  • Data enrichment
  • Real time alerting for data issues or changes
  • Basket analysis with bubble chart validation
  • Single customer view
  • Dummy data creation
  • Metadata discovery
  • Anomaly discovery and data cleansing tool
  • Hadoop integration

1

default

default
default
default
default

3. Talend Open Studio—a suite of open source tools, data quality features include:

  • Customizable data assessment
  • A pattern library
  • Analytics with graphical charts
  • Fraud pattern detection
  • Column set analysis
  • Advanced matching
  • Time column correlation

map-your-path-to-clean-data

Commercial data profiling tools

4. Data Profiling in Informatica—key features include:

  • Data stewardship console which mimics data management workflow
  • Exception handling interface for business users
  • Enterprise data governance
  • Map data quality rules once and deploy on any platform
  • Data standardization, enrichment, de-duplication and consolidation
  • Metadata management

DQ_Reporting_961

5. Oracle Enterprise Data Quality—key features include:

  • Data profiling, auditing and dashboards
  • Parsing and standardization including constructed fields, misfiled data, poorly structured data and notes fields
  • Automated match and merge
  • Case management by human operators
  • Address verification
  • Product data verification
  • Integration with Oracle Master Data Management

im08t0-174214-flat-drm3-1933042

6. SAS DataFlux—key features include:

  • Extracts, cleanses, transforms, conforms, aggregates, loads and manages data
  • Supports batch-oriented and real-time Master Data Management
  • Creates real-time, reusable data integration services
  • User-friendly semantic reference data layer
  • Visibility into where data originated and how it was transformed
  • Optional enrichment components

Using-Multiple-Quality-Knowledge-Base-Locales06

Data profiling in a cloud-based data pipeline: the need for speed

Traditional data profiling, as described in this post, is a complex activity performed by data engineers prior to, and during, ingestion of data to a data warehouse. Data is meticulously analyzed and processed (with partial automation) before it is ready to enter the pipeline.

Today more organizations are moving data infrastructure to the cloud, and discovering that data ingestion can happen at the click of a button. Cloud data warehouses, data management tools and ETL services come pre-integrated with hundreds of data sources. But if you can click a button and move data instantly into your target system, what about data profiling?

Data profiling is more crucial than ever, with huge volumes flowing through the big data pipeline and the prevalence of unstructured data. In an cloud-based data pipeline architecture, you need an automated data warehouse that can take care of data profiling and preparation on its own. Instead of analyzing and treating the data using a data profiling tool, just pour it into the automated data warehouse, and it will automatically be cleaned, optimized and prepared for analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant