About

Goal is to capture each sources measurable value with respect to its contribution to the person formations in the identity graph through a readable scorecard table.

MapReduce:

Mapper
Reducer

Mapper.py takes the dataset and converts it to another set of data, where individual elements are broken down into tuples of key,value pairs intermediate outputs. reducer.py takes the output from mapper.py as the input.In the reducer, a string is parsed from mapper.py as a key, value pair and kept as a set(). Combining the data tuples into smaller set, this created the final output.

Technologies Used

Jenkins
Hadoop
Docker
Airflow
BigQuery

Docker Installation

Click here to download Docker

Move Docker to applications folder and Double-click Docker.dmg to start the install process.

When the installation completes and Docker starts, the whale in the top status bar shows that Docker is running, and accessible from a terminal.

Run docker version in terminal to check that you have the latest release installed.

Add Mapper into DOCKER ---> COPY mapper.py
Add Reducer into DOCKER ---> COPY reducer.py

Airflow

Airflow is a platform to programmatically author, schedule and monitor data pipelines. The Airflow scheduler, while following the specified dependencies of the DAGs created, executes your tasks.

run_touchpoint_source.py

import abilitec_build_utils as abu in python script as library
Pass in specified parameters into Abu.run_dataproc_hadoop_streaming_with_most_recent_source_inputs()
Save File
Add file into DOCKER ---> COPY run_touchpoint_source.py

Path Structure: gs:///<prefix(DATA/US)>//yyyymmdd/partfile

Airflow can have multiple dependencies and no cycles. The scheduler will take care of the order in which it runs ---> do this with python script source_scorecard_standalone.py

Creating an airflow DAG in scorecard_standalone.py

Create/Rename --> DOCKER_IMAGE_NAME, SOURCE_NAMES, COMMAND
Create DAG:

with DAG(DAG_NAME, default_args=dict(D), params={}, concurrency=16, catchup=False, schedule_interval=None, user_defined_filters=dict(recursive_render=recursive_render)) as dag:

Create specific task:

task_name = docker_run_operator('TASK_NAME','DOCKER_IMAGE','COMMAND')

Add file into DOCKER ---> COPY scorecard_standalone.py

BigQuery

Create BigQuery layout:

sourceScorecard_bq_layout.json can be used as a template
If needed, use fieldnames.py to create Field Names for long list of values

Contributions

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

Notes

Due to confidential reasons:

Data that includes source names has been removed from files
Jenkins file not included
run_touchpoint_source.py not included

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Dockerfile		Dockerfile
README.md		README.md
addHeader.py		addHeader.py
fieldnames.py		fieldnames.py
mapper.py		mapper.py
reducer.py		reducer.py
sourceScorecard_bq_layout.json		sourceScorecard_bq_layout.json
source_columns_get.py		source_columns_get.py
source_scorecard_standalone_TEMPLATE.py		source_scorecard_standalone_TEMPLATE.py
test.py		test.py
udf_burst.sql		udf_burst.sql
udf_test.sql		udf_test.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

MapReduce:

Technologies Used

Docker Installation

Airflow

run_touchpoint_source.py

Creating an airflow DAG in scorecard_standalone.py

BigQuery

Contributions

Notes

About

Releases

Packages

Languages

taimasuid/source_scorecard

Folders and files

Latest commit

History

Repository files navigation

About

MapReduce:

Technologies Used

Docker Installation

Airflow

run_touchpoint_source.py

Creating an airflow DAG in scorecard_standalone.py

BigQuery

Contributions

Notes

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages