Consumer Complaints

Project Summary

Introduction

Instead of tv ad or traditional advertising board, more and more companies rely on internet for its marketing, and they want to measure their brand awareness across the internet, especially while comparing with other competitors or evaluating a specific campaign. However, the information are spread across different platforms and it could be difficult to have an big picture of a brand's exposure throughout the internet. To achieve that, this data pipeline ingests crawling information from the whole internet, ranks and compares brand popularity of the top U.S. Fast Food chains with the normalized count that they have been mentioned on internet over time. This method can also be generalized to all industries and even election campaigns to evaluate popularity and branding efficiency.

Slide

Demo Slide

Pipeline

The pipeline first retrieves the Index files that contains path to WARC files for each crawling records from S3, and filters to gain the exact file path, offset, and length for each potentially related records using spark sql querying on url keywords for each crawling records. After shuffling the query results based on file path, spark ingests the actual WARC files that contains the crawling metadata and HTML response, processes and normalizes the data for each brands over different platforms over time, and saved the result into csv files in s3, which would be further used for visualization in Tableau.

Run Instruction

Set up S3 bucket
Set up AWS EMR clusters with package installation in /bootstrap/install_python_modules.sh or setup spark and deploy it to clusters
Run the spark job with spark-submit --master yarn --deploy-mode client --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.7 --read_input {False} --input_crawl {crawl_session} requestcount.py {output_path}. {crawl_session} is the partition for crawl index for a specific month (eg:"CC-MAIN-2020-10"), and {output_path} should be replaced with S3 object paths for Athena query results and the output. Can also change the read_input flag to true to read a csv file as sql query results that will locate to the WARC records.

Repo structure

Dataset

common crawl dataset: commoncrawl
common crawl index data: index-to-warc-files-and-urls

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
Athena		Athena
bootstrap		bootstrap
spark		spark
.gitignore		.gitignore
README.md		README.md
readme_pipeline.png		readme_pipeline.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Consumer Complaints

Table of Contents

Project Summary

Introduction

Slide

Pipeline

Run Instruction

Repo structure

Dataset

About

Releases

Packages

Languages

zhiqingrao/Common_crawl_insight

Folders and files

Latest commit

History

Repository files navigation

Consumer Complaints

Table of Contents

Project Summary

Introduction

Slide

Pipeline

Run Instruction

Repo structure

Dataset

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages