Skip to content

zhiqingrao/Common_crawl_insight

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python 3.6 Spark

Consumer Complaints

Table of Contents

  1. Project Summary
  2. Run Instruction
  3. Repo structure
  4. Dataset

Project Summary

Introduction

Instead of tv ad or traditional advertising board, more and more companies rely on internet for its marketing, and they want to measure their brand awareness across the internet, especially while comparing with other competitors or evaluating a specific campaign. However, the information are spread across different platforms and it could be difficult to have an big picture of a brand's exposure throughout the internet. To achieve that, this data pipeline ingests crawling information from the whole internet, ranks and compares brand popularity of the top U.S. Fast Food chains with the normalized count that they have been mentioned on internet over time. This method can also be generalized to all industries and even election campaigns to evaluate popularity and branding efficiency.

Slide

Demo Slide

Pipeline

Pipeline

The pipeline first retrieves the Index files that contains path to WARC files for each crawling records from S3, and filters to gain the exact file path, offset, and length for each potentially related records using spark sql querying on url keywords for each crawling records. After shuffling the query results based on file path, spark ingests the actual WARC files that contains the crawling metadata and HTML response, processes and normalizes the data for each brands over different platforms over time, and saved the result into csv files in s3, which would be further used for visualization in Tableau.

Run Instruction

  1. Set up S3 bucket
  2. Set up AWS EMR clusters with package installation in /bootstrap/install_python_modules.sh or setup spark and deploy it to clusters
  3. Run the spark job with spark-submit --master yarn --deploy-mode client --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.7 --read_input {False} --input_crawl {crawl_session} requestcount.py {output_path}. {crawl_session} is the partition for crawl index for a specific month (eg:"CC-MAIN-2020-10"), and {output_path} should be replaced with S3 object paths for Athena query results and the output. Can also change the read_input flag to true to read a csv file as sql query results that will locate to the WARC records.

Repo structure

Dataset

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages