Skip to content

The repository contains a data pipeline that scrapes product data from Walmart.com based on a specified keyword. The data is cleaned and transformed before being saved to an S3 bucket in CSV file format. The code is allowing users to specify the keyword. The data can then be used to create reports and visualizations using Power BI.

Notifications You must be signed in to change notification settings

usmananwaar-de/walmart-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Walmart Products Data Scraper

This python script is a data pipeline that scrapes product data from Walmart.com based on a specified keyword, and saves the data to an S3 bucket in CSV file format. The data then used to create reports and visualizations using Power BI.

Before running the script, certain changes need to be made in the settings.py file.

Requirements

Installation

  1. Clone the repository using bash:
    git clone hhttps://github.com/usmananwaar-de/walmart-scraper
  2. Navigate to the cloned directory in the command line.
  3. Create a virtual environment by running the command (For Windows): python -m venv [venv name]
  4. Activate the virtual environment by running the command: venv\Scripts\activate
  5. Change the directory using cd command and go into the spiders folder
  6. Install the required libraries by running the commandpip install -r requirements.txt

Usage

Before running the script, the following changes need to be made in the settings.py file:

  1. Open the settings.py file and replace the YOUR_SCRAPEOPS_API_KEY variable with your ScrapeOps API key
  2. Replace the YOUR_S3_BUCKET_PATH with the path to your S3 bucket.
  3. Replace the YOUR_AWS_KEY_ID and YOUR_AWS_SECRET_ACCESS_KEY with your AWS access key ID and secret access key..

To run the script, use the following command and write the desired product keyword:

scrapy crawl walmart

**Note: rotating proxies are used because Walmart.com detects scraper bots and blocks their IP addresses.**

Notes

  • Walmart.com may block your scraper, so it's important to use a proxy service like ScrapeOps to avoid this

Get Data From AWS S3 to Power BI

  • Open Power BI Desktop and Click on Get Data. Search for Python Script and copy and paste the following code: import boto3
    AWS_ACCESS_KEY = "your-aws-access-key"
    AWS_SECRET_ACCESS_KEY = "your-aws-secret-access-key"
    AWS_DEFAULT_REGION = "your-aws-region"
    
    s3 = boto3.resource('s3')
    bucket = s3.Bucket('your-bucket-name/file.csv')
    
    for obj in bucket.objects.all():
      key = obj.key
      body = obj.get()['Body'].read()
    

  • Make sure you've boto3 install. That's all, csv file will be successfully imported

Data Visualization By Power BI Report

To dive into the interactive world of this Power BI report, simply download the "Walmart report.pbix" file and unleash the power of data visualization with Power BI.


Thank you for reading till the end!

About

The repository contains a data pipeline that scrapes product data from Walmart.com based on a specified keyword. The data is cleaned and transformed before being saved to an S3 bucket in CSV file format. The code is allowing users to specify the keyword. The data can then be used to create reports and visualizations using Power BI.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages