Skip to content

Commit

Permalink
Initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
ranjan-mohanty committed Mar 10, 2024
0 parents commit 20d8415
Show file tree
Hide file tree
Showing 12 changed files with 603 additions and 0 deletions.
33 changes: 33 additions & 0 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
name: Build
on:
workflow_call:
push:
branches-ignore:
- main
tags-ignore:
- "**"

jobs:
build:
name: Build distribution
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.x"
- name: Install pypa/build
run: >-
python3 -m
pip install
build
--user
- name: Build a binary wheel and a source tarball
run: python3 -m build
- name: Store the distribution packages
uses: actions/upload-artifact@v3
with:
name: python-package-distributions
path: dist/
96 changes: 96 additions & 0 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
name: Publish
on:
push:
branches:
- main
tags:
- "**"

jobs:
build:
uses: ./.github/workflows/build.yml
publish-to-pypi:
name: Publish to PyPI
if: startsWith(github.ref, 'refs/tags/') # only publish to PyPI on tag pushes
needs:
- build
runs-on: ubuntu-latest
environment:
name: pypi
url: https://pypi.org/p/amazon-product-details-scraper
permissions:
id-token: write

steps:
- name: Download all the dists
uses: actions/download-artifact@v3
with:
name: python-package-distributions
path: dist/
- name: Publish distribution to PyPI
uses: pypa/gh-action-pypi-publish@release/v1

github-release:
name: >-
Sign the Python distribution with Sigstore
and upload them to GitHub Release
needs:
- publish-to-pypi
runs-on: ubuntu-latest

permissions:
contents: write
id-token: write

steps:
- name: Download all the dists
uses: actions/download-artifact@v3
with:
name: python-package-distributions
path: dist/
- name: Sign the dists with Sigstore
uses: sigstore/gh-action-sigstore-python@v1.2.3
with:
inputs: >-
./dist/*.tar.gz
./dist/*.whl
- name: Create GitHub Release
env:
GITHUB_TOKEN: ${{ github.token }}
run: >-
gh release create
'${{ github.ref_name }}'
--repo '${{ github.repository }}'
--notes ""
- name: Upload artifact signatures to GitHub Release
env:
GITHUB_TOKEN: ${{ github.token }}
run: >-
gh release upload
'${{ github.ref_name }}' dist/**
--repo '${{ github.repository }}'
publish-to-testpypi:
name: Publish to TestPyPI
if: ${{ !startsWith(github.ref, 'refs/tags')}}
needs:
- build
runs-on: ubuntu-latest

environment:
name: testpypi
url: https://test.pypi.org/p/amazon-product-details-scraper

permissions:
id-token: write

steps:
- name: Download all the dists
uses: actions/download-artifact@v3
with:
name: python-package-distributions
path: dist/
- name: Publish distribution to TestPyPI
uses: pypa/gh-action-pypi-publish@release/v1
with:
repository-url: https://test.pypi.org/legacy/
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
*.egg-info
build
dist
my_env
venv
output
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2024 Ranjan Mohanty

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
114 changes: 114 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
## Amazon Product Details Scraper

This script helps you scrape product details from Amazon product pages. It extracts information like title, description, and image URLs, saving them to JSON files.

### Features

- Fetches product details from a single Amazon product URL or a list of URLs in a file.
- Writes extracted data to JSON files for easy storage and processing.
- Optionally downloads product images along with details.

### Installation

**Requirements:**

- Python 3 (tested with 3.7+)
- Libraries:
- requests: 2.27.1
- beautifulsoup4: 4.11.1
- urllib3: 1.26.6

**Instructions:**

1. Make sure you have Python 3 installed. You can check by running `python3 --version` in your terminal.
2. **Create a virtual environment (recommended):**

- Virtual environments help isolate project dependencies and avoid conflicts with other Python installations on your system.
- Here's how to create a virtual environment using `venv`:

```bash
python3 -m venv my_env # Replace "my_env" with your desired environment name
```

- Activate the virtual environment:

```bash
source my_env/bin/activate
```

3. **Install:**

```bash
python3 setup.py install
```

This will automatically download and install the necessary libraries based on the specifications within the activated virtual environment.

### Usage

**Basic Usage:**

```bash
amazon-scraper --url https://www.amazon.com/product-1 # Replace with your product URL
```

This will scrape details from the provided Amazon product URL and write them to a JSON file in the "output" directory (default).

**Using a URL List:**

1. Create a text file containing a list of Amazon product URLs (one per line).
2. Run the script with the `--url-list` option and provide the file path:

```bash
amazon-scraper --url-list product_urls.txt
```

This will process each URL in the file and save the scraped details for each product in separate directories within "output".

**Optional: Downloading Images**

```bash
amazon-scraper --url https://www.amazon.com/product-1 --download-image
```

The `--download-image` flag enables downloading product images along with other details.

**Getting Help:**

The script offers a built-in help message that provides a quick overview of available options and usage instructions. To access the help, run the script with the `--help` option:

```bash
amazon_scraper --help
```

### Configuration

**Logging:**

- The script uses basic logging for information and error messages.
- You can modify the logging level by editing the `DEFAULT_LOG_LEVEL` in `config.py` line in the code (refer to the Python documentation for logging configuration).

### Example

**Scenario:**

Scrape details for two products from a file named "products.txt" and download images:

1. Create a file named "products.txt" with the following content:

```
https://www.amazon.com/product-1
https://www.amazon.com/product-2
```

2. Run the script with the following command:

```bash
amazon-scraper --url-list products.txt --download-image
```

This will process both URLs in the file, scrape details, create separate output directories for each product, and download images.

### Disclaimer

This script is for educational purposes only. Please be respectful of Amazon's terms of service when using it. Consider using official APIs provided by Amazon for extensive data collection.
5 changes: 5 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
[build-system]
# These are the assumed default build requirements from pip:
# https://pip.pypa.io/en/stable/reference/pip/#pep-517-and-518-support
requires = ["setuptools>=43.0.0", "wheel"]
build-backend = "setuptools.build_meta"
4 changes: 4 additions & 0 deletions setup.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[metadata]
# This includes the license file(s) in the wheel.
# https://wheel.readthedocs.io/en/stable/user_guide.html#including-license-files-in-the-generated-wheel-file
license_files = LICENSE
28 changes: 28 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
from setuptools import setup, find_packages
import pathlib

here = pathlib.Path(__file__).parent.resolve()
# Get the long description from the README file
long_description = (here / "README.md").read_text(encoding="utf-8")

setup(
name="amazon-product-details-scraper",
version="1.0.0",
description="Scrapes product details from Amazon product pages and also downloads the images",
long_description=long_description,
long_description_content_type="text/markdown",
url="https://github.com/ranjan-mohanty/amazon-product-details-scraper/blob/main/README.md",
author="Ranjan Mohanty",
author_email="ranjan@duck.com",
packages=find_packages(),
keywords="amazon, scraper",
entry_points={
"console_scripts": [
"amazon-scraper=src.scraper:main",
]
},
install_requires=["requests==2.31.0", "beautifulsoup4==4.11.1", "urllib3==1.26.18"],
project_urls={ # Optional
"Source": "https://github.com/ranjan-mohanty/amazon-product-details-scraper",
},
)
Empty file added src/__init__.py
Empty file.
8 changes: 8 additions & 0 deletions src/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
import os
import logging


DEFAULT_LOG_LEVEL = logging.INFO

DEFAULT_OUTPUT_FOLDER = os.path.expanduser("../output")
DEFAULT_OUTPUT_FILENAME = "product-info.json"
Loading

0 comments on commit 20d8415

Please sign in to comment.