Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use git scraping in refurbished products #104

Open
zmoog opened this issue Nov 4, 2022 · 15 comments
Open

Use git scraping in refurbished products #104

zmoog opened this issue Nov 4, 2022 · 15 comments
Assignees

Comments

@zmoog
Copy link
Owner

zmoog commented Nov 4, 2022

https://simonwillison.net/2020/Oct/9/git-scraping/

@zmoog
Copy link
Owner Author

zmoog commented Nov 5, 2022

The first step is creating a new repository to host the data: https://github.com/zmoog/refurbished-history

@zmoog zmoog self-assigned this Nov 5, 2022
@zmoog
Copy link
Owner Author

zmoog commented Nov 5, 2022

I am adding the label git-scraping topic as other people are doing.

@zmoog
Copy link
Owner Author

zmoog commented Nov 5, 2022

Try to scrape some data:

asdf local python 3.10.8

python -m venv venv
. venv/bin/activate

pip install refurbished
pip install --upgrade pip

rfrb it ipads --format json | jq > it-ipads.json

@zmoog
Copy link
Owner Author

zmoog commented Nov 5, 2022

We should cache Python packages to use few resources as possible. Here is some article to start from:

@zmoog
Copy link
Owner Author

zmoog commented Nov 5, 2022

Added a simple workflow to start scraping:

name: Scrape latest data

on:
  push:
  workflow_dispatch:
  # schedule:
  #   - cron:  '6,26,46 * * * *'

jobs:
  scheduled:
    runs-on: ubuntu-latest
    steps:
    - name: Check out this repo
      uses: actions/checkout@v2
    - uses: actions/setup-python@v2
      with:
        python-version: '3.10'
        cache: 'pip'
    - run: pip install refurbished
    - name: Fetch latest data
      run: |-
        # curl https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents | jq . > incidents.json
        for STORE in "it" "uk" "us" "cn" "ie" "fr" "ca"
        do
            PRODUCTS="stores/$STORE"
            mkdir -p ${PRODUCTS}

            for PRODUCT in "ipads" "iphones" "macs"
            do
                echo "Scraping $STORE $PRODUCT"
            rfrb ${STORE} ${PRODUCT} --format json 2> /dev/null > ${PRODUCTS}/${PRODUCT}.json
            done
        done        
    - name: Commit and push if it changed
      run: |-
        git config user.name "Automated"
        git config user.email "actions@users.noreply.github.com"
        git add -A
        timestamp=$(date -u)
        git commit -m "Latest data: ${timestamp}" || exit 0
        git push

I am disabling the schedule until I tested it using workflow_dispatch

@zmoog
Copy link
Owner Author

zmoog commented Nov 5, 2022

Weird, the setup-python@v2 step fails with this error:

Error: No file in /home/runner/work/refurbished-history/refurbished-history matched to [**/requirements.txt], make sure you have checked out the target repository

It looks like this steps looks for a requirements.txt file in the repo.

@zmoog
Copy link
Owner Author

zmoog commented Nov 5, 2022

Oh, I see it uses the hash of requirements.txt to cache dependencies:

The action defaults to searching for a dependency file (requirements.txt for pip, Pipfile.lock for pipenv or poetry.lock for poetry) in the repository, and uses its hash as a part of the cache key.

See Caching packages dependencies for more details.

@zmoog
Copy link
Owner Author

zmoog commented Nov 5, 2022

So I will:

  • upgrade setup-python to v4, since there are several warning from the action.
  • embrace requirements.txt

@zmoog
Copy link
Owner Author

zmoog commented Nov 5, 2022

The next step is schedule the scrape. I believe we can start with once a day at 9:00 am (CET).

@zmoog
Copy link
Owner Author

zmoog commented Nov 5, 2022

GitHub Actions schedule trigger uses UTC.

Let's schedule the scraping job at 9:00 UTC:

on:
  push:
  workflow_dispatch:
  schedule:
    - cron:  '00 09 * * *'

@zmoog
Copy link
Owner Author

zmoog commented Nov 5, 2022

$ head -n 12 stores/ca/macs.json
[
  {
    "name": "Refurbished Mac mini Apple M1 Chip with 8\u2010Core CPU and 8\u2010Core GPU",
    "family": "mac",
    "store": "ca",
    "url": "https://www.apple.com/ca/shop/product/FGNT3LL/A/refurbished-mac-mini-apple-m1-chip-with-8%E2%80%91core-cpu-and-8%E2%80%91core-gpu",
    "price": 979.0,
    "previous_price": 1149.0,
    "savings_price": 170.0,
    "saving_percentage": 0.14795474325500435,
    "model": "FGNT3LL"
  },

$ git-history file refurbished.db stores/ca/macs.json --id model
  [####################################]  1/1  100%

$ sqlite3 refurbished.db
SQLite version 3.37.0 2021-12-09 01:34:53
Enter ".help" for usage hints.
sqlite> select * from namespaces;
1|item
sqlite> .tables
namespaces
sqlite>

🤔

@zmoog
Copy link
Owner Author

zmoog commented Nov 5, 2022

While I try to understand what's I am doing wrong, I updated the schedule to every 4h.

@zmoog
Copy link
Owner Author

zmoog commented Nov 5, 2022

Oh, it looks like this happens because the files are stored in subfolders, see https://github.com/simonw/git-history/pull/52/files for more.

@zmoog
Copy link
Owner Author

zmoog commented Nov 5, 2022

I patched the installed version of git-history using this change: https://github.com/simonw/git-history/pull/52/files#diff-8c939659a7826623adb8d2ebb92231029a116aaeb75f1558101de8aa3ea487f8

And now it creates the tables in the SQLite database.

$ sqlite3 refurbished.db
  [####################################]  2/2  100%

$ sqlite3 refurbished.db
SQLite version 3.37.0 2021-12-09 01:34:53
Enter ".help" for usage hints.
sqlite> .tables
columns              item_changed         namespaces
commits              item_version
item                 item_version_detail
sqlite>

Cool.

@zmoog
Copy link
Owner Author

zmoog commented Nov 8, 2022

Explore the database using Datasette:

$ datasette refurbished.db
INFO:     Started server process [5797]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8001 (Press CTRL+C to quit)

CleanShot 2022-11-08 at 21 58 17@2x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Paused
Status: In Progress
Development

No branches or pull requests

1 participant