-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use git scraping in refurbished products #104
Comments
The first step is creating a new repository to host the data: https://github.com/zmoog/refurbished-history |
I am adding the label |
Try to scrape some data: asdf local python 3.10.8
python -m venv venv
. venv/bin/activate
pip install refurbished
pip install --upgrade pip
rfrb it ipads --format json | jq > it-ipads.json |
We should cache Python packages to use few resources as possible. Here is some article to start from: |
Added a simple workflow to start scraping: name: Scrape latest data
on:
push:
workflow_dispatch:
# schedule:
# - cron: '6,26,46 * * * *'
jobs:
scheduled:
runs-on: ubuntu-latest
steps:
- name: Check out this repo
uses: actions/checkout@v2
- uses: actions/setup-python@v2
with:
python-version: '3.10'
cache: 'pip'
- run: pip install refurbished
- name: Fetch latest data
run: |-
# curl https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents | jq . > incidents.json
for STORE in "it" "uk" "us" "cn" "ie" "fr" "ca"
do
PRODUCTS="stores/$STORE"
mkdir -p ${PRODUCTS}
for PRODUCT in "ipads" "iphones" "macs"
do
echo "Scraping $STORE $PRODUCT"
rfrb ${STORE} ${PRODUCT} --format json 2> /dev/null > ${PRODUCTS}/${PRODUCT}.json
done
done
- name: Commit and push if it changed
run: |-
git config user.name "Automated"
git config user.email "actions@users.noreply.github.com"
git add -A
timestamp=$(date -u)
git commit -m "Latest data: ${timestamp}" || exit 0
git push I am disabling the schedule until I tested it using workflow_dispatch |
Weird, the setup-python@v2 step fails with this error:
It looks like this steps looks for a |
Oh, I see it uses the hash of
See Caching packages dependencies for more details. |
So I will:
|
The next step is schedule the scrape. I believe we can start with once a day at 9:00 am (CET). |
GitHub Actions schedule trigger uses UTC. Let's schedule the scraping job at 9:00 UTC: on:
push:
workflow_dispatch:
schedule:
- cron: '00 09 * * *' |
$ head -n 12 stores/ca/macs.json
[
{
"name": "Refurbished Mac mini Apple M1 Chip with 8\u2010Core CPU and 8\u2010Core GPU",
"family": "mac",
"store": "ca",
"url": "https://www.apple.com/ca/shop/product/FGNT3LL/A/refurbished-mac-mini-apple-m1-chip-with-8%E2%80%91core-cpu-and-8%E2%80%91core-gpu",
"price": 979.0,
"previous_price": 1149.0,
"savings_price": 170.0,
"saving_percentage": 0.14795474325500435,
"model": "FGNT3LL"
},
$ git-history file refurbished.db stores/ca/macs.json --id model
[####################################] 1/1 100%
$ sqlite3 refurbished.db
SQLite version 3.37.0 2021-12-09 01:34:53
Enter ".help" for usage hints.
sqlite> select * from namespaces;
1|item
sqlite> .tables
namespaces
sqlite> 🤔 |
While I try to understand what's I am doing wrong, I updated the schedule to every 4h. |
Oh, it looks like this happens because the files are stored in subfolders, see https://github.com/simonw/git-history/pull/52/files for more. |
I patched the installed version of git-history using this change: https://github.com/simonw/git-history/pull/52/files#diff-8c939659a7826623adb8d2ebb92231029a116aaeb75f1558101de8aa3ea487f8 And now it creates the tables in the SQLite database.
Cool. |
https://simonwillison.net/2020/Oct/9/git-scraping/
The text was updated successfully, but these errors were encountered: