Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Store all links to all related questions in file #10

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rubo77
Copy link

@rubo77 rubo77 commented Jul 14, 2019

php artisan exportStackExchange    

will create a list with all links to all QAs, that you took part in. There is a 1s time sleep in between curl calls, so you don't get rate limited (after about 150calls. I hope that is enough to avoid it. I somewhere else needed 15s slep to avoid that)

That page can be downloaded with

mkdir download
cd download/
sort ../storage/app/StackExchange/2019-07-14_225540_UTC/urls.html|uniq>urls.html
python -m SimpleHTTPServer &
wget -c -r -l 1 --wait 0.1 --random-wait --adjust-extension -e robots=off -p -k -H http://localhost:8000/urls.html

if those are more than 100 sites, you should add a bigger wait between requests, otherwise you get blocked:

 --wait 15

with depth level -l 1 you will get all questions that are directly linked on the page

@rubo77 rubo77 changed the title Store all links to all related questions in file RFC: Store all links to all related questions in file Jul 14, 2019
@rubo77 rubo77 force-pushed the all_qas branch 4 times, most recently from 32abf06 to 6fce886 Compare July 15, 2019 09:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant