Skip to content
This repository has been archived by the owner on Dec 26, 2021. It is now read-only.

Miss tickers when mass scraping #6

Closed
truongphanduykhanh opened this issue Aug 29, 2021 · 10 comments
Closed

Miss tickers when mass scraping #6

truongphanduykhanh opened this issue Aug 29, 2021 · 10 comments

Comments

@truongphanduykhanh
Copy link

truongphanduykhanh commented Aug 29, 2021

There are more than 3,000 tickers in file bizType_ind_tickers.csv. However, there are only 600-1000 tickers dowloaded when I mass scrap (it took 5 hours everytime to scrap those tickers with my internet connection). It missed many bluechips, whose information are certainly available on VietStock such as VIC, GAS.

I have tried several times by both WiFi and Lan connection. The number of tickers and number json files vary from one time to another. One time has 657 tickers, another time has 1025 tickers. None of those times are closed to the total 3129 tickers as in bizType_ind_tickers.csv.

The executions were all stopped by itself and stated clearly that it finished (as terminal shown here at the end).

Following are count summary of total tickers and downloaded tickers for one time I scrap mass:

biztype_id ind_id ticker ticker_download
TOTAL 3129 657
0 1 100 136 110
1 1 200 81 30
2 1 300 171 49
3 1 400 598 49
4 1 500 903 50
5 1 600 192 51
6 1 700 67 11
7 1 800 221 50
8 1 900 84 24
9 1 1000 22 0
10 1 1100 3 0
11 1 1200 66 12
12 1 1300 5 0
13 1 1400 66 7
14 1 1500 4 0
15 1 1600 5 0
16 1 1700 7 0
17 1 1800 43 0
18 1 1900 6 0
19 1 2000 4 0
20 2 1000 105 103
21 3 1000 75 60
22 4 900 1 0
23 4 1000 39 24
24 4 1600 1 0
25 5 1000 32 19
26 6 1000 2 1
27 6 2000 2 1
28 7 1000 7 6
29 8 1200 181 0

Following are terminal record:

Scrape Finance Data - version 2

Do you wish to mass scrape? [y/n] y
Do you wish clear ALL scraped files and kill ALL running Celery workers? [y/n] y
Clearing scraped files and all running workers, please wait...

OK
rm: cannot remove './run/celery/*': No such file or directory
removed './run/scrapy/financeInfo.scrapy' removed './logs/corporateAZExpress_log_verbose.log' removed './logs/corporateAZOverview_log_verbose.log' removed './logs/financeInfo_log_verbose.log'
Do you wish to start mass scraping now? Process will automatically exit when finished. [y] y
Creating Celery workers...
Running Celery tasks for mass scrape...
Scrapy is still running…
Scrapy is still running...
Scrapy is still running...
…
Scrapy is still running...
Scrapy has finished
Killing Celery workers, flushing Redis queues, deleting Celery run files...
OK
removed './run/celery/workercorpAZ.pid'
removed './run/celery/workerfinance.pid'
Exiting...
@vincetran96
Copy link
Owner

Can you post the summary at the end of the financeInfo log file here. It looks something like this:

2021-08-29 14:28:48 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'bans/status/403': 1,
 'elapsed_time_seconds': 1442.329607,
 'finish_reason': 'CorpAZ is closed; CorpAZ queue is empty; Spider is idling',
 'finish_time': datetime.datetime(2021, 8, 29, 14, 28, 48, 534389),
 'httpcompression/response_bytes': 84604811,
 'httpcompression/response_count': 2330,
 'log_count/INFO': 3533,
 'memusage/max': 64888832,
 'memusage/startup': 61087744,
 'response_received_count': 2331,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/403': 1,
 'scheduler/dequeued/redis': 2330,
 'scheduler/enqueued/redis': 6500,
 'start_time': datetime.datetime(2021, 8, 29, 14, 4, 46, 204782)}
2021-08-29 14:28:48 [scrapy.core.engine] INFO: Spider closed (CorpAZ is closed; CorpAZ queue is empty; Spider is idling)
2021-08-29 14:28:48 [financeInfo] INFO: Deleted status file at run/scrapy/financeInfo.scrapy

@truongphanduykhanh
Copy link
Author

Here we go:

2021-08-29 22:18:53 [financeInfo] INFO: === IDLING... ===
2021-08-29 22:18:53 [financeInfo] INFO: corpAZ closed key: 1
2021-08-29 22:18:53 [financeInfo] INFO: corpAZ key financeInfo:corpAZtickers contains: []
2021-08-29 22:18:53 [financeInfo] INFO: set()
2021-08-29 22:18:53 [scrapy.core.engine] INFO: Closing spider (CorpAZ is closed; CorpAZ queue is empty; Spider is idling)
2021-08-29 22:18:53 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'bans/error/twisted.internet.error.TimeoutError': 1,
 'bans/status/403': 1,
 'elapsed_time_seconds': 21089.140718,
 'finish_reason': 'CorpAZ is closed; CorpAZ queue is empty; Spider is idling',
 'finish_time': datetime.datetime(2021, 8, 29, 22, 18, 53, 434348),
 'httpcompression/response_bytes': 1455817851,
 'httpcompression/response_count': 33926,
 'log_count/INFO': 44694,
 'log_count/WARNING': 1,
 'memusage/max': 71532544,
 'memusage/startup': 71348224,
 'response_received_count': 33927,
 'retry/count': 1,
 'retry/reason_count/twisted.internet.error.TimeoutError': 1,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/403': 1,
 'scheduler/dequeued/redis': 33927,
 'scheduler/enqueued/redis': 33927,
 'start_time': datetime.datetime(2021, 8, 29, 16, 27, 24, 293630)}
2021-08-29 22:18:53 [scrapy.core.engine] INFO: Spider closed (CorpAZ is closed; CorpAZ queue is empty; Spider is idling)
2021-08-29 22:18:53 [financeInfo] INFO: Deleted status file at run/scrapy/financeInfo.scrapy

@truongphanduykhanh
Copy link
Author

@vincetran96 although the mass scraping doesn't work perfectly with me as being described, it runs much faster than the business-industry options. It takes about 5 hours for mass options download ~600 tickers before stopping. While it takes... 20 hours to scrap biz-ind option 1;400, in which there are also ~600 tickers.

@vincetran96
Copy link
Owner

That's quite a big difference. There could still be some logic error in the implementation of the biz-ind scraping option. Other factors like the quality of connection to Vietstock API at the time of scraping may also be involved. Unfortunately, I currently do not have the time to go through the code.

If you're really interested and have the time, you're more than welcome to inspect the code and propose what needs to be changed to make it better 😄

@vincetran96
Copy link
Owner

@truongphanduykhanh are you satisfied with what you got from the program so far? Would you still like me to look into this? I'm suspecting network problems (pertaining to connection quality) that are not caught during the scraping process, as this tends to happen a lot when downloading data for extended periods of time.

If you're OK with what you got, I will close this issue.

@truongphanduykhanh
Copy link
Author

@vincetran96 I haven't figured out how to scrape mass option (without interupting). I have tried to scrape mass 5 times and all of them were interrupted after finishing 600-1000 tickers.

If it doesn't take you too much time, can you look into it? Otherwise, the biz-ind option can download all tickers even though it is slow.

@vincetran96
Copy link
Owner

Sorry for the delayed comment. I was trying to look into this yesterday. However, it seems like Vietstock has implemented a new security measure to prevent repeated/duplicate request to their financials API endpoints (see issue #8). Now I have to find out a workaround that before addressing this issue.

@vincetran96
Copy link
Owner

By default, Scrapy only handles response codes in the 200-300 range (source). While I suspect there were some responses outside that range when you mass-scraped, there doesn't seem to be any in the scrape log summary you posted.

I will have to do some further testing to see how Scrapy behaves with HTTP exceptions. For now, please just scrape whatever tickers are missing. Thanks!

@vincetran96
Copy link
Owner

vincetran96 commented Sep 27, 2021

So... Good news and bad news!

The good news is that I found that the scraper did not fully print out the downloader stats (the summary at the end of the financeInfo log file) when finished (see issue #10). I have fixed this bug and you can now see how many requests are in error, etc. This will give you a better picture as to why some tickers are missing/not downloaded. Also, when the scraper encounters an error, it will log the errors out into a file called financeInfo_{report_type}_spidererrors_short.log. It will contain the ticker, the report_type, the page number at which the error occurs, and the error type (or class).

The bad news is that I still don't know why there are missing tickers..., but I still believe it's due to connection issues, because I don't see any logic errors that may leave some tickers un-handled. Honestly, this is taking quite a lot of my time and I personally think that this fix will guide you in the right direction, because mass-scraping is not an easy job and a lot of external factors can happen during the process, causing errors.

Please try pulling from master with the latest commit and see how it goes.

@vincetran96
Copy link
Owner

@truongphanduykhanh I'm closing this issue since it's been inactive for more than a week. If there's anything, please comment :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants