Miss tickers when mass scraping #6

truongphanduykhanh · 2021-08-29T14:16:21Z

There are more than 3,000 tickers in file bizType_ind_tickers.csv. However, there are only 600-1000 tickers dowloaded when I mass scrap (it took 5 hours everytime to scrap those tickers with my internet connection). It missed many bluechips, whose information are certainly available on VietStock such as VIC, GAS.

I have tried several times by both WiFi and Lan connection. The number of tickers and number json files vary from one time to another. One time has 657 tickers, another time has 1025 tickers. None of those times are closed to the total 3129 tickers as in bizType_ind_tickers.csv.

The executions were all stopped by itself and stated clearly that it finished (as terminal shown here at the end).

Following are count summary of total tickers and downloaded tickers for one time I scrap mass:

	biztype_id	ind_id	ticker	ticker_download
TOTAL			3129	657
0	1	100	136	110
1	1	200	81	30
2	1	300	171	49
3	1	400	598	49
4	1	500	903	50
5	1	600	192	51
6	1	700	67	11
7	1	800	221	50
8	1	900	84	24
9	1	1000	22	0
10	1	1100	3	0
11	1	1200	66	12
12	1	1300	5	0
13	1	1400	66	7
14	1	1500	4	0
15	1	1600	5	0
16	1	1700	7	0
17	1	1800	43	0
18	1	1900	6	0
19	1	2000	4	0
20	2	1000	105	103
21	3	1000	75	60
22	4	900	1	0
23	4	1000	39	24
24	4	1600	1	0
25	5	1000	32	19
26	6	1000	2	1
27	6	2000	2	1
28	7	1000	7	6
29	8	1200	181	0

Following are terminal record:

Scrape Finance Data - version 2

Do you wish to mass scrape? [y/n] y
Do you wish clear ALL scraped files and kill ALL running Celery workers? [y/n] y
Clearing scraped files and all running workers, please wait...

OK
rm: cannot remove './run/celery/*': No such file or directory
removed './run/scrapy/financeInfo.scrapy' removed './logs/corporateAZExpress_log_verbose.log' removed './logs/corporateAZOverview_log_verbose.log' removed './logs/financeInfo_log_verbose.log'
Do you wish to start mass scraping now? Process will automatically exit when finished. [y] y
Creating Celery workers...
Running Celery tasks for mass scrape...
Scrapy is still running…
Scrapy is still running...
Scrapy is still running...
…
Scrapy is still running...
Scrapy has finished
Killing Celery workers, flushing Redis queues, deleting Celery run files...
OK
removed './run/celery/workercorpAZ.pid'
removed './run/celery/workerfinance.pid'
Exiting...

The text was updated successfully, but these errors were encountered:

vincetran96 · 2021-08-29T15:00:23Z

Can you post the summary at the end of the financeInfo log file here. It looks something like this:

2021-08-29 14:28:48 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'bans/status/403': 1,
 'elapsed_time_seconds': 1442.329607,
 'finish_reason': 'CorpAZ is closed; CorpAZ queue is empty; Spider is idling',
 'finish_time': datetime.datetime(2021, 8, 29, 14, 28, 48, 534389),
 'httpcompression/response_bytes': 84604811,
 'httpcompression/response_count': 2330,
 'log_count/INFO': 3533,
 'memusage/max': 64888832,
 'memusage/startup': 61087744,
 'response_received_count': 2331,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/403': 1,
 'scheduler/dequeued/redis': 2330,
 'scheduler/enqueued/redis': 6500,
 'start_time': datetime.datetime(2021, 8, 29, 14, 4, 46, 204782)}
2021-08-29 14:28:48 [scrapy.core.engine] INFO: Spider closed (CorpAZ is closed; CorpAZ queue is empty; Spider is idling)
2021-08-29 14:28:48 [financeInfo] INFO: Deleted status file at run/scrapy/financeInfo.scrapy

truongphanduykhanh · 2021-08-30T01:42:08Z

Here we go:

2021-08-29 22:18:53 [financeInfo] INFO: === IDLING... ===
2021-08-29 22:18:53 [financeInfo] INFO: corpAZ closed key: 1
2021-08-29 22:18:53 [financeInfo] INFO: corpAZ key financeInfo:corpAZtickers contains: []
2021-08-29 22:18:53 [financeInfo] INFO: set()
2021-08-29 22:18:53 [scrapy.core.engine] INFO: Closing spider (CorpAZ is closed; CorpAZ queue is empty; Spider is idling)
2021-08-29 22:18:53 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'bans/error/twisted.internet.error.TimeoutError': 1,
 'bans/status/403': 1,
 'elapsed_time_seconds': 21089.140718,
 'finish_reason': 'CorpAZ is closed; CorpAZ queue is empty; Spider is idling',
 'finish_time': datetime.datetime(2021, 8, 29, 22, 18, 53, 434348),
 'httpcompression/response_bytes': 1455817851,
 'httpcompression/response_count': 33926,
 'log_count/INFO': 44694,
 'log_count/WARNING': 1,
 'memusage/max': 71532544,
 'memusage/startup': 71348224,
 'response_received_count': 33927,
 'retry/count': 1,
 'retry/reason_count/twisted.internet.error.TimeoutError': 1,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/403': 1,
 'scheduler/dequeued/redis': 33927,
 'scheduler/enqueued/redis': 33927,
 'start_time': datetime.datetime(2021, 8, 29, 16, 27, 24, 293630)}
2021-08-29 22:18:53 [scrapy.core.engine] INFO: Spider closed (CorpAZ is closed; CorpAZ queue is empty; Spider is idling)
2021-08-29 22:18:53 [financeInfo] INFO: Deleted status file at run/scrapy/financeInfo.scrapy

truongphanduykhanh · 2021-09-09T13:18:44Z

@vincetran96 although the mass scraping doesn't work perfectly with me as being described, it runs much faster than the business-industry options. It takes about 5 hours for mass options download ~600 tickers before stopping. While it takes... 20 hours to scrap biz-ind option 1;400, in which there are also ~600 tickers.

vincetran96 · 2021-09-10T19:02:21Z

That's quite a big difference. There could still be some logic error in the implementation of the biz-ind scraping option. Other factors like the quality of connection to Vietstock API at the time of scraping may also be involved. Unfortunately, I currently do not have the time to go through the code.

If you're really interested and have the time, you're more than welcome to inspect the code and propose what needs to be changed to make it better 😄

vincetran96 · 2021-09-16T00:14:06Z

@truongphanduykhanh are you satisfied with what you got from the program so far? Would you still like me to look into this? I'm suspecting network problems (pertaining to connection quality) that are not caught during the scraping process, as this tends to happen a lot when downloading data for extended periods of time.

If you're OK with what you got, I will close this issue.

truongphanduykhanh · 2021-09-16T07:09:24Z

@vincetran96 I haven't figured out how to scrape mass option (without interupting). I have tried to scrape mass 5 times and all of them were interrupted after finishing 600-1000 tickers.

If it doesn't take you too much time, can you look into it? Otherwise, the biz-ind option can download all tickers even though it is slow.

vincetran96 · 2021-09-23T11:35:20Z

Sorry for the delayed comment. I was trying to look into this yesterday. However, it seems like Vietstock has implemented a new security measure to prevent repeated/duplicate request to their financials API endpoints (see issue #8). Now I have to find out a workaround that before addressing this issue.

vincetran96 · 2021-09-24T20:18:28Z

By default, Scrapy only handles response codes in the 200-300 range (source). While I suspect there were some responses outside that range when you mass-scraped, there doesn't seem to be any in the scrape log summary you posted.

I will have to do some further testing to see how Scrapy behaves with HTTP exceptions. For now, please just scrape whatever tickers are missing. Thanks!

vincetran96 · 2021-09-27T22:09:30Z

So... Good news and bad news!

The good news is that I found that the scraper did not fully print out the downloader stats (the summary at the end of the financeInfo log file) when finished (see issue #10). I have fixed this bug and you can now see how many requests are in error, etc. This will give you a better picture as to why some tickers are missing/not downloaded. Also, when the scraper encounters an error, it will log the errors out into a file called financeInfo_{report_type}_spidererrors_short.log. It will contain the ticker, the report_type, the page number at which the error occurs, and the error type (or class).

The bad news is that I still don't know why there are missing tickers..., but I still believe it's due to connection issues, because I don't see any logic errors that may leave some tickers un-handled. Honestly, this is taking quite a lot of my time and I personally think that this fix will guide you in the right direction, because mass-scraping is not an easy job and a lot of external factors can happen during the process, causing errors.

Please try pulling from master with the latest commit and see how it goes.

vincetran96 · 2021-10-07T01:35:33Z

@truongphanduykhanh I'm closing this issue since it's been inactive for more than a week. If there's anything, please comment :)

vincetran96 closed this as completed Oct 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Miss tickers when mass scraping #6

Miss tickers when mass scraping #6

truongphanduykhanh commented Aug 29, 2021 •

edited

vincetran96 commented Aug 29, 2021

truongphanduykhanh commented Aug 30, 2021

truongphanduykhanh commented Sep 9, 2021

vincetran96 commented Sep 10, 2021

vincetran96 commented Sep 16, 2021

truongphanduykhanh commented Sep 16, 2021

vincetran96 commented Sep 23, 2021

vincetran96 commented Sep 24, 2021

vincetran96 commented Sep 27, 2021 •

edited

vincetran96 commented Oct 7, 2021

Miss tickers when mass scraping #6

Miss tickers when mass scraping #6

Comments

truongphanduykhanh commented Aug 29, 2021 • edited

vincetran96 commented Aug 29, 2021

truongphanduykhanh commented Aug 30, 2021

truongphanduykhanh commented Sep 9, 2021

vincetran96 commented Sep 10, 2021

vincetran96 commented Sep 16, 2021

truongphanduykhanh commented Sep 16, 2021

vincetran96 commented Sep 23, 2021

vincetran96 commented Sep 24, 2021

vincetran96 commented Sep 27, 2021 • edited

vincetran96 commented Oct 7, 2021

truongphanduykhanh commented Aug 29, 2021 •

edited

vincetran96 commented Sep 27, 2021 •

edited