Miss tickers when mass scraping #6
Comments
Can you post the summary at the end of the financeInfo log file here. It looks something like this:
|
Here we go:
|
@vincetran96 although the mass scraping doesn't work perfectly with me as being described, it runs much faster than the business-industry options. It takes about 5 hours for mass options download ~600 tickers before stopping. While it takes... 20 hours to scrap biz-ind option |
That's quite a big difference. There could still be some logic error in the implementation of the biz-ind scraping option. Other factors like the quality of connection to Vietstock API at the time of scraping may also be involved. Unfortunately, I currently do not have the time to go through the code. If you're really interested and have the time, you're more than welcome to inspect the code and propose what needs to be changed to make it better 😄 |
@truongphanduykhanh are you satisfied with what you got from the program so far? Would you still like me to look into this? I'm suspecting network problems (pertaining to connection quality) that are not caught during the scraping process, as this tends to happen a lot when downloading data for extended periods of time. If you're OK with what you got, I will close this issue. |
@vincetran96 I haven't figured out how to scrape mass option (without interupting). I have tried to scrape mass 5 times and all of them were interrupted after finishing 600-1000 tickers. If it doesn't take you too much time, can you look into it? Otherwise, the biz-ind option can download all tickers even though it is slow. |
Sorry for the delayed comment. I was trying to look into this yesterday. However, it seems like Vietstock has implemented a new security measure to prevent repeated/duplicate request to their financials API endpoints (see issue #8). Now I have to find out a workaround that before addressing this issue. |
By default, Scrapy only handles response codes in the 200-300 range (source). While I suspect there were some responses outside that range when you mass-scraped, there doesn't seem to be any in the scrape log summary you posted. I will have to do some further testing to see how Scrapy behaves with HTTP exceptions. For now, please just scrape whatever tickers are missing. Thanks! |
So... Good news and bad news! The good news is that I found that the scraper did not fully print out the downloader stats (the summary at the end of the The bad news is that I still don't know why there are missing tickers..., but I still believe it's due to connection issues, because I don't see any logic errors that may leave some tickers un-handled. Honestly, this is taking quite a lot of my time and I personally think that this fix will guide you in the right direction, because mass-scraping is not an easy job and a lot of external factors can happen during the process, causing errors. Please try pulling from |
@truongphanduykhanh I'm closing this issue since it's been inactive for more than a week. If there's anything, please comment :) |
There are more than 3,000 tickers in file
bizType_ind_tickers.csv
. However, there are only 600-1000 tickers dowloaded when I mass scrap (it took 5 hours everytime to scrap those tickers with my internet connection). It missed many bluechips, whose information are certainly available on VietStock such as VIC, GAS.I have tried several times by both WiFi and Lan connection. The number of tickers and number json files vary from one time to another. One time has 657 tickers, another time has 1025 tickers. None of those times are closed to the total 3129 tickers as in
bizType_ind_tickers.csv
.The executions were all stopped by itself and stated clearly that it finished (as terminal shown here at the end).
Following are count summary of total tickers and downloaded tickers for one time I scrap mass:
Following are terminal record:
The text was updated successfully, but these errors were encountered: