Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loop on multiple Analyze website write on same variable object #49

Open
mayouf opened this issue Jan 1, 2020 · 5 comments
Open

loop on multiple Analyze website write on same variable object #49

mayouf opened this issue Jan 1, 2020 · 5 comments
Assignees

Comments

@mayouf
Copy link

mayouf commented Jan 1, 2020

Hi,
I want to analyze multiple website by loop on a list and write the results in a json file.

I notice that when we crawl 2 differents website and we store the output in two differents variables (let's say A and B), the second variable, B, gets incremented of A...and so on for other crawls.

It is like the analyse() write on a the same object !!

And it gets even weirder when I delete A and B with a del A,B, the analyse() function do not re-run, it recovers the old results from nowhere !!

I tried to use function % reset to erase the memory...but still recover the results from local memory !!!

here is an example:

from seoanalyzer import analyze
A = analyze("https://krugerwildlifesafaris.com/")

# the lenght is 90
print(len(A['pages'])) 

B = analyze("http://www.vintage.co.bw/")

# the lenght is 90
print(len(A['pages']))
# the lenght is 100 but it should be 10 
print(len(B['pages']))

the A has 90 pages and B should have only 10 pages, but it has 90 from A + its own 10..

how to avoid this ?
Why this erratic behavior ?

regards,

karim.m

@mayouf mayouf changed the title Analyze funtion write on same variable object loop on multiple Analyze website write on same variable object Jan 1, 2020
@ghost
Copy link

ghost commented Jan 4, 2020

Same problem guyz !

@ghost
Copy link

ghost commented Jan 4, 2020

I fixed the issue by doing this:
Go to the ("Manifest") class in the implementation and look for the "Analyze" method.

At the end of the method, before "return output" just write:
Manifest.clear_cache()

Everything will be cool !

@mayouf
Copy link
Author

mayouf commented Jan 4, 2020

Hi Ghezaielm,

Thanks for your quick feedback..by the meantime, I used another workaround, see below:

import os
for website in list_of_website:
----file_name = # whatever name file you want
----command='seoanalyze {} -f json > "{}"'.format(website,file_name)
----returned_value = os.system(command)
----print(str(returned_value)+' name= '+file_name+' '+website)

And it is convenient if you want parallelize you crawl by using ThreadPoolExecutor

I have 8 cores /20 threads CPU, it is damn fast...I crawled 20k websites in few hours !!

with concurrent.futures.ThreadPoolExecutor(max_workers=80) as executor:
#48 Start the load operations and mark each future with its URL
future_to_url = {executor.submit(analyze_SEO, url): url for row in list_website}
#print(future_to_url)

for future_url in concurrent.futures.as_completed(future_to_url):
url_completed = future_to_url[future_url]

try:
data = url_completed .result()
if data!=None:
print(data)
except Exception as exc:
print('%r generated an exception: %s' % url, exc)

(PS: sorry I did not how to make the spaces on github quote for code)

@mayouf
Copy link
Author

mayouf commented Jan 4, 2020

Did you submit the correction on github ?

@sethblack
Copy link
Owner

Ah, right. I'm putting this on my roadmap for v4.1. 👍

@sethblack sethblack self-assigned this Feb 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants