-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encounter error "Unable to activate object" when there are multiple threads / concurrent tasks in Spark #5784
Comments
In the previous post mentioned here: https://discuss.nebula-graph.com.cn/t/topic/9726, zhang_hytc encountered the same issue as you did. You can try the following steps: First, execute the Next, confirm whether you can establish a connection from your local environment to the storaged addresses exposed by the metad service. |
Thanks @QingZ11 for your prompt response. The issue we encounter is not that we cannot connect to the storaged service under any circumstances. Instead, the problem is that we encounter the error when |
please make sure all the spark workers can ping the storaged address. |
Thanks @Nicole00 for reminder. I confirm that all Spark workers and the storaged service are within the same VPC network and their ports are connected. |
I've taken the initiative to do some preliminary checks, but so far, those have not led to a resolution. To proceed further and more effectively troubleshoot the issue, could you advise me on the following: |
In addition, we have observed weird behavior in another test, which is to connect to the database and count number via spark-shell. spark-shell --master yarn --deploy-mode client --driver-memory=2G --executor-memory=2G --num-executors=2 --executor-cores=2 --conf spark.dynamicAllocation.enabled=false --jars nebula-spark-connector_3.0-3.0-SNAPSHOT-jar-with-dependencies.jar We run again the following snippet import org.apache.spark.sql.DataFrame
import com.vesoft.nebula.connector.connector.NebulaDataFrameReader
import com.vesoft.nebula.connector.{NebulaConnectionConfig, ReadNebulaConfig}
sc.setLogLevel("INFO")
val ec2_public_ip = "xx.xx.xx.xx"
val config = NebulaConnectionConfig.builder().withMetaAddress(s"${ec2_public_ip}:9559").withConnectionRetry(2).build()
val nebulaReadEdgeConfig: ReadNebulaConfig = ReadNebulaConfig.builder().withSpace("acct2asset_20231130").withLabel("USES").withNoColumn(false).withReturnCols(List()).withPartitionNum(20).build()
val dataset = spark.read.nebula(config, nebulaReadEdgeConfig).loadEdgesToDF()
dataset.show()
dataset.count() The first four tasks raised "Unable to activate object" the error while the following ones did not. |
So wired! Does the first four tasks are located in the different machines with the other tasks? |
@Nicole00 Yes, these tasks all run on the same single machine where storaged service is and I confirm |
Really wired. If the tasks are all run on ONE SAME single machine, looks like the storaged server is not ready at 10:59:00, but ready at 11:01:09. |
could you please provide some log information for nebula storaged? |
Sure, could you please let me know what |
You can config |
The logging is configured so that |
@Nicole00 btw following some random thought, we found tons of TCP connection with import org.apache.spark.sql.DataFrame
import com.vesoft.nebula.connector.connector.NebulaDataFrameReader
import com.vesoft.nebula.connector.{NebulaConnectionConfig, ReadNebulaConfig}
sc.setLogLevel("INFO")
val ec2_public_ip = "xx.xx.xx.xx"
val config = NebulaConnectionConfig.builder().withMetaAddress(s"${ec2_public_ip}:9559").withConnectionRetry(2).build()
val nebulaReadEdgeConfig: ReadNebulaConfig = ReadNebulaConfig.builder().withSpace("acct2asset_20231130").withLabel("USES").withNoColumn(false).withReturnCols(List()).withPartitionNum(20).build()
val dataset = spark.read.nebula(config, nebulaReadEdgeConfig).loadEdgesToDF()
dataset.count() There are around 21k TCP connections:
Is this expected? |
Sorry for reply late. I checked the How many data in your |
@Nicole00 No worries! There are 303938330 |
OK, I'll make a test to see if there any connection leak. And at the mean time, maybe you can update your nebula-spark-connector to the latest version. |
A bit summary of what have been observed so far:
@Nicole00 Do you have any other ideas taking these into consideration? Anything we could try to increase parallelism when reading graph? |
I really cannot reproduce your problem. I still think it's a network problem. |
This question came up to me very accidentally, it's about the port amount. |
@sparkle-apt hi, I have noticed that the issue you created hasn’t been updated for nearly a month, is this issue been resolved? If not resolved, can you provide some more information? If solved, can you close this issue? Thanks a lot for your contribution anyway 😊 |
We have decided to not be blocked by this issue for the moment and move forward with other projects and test in larger clusters. We will get back to it when bandwidth allows. So we can close the issue. Thanks for the reminder. |
Settings
Hardware and software overview
NebulaGraph Database deployment
Computation cluster
Graph data
Others
Issues
When trying scanning full graph data such as
count()
as shown in the snippet below on the EMR machine, we encounteredUnable to activate object
error.Snippet:
Error log:
However, we can successfully run the following and get results on EMR machine.
We also tested scripts involving different volumes of the graph data. When
val n_limit = 1000000
, we can successfully run the following (which is a modified snippet from nebula-algorithm package):However, when we increase to
val n_limit = 10000000
, it failed and we got the sameUnable to activate object
error.What we found so far
With more tests going on, we found that when number of all threads / concurrent tasks is
1
, there would not be such error, whereas when number of threads is greater than1
, the error appears. We are suspecting that there is certain constraint of NebulaGraph Database and wondering whether proper configuration tuning could help.Could you please help with this issue? Feel free to let me know if I need provide more information. Thanks a lot!
The text was updated successfully, but these errors were encountered: