-
Notifications
You must be signed in to change notification settings - Fork 431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hateful users Jupyter notebook #430
Conversation
…methods, GCN, GAT, and GraphSAGE, on classifying hateful users in an online social network.
…models run without problems with StellarGraph version 0.7.0
Check out this pull request on ReviewNB: https://app.reviewnb.com/stellargraph/stellargraph/pull/430 You'll be able to see visual diffs and write comments on notebook cells. Powered by ReviewNB. |
Code Climate has analyzed commit 26d57c7 and detected 0 issues on this pull request. View more on Code Climate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The notebook is an interesting use-case how to easy compare different models. It definitely contributes to further understanding of the library's capabilities. However, it still needs a couple of changes:
-
I'm not sure that the current folder structure is the best option as it is too nested. First, we might not have use-cases for all the tasks and have many for some. Secondly, if we develop use-cases that are "mixed": like exploring both node classification and interpretability, or link prediction and communities, we will have problems where to put them. One possible suggestion would be to leave use-cases in the demos (as it is now), but remove "node-classification" folder. I think having just use-cases all together will be good enough. and we can be more verbose in titling them (current title is good imo).
-
"We pose identifying hateful users as a binary classification problem and demonstrate the advantage of connected vs unconnected data with regards to increased prediction accuracy for a highly unbalanced dataset in a semi-supervised setting with few training examples." - this part does not read really well. May be better: "We pose a problem of identifying users as a binary classification task". And I'd split the second sentence as it is too long.
-
cell 3: do you think we need a table of all the features? Especially if we delete all _c* features and features related to network structure? I copied them directly from [1], so if somebody wants to get more details, the link is there. I'd keep only ones that we ended up using. smth like the dataset contains features like ..., and the rest of the features are described [link to exect txt file in Ribeiro github].
-
cell 12 I think comment before creating yeo-johnson pre-processing of features is useful for following what is happening (though you do explain it later).
-
cell 13 What feature was transformed? may be to explicitly mention the exact feature. Bug: you are plotting row instead of column
-
cell 21 For less experienced users it might be useful to mention that 1 indicates that a user is hateful, 0 - not, and 2 - unknown label.
-
before cell 23 "For machine learning we want to take a subset of the nodes for training, and use the rest for validation and testing. We'll use scikit-learn again to split our data into training and test sets." - "again" is referred to a preprocessing here? A bit unclear why it is important that we use it again :)
-
"We are only going to use 15% of the annotated nodes for training and the reamining 85% of nodes for testing." - typo in "remaining", and would be good to mention why to train on 15%. is it a scalability issue? I guess I'm more used to training being the same or larger than the test.
-
I also feel that cell 32 requires a bit of explanation (e.g. these
node_features
are assigned to a graph object). Also, not clear why you need to create separatelyannotated_user_features
as graph features come fromnode_data
. -
as this user case for external library users, we do not need to print so many data.head() outputs e.g. cell 9, 19, 24, 30, and 32 as they do not provide additional info for a user apart from a sanity check.
-
before cell 33 "As the model is imblanaced" - data is imbalanced, also typo
-
before cell 34: "and the number of estimators in the ensemble as well as model-specific parameters." - I don't see any parameters related to ensembles.
-
before cell 36: "Now we can specify our machine learning model, we need a few more parameters for this but the parameters are model-specific." - rephrase
-
before cell 39: the binary cross entroy loss. - typo: entropy
-
cell 75-77 Requires a bit more clarification. e.g. to mention that it is first 100 samples from different thresholds. and why exactly 55:66? Also what can we conclude from this table? I suggest to move this from the last paragraph of the conclusion to the cell under the table?
The reason this is needed is that
Note the change to printing the value counts. Also, the train_targets can be kept as a DataFrame and used in the flow method, except for calculating the class weights does require using
You could also plot these as labelled points on the ROC curve. |
Hi @adocherty and @annitrolla I have tackled all the issues you raised. Can you have another look? Thank you! |
Looks good: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The notebook looks good 👍
26d57c7
Hi @adocherty and @annitrolla
I have created a demo notebook for the hateful users on Twitter use-case. It can be used to build a predictive model using GraphSAGE, GCN, or GAT and compare with a Logistic Regression model trained on the same split of the data.
I have created a new folder under demos named use-cases. I think notebooks that are not demonstrating a specific algorithm applied to a specific type of graph-ml problem e.g, node classification or edge prediction, but rather show how to use one or more methods to solve an interesting use-case should be placed in this new folder. I'm open for suggestions if you find this is not the best folder structure.
I have tested the notebook against the latest StellarGraph release, 0.7.0. Let me know of any issues.
Regards,
P.