Hateful users Jupyter notebook #430

PantelisElinas · 2019-06-21T04:51:40Z

I have created a demo notebook for the hateful users on Twitter use-case. It can be used to build a predictive model using GraphSAGE, GCN, or GAT and compare with a Logistic Regression model trained on the same split of the data.

I have created a new folder under demos named use-cases. I think notebooks that are not demonstrating a specific algorithm applied to a specific type of graph-ml problem e.g, node classification or edge prediction, but rather show how to use one or more methods to solve an interesting use-case should be placed in this new folder. I'm open for suggestions if you find this is not the best folder structure.

I have tested the notebook against the latest StellarGraph release, 0.7.0. Let me know of any issues.

Regards,

P.

…methods, GCN, GAT, and GraphSAGE, on classifying hateful users in an online social network.

…models run without problems with StellarGraph version 0.7.0

review-notebook-app · 2019-06-21T04:51:45Z

Check out this pull request on ReviewNB: https://app.reviewnb.com/stellargraph/stellargraph/pull/430

You'll be able to see visual diffs and write comments on notebook cells. Powered by ReviewNB.

codeclimate · 2019-06-21T04:52:26Z

Code Climate has analyzed commit 26d57c7 and detected 0 issues on this pull request.

View more on Code Climate.

annitrolla

The notebook is an interesting use-case how to easy compare different models. It definitely contributes to further understanding of the library's capabilities. However, it still needs a couple of changes:

I'm not sure that the current folder structure is the best option as it is too nested. First, we might not have use-cases for all the tasks and have many for some. Secondly, if we develop use-cases that are "mixed": like exploring both node classification and interpretability, or link prediction and communities, we will have problems where to put them. One possible suggestion would be to leave use-cases in the demos (as it is now), but remove "node-classification" folder. I think having just use-cases all together will be good enough. and we can be more verbose in titling them (current title is good imo).
"We pose identifying hateful users as a binary classification problem and demonstrate the advantage of connected vs unconnected data with regards to increased prediction accuracy for a highly unbalanced dataset in a semi-supervised setting with few training examples." - this part does not read really well. May be better: "We pose a problem of identifying users as a binary classification task". And I'd split the second sentence as it is too long.
cell 3: do you think we need a table of all the features? Especially if we delete all _c* features and features related to network structure? I copied them directly from [1], so if somebody wants to get more details, the link is there. I'd keep only ones that we ended up using. smth like the dataset contains features like ..., and the rest of the features are described [link to exect txt file in Ribeiro github].
cell 12 I think comment before creating yeo-johnson pre-processing of features is useful for following what is happening (though you do explain it later).
cell 13 What feature was transformed? may be to explicitly mention the exact feature. Bug: you are plotting row instead of column
cell 21 For less experienced users it might be useful to mention that 1 indicates that a user is hateful, 0 - not, and 2 - unknown label.
before cell 23 "For machine learning we want to take a subset of the nodes for training, and use the rest for validation and testing. We'll use scikit-learn again to split our data into training and test sets." - "again" is referred to a preprocessing here? A bit unclear why it is important that we use it again :)
"We are only going to use 15% of the annotated nodes for training and the reamining 85% of nodes for testing." - typo in "remaining", and would be good to mention why to train on 15%. is it a scalability issue? I guess I'm more used to training being the same or larger than the test.
I also feel that cell 32 requires a bit of explanation (e.g. these node_features are assigned to a graph object). Also, not clear why you need to create separately annotated_user_features as graph features come from node_data.
as this user case for external library users, we do not need to print so many data.head() outputs e.g. cell 9, 19, 24, 30, and 32 as they do not provide additional info for a user apart from a sanity check.
before cell 33 "As the model is imblanaced" - data is imbalanced, also typo
before cell 34: "and the number of estimators in the ensemble as well as model-specific parameters." - I don't see any parameters related to ensembles.
before cell 36: "Now we can specify our machine learning model, we need a few more parameters for this but the parameters are model-specific." - rephrase
before cell 39: the binary cross entroy loss. - typo: entropy
cell 75-77 Requires a bit more clarification. e.g. to mention that it is first 100 samples from different thresholds. and why exactly 55:66? Also what can we conclude from this table? I suggest to move this from the last paragraph of the conclusion to the cell under the table?

adocherty · 2019-06-25T07:16:25Z

I agree with @annitrolla that the use-cases directory would be better without the intermediate node-classification directory.
"Of the original 1037 node features, we are keeping only 204 that are based on a user's attributes and tweet lexicon." Some justification for this would improve understanding of what we are doing.
Having to add an axis back again for the targets is a little clunky:

annotated_user_features = annotated_users.drop(columns=['hate'])
annotated_user_targets = annotated_users['hate']
print(annotated_user_targets.value_counts())
...
train_targets = train_targets.values
train_targets = train_targets[...,np.newaxis]
test_targets = test_targets.values
test_targets = test_targets[...,np.newaxis]

The reason this is needed is that annotated_users['hate'] is a pd.Series not a pd.DataFrame. By having double brackets for the selection we get a pd.DataFrame and don't have to add the axis back:

annotated_user_targets = annotated_users[['hate']]
print(annotated_user_targets.hate.value_counts())
...

Note the change to printing the value counts. Also, the train_targets can be kept as a DataFrame and used in the flow method, except for calculating the class weights does require using train_targets.values.

"Train the model, keeping track of its loss and accuracy on the training set, and its performance on the validation set during the training (e.g., for early stopping), " we aren't using early stopping here, so perhaps remove it.
The table showing FPR and TPR at the end is hard to interpret. Giving approximate numbers would be easier, for example:

print("At 1% FPR, GraphSAGE TPR={:.3f}, LR TPR={:.3f}".format(np.interp(0.01, fpr, tpr), np.interp(0.01, fpr_lr, tpr_lr)))

You could also plot these as labelled points on the ROC curve.

PantelisElinas · 2019-06-27T06:34:03Z

Hi @adocherty and @annitrolla

I have tackled all the issues you raised. Can you have another look?

Thank you!

annitrolla · 2019-06-27T08:01:57Z

Looks good:
found two typos:
the totail number of nodes
and the reamining 85%

adocherty

The notebook looks good 👍

PantelisElinas added 5 commits June 18, 2019 16:03

Added Jupyter notebook demonstrating the use of graph neural network …

ef37a30

…methods, GCN, GAT, and GraphSAGE, on classifying hateful users in an online social network.

added conclusion.

078fb57

Merge branch 'develop' into hateful-users

2195cfc

Removed some cells relating to debugging and improved some descriptions.

6738775

Corrected some spelling errors and checked to make sure that all GNN …

e5ec760

…models run without problems with StellarGraph version 0.7.0

PantelisElinas added the ml label Jun 21, 2019

PantelisElinas added this to the v0.4 Sprint 7 milestone Jun 21, 2019

PantelisElinas requested review from adocherty and annitrolla June 21, 2019 04:51

PantelisElinas self-assigned this Jun 21, 2019

Added accuracy score for LR. Corrected some spelling errors.

547b91f

adocherty assigned adocherty and annitrolla Jun 24, 2019

annitrolla suggested changes Jun 25, 2019

View reviewed changes

Corrected node feature density plot before and after transformation.

ba1dcdd

PantelisElinas added 3 commits June 27, 2019 09:22

Changed use-case folder structure to remove sub-folders for clarity.

c85791e

Merge branch 'develop' into hateful-users

6b0026a

Updated based on reviewer comments.

a77a6ec

annitrolla previously approved these changes Jun 27, 2019

View reviewed changes

adocherty previously approved these changes Jun 27, 2019

View reviewed changes

Corrected spelling errors.

26d57c7

PantelisElinas dismissed stale reviews from adocherty and annitrolla via 26d57c7 June 27, 2019 23:26

PantelisElinas merged commit 3eacabf into develop Jun 27, 2019

PantelisElinas deleted the hateful-users branch June 27, 2019 23:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hateful users Jupyter notebook #430

Hateful users Jupyter notebook #430

PantelisElinas commented Jun 21, 2019

review-notebook-app bot commented Jun 21, 2019

codeclimate bot commented Jun 21, 2019 •

edited

Loading

annitrolla left a comment

adocherty commented Jun 25, 2019

PantelisElinas commented Jun 27, 2019

annitrolla commented Jun 27, 2019

adocherty left a comment

Hateful users Jupyter notebook #430

Hateful users Jupyter notebook #430

Conversation

PantelisElinas commented Jun 21, 2019

review-notebook-app bot commented Jun 21, 2019

codeclimate bot commented Jun 21, 2019 • edited Loading

annitrolla left a comment

Choose a reason for hiding this comment

adocherty commented Jun 25, 2019

PantelisElinas commented Jun 27, 2019

annitrolla commented Jun 27, 2019

adocherty left a comment

Choose a reason for hiding this comment

codeclimate bot commented Jun 21, 2019 •

edited

Loading