Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hateful users Jupyter notebook #430

Merged
merged 11 commits into from
Jun 27, 2019
Merged

Hateful users Jupyter notebook #430

merged 11 commits into from
Jun 27, 2019

Conversation

PantelisElinas
Copy link
Contributor

Hi @adocherty and @annitrolla

I have created a demo notebook for the hateful users on Twitter use-case. It can be used to build a predictive model using GraphSAGE, GCN, or GAT and compare with a Logistic Regression model trained on the same split of the data.

I have created a new folder under demos named use-cases. I think notebooks that are not demonstrating a specific algorithm applied to a specific type of graph-ml problem e.g, node classification or edge prediction, but rather show how to use one or more methods to solve an interesting use-case should be placed in this new folder. I'm open for suggestions if you find this is not the best folder structure.

I have tested the notebook against the latest StellarGraph release, 0.7.0. Let me know of any issues.

Regards,

P.

@review-notebook-app
Copy link

Check out this pull request on ReviewNB: https://app.reviewnb.com/stellargraph/stellargraph/pull/430

You'll be able to see visual diffs and write comments on notebook cells. Powered by ReviewNB.

@codeclimate
Copy link

codeclimate bot commented Jun 21, 2019

Code Climate has analyzed commit 26d57c7 and detected 0 issues on this pull request.

View more on Code Climate.

Copy link
Contributor

@annitrolla annitrolla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The notebook is an interesting use-case how to easy compare different models. It definitely contributes to further understanding of the library's capabilities. However, it still needs a couple of changes:

  • I'm not sure that the current folder structure is the best option as it is too nested. First, we might not have use-cases for all the tasks and have many for some. Secondly, if we develop use-cases that are "mixed": like exploring both node classification and interpretability, or link prediction and communities, we will have problems where to put them. One possible suggestion would be to leave use-cases in the demos (as it is now), but remove "node-classification" folder. I think having just use-cases all together will be good enough. and we can be more verbose in titling them (current title is good imo).

  • "We pose identifying hateful users as a binary classification problem and demonstrate the advantage of connected vs unconnected data with regards to increased prediction accuracy for a highly unbalanced dataset in a semi-supervised setting with few training examples." - this part does not read really well. May be better: "We pose a problem of identifying users as a binary classification task". And I'd split the second sentence as it is too long.

  • cell 3: do you think we need a table of all the features? Especially if we delete all _c* features and features related to network structure? I copied them directly from [1], so if somebody wants to get more details, the link is there. I'd keep only ones that we ended up using. smth like the dataset contains features like ..., and the rest of the features are described [link to exect txt file in Ribeiro github].

  • cell 12 I think comment before creating yeo-johnson pre-processing of features is useful for following what is happening (though you do explain it later).

  • cell 13 What feature was transformed? may be to explicitly mention the exact feature. Bug: you are plotting row instead of column

  • cell 21 For less experienced users it might be useful to mention that 1 indicates that a user is hateful, 0 - not, and 2 - unknown label.

  • before cell 23 "For machine learning we want to take a subset of the nodes for training, and use the rest for validation and testing. We'll use scikit-learn again to split our data into training and test sets." - "again" is referred to a preprocessing here? A bit unclear why it is important that we use it again :)

  • "We are only going to use 15% of the annotated nodes for training and the reamining 85% of nodes for testing." - typo in "remaining", and would be good to mention why to train on 15%. is it a scalability issue? I guess I'm more used to training being the same or larger than the test.

  • I also feel that cell 32 requires a bit of explanation (e.g. these node_features are assigned to a graph object). Also, not clear why you need to create separately annotated_user_features as graph features come from node_data.

  • as this user case for external library users, we do not need to print so many data.head() outputs e.g. cell 9, 19, 24, 30, and 32 as they do not provide additional info for a user apart from a sanity check.

  • before cell 33 "As the model is imblanaced" - data is imbalanced, also typo

  • before cell 34: "and the number of estimators in the ensemble as well as model-specific parameters." - I don't see any parameters related to ensembles.

  • before cell 36: "Now we can specify our machine learning model, we need a few more parameters for this but the parameters are model-specific." - rephrase

  • before cell 39: the binary cross entroy loss. - typo: entropy

  • cell 75-77 Requires a bit more clarification. e.g. to mention that it is first 100 samples from different thresholds. and why exactly 55:66? Also what can we conclude from this table? I suggest to move this from the last paragraph of the conclusion to the cell under the table?

@adocherty
Copy link
Contributor

  • I agree with @annitrolla that the use-cases directory would be better without the intermediate node-classification directory.

  • "Of the original 1037 node features, we are keeping only 204 that are based on a user's attributes and tweet lexicon." Some justification for this would improve understanding of what we are doing.

  • Having to add an axis back again for the targets is a little clunky:

annotated_user_features = annotated_users.drop(columns=['hate'])
annotated_user_targets = annotated_users['hate']
print(annotated_user_targets.value_counts())
...
train_targets = train_targets.values
train_targets = train_targets[...,np.newaxis]
test_targets = test_targets.values
test_targets = test_targets[...,np.newaxis]

The reason this is needed is that annotated_users['hate'] is a pd.Series not a pd.DataFrame. By having double brackets for the selection we get a pd.DataFrame and don't have to add the axis back:

annotated_user_targets = annotated_users[['hate']]
print(annotated_user_targets.hate.value_counts())
...

Note the change to printing the value counts. Also, the train_targets can be kept as a DataFrame and used in the flow method, except for calculating the class weights does require using train_targets.values.

  • "Train the model, keeping track of its loss and accuracy on the training set, and its performance on the validation set during the training (e.g., for early stopping), " we aren't using early stopping here, so perhaps remove it.

  • The table showing FPR and TPR at the end is hard to interpret. Giving approximate numbers would be easier, for example:

print("At 1% FPR, GraphSAGE TPR={:.3f}, LR TPR={:.3f}".format(np.interp(0.01, fpr, tpr), np.interp(0.01, fpr_lr, tpr_lr)))

You could also plot these as labelled points on the ROC curve.

@PantelisElinas
Copy link
Contributor Author

Hi @adocherty and @annitrolla

I have tackled all the issues you raised. Can you have another look?

Thank you!

@annitrolla
Copy link
Contributor

Looks good:
found two typos:
the totail number of nodes
and the reamining 85%

annitrolla
annitrolla previously approved these changes Jun 27, 2019
adocherty
adocherty previously approved these changes Jun 27, 2019
Copy link
Contributor

@adocherty adocherty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The notebook looks good 👍

@PantelisElinas PantelisElinas dismissed stale reviews from adocherty and annitrolla via 26d57c7 June 27, 2019 23:26
@PantelisElinas PantelisElinas merged commit 3eacabf into develop Jun 27, 2019
@PantelisElinas PantelisElinas deleted the hateful-users branch June 27, 2019 23:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants