Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rebuilds the ptwiki model with more data. #225

Merged
merged 2 commits into from Jun 11, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
32 changes: 23 additions & 9 deletions Makefile
Expand Up @@ -2902,6 +2902,10 @@ datasets/ptwiki.human_labeled_revisions.20k_2015.json:
./utility fetch_labels \
https://labels.wmflabs.org/campaigns/ptwiki/7/ > $@

datasets/ptwiki.human_labeled_revisions.4k_2020.json:
./utility fetch_labels \
https://labels.wmflabs.org/campaigns/ptwiki/93/ > $@

# From https://quarry.wmflabs.org/query/43215
datasets/ptwiki.sampled_revisions.10k_2020.json:
wget -qO- https://quarry.wmflabs.org/run/444194/output/0/json-lines > $@
Expand All @@ -2910,17 +2914,27 @@ datasets/ptwiki.autolabeled_revisions.10k_2020.json: \
datasets/ptwiki.sampled_revisions.10k_2020.json
cat $< | \
./utility autolabel --host=https://pt.wikipedia.org \
--trusted-groups=bot,sysop,bureaucrat \
--trusted-groups=bot,sysop,bureaucrat,autoreviewer,rollbacker \
--trusted-edits=1000 \
--revert-radius=5 \
--verbose > $@

datasets/ptwiki.labeled_revisions.10k_2020.json: \
datasets/ptwiki.human_labeled_revisions.4k_2020.json \
datasets/ptwiki.autolabeled_revisions.10k_2020.json
./utility merge_labels $^ > $@

datasets/ptwiki.labeled_revisions.20k_2015.json: \
datasets/ptwiki.human_labeled_revisions.20k_2015.json
./utility merge_labels $^ > $@

datasets/ptwiki.labeled_revisions.w_cache.20k_2015.json: \
datasets/ptwiki.labeled_revisions.20k_2015.json
datasets/ptwiki.labeled_revisions.30k_2015_2020.json: \
datasets/ptwiki.labeled_revisions.20k_2015.json \
datasets/ptwiki.labeled_revisions.10k_2020.json
cat $^ > $@

datasets/ptwiki.labeled_revisions.w_cache.30k_2015_2020.json: \
datasets/ptwiki.labeled_revisions.30k_2015_2020.json
cat $< | \
revscoring extract \
editquality.feature_lists.ptwiki.damaging \
Expand All @@ -2930,7 +2944,7 @@ datasets/ptwiki.labeled_revisions.w_cache.20k_2015.json: \
--verbose > $@

tuning_reports/ptwiki.damaging.md: \
datasets/ptwiki.labeled_revisions.w_cache.20k_2015.json
datasets/ptwiki.labeled_revisions.w_cache.30k_2015_2020.json
cat $< | \
revscoring tune \
config/classifiers.params.yaml \
Expand All @@ -2945,13 +2959,13 @@ tuning_reports/ptwiki.damaging.md: \
--debug > $@

models/ptwiki.damaging.gradient_boosting.model: \
datasets/ptwiki.labeled_revisions.w_cache.20k_2015.json
datasets/ptwiki.labeled_revisions.w_cache.30k_2015_2020.json
cat $< | \
revscoring cv_train \
revscoring.scoring.models.GradientBoosting \
editquality.feature_lists.ptwiki.damaging \
damaging \
--version=$(damaging_major_minor).0 \
--version=$(damaging_major_minor).1 \
-p 'learning_rate=0.01' \
-p 'max_depth=7' \
-p 'max_features="log2"' \
Expand All @@ -2964,7 +2978,7 @@ models/ptwiki.damaging.gradient_boosting.model: \
revscoring model_info $@ > model_info/ptwiki.damaging.md

tuning_reports/ptwiki.goodfaith.md: \
datasets/ptwiki.labeled_revisions.w_cache.20k_2015.json
datasets/ptwiki.labeled_revisions.w_cache.30k_2015_2020.json
cat $< | \
revscoring tune \
config/classifiers.params.yaml \
Expand All @@ -2979,13 +2993,13 @@ tuning_reports/ptwiki.goodfaith.md: \
--debug > $@

models/ptwiki.goodfaith.gradient_boosting.model: \
datasets/ptwiki.labeled_revisions.w_cache.20k_2015.json
datasets/ptwiki.labeled_revisions.w_cache.30k_2015_2020.json
cat $< | \
revscoring cv_train \
revscoring.scoring.models.GradientBoosting \
editquality.feature_lists.ptwiki.goodfaith \
goodfaith \
--version=$(goodfaith_major_minor).0 \
--version=$(goodfaith_major_minor).1 \
-p 'learning_rate=0.01' \
-p 'max_depth=7' \
-p 'max_features="log2"' \
Expand Down
20 changes: 16 additions & 4 deletions config/wikis/ptwiki.yaml
Expand Up @@ -5,6 +5,8 @@ host: pt.wikipedia.org
external_samples:
human_labeled_revisions.20k_2015:
labeling_campaign: https://labels.wmflabs.org/campaigns/ptwiki/7/
human_labeled_revisions.4k_2020:
labeling_campaign: https://labels.wmflabs.org/campaigns/ptwiki/93/
sampled_revisions.10k_2020:
quarry_page: https://quarry.wmflabs.org/query/43215
quarry_url: https://quarry.wmflabs.org/run/444194/output/0/json-lines
Expand All @@ -23,34 +25,44 @@ autolabeled_samples:
merged_samples:
labeled_revisions.20k_2015:
- human_labeled_revisions.20k_2015
labeled_revisions.10k_2020:
- human_labeled_revisions.4k_2020
- autolabeled_revisions.10k_2020

concatenated_samples:
labeled_revisions.30k_2015_2020:
- labeled_revisions.20k_2015
- labeled_revisions.10k_2020

extracted_samples:
labeled_revisions.w_cache.20k_2015:
sample: labeled_revisions.20k_2015
labeled_revisions.w_cache.30k_2015_2020:
sample: labeled_revisions.30k_2015_2020
features_for:
- damaging
- goodfaith

models:
damaging:
observations: labeled_revisions.w_cache.20k_2015
observations: labeled_revisions.w_cache.30k_2015_2020
label: damaging
pop_rate_true: 0.06896029864299047
tune: true
cv_train:
algorithm: GradientBoosting
build_number: 1
parameters:
learning_rate: 0.01
max_depth: 7
max_features: log2
n_estimators: 700
goodfaith:
observations: labeled_revisions.w_cache.20k_2015
observations: labeled_revisions.w_cache.30k_2015_2020
label: goodfaith
pop_rate_true: 0.9397669373959542
tune: true
cv_train:
algorithm: GradientBoosting
build_number: 1
parameters:
learning_rate: 0.01
max_depth: 7
Expand Down
112 changes: 56 additions & 56 deletions model_info/arwiki.damaging.md
@@ -1,12 +1,12 @@
Model Information:
- type: GradientBoosting
- version: 0.5.0
- params: {'population_rates': None, 'min_samples_split': 2, 'scale': True, 'warm_start': False, 'loss': 'deviance', 'labels': [True, False], 'subsample': 1.0, 'learning_rate': 0.01, 'min_impurity_decrease': 0.0, 'init': None, 'validation_fraction': 0.1, 'presort': 'auto', 'center': True, 'min_samples_leaf': 1, 'criterion': 'friedman_mse', 'max_features': 'log2', 'max_leaf_nodes': None, 'multilabel': False, 'label_weights': OrderedDict([(True, 10)]), 'tol': 0.0001, 'max_depth': 3, 'n_estimators': 100, 'min_impurity_split': None, 'n_iter_no_change': None, 'random_state': None, 'min_weight_fraction_leaf': 0.0, 'verbose': 0}
- params: {'labels': [True, False], 'learning_rate': 0.01, 'warm_start': False, 'n_iter_no_change': None, 'init': None, 'multilabel': False, 'ccp_alpha': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'center': True, 'n_estimators': 100, 'population_rates': None, 'max_depth': 3, 'verbose': 0, 'min_impurity_decrease': 0.0, 'subsample': 1.0, 'max_features': 'log2', 'scale': True, 'label_weights': OrderedDict([(True, 10)]), 'loss': 'deviance', 'tol': 0.0001, 'criterion': 'friedman_mse', 'min_impurity_split': None, 'presort': 'deprecated', 'max_leaf_nodes': None, 'validation_fraction': 0.1, 'min_weight_fraction_leaf': 0.0, 'random_state': None}
Environment:
- revscoring_version: '2.5.1'
- platform: 'Linux-4.9.0-9-amd64-x86_64-with-debian-9.9'
- revscoring_version: '2.8.0'
he7d3r marked this conversation as resolved.
Show resolved Hide resolved
- platform: 'Linux-4.9.0-11-amd64-x86_64-with-debian-9.12'
- machine: 'x86_64'
- version: '#1 SMP Debian 4.9.168-1+deb9u2 (2019-05-13)'
- version: '#1 SMP Debian 4.9.189-3+deb9u1 (2019-09-20)'
- system: 'Linux'
- processor: ''
- python_build: ('default', 'Sep 27 2018 17:25:39')
Expand All @@ -15,67 +15,67 @@ Model Information:
- python_implementation: 'CPython'
- python_revision: ''
- python_version: '3.5.3'
- release: '4.9.0-9-amd64'
- release: '4.9.0-11-amd64'

Statistics:
counts (n=18528):
counts (n=18479):
label n ~True ~False
------- ----- --- ------- --------
True 339 --> 16 323
False 18189 --> 14 18175
True 339 --> 1 338
False 18140 --> 2 18138
rates:
True False
---------- ------ -------
sample 0.018 0.982
population 0.021 0.979
match_rate (micro=0.977, macro=0.5):
False True
------- ------
0.998 0.002
filter_rate (micro=0.023, macro=0.5):
False True
------- ------
0.002 0.998
recall (micro=0.979, macro=0.523):
False True
------- ------
0.999 0.047
!recall (micro=0.068, macro=0.523):
False True
------- ------
0.047 0.999
precision (micro=0.971, macro=0.776):
False True
------- ------
0.98 0.573
!precision (micro=0.582, macro=0.776):
False True
------- ------
0.573 0.98
f1 (micro=0.97, macro=0.538):
False True
------- ------
0.989 0.087
!f1 (micro=0.107, macro=0.538):
False True
------- ------
0.087 0.989
match_rate (micro=0.978, macro=0.5):
True False
------ -------
0 1
filter_rate (micro=0.022, macro=0.5):
True False
------ -------
1 0
recall (micro=0.979, macro=0.501):
True False
------ -------
0.003 1
!recall (micro=0.024, macro=0.501):
True False
------ -------
1 0.003
precision (micro=0.966, macro=0.674):
True False
------ -------
0.369 0.979
!precision (micro=0.382, macro=0.674):
True False
------ -------
0.979 0.369
f1 (micro=0.968, macro=0.497):
True False
------ -------
0.006 0.989
!f1 (micro=0.027, macro=0.497):
True False
------ -------
0.989 0.006
accuracy (micro=0.979, macro=0.979):
False True
------- ------
0.979 0.979
fpr (micro=0.932, macro=0.477):
False True
------- ------
0.953 0.001
roc_auc (micro=0.938, macro=0.937):
False True
------- ------
0.938 0.937
pr_auc (micro=0.983, macro=0.638):
False True
------- ------
0.998 0.278
True False
------ -------
0.979 0.979
fpr (micro=0.976, macro=0.499):
True False
------ -------
0 0.997
roc_auc (micro=0.936, macro=0.936):
True False
------ -------
0.937 0.936
pr_auc (micro=0.982, macro=0.616):
True False
------ -------
0.234 0.998

- score_schema: {'title': 'Scikit learn-based classifier score with probability', 'type': 'object', 'properties': {'probability': {'description': 'A mapping of probabilities onto each of the potential output labels', 'type': 'object', 'properties': {'true': {'type': 'number'}, 'false': {'type': 'number'}}}, 'prediction': {'description': 'The most likely label predicted by the estimator', 'type': 'boolean'}}}
- score_schema: {'type': 'object', 'title': 'Scikit learn-based classifier score with probability', 'properties': {'probability': {'type': 'object', 'description': 'A mapping of probabilities onto each of the potential output labels', 'properties': {'true': {'type': 'number'}, 'false': {'type': 'number'}}}, 'prediction': {'type': 'boolean', 'description': 'The most likely label predicted by the estimator'}}}