## Draft topic walk-through

This notebook investigates how the drafttopic model is built, then uses the model to score a document.

## How does it work?

We start by running a low-level analysis to demonstrate that we've successfully loaded the word2vec embeddings, and to illustrate what kind of data is being used to train the model.  Normally, you'll let feature extraction run using the defaults.

In [2]:
from drafttopic.feature_lists import wordvectors


wordvectors.vectorize_words(["love", "shack"])

[array([ 0.10302734, -0.15234375,  0.02587891,  0.16503906, -0.16503906,
         0.06689453,  0.29296875, -0.26367188, -0.140625  ,  0.20117188,
        -0.02624512, -0.08203125, -0.02770996, -0.04394531, -0.23535156,
         0.16992188,  0.12890625,  0.15722656,  0.00756836, -0.06982422,
        -0.03857422,  0.07958984,  0.22949219, -0.14355469,  0.16796875,
        -0.03515625,  0.05517578,  0.10693359,  0.11181641, -0.16308594,
        -0.11181641,  0.13964844,  0.01556396,  0.12792969,  0.15429688,
         0.07714844,  0.26171875,  0.08642578, -0.02514648,  0.33398438,
         0.18652344, -0.20996094,  0.07080078,  0.02600098, -0.10644531,
        -0.10253906,  0.12304688,  0.04711914,  0.02209473,  0.05834961,
        -0.10986328,  0.14941406, -0.10693359,  0.01556396,  0.08984375,
         0.11230469, -0.04370117, -0.11376953, -0.0037384 , -0.01818848,
         0.24316406,  0.08447266, -0.07080078,  0.18066406,  0.03515625,
        -0.09667969, -0.21972656, -0.00328064, -0.0

Prepare an extractor to pull information from the English Wikipedia API.

In [2]:
import mwapi
from revscoring.extractors import api

extractor = api.Extractor(mwapi.Session("https://en.wikipedia.org", user_agent="drafttopic demo"))

Run feature extraction from within the revscoring framework.

In [8]:
features = wordvectors.drafttopic
rev_id = 604413609
feature_values = extractor.extract(rev_id, features)
list(zip(features, feature_values))

[(<feature_vector.revision.text.google_news_vector_mean>,
  array([ 0.05240885,  0.01155599, -0.02832031,  0.2718099 , -0.23209635,
         -0.04370117,  0.0612793 , -0.18595378,  0.09033203, -0.01529948,
          0.08577474,  0.10717773,  0.19824219,  0.03938802, -0.02099609,
         -0.08333333, -0.00569661,  0.08056641, -0.15966797, -0.03450521,
          0.1352946 ,  0.16015625, -0.00236003, -0.00642904, -0.14200846,
         -0.01424154,  0.06347656,  0.11686198,  0.0059611 ,  0.03503418,
         -0.04589844,  0.06241862, -0.12255859,  0.01123047,  0.0953776 ,
         -0.11206055,  0.0631307 , -0.00764974,  0.02596029,  0.08719889,
         -0.01660156,  0.18983968,  0.04178874,  0.0999349 , -0.01448568,
         -0.1866862 ,  0.08902486,  0.04707845, -0.02050781,  0.12174479,
         -0.1319987 , -0.00981649,  0.00183105, -0.13989258,  0.11751302,
          0.08821615, -0.15429688, -0.00587972, -0.05625407, -0.22330729,
          0.15226237,  0.0193278 , -0.10071818, -0.062

## Running the model

Once the model has been trained, we can do high-level analysis of documents and predict which WikiProjects might have a subject matter overlap.

In [1]:
from revscoring import Model
sm = Model.load(open("../models/enwiki.drafttopic.gradient_boosting.model"))
print(sm.info.format())



Model Information:
	 - type: GradientBoosting
	 - version: None
	 - params: {'min_impurity_decrease': 0.0, 'max_leaf_nodes': None, 'labels': ['Culture.Plastic arts', 'Assistance.Files', 'STEM.Geosciences', 'History_And_Society.Military and warfare', 'STEM.Chemistry', 'Culture.Internet culture', 'STEM.Science', 'Assistance.Article improvement and grading', 'Assistance.Contents systems', 'History_And_Society.Politics and government', 'STEM.Medicine', 'Culture.Visual arts', 'STEM.Information science', 'Geography.Bodies of water', 'History_And_Society.Education', 'STEM.Meteorology', 'Culture.Philosophy and religion', 'STEM.Space', 'Culture.Performing arts', 'Geography.Oceania', 'Culture.Language and literature', 'Geography.Europe', 'STEM.Time', 'Assistance.Maintenance', 'Culture.Media', 'Culture.Food and drink', 'Geography.Landforms', 'STEM.Technology', 'History_And_Society.History and society', 'History_And_Society.Transportation', 'Geography.Maps', 'Geography.Cities', 'History_And_Societ

In [6]:
feature_values = list(extractor.extract(604413609, sm.features))
sm.score(feature_values)

{'prediction': ['Culture.Performing arts'],
 'probability': {'Assistance.Article improvement and grading': 2.011002259723534e-06,
  'Assistance.Contents systems': 0.0005100113181515472,
  'Assistance.Files': 9.719637976059903e-05,
  'Assistance.Maintenance': 0.000764378196588805,
  'Culture.Arts': 0.00028011403998532634,
  'Culture.Broadcasting': 0.0009482704118863797,
  'Culture.Crafts and hobbies': 0.0006498123595639325,
  'Culture.Entertainment': 0.0015516597726452023,
  'Culture.Food and drink': 9.58275297133763e-05,
  'Culture.Internet culture': 0.0005362720620836575,
  'Culture.Language and literature': 0.004046640436250856,
  'Culture.Media': 0.0014750708225405053,
  'Culture.Performing arts': 0.999728007557077,
  'Culture.Philosophy and religion': 0.0010963450396127364,
  'Culture.Plastic arts': 0.0012806616011608325,
  'Culture.Sports': 0.0003678988148347935,
  'Culture.Visual arts': 0.0024963296801508612,
  'Geography.Bodies of water': 0.0005059871503130626,
  'Geography.Citi