# The top k approach
The top-k method is a useful method to inspect a model's results. It simply consists of looking at the **most and least successful examples** to identify patterns within them. These patterns can then be used to engineer new features, or iterate on existing ones.

First, we load the data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
import joblib

import sys
sys.path.append("..")
import warnings
warnings.filterwarnings('ignore')

In [2]:
from ml_editor.ch4_data_processing import (
    format_raw_df,
    get_split_by_author,
    add_text_features_to_df,
    get_vectorized_series,
    get_feature_vector_and_label,
)

In [3]:
from ml_editor.ch5_model_evaluation import get_top_k

Then, we add features and split the dataset.

In [5]:
data_path = Path('../raw_data/writers.csv')
df = pd.read_csv(data_path)
df = format_raw_df(df.copy())

In [6]:
df = add_text_features_to_df(df.loc[df['is_question']].copy())
train_df, test_df = get_split_by_author(df, test_size=0.2, random_state=40)

We load the trained model, and vectorize the features.

In [7]:
model_path = Path('../models/model_1.pkl')
clf = joblib.load(model_path)
vectorizer_path = Path("../models/vectorizer_1.pkl")
vectorizer = joblib.load(vectorizer_path)

In [8]:
train_df['vectors'] = get_vectorized_series(train_df['full_text'].copy(),
                                            vectorizer)
test_df['vectors'] = get_vectorized_series(test_df['full_text'].copy(),
                                           vectorizer)

features = [
    'action_verb_full', 'question_mark_full', 'text_len', 'language_question'
]

X_train, y_train = get_feature_vector_and_label(train_df, features)
X_test, y_test = get_feature_vector_and_label(test_df, features)

Now, we'll use the top k method to look at:

- The k best performing examples for each class (high and low scores)
- The k worst performing examples for each class
- The k most unsure examples, where our models prediction probability is close to .5

To read more about how plotting these particular examples can help with model iteration, please refer to Chapter 5 of the book.

In [9]:
test_analysis_df = test_df.copy()
y_predicted_proba = clf.predict_proba(X_test)
test_analysis_df['predicted_proba'] = y_predicted_proba[:, 1]
test_analysis_df['true_label'] = y_test

In [11]:
test_analysis_df.head(2)

Unnamed: 0_level_0,Unnamed: 0,AcceptedAnswerId,AnswerCount,Body,ClosedDate,CommentCount,CommunityOwnedDate,ContentLicense,CreationDate,FavoriteCount,...,Score_question,AcceptedAnswerId_question,full_text,action_verb_full,language_question,question_mark_full,text_len,vectors,predicted_proba,true_label
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5,3,,8,<p>I want my short story to have a specific po...,,1,,CC BY-SA 4.0,2010-11-18T20:43:59.693,4.0,...,,,Decide on a theme/overarching meaning before w...,True,False,True,878,"(0, 8983)\t0.0347055340123551\n (0, 8959)\t...",0.55,True
18,12,32.0,11,<p>I write a daily piece and have been doing s...,,6,,CC BY-SA 2.5,2010-11-18T20:55:11.123,14.0,...,,,Self Editing tips/tricks I write a daily piece...,False,False,True,431,"(0, 9001)\t0.10301466893680125\n (0, 8983)\...",0.51,True


In [13]:
to_display = [
    'predicted_proba',
    'true_label',
    'Title',
    'body_text',
    'text_len',
    'action_verb_full',
    'question_mark_full',
    'language_question',
]
threshold = 0.5

top_pos, top_neg, worst_pos, worst_neg, unsure = get_top_k(test_analysis_df,
                                                           'predicted_proba',
                                                           'true_label',
                                                           k=2)
pd.options.display.max_colwidth = 500

Most confident correct positive predictions

In [14]:
top_pos[to_display]

Unnamed: 0_level_0,predicted_proba,true_label,Title,body_text,text_len,action_verb_full,question_mark_full,language_question
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
32980,0.74,True,How to communicate character desire?,"As mentioned elsewhere, it has been a stumbling block for my readers to understand what drives my characters. I had thought I had communicated character desires through showing, and action, but it is not seeming to translate to the reader. \nExample: One character wishes to follow the footsteps of her mother, who has passed away. This desire is to honor her mother, and she comes from this sort of culture. She makes choices towards following her mother's footsteps throughout the first half of...",2924,True,True,False
529,0.73,True,How to make travel scenes interesting without adding needless plot diversions?,"I have always had a problem with travel in my stories. Since I'm writing an epic fantasy novel, travel is a big theme as characters often have to move from where they are to where the plot dictates.\nHowever, one of the difficulties I have is that the travel itself is often not important to the plot. In the novel I'm reading now (Wizard's First Rule by Terry Goodkind), there is a huge amount of travel, and the author adds needless encounters with various magical beasts just to keep tension...",1391,True,True,False


Most confident correct negative predictions

In [15]:
top_neg[to_display]

Unnamed: 0_level_0,predicted_proba,true_label,Title,body_text,text_len,action_verb_full,question_mark_full,language_question
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
3509,0.08,False,Bibliography entry for a paper presented to a workshop,"I'm using bibtex to create a bibliography with xelatex.\nI have to cite a paper presented to a workshop, but not published on a book or proceedings.\nWhat @misc fields can I use?\n",232,True,True,False
7488,0.09,False,Releases needed for picture books?,Do you need location releases for national parks and model releases for Pets to use in picture books?\n,137,False,True,False


It seems most of the correct negative predictions have **short length**. This result reinforces the feature importance analysis which showed question length as one of the most important features.

Let's look at the most confident incorrect negative predictions

In [16]:
worst_pos[to_display]

Unnamed: 0_level_0,predicted_proba,true_label,Title,body_text,text_len,action_verb_full,question_mark_full,language_question
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2250,0.1,True,Put text between section and subsection headings?,Are there rules about whether or not to put text between section and subsection headings (in scholarly works)?\n\n1 section heading\ntext / no text here?\n1.1 subsection heading\nmore text\n1.2 subsection heading\n...\n\n,262,False,True,False
42882,0.11,True,Capitlization of A Named Experiment,"I have an experiment which we call 'the krypton experiment'. In referring to the krypton experiment, should it be capitalized?\ne.g.\nThe Krypton Experiment was used as a source of benchmark data.\nor\nThe krypton experiment was used as a source of benchmark data.\n",298,True,True,True


On the flipside, we find an overrepresentation of short questions with high scores in the examples our model got wrong.

Next, let's look at the most confident incorrect positive predictions

In [17]:
worst_neg[to_display]

Unnamed: 0_level_0,predicted_proba,true_label,Title,body_text,text_len,action_verb_full,question_mark_full,language_question
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7878,0.78,False,"When quoting a person's informal speech, how much liberty do you have to make changes to what they say?","Even during a formal interview for a news article, people speak informally. They say ""uhm"", they cut off sentences half-way through, they interject phrases like ""you know?"", and they make innocent grammatical mistakes.\nAs somebody who wants to fairly and accurately report the discussion that takes place in an interview, what guidelines should I use in making changes to what a person says?\nWhile the simplest solution is to write exactly what they say and [sic] any errors they make, that can...",694,True,True,False
38928,0.74,False,How to use professional jargon when writing fiction?,"The military, the medical professions, police, etc. - they have their professional jargon. One noteworthy characteristic of this jargon is the extensive use of abbreviations. Those abbreviations are associated with ""being a professional"" to such an extent that tv shows often use them as shorthand for marking out the professionals.\nAs far as the general picture goes, soldiers at least do indeed use a lot of abbreviations (personal experience here). So to that extent, the media got it right.\...",1862,True,True,False


And finally, the most "unsure" questions, the ones where a model's probability is closest to equal for all classes (.5 in our case since we have two classes).

In [18]:
unsure[to_display]

Unnamed: 0_level_0,predicted_proba,true_label,Title,body_text,text_len,action_verb_full,question_mark_full,language_question
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
8778,0.5,False,Does my poem convey the character of the (fictional) author well?,"The core character in my current work-in-progress is an immortal goddess (of the minor kind), who goes increasingly desperate. In her desperation she's about to do something quite terrible, and the protagonist's task is to stop her. He knows her final intent, but not the motives - why she wants to do this. About the most significant thread of the story is discovery: learning about her motives, worries and desires, and acting upon them, influencing others in ways that rekindle her hope.\nAt o...",3306,True,True,False
31561,0.5,True,How to avoid the villain being a caricature,"I am on draft 4 of my story now, and many things are hanging together well. As a result, lesser items are coming into sharper focus. I need to revise for those next.\nMy villain needs work. He is too much of a caricature. I looked on this site and found this question which gives me some ideas to improve my villain. Still, I feel he needs more work than those answers provide (make him human, consistent, the hero in his frame of reference). My immediate goal is for him to be frightening, sinis...",2828,True,True,False


To find new candidate features, I recommend combining the top-k method with feature importance and vectorization.