In [1]:
from standard_libs import *
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
from ml_editor.data_processing import (
    format_raw_df, get_split_by_author,
    add_text_features_to_df,
    get_vectorized_series,
    get_feature_vector_and_label
)

from ml_editor.model_evaluation import get_top_k

data_path = Path('../data/writers.csv')
df = pd.read_csv(data_path)
df = format_raw_df(df.copy())

Then we add features and split the dataset.

In [3]:
df = add_text_features_to_df(df.loc[df['is_question']].copy())
train_df, test_df = get_split_by_author(df, test_size=0.2, random_state=42)

We load the trained model, and vectorize the features.

In [4]:
model_path = Path('../models/model_1.pkl')
clf = joblib.load(model_path)
vectorizer_path = Path('../models/vectorizer_1.pkl')
vectorizer = joblib.load(vectorizer_path)

In [5]:
train_df['vectors'] = get_vectorized_series(train_df['full_text'].copy(), vectorizer)
test_df['vectors'] = get_vectorized_series(test_df['full_text'].copy(), vectorizer)

features = [
    "action_verb_full",
    "question_mark_full",
    "text_len",
    "language_question",
]

X_train, y_train = get_feature_vector_and_label(train_df, features)
X_test, y_test = get_feature_vector_and_label(test_df, features)

Now, we'll use the top k method to look at:
* The k best performing examples for each class (high and low scores)
* The k worst performing examples for each class
* The k most unsure examples, where our models prediction probability is close to 0.5

In [6]:
test_analysis_df = test_df.copy()

y_predicted_proba = clf.predict_proba(X_test)

test_analysis_df['predicted_proba'] = y_predicted_proba[:, 1]
test_analysis_df['true_label'] = y_test

to_display = [
    "predicted_proba",
    "true_label",
    "Title",
    "body_text",
    "text_len",
    "action_verb_full",
    "question_mark_full",
    "language_question"
]

threshold = 0.5

top_pos, top_neg, worst_pos, worst_neg, unsure = get_top_k(test_analysis_df, "predicted_proba", "true_label", k=2)

In [21]:
top_pos[to_display]

Unnamed: 0_level_0,predicted_proba,true_label,Title,body_text,text_len,action_verb_full,question_mark_full,language_question
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
20034,0.77,True,Can Extensive Outlining Take the Place of the First Draft?,"Background: I've been writing fan fiction for five years now. I began when I didn't have a clue what I was doing, so my methods have evolved during those years. By now I have a solid process that I follow, and I feel I have a good grasp on what I'm doing. There is one small detail that has been bothering me for some time now though: \nOn this site and off, I've heard writers everywhere refer to first, second, third, and sometimes even fourth or fifth drafts. You write your story out as the first draft, wait a while, start over with the second draft, and so on. It's a solid principle that I try to use. I say 'try' because, to this day, I have never written a second draft. Everything I have ever written, every piece of fan fiction, I have written only one draft of. I've done editing for sure, but I've never rewritten the whole thing. The most I've done is maybe rewrite half of a chapter a few times. \nMost people say that your first draft is usually terrible. Some even go so far as to say that it has little use beyond getting your idea down. Most agree that you largely dispose of your first draft and simply start over. (These observations are based on what I've heard.) Here's the problem though: starting even with my very first fan fiction, my ratings have consistently been high. My readers have liked what I wrote. Even to me, my writing hasn't looked like the disorganized mess I think a first draft is supposed to be. \nI think I know what is going on. As I mentioned above, I have a solid process that I follow. That process is more for outlining and development than writing. I go through every aspect of the fiction that I need (character, plot, stakes, etc.) in detail. I work out exactly what I need, how I'll get it, and where it will be. In fact, by the time I get to the plot section, the fiction has already begun to take shape just from all the other parts I know it will need to have. By the time I'm done with my process and ready to begin the first draft, my fiction is detailed down to the individual scenes. Not much editing of the outline takes place; I generally leave that up to when I am writing, as I feel is necessary. Sometimes, during writing, I change, delete, or add a few scenes to make it work, and I often have to detail things better than I have in the outline, but for the most part, my outline remains in the same general shape as when I started. The closest I've ever come to writing a second draft is scraping chapter one several times in quick succession until I come up with the right opening. \nI think because my outlining and development is so detailed, it is taking the place of the first and possibly even the second draft. Could this be? I'm hesitant to accept this, because multiple drafts seems like one of those universal things that all writers go through, with very little exception. \nQuestion: Is my detailed outlining and development taking the place of first and possibly second drafts? \n",3015,False,True,False
5400,0.75,True,How can I catch more errors when I proofread?,"I have a problem where I often proof my own writing and I don't catch all the errors while I am reading through it. I often miss entire words out of sentences or find myself repeating words. I can read a document several times and I catch new errors every time. Eventually, I'll feel like I've caught everything, but I find out after I've posted or printed it that I left out some word. The whole process takes hours instead of a few minutes. This process is so frustrating that sometimes I just give up. Does anyone have this experience writing and if so, what techniques have you developed that help?\nP.S:\nFor some reason, I make fewer errors and my writing is a lot speedier if I write it out long hand first. For some reason, the word processor makes it hard to keep your train of thought going because you find yourself derailed by the formatting. I also found using NotePad to be a useful tool. Since it doesn't have formatting, it is less distracting. I also set the width of the Window to be very short. For some reason, my thoughts are less likely to get derailed and I make fewer errors.\nEdit: I haven't picked an answer because all of these responses are great! I also want to keep the suggestions coming so that others will benefit. Thanks a lot. \n",1311,True,True,False


In [7]:
top_neg[to_display]

Unnamed: 0_level_0,predicted_proba,true_label,Title,body_text,text_len,action_verb_full,question_mark_full,language_question
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
16095,0.12,False,Citation anniversary editions,"How do I do MLA citations for an anniversary edition, e.g. 30th anniversary Selfish Gene. Do I treat it as a normal edition or just ignore that it is a new edition?\n",195,False,True,False
16799,0.13,False,Capitalization of Open form Compound Words in Titles,"What would be considered proper capitalization of open form compound words in titles? Should the second part of the compound word be capitalized? Why?\nFor example, the capitalization for which title would be correct?\n\nCash flow Analysis Report\n\n--OR--\n\nCash Flow Analysis Report\n\nThanks!\n",343,True,True,True


It seems most of the correct negative predictions have **short length**. This result reinforces the feature important analysis which showed question legnth as one of the most important features.

In [8]:
worst_pos[to_display]

Unnamed: 0_level_0,predicted_proba,true_label,Title,body_text,text_len,action_verb_full,question_mark_full,language_question
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
35367,0.08,True,How to relate stormy weather to sadness?,Is it possible that stormy weather can be related to sadness? The weather could be related to gloom?\n,142,True,True,False
17543,0.11,True,Using Pronoun 'It' repetitvely for emphasis?,"I'd like to know if using ""It"" repetitively (for emphasis) in this context is okay grammatically.\n\nTV has become the modern day baby sitter. It is raising our children. It is dictating the cultural narrative and shaping future society. It is raising the bored inattentive child. It is raising the consumer child. It is raising the aggressive child. It is raising the obese child. It is raising the misinformed and complacent child. It is raising the disenchanted child. And what’s more, it is doing all this with our smiling acquiescence. \n\n",591,False,True,False


In [10]:
worst_neg[to_display]

Unnamed: 0_level_0,predicted_proba,true_label,Title,body_text,text_len,action_verb_full,question_mark_full,language_question
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
16095,0.12,False,Citation anniversary editions,"How do I do MLA citations for an anniversary edition, e.g. 30th anniversary Selfish Gene. Do I treat it as a normal edition or just ignore that it is a new edition?\n",195,False,True,False
16799,0.13,False,Capitalization of Open form Compound Words in Titles,"What would be considered proper capitalization of open form compound words in titles? Should the second part of the compound word be capitalized? Why?\nFor example, the capitalization for which title would be correct?\n\nCash flow Analysis Report\n\n--OR--\n\nCash Flow Analysis Report\n\nThanks!\n",343,True,True,True


In [11]:
unsure[to_display]

Unnamed: 0_level_0,predicted_proba,true_label,Title,body_text,text_len,action_verb_full,question_mark_full,language_question
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
19046,0.5,False,Should I be a Novel Writer or a Screenwriter?,"Should I be a Novel Writer or a Screenwriter? \nThis question is intended for the beginning writer, who is unsure if he should start with writing physical books, or with writing scripts for movie directors/producers. \nObviously this question will inspire a lot of debate, which I know is not something we want on this site. So instead, I was wondering if someone could provide me with a comprehensible list of the pros and cons of fictional novel writing versus screenwriting. \n",523,True,True,False
37010,0.5,True,Death as person - A funny part of the story? Or serious stuff?,"Death as a person is commonly known to any reader of the ""Discworld"" series from Terry Pratchett. Also death appears in the series ""Supernatural"" as one of the apocalyptic riders. Another approach in this case is not known to me. \nThe thought of death as a ""person"" is something that follows me through my whole writing life. But there is always the one question: Is death as a person only good for a small funny part of the story (something like ""taking out the pressure of the scene"") or can death as a person be a serious part of the story?\nThis is the question to you.\n",636,True,True,False
