# Inspect feature importance
A good way to diagnose a model's performance is to examine which features it uses the most to make predictions, and which features don't seem to help at all. This is called feature importance analysis.

To do so, we first load the data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
import joblib

import sys
sys.path.append("..")
import warnings
warnings.filterwarnings('ignore')

In [2]:
from ml_editor.ch4_data_processing import (
    format_raw_df,
    get_split_by_author,
    add_text_features_to_df,
    get_vectorized_series,
    get_feature_vector_and_label,
)

In [3]:
from ml_editor.ch5_model_evaluation import get_feature_importance

In [4]:
data_path = Path('../raw_data/writers.csv')
df = pd.read_csv(data_path)
df = format_raw_df(df.copy())

Then, we add features and split the data.

In [5]:
df = add_text_features_to_df(df.loc[df['is_question']].copy())
train_df, test_df = get_split_by_author(df, test_size=0.2, random_state=40)

We load the pretrained model and vectorizer.

In [6]:
model_path = Path('../models/model_1.pkl')
clf = joblib.load(model_path)
vectorizer_path = Path("../models/vectorizer_1.pkl")
vectorizer = joblib.load(vectorizer_path)

We vectorize the text and get features ready for our model

In [7]:
train_df['vectors'] = get_vectorized_series(train_df['full_text'].copy(),
                                            vectorizer)
test_df['vectors'] = get_vectorized_series(test_df['full_text'].copy(),
                                           vectorizer)

features = [
    'action_verb_full', 'question_mark_full', 'text_len', 'language_question'
]

X_train, y_train = get_feature_vector_and_label(train_df, features)
X_test, y_test = get_feature_vector_and_label(test_df, features)

Now, we can leverage sklearn's api to display the most and least important features.

In [8]:
w_indices = vectorizer.get_feature_names()
w_indices.extend(features)
all_feature_names = np.array(w_indices)

In [9]:
k = 10
print('Top %s importances: \n ' % k)
print('\n'.join([
    '%s: %.2g' % (tup[0], tup[1])
    for tup in get_feature_importance(clf, all_feature_names)[:k]
]))

Top 10 importances: 
 
text_len: 0.0092
what: 0.0049
are: 0.0049
writing: 0.0043
story: 0.0041
can: 0.004
am: 0.0038
do: 0.0038
not: 0.0038
as: 0.0037


In [10]:
print('\nBottom %s importances: \n ' % k)
print('\n'.join([
    '%s: %.2g' % (tup[0], tup[1])
    for tup in get_feature_importance(clf, all_feature_names)[-k:]
]))


Bottom 10 importances: 
 
whos: 0
communications: 0
owned: 0
slick: 0
pacific: 0
funding: 0
fundamentally: 0
functionally: 0
succinctly: 0
brows: 0


We can see that the text length is the most important feature. The next most important features seem like common English words. In order to use this model to provide useful writing suggestions, we should work on generating features values that would be **easier** to turn into actionable writing advice.