## Inspect Feature Importance
A good way of diagnosing a model's performance is to examine which features it uses the most to make predictions, and which features don't seem to help at all. 

In [1]:
from standard_libs import *

In [2]:
%load_ext autoreload
%autoreload 2

In [4]:
from ml_editor.data_processing import (
    format_raw_df,
    get_split_by_author,
    add_text_features_to_df,
    get_vectorized_series,
    get_feature_vector_and_label
)

In [5]:
def get_feature_importance(clf, feature_names):
    importances = clf.feature_importances_
    indices_sorted_by_importance = np.argsort(importances)[::-1]
    return list(
        zip(
            feature_names[indices_sorted_by_importance],
            importances[indices_sorted_by_importance],
        )
    )

In [6]:
data_path = Path('../data/writers.csv')
df = pd.read_csv(data_path)
df = format_raw_df(df.copy())

In [7]:
df = add_text_features_to_df(df.loc[df["is_question"]].copy())
train_df, test_df = get_split_by_author(df, test_size=0.2, random_state=40)

In [8]:
model_path = Path('../models/model_1.pkl')
clf = joblib.load(model_path)
vectorizer_path = Path('../models/vectorizer_1.pkl')
vectorizer = joblib.load(vectorizer_path)

In [9]:
train_df["vectors"] = get_vectorized_series(train_df["full_text"].copy(), vectorizer)
test_df["vectors"] = get_vectorized_series(test_df["full_text"].copy(), vectorizer)

features = [
                "action_verb_full",
                "question_mark_full",
                "text_len",
                "language_question",
            ]
X_train, y_train = get_feature_vector_and_label(train_df, features)
X_test, y_test = get_feature_vector_and_label(test_df, features)

Now we can leverage sklearn's api to display the most and least important features.

In [10]:
w_indices = vectorizer.get_feature_names()
w_indices.extend(features)
all_feature_names = np.array(w_indices)

In [13]:
k = 10
pd.DataFrame(get_feature_importance(clf, all_feature_names), columns=['Feature Name', 'Importance']).iloc[:k, :]

Unnamed: 0,Feature Name,Importance
0,text_len,0.008286
1,are,0.005689
2,what,0.004942
3,ve,0.004694
4,writing,0.004478
5,can,0.004441
6,do,0.004104
7,story,0.004055
8,don,0.00396
9,not,0.003678


Bottom 10 importances:

In [15]:
pd.DataFrame(get_feature_importance(clf, all_feature_names), columns=['Feature Name', 'Importance']).iloc[-k:-1, :]

Unnamed: 0,Feature Name,Importance
7551,goodreads,0.0
7552,brilliance,0.0
7553,goods,0.0
7554,temperature,0.0
7555,brevity,0.0
7556,breeze,0.0
7557,temple,0.0
7558,plate,0.0
7559,bread,0.0


We see that the text length is the most important feature. The next most important features seem like common English words. In order to use this model to provide useful writing suggestions, we should work on generating feature values that would be **easier** to turn into actionable writing advice.