# Training the Final Model and Determining Important Features

Now that the final model has been trained with optimal hyperparameters, we want to understand which features are important so that we can provide actionable insights to users of the tool. We first fit the final model. To do this, we'll fit the final model, examine which unigrams/bigrams and meta-features are indicative of receiving relief.

The first step we need to do is to call the data from the PostgreSQL, standardize some of the features, call the n-grams constructed previously, and fit the model. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import psycopg2
import nltk
from scipy import sparse
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn import metrics
from sklearn.linear_model import SGDRegressor, SGDClassifier
from sklearn.externals import joblib
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import (learning_curve, StratifiedShuffleSplit, cross_val_score, ShuffleSplit,
                                     cross_val_predict, GridSearchCV)
import seaborn as sns

%matplotlib inline
sns.set(context='notebook', style='darkgrid')
sns.set(font_scale=1.4)

In [None]:
# Set Postgres credentials/read in complaint database
db_name = 'complaint1'
username = 'postgres'
host = 'localhost'
port = '5432' 
#password = ''

con = psycopg2.connect(database=db_name, 
    host='localhost',
    user=username,
    password=password)

sql_query = """
SELECT * FROM complaint1;
"""
complaints_df = pd.read_sql_query(sql_query,con)

In [None]:
#the narratives might have missing values after pre-processing, so we'll remove any that are empty now
complaints_df=complaints_df.dropna(subset = ['narrative'])
complaints_df.shape

In [None]:
meta_feat=['sentiment','ADJ','ADP','ADV','CCONJ','DET','INTJ','NOUN','NUM','PART','PRON',
          'PROPN','PUNCT','SPACE','SYM','VERB','X','avg_words_sent','num_sent','num_word']

#select these features from the full data set
X = complaints_df[meta_feat]

# Remove all rows with no data
X_cleaned = X[~X.isnull().all(axis=1)]

# Fill remaining missing values with zero
X_cleaned = X_cleaned.fillna(0)

# Standardize the meta features
scaler = StandardScaler()
X_std = scaler.fit_transform(X_cleaned)

X_ngrams = sparse.load_npz("ngrams.npz")

X_std_sparse = sparse.csr_matrix(X_std)
X_full = sparse.hstack([X_std_sparse, X_ngrams])
X_full.shape

Now, we need to load the classifier trained on the full data set. 

In [None]:
trained_class = joblib.load('trained_classifier.pkl')

# Identifying the Most Important Features

The first thing we will consider is how the whole data set (meta features and n-grams) are predictive of receiving relief. 

In [None]:
#load in the pickle of the vectorizer for the n-grams
vectorizer = joblib.load('tfidf_unibi_250.pkl')

In [None]:
# Combine meta feature labels with n-gram labels
all_features = meta_feat + vectorizer.get_feature_names()

In [None]:
# Add the corresponding feature names to the parameters, sorted from highest
# to lowest
feature_ranks = pd.Series(
    trained_class.coef_.T.ravel(),
    index=all_features
).sort_values(ascending=False)[:19][::-1]

# Display a bar graph of the top features
graph = feature_ranks.plot(
    kind='barh',
    legend=False,
    figsize=(4, 8),
    color='#666666'
);

Looks like some of the key words are related to credit bureaus, feeds people are charged, and fraudulent charges. Let's now take a look at the meta-features to figure out which of those are predictive of receiving relief.

In [None]:
# Add the corresponding meta feature names to the parameters, sorted from
# highest to lowest
meta_feature_ranks = pd.Series(
    trained_class.coef_.T.ravel()[:len(meta_feat)],
    index=meta_feat
).sort_values(ascending=False)[::-1]

# Display a bar plot of the meta feature importance
graph2 = meta_feature_ranks.plot(
    kind='barh',
    legend=False,
    figsize=(5, 8),
    color='#666666'
)

Looks like having a complaint that is positive and clearly written is more likely to be successful in obtaining relief. The use of adjectives indicates that the complaint should be descriptive, and punctuation likely indicates that those writing in should be concise (number of sentences is negatively related with receiving relief). Determiners are likely needed to make the writing clearer. 

Alternatively, making a very long complaint is probably not a good idea. Likewise, we see including coordinating conjunctions and number of sentences as being problematic. Likewise, including interjections and participles (i.e., 's to show ownership) are related to not receiving relief.  

In [None]:
#define the predictive features of receiving relief 
pred_feat=['sentiment','SYM','DET','PUNCT','ADJ','VERB']

# Making Recommendations/Actionable Insights

So that users can modify their narrative complaints to make them more likely to receive relief, we will compare the users submitted narrative with those that were successful. To do so, we first need to identify the narratives in our training set that were successful. Then, we will need to pre-process this data so that 

In [None]:
pd.options.display.max_columns = None
complaints_df.head()

In [None]:
#subset only successful complaints
success=complaints_df[complaints_df.response=='relief']
success.shape

In [None]:
#just subset the meta-features to provide recommendations
success_meta=success[meta_feat]
#get rid of rows that don't contain any information
success_cleaned = success_meta[~success_meta.isnull().all(axis=1)]
#fill in missing values
success_cleaned = success_cleaned.fillna(0)

In [None]:
#compute the average of each of the meta features for successful complaints 
avg_meta_success = success_cleaned.mean()

#standardize the meta features
success_meta_std = pd.Series(scaler.transform([avg_meta_success]).ravel(), index=meta_feat)

#save these results so they can be easily called on for webapp
joblib.dump(success_meta_std, 'success_meta_vector.pkl')

# Creating Recommendations From a Narrative Complaint

We now need to compare the features in a complaint to complaints that were successful. To do that, we'll use a tester complaint that I made up.

In [None]:
import feat_eng
tester=("I am really disappointed by Wells Fargo. I opened up a savings account with them a few years ago because they had a good rate; however, I suddenly realized they had opened many new accounts under my name. I found out about this problem when I pulled my credit report. I'm not sure who, but someone had opened dozens of accounts at the bank in my name. There were several checking accounts and savings accounts that were charged tons of fees, totalling around $300. When I called, I was told that these accounts were all mistakes and my money would be refunded, but unfortunately, it has not been refunded. I'm writing to get the money that is owed to me refunded.")

In [None]:
#calculate the meta features for the tester complaint, to do so, we will call on the commands developed in the feat_eng 
#file

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyser = SentimentIntensityAnalyzer()

def sentiment_analyzer_scores(text):
    score = analyser.polarity_scores(text)
    lb = score['compound']
    if lb >= 0.05:
        return 1
    elif (lb > -0.05) and (lb < 0.05):
        return 0
    else:
        return -1

import spacy
from collections import Counter
nlp=spacy.load('en')

def postag(text):
    doc=nlp(text)
    pos=[(i, i.pos_) for i in doc]
    counts=Counter(tag for word, tag in pos)
    return counts

def sent_word_tok(text):
    sents=nltk.sent_tokenize(text)
    words=nltk.word_tokenize(text)
    num_sents=len(sents)
    num_words=len(words)
    
    if num_words == 0:
        avg_word_sent == 0
    else:
        avg_word_sent = num_words/num_sents
    return {'num_word': num_words, 'num_sent': num_sents, 'avg_words_sent': avg_word_sent}

def meta_calc(narrative):
    
    #take the narrative text and analyze it so that we can get a prediction
    # first, get sentiment
    senti = sentiment_analyzer_scores(narrative)
    # Take input text and get POS tags
    pos = postag(narrative)
    # Take input text and get summary statistics about length
    length = sent_word_tok(narrative)

    #create pos table so all variables are present
    pos_df=pd.DataFrame()
    pos_df["ADJ"]=0; pos_df["ADP"]=0; pos_df["ADV"]=0; pos_df["CCONJ"]=0; 
    pos_df["DET"]=0; pos_df["INTJ"]=0; pos_df["NOUN"]=0; pos_df["NUM"]=0; 
    pos_df["PART"]=0; pos_df["PRON"]=0; pos_df["PROPN"]=0; pos_df["PUNCT"]=0; 
    pos_df["SPACE"]=0; pos_df["SYM"]=0; pos_df["VERB"]=0; pos_df["X"]=0

    #change these all to a pandas data frame and concatenate them
    senti_df=pd.DataFrame(pd.Series(senti))
    senti_df.columns=['senti']
    pos_df_data=pd.DataFrame(pos,index=[0])
    pos_fin=pos_df.append(pos_df_data)
    pos_fin=pos_fin.fillna(0)
    length_df=pd.DataFrame(length,index=[0])
    #concatenate these to form meta-feature vector
    meta_feat=pd.merge(senti_df,pos_fin,left_index=True, right_index=True)
    meta_feat=pd.merge(meta_feat,length_df,left_index=True, right_index=True)
 
    #generate the output
    return meta_feat

In [None]:
meta_feat_test=meta_calc(tester)

In [None]:
meta_feat_test

In [None]:
# Standardize the feature vector
scaler=joblib.load('trained_scaler.pkl')
feature_vector_std = pd.Series(scaler.transform(meta_feat_test).ravel(),index=meta_feat)

In [None]:
# Compute meta feature ranks
feature_ranks = pd.Series(trained_class.coef_.T.ravel()[:len(meta_feat)], index=meta_feat)

In [None]:
# Compute the weighted score of the meta features of a narrative
user_nar_score = np.multiply(feature_vector_std[pred_feat],feature_ranks[pred_feat])
user_nar_score

In [None]:
# Compute the weighted score of the meta features of successful narratives
suc_nar_score = np.multiply(success_meta_std[pred_feat],feature_ranks[pred_feat])
suc_nar_score

In [None]:
# Combine the weighted scores into a single DataFrame
messy = pd.DataFrame([user_nar_score, suc_nar_score], index=['Your Narrative', 'Successful Narratives']).T.reset_index()

# Transform the combined data into tidy format
tidy = pd.melt(messy,id_vars='index', value_vars=['Your Narrative', 'Successful Narratives'],var_name=' ')
messy

In [None]:
# Draw a grouped bar plot of the weighted scores
fig = sns.factorplot(
    data=tidy,
    y='index',
    x='value',
    hue=' ',
    kind='bar',
    size=5,
    aspect=1.5,
    palette='Set1',
    legend_out=False
).set(
    xlabel='score',
    ylabel='',
    xticks=[]
)
# Re-label the y-axis and reposition the legend
['sentiment','SYM','DET','PUNCT','ADJ','VERB']
labels = ['Sentiment','Symbols','Determiners','Punctuation','Adjectives','Verbs']
plt.yticks(np.arange(len(pred_feat)), labels)
fig.ax.legend(loc='lower right');
