# Classification with Scikit-Learn #

By John Semerdjian, assisted by Marti Hearst

October 2015

This is a tutorial on how to use Sciki-Learn in combination with Pandas to train and test a text classification algorithm.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import nltk

### Reading a csv file

Let's import data straight into a `DataFrame` from most file types, e.g. `.csv` or `.json`.

Download the consumer complaints dataset as a `.csv` file here: https://data.consumerfinance.gov/dataset/Consumer-Complaints/s6ew-h6mp

Description from [Consumer Financial Protection Bureau](http://www.consumerfinance.gov/complaintdatabase/):

> Each week we send thousands of consumers' complaints about financial products and services to companies for response. Complaints are listed in the database after the company responds or after they’ve had the complaint for 15 calendar days, whichever comes first.

> We publish the consumer’s description of what happened if the consumer opts to share it and after taking steps to remove personal information. See our Scrubbing Standard for more details

> We don’t verify all the facts alleged in these complaints, but we take steps to confirm a commercial relationship. We may remove complaints if they don’t meet all of the publication criteria. Data is refreshed nightly.

In [None]:
df = pd.read_csv("Consumer_Complaints.csv", low_memory=False)

### First look at your data

In [None]:
df.head()

### The data is really wide! Let's extract a few columns to review

We can pass a list of column names to our filter our DataFrame

In [None]:
cols = ["Product", "Sub-product", "Issue", "Sub-issue", 
        "Consumer complaint narrative", "Company public response", 
        "Company", "Company response to consumer"]

Put the list of column names in brackets after the name of the DataFrame to subset. 

In [None]:
df[cols].head()

### Return all rows that do not have `NaN` in the `Consumer complaint narrative` column

The `df["Consumer complaint narrative"].notnull()` argument returns a boolean of values, `True` if the data are not null (`NaN`) and `False` for the rest. We place the array of boolean values within the DataFrame `df` to subset it further.  Farther down we will see how the `reset_index()` function gives us a clean index so we don't have to use the same indices as the larger DataFrame for subsetting.

In [None]:
filtered_data = df["Consumer complaint narrative"].notnull()
filtered_data[:10]

In [None]:
df_narrative = df[filtered_data]

Notice where the index starts on the left-most column -- it is no longer in descending order from 0 to the length of the number of rows.

In [None]:
df_narrative[cols].head()

### Ploting with Pandas

We can plot the distribution of categories in the `Products` column by chaining the `.value_counts()` and `.plot()` methods after selecting the `Products` column

We can then count for each unique value in that column the number of observations within in the DataFrame, which we sort ascending.

In [None]:
sorted_product_counts = df_narrative.Product.value_counts(ascending=True)
sorted_product_counts

Next, we can plot a horizatontal (`barh`) bar graph to view the results, fix the fiture size to 8x6, and give it a title.

In [None]:
sorted_product_counts.plot(kind='barh', figsize=(8,6), title="Product Categories");

### Create training, development, and test sets

First, let's shuffle the rows in our `DataFrame`. There are many ways of splitting our data into training, development, and test sets. We'll use the `numpy` function `random.permutation` to generate a randomized array of row indices. 

(Alternatively, we can use the [`train_test_split`](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html) function from `sklearn.cross_validation` to easily create training and "test" sets.)

In [None]:
df_narrative.index[:10]

In [None]:
random_index = np.random.permutation(df_narrative.index)
random_index[:10]

After we apply this randomized index, we'll need to reset the index of our new `DataFrame`. This allows us to us the normal indexing approaches.



In [None]:
df_narrative.ix[random_index, ['Product', 'Consumer complaint narrative']][:5]

The `drop=True` option in `reset_index()` resets our rows without adding a new column indicated the old index while `inplace=True` performs the operation in place instead of returning a copy of the `DataFrame`

In [None]:
df_narrative_shuffled = df_narrative.ix[random_index, ['Product', 'Consumer complaint narrative']]
df_narrative_shuffled.reset_index(drop=True, inplace=True)
df_narrative_shuffled[:5]

### Create 60/20/20 split for training/dev/test sets

The `.shape` function returns a tuple of the number of rows and columns in a DataFrame

In [None]:
rows, columns = df_narrative_shuffled.shape
print("Rows:", rows)
print("Columns:", columns)

In [None]:
train_size = round(rows*.6)
dev_size   = round(rows*.2)

First 60% of rows are the training set

In [None]:
df_train = df_narrative_shuffled.loc[:train_size]
df_train.shape

In [None]:
df_train.head()

Followed by the next 20% of rows for the development set

In [None]:
df_dev = df_narrative_shuffled.loc[train_size:dev_size+train_size].reset_index(drop=True)
df_dev.shape

And the last 20% are the test set

In [None]:
df_test = df_narrative_shuffled.loc[dev_size+train_size:].reset_index(drop=True)
df_test.shape

# Scikit-Learn

After we've wrangled/cleaned/separated our data with `Pandas`, we can start building machine learning algorithms using `Scikit-Learn`, which gives us a rich, unified API to quickly create classification models.

### Building features from scratch

Let's say you have an intuition for the terms you think would be helpful for classification consumer complaints. We can quickly create a column vector for each feature you think of then use a simple classification algorithm for prediction.

For now let's just build features to classify credit card-related compliants.

I thought of the following features:

* character: "$"
* word: "payment"
* bigram: "credit card"

There are two feature processing functions below.  One handles features consisting of one word, and the other handles features consisting of two words.  They count how often the passed in term occurs in the document.  In the bigram case, a FreqDist is needed to keep track.

In [None]:
def unigram_feature(x, unigram):
    word_list = x.lower().split(" ")
    return word_list.count(unigram)

def bigram_feature(x, bigram):
    bigram_tuple = tuple(bigram.split())
    word_list = x.lower().split(" ")
    bi = nltk.FreqDist(nltk.bigrams(word_list))
    return bi[bigram_tuple]

Train the dollar sign feature.  It doesn't occur in the first 10 documents.

In [None]:
train_dollarsign_feature = df_train['Consumer complaint narrative'].apply(lambda x: unigram_feature(x, ('$')))
train_dollarsign_feature[:10]

Train the 'payment' feature.  It occurs twice in document 1 and one time in document 8.

In [None]:
train_payment_feature = df_train['Consumer complaint narrative'].apply(lambda x: unigram_feature(x, ('payment')))
train_payment_feature[:10]

In [None]:
train_creditcard_feature = df_train['Consumer complaint narrative'].apply(lambda x: bigram_feature(x, ('credit card')))
train_creditcard_feature[:10]

Bring your feature vectors together into a `DataFrame`

In [None]:
df_train_features = pd.DataFrame({'dollar': train_dollarsign_feature, 
                                  'payment': train_payment_feature, 
                                  'creditcard': train_creditcard_feature})

In [None]:
df_train_features.head()

Create the feature vectors for the development set too

In [None]:
dev_dollarsign_feature = df_dev['Consumer complaint narrative'].apply(lambda x: unigram_feature(x, ('$')))
dev_payment_feature = df_dev['Consumer complaint narrative'].apply(lambda x: unigram_feature(x, ('payment')))
dev_creditcard_feature = df_dev['Consumer complaint narrative'].apply(lambda x: bigram_feature(x, ('credit card')))

In [None]:
df_dev_features = pd.DataFrame({'dollar': dev_dollarsign_feature, 
                                'payment': dev_payment_feature, 
                                'creditcard': dev_creditcard_feature})

In [None]:
df_dev_features.head()

### Building a model using your features

We'll build a single Naive Bayes to predict the product category based on the consumer complain features we just created.  The 'fit' function does the training.  We pass in the features and the correct classes that we want as output (the Product column) as the arguments.

In [None]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()
nb_model = nb.fit(df_train_features, df_train.Product)

The predict function does the classification.

In [None]:
nb_predictions = nb_model.predict(df_dev_features)
nb_predictions[0]

We'll use a `Scikit-Learn` function to calculate the accuracy.

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(df_dev.Product, nb_predictions)

Ouch! That's not a very good overall score. Let's look at individual class accuracy.

### Classification report

Another way of evaluating the performance of your models is to use `Scikit-Learn`'s `classification_report` function.

* **Precision**:
$$\frac{TP} {TP+FP}$$

* **Recall, Sensitivity, TP Rate**:
$$\frac{TP} {TP+FN}$$

* **$F_1$ Measure**:
$$F _1 = 2 \frac{PR} {P + R}$$

The class labels are usually returned sorted in alphabetical/numerical order.

In [None]:
class_labels = np.sort(df_train.Product.unique())

Run the full report

In [None]:
from sklearn.metrics import classification_report

print(classification_report(df_dev.Product, nb_predictions, target_names=class_labels))

### Creating feature using overall word counts

Another strategy of creating features is to use *all* the words in our collection. [`CountVectorizer()`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) accepts an array of text and converts the text through tokenization and counting unique terms, thereby generating our so-called "bag of words".

In other words, for each row in our `DataFrame` we get a long vector/array of the counts of each word.  You can modify the tokenizer, remove stop words, generate n-gram features, and perform other types of text processing. 

Here are some options to explore:

`token_pattern : string`

> Regular expression denoting what constitutes a “token”, only used if tokenize == ‘word’. The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).

`min_df : float in range [0.0, 1.0] or int, default=1`

> When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

`max_features : int or None, default=None`

> If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. This parameter is ignored if vocabulary is not None.

`stop_words : string {‘english’}, list, or None (default)`

> If ‘english’, a built-in stop word list for English is used. If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.


`ngram_range : tuple (min_n, max_n)`

> The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.


If you're interested in using Tf-Idf instead of counts, check out [`TfidfVectorizer()`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

The [feature extraction documentation](http://scikit-learn.org/stable/modules/feature_extraction.html) from Scikit-Learn is also very good.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

Let's use n-gram between 1 to 2 values with a simplified token pattern to find ngrams that occur more than 5 times in the collection.

In [None]:
vec = CountVectorizer(ngram_range=(1, 2), token_pattern=r'\b\w+\b', min_df=5)

We can manually inspect what the tokenizer does by passing in a string

In [None]:
tokenizer = vec.build_tokenizer()

In [None]:
tokenizer("What's the warranty for this $40.00 toaster?")

For starters, we'll use the default tokenization pattern and all the text across the **training set** to create our feature vectors. The `fit_transform()` performs this activity and returns a sparse array of the word counts.

In [None]:
arr_train_feature_sparse = vec.fit_transform(df_train["Consumer complaint narrative"])
arr_train_feature_sparse

A sparse matrix is an efficient way of storing data where most values are 0. Just for information's sake, below we convert the sparse array into a normal array to get a better sense of what's going on.

In [None]:
arr_train_feature = arr_train_feature_sparse.toarray()
arr_train_feature

Instead of getting back a `Pandas DataFrame`, we get back a `numpy array` object.

### Inspecting our features

To see what our features are, we'll use the `get_feature_names()` function on our `vec` object that we fitted and transformed.  Remember that we asked for both unigrams and bigrams when we created the feature vector.

You'll see a lot of nonsensical features there. What are some strategies that you can think of to make the tokenizer produce more informative features?

In [None]:
feature_labels = vec.get_feature_names()
feature_labels[100:110]

Let's try matching one word from the first row in our `DataFrame` to it's respective position in our feature vector.

In [None]:
row0 = df_train.ix[0, 'Consumer complaint narrative']
row0

This is how we search for a word in the feature labels.

In [None]:
feature_index = feature_labels.index('credit')
feature_index

We should expect the count of the number of occurrences

In [None]:
arr_train_feature[0, feature_index]

In order to build our model, we'll need to perform the same transformation on our dev and test sets as we did on the training set.  To do this, we use the `transform()` function.

Note that we use 'transform()' and not  `fit_transform()` since that would reset the features using the text from the dev or test set. We only want to use the features that are present in the training set.

In [None]:
arr_dev_feature_sparse = vec.transform(df_dev["Consumer complaint narrative"])
arr_dev_feature = arr_dev_feature_sparse.toarray()

### We have way more features than observations!

This is a good time to consider dimensionality reduction.

In [None]:
arr_train_feature.shape

### Most common features

Let's plot the distribution of counts for the massive feature set we created.

In [None]:
feature_sum = arr_train_feature.sum(axis=0)

df_feature_sum = pd.DataFrame({'counts': feature_sum})
df_feature_sum.index = vec.get_feature_names()

Top 10 features

In [None]:
df_feature_sum.sort('counts', ascending=False)[:10]

Plot the top 50 features

In [None]:
df_feature_sum.sort('counts', ascending=False)[:50].plot(kind='barh', figsize=(7,10));

### Many of these features are stopwords

How might we fix this?  (Hint: this makes the algorithm work much better on this classification problem.)

### Let's manually reduce our dimensions by using only use the top N features in our training set

Since so many of our features rarely occur, let's (arbitrary) cap our features to the top 1000 most common n-grams using the `max_features` variable within the `CountVectorizer()` function.

In [None]:
vec = CountVectorizer(ngram_range=(1, 2), token_pattern=r'\b\w+\b', min_df=5, max_features=1000)
arr_train_feature_sparse = vec.fit_transform(df_train["Consumer complaint narrative"])
arr_train_feature = arr_train_feature_sparse.toarray()
arr_train_feature.shape

Remember that if you transform the training set, you also have to transfomr the development set using this new vector vec to get the desired effect.

In [None]:
arr_dev_feature_sparse = vec.transform(df_dev["Consumer complaint narrative"])
arr_dev_feature = arr_dev_feature_sparse.toarray()
arr_dev_feature.shape

# Train two machine learning classification models

### Naive Bayes (generative)

This initializes a [Multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB) model which we then fit in the next row. We pass the training features along with their true labels, `df_train.Product`.

In [None]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()
nb_model = nb.fit(arr_train_feature, df_train.Product)

We can easily predict the labels using our new model and the feature vectors from the dev set.

In [None]:
nb_predictions = nb_model.predict(arr_dev_feature)
nb_predictions[0]

This function returns the accuracy of our Naive Bayes model.

In [None]:
accuracy_score(df_dev.Product, nb_predictions)

Instead of looking at the class labels, let's look at the probability of predicting each class.

In [None]:
nb_predictions_probs = nb_model.predict_proba(arr_dev_feature)
nb_predictions_probs.shape

Plot the predicted probabilities of each class for the first observation in our dev set.

In [None]:
plt.figure(figsize=(8,5))
plt.plot(nb_predictions_probs[0,:])
plt.xticks(np.arange(11), class_labels, rotation='vertical')
plt.show()

What does the text look like for the first observation in the dev set?

In [None]:
df_dev.loc[0, 'Consumer complaint narrative']

It's real label?

In [None]:
df_dev.loc[0, 'Product']

### Logistic Regression (discriminative)

We perform the same steps here as we did above.

In [None]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg_model = logreg.fit(arr_train_feature, df_train.Product)

In [None]:
logreg_predictions = logreg_model.predict(arr_dev_feature)

In [None]:
accuracy_score(df_dev.Product, logreg_predictions)

Let's compare the probabilities for the first observation in the dev set between logistic regression and naive bayes.

In [None]:
logreg_predictions_probs = logreg_model.predict_proba(arr_dev_feature)

In [None]:
plt.figure(figsize=(8,5))
plt.plot(logreg_predictions_probs[0,:], label='Logistic Regression')
plt.plot(nb_predictions_probs[0,:], label='Naive Bayes')
plt.xticks(np.arange(11), class_labels, rotation='vertical')
plt.legend(frameon=False)
plt.show()

In [None]:
logreg_predictions[0]

# Create Confusion Matrix

A confusion matrix is handy when inspecting the errors from a multi-class classification problem. Each row and column represents the how well our predicted labels matched their true values. 

`Scikit-Learn` has a function called `confusion_matrix()` which produces array with this data. However, the `Pandas crosstab` function does a better job displaying this information.

Remember, if our model had 100% accuracy, we would expect only the diagonal values to be populated.

In [None]:
pd.crosstab(df_dev.Product, nb_predictions, 
            rownames=['True'], colnames=['Predicted'], 
            margins=True)

In [None]:
pd.crosstab(df_dev.Product, logreg_predictions, 
            rownames=['True'], colnames=['Predicted'], 
            margins=True)

We created a function to help you visualize the confusion matrix data below.

In [None]:
def plot_confusion_matrix(cm, title, target_names, cmap=plt.cm.coolwarm):
    plt.figure(figsize=(8,8))
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(target_names))
    plt.xticks(tick_marks, target_names, rotation=45)
    plt.yticks(tick_marks, target_names)
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

### Plot Confusion Matrix

Confusion Matrix names are sorted by the `confusion_matrix` function in Scikit-Learn

In [None]:
from sklearn.metrics import confusion_matrix

nb_cm = confusion_matrix(df_dev.Product, nb_predictions)
plot_confusion_matrix(nb_cm, "Naive Bayes Confusion Matrix", class_labels)

In [None]:
logreg_cm = confusion_matrix(df_dev.Product, logreg_predictions)
plot_confusion_matrix(logreg_cm, "Logistic Regression Confusion Matrix", class_labels)

### Classification report

In [None]:
print(classification_report(df_dev.Product, nb_predictions, target_names=class_labels))

In [None]:
print(classification_report(df_dev.Product, logreg_predictions, target_names=class_labels))