# Dealing with text and Naive Bayes

***

In [1]:
# Import the libraries we will be using
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

import matplotlib.pylab as plt
%matplotlib inline

# We will want to keep track of some different roc curves, lets do that here
tprs = []
fprs = []
roc_labels = []
aucs = []

## Document classification and customer satisfaction

You've been hired by Trans American Airlines (TAA) as a business analytics professional. One of the top priorities of TAA is  customer service. For TAA, it is of utmost importance to identify whenever customers are unhappy with the way employees have treated them. You've been hired to analyze twitter data in order to detect whenever a customer has complaints about flight attendants. Tweets suspected to be related to flight attendant complaints should be forwarded directly to the customer service department in order to track the issue and take corrective actions.  

Let's start by loading the data.

In [None]:
! git clone https://github.com/yizuc/datamining.git

In [None]:
data_path = '/content/datamining/Module6_Text_NaiveBayes/data/tweets.csv'
df = pd.read_csv(data_path)
df.head()

Let's take a look at what do people complain about in Twitter.

In [None]:
df.negativereason.value_counts()

Our label is given by "Flight Attendant Complaints"

In [None]:
# We'll call our label 'service_issue' and keep only the text as a feature.
df["is_fa_complaint"] = (df.negativereason == "Flight Attendant Complaints").astype(int)
df = df[["is_fa_complaint", "text"]]
df.shape

Let's take a look at the percentage of tweets related to complaints about flight attendants.

In [None]:
df['is_fa_complaint'].mean()

Here are some examples of the tweets.

In [None]:
print("Flagged as FA complaints:")
print(df[df.is_fa_complaint == 1].text.values[0:5])

print("\nNot flagged as FA complaint:")
print(df[df.is_fa_complaint == 0].text.values[0:5])

Since we are going to do some modeling, we should split our data into a training and a test set.

In [8]:
X = df['text']
Y = df['is_fa_complaint']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.2, random_state=42)

### Text as features
How can we turn the large amount of text for each record into useful features?


#### Binary representation
One way is to create a matrix that uses each word as a feature and keeps track of whether or not a word appears in a document/record. You can do this in sklearn with a `CountVectorizer()` and setting `binary` to `true`. The process is very similar to how you fit a model: you will fit a `CounterVectorizer()`. This will figure out what words exist in your data.

In [None]:
binary_vectorizer = CountVectorizer(binary=True)
binary_vectorizer.fit(X_train)

Let's look at the vocabulary the `CountVectorizer()` learned.

In [None]:
vocabulary_list = list(zip( binary_vectorizer.vocabulary_.keys(), binary_vectorizer.vocabulary_.values()) )

vocabulary_list[0:10]

Now that we know what words are in the data, we can transform our text into a clean matrix. Simply .transform() the raw data using our fitted CountVectorizer(). You will do this for the training and test data. What do you think happens if there are new words in the test data that were not seen in the training data?

In [11]:
X_train_binary = binary_vectorizer.transform(X_train)
X_test_binary = binary_vectorizer.transform(X_test)

We can take a look at our new `X_test_counts`.

In [None]:
X_test_binary

Sparse matrix? Where is our data?

If you look at the output above, you will see that it is being stored in a *sparse* matrix (as opposed to the typical dense matrix) that is ~3k rows long and ~13k columns. The rows here are records in the original data and the columns are words. Given the shape, this means there are ~39m cells that should have values. However, from the above, we can see that only ~46k cells (~0.12%) of the cells have values! Why is this?

To save space, sklearn uses a sparse matrix. This means that only values that are not zero are stored! This saves a ton of space! This also means that visualizing the data is a little trickier. Let's look at a very small chunk.

In [None]:
# Recall that 13183 is the index for "you"
X_test_binary[0:20, 13180:13200].todense()

#### Applying a model
Now that we have a ton of features (one for every word!) let's try using a logistic regression model to predict which tweets are about flight attendant complaints.

In [None]:
def get_model_roc(models, Xs_test, names, Y_test):
    plt.rcParams['figure.dpi'] = 100
    for i in range(len(models)):
        model = models[i]
        X_test = Xs_test[i]
        name = names[i]
        probs = model.predict_proba(X_test)[:,1]
        fpr, tpr, thresholds = metrics.roc_curve(Y_test, probs)
        plt.plot(fpr, tpr, label=name)
        plt.plot([0, 1], [0, 1], linestyle='dashed', color='black')
        plt.xlabel("False Positive Rate")
        plt.ylabel("True Positive Rate")
        plt.title("ROC Curve")
        print ("AUC for {0} = {1:.3f}".format(name, metrics.roc_auc_score(Y_test, probs)))
    plt.legend()
    plt.show()

        
model_binary = LogisticRegression(solver='liblinear')
model_binary.fit(X_train_binary, Y_train)
get_model_roc([model_binary], [X_test_binary], ['binary'], Y_test)

#### Counts instead of binary
Instead of using a 0 or 1 to represent the occurence of a word, we can use the actual counts. We do this the same way as before, but now we leave `binary` set to `false` (the default value).

In [None]:
# Fit a counter
count_vectorizer = CountVectorizer()
count_vectorizer.fit(X_train)

# Transform to counter
X_train_counts = count_vectorizer.transform(X_train)
X_test_counts = count_vectorizer.transform(X_test)

# Model
model_counts = LogisticRegression(solver='liblinear')
model_counts.fit(X_train_counts, Y_train)

get_model_roc([model_binary, model_counts], [X_test_binary, X_test_counts], ['binary', 'counts'], Y_test)

#### Tf-idf
Another popular technique when dealing with text is to use the term frequency - inverse document frequency (tf-idf) measure instead of just counts as the feature values (see the book).

In [None]:
# Fit a counter
tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit(X_train)

# Transform to a counter
X_train_tfidf = tfidf_vectorizer.transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Model
model_tfidf = LogisticRegression(solver='liblinear')
model_tfidf.fit(X_train_tfidf, Y_train)

get_model_roc([model_binary, model_tfidf], [X_test_binary, X_test_tfidf], ['binary', 'tf-idf'], Y_test)

The `CountVectorizer()` and `TfidfVectorizer()` functions have many options. You can restrict the words you would like in the vocabulary. You can add n-grams. You can use stop word lists. Which options you should use generally depend on the type of data you are dealing with. 

In [None]:
# Fit a counter
ngram_vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1, 2))
ngram_vectorizer.fit(X_train)

# Transform to a counter
X_train_ngram = ngram_vectorizer.transform(X_train)
X_test_ngram = ngram_vectorizer.transform(X_test)

# Model
model_ngram = LogisticRegression(solver='liblinear')
model_ngram.fit(X_train_ngram, Y_train)

get_model_roc([model_binary, model_ngram], [X_test_binary, X_test_ngram], ['binary', '2-ngram'], Y_test)

### Modeling with another technique: Naive Bayes

So far we have been exposed to tree classifiers and logistic regression in class. Now, it's time for another popular modeling technique of supervised learning (especially in text classification): the Naive Bayes (NB) classifier. In particular, we are using a Bernoulli Naive Bayes (BNB) for our binary classification. (Bernoulli NB is the model described in the book; there are other versions of NB.)

As described in your text, the Naive Bayes model is a **probabilistic approach which assumes conditional independence between features** (in this case, each word is a feature, the conditioning is on the true class). It assigns class labels (e.g. service_issue = 1 or service_issue = 0). In other words, Naive Bayes models the probabilities of the presence of each _word_, given that we have a service issue, and given that we do not have a service issue.  Then it combines them using Bayes Theorem (again, as described in the book).

Using this model in sklearn works just the same as the others we've seen ([More details here..](http://scikit-learn.org/stable/modules/naive_bayes.html))

- Choose the model
- Fit the model (Train)
- Predict with the model (Train or Test or Use data)

In [None]:
from sklearn.naive_bayes import BernoulliNB

model_nb = BernoulliNB()
model_nb.fit(X_train_binary, Y_train)

The past few weeks we have seen that many of the models we are using have different complexity control parameters that can be tweaked. In naive Bayes, the parameter that is typically tuned is the Laplace smoothing value **`alpha`**.

Also, there are other versions of naive Bayes:

1. **Multinomial naive Bayes (MNB):** This model handles count features and not just binary features. Sometimes MNB is used with binary presence/absence variables anyway (like word presence), even though that violates the model assumptions, because in practice it works well anyway.
2. **Gaussian Naive Bayes (GNB):** This model considers likelihood of the features as Gaussian--and thus we can use it for continuous features.  Sometimes GNB and Bernoulli NB are combined when one has features of mixed types.  

In [None]:
get_model_roc([model_binary, model_nb], [X_test_binary, X_test_binary], ['binary', 'naive-bayes'], Y_test)