# Steven Rae - Machine Learning Capstone Project - Step 2

## Using Naive Bayes Classification Technique

**Question to be Answered:** Focusing on Essay no. 7 (On a typical Friday night I am....), want to create
a model that predicts if a submitted essay describes outgoing activites or
introverted/stay at home activites occurring on a Friday night.

Define inputs and load the OKCupid data into a pandas dataframe:

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

okcupidDF = pd.read_csv("profiles.csv")

My review of the essay quetion content in Step 1 showed that all essay questions are provided in completely lower case. Therefore no need to write code specifically to change text to all lower case. However, Essay 7 text does need to be checked for any Nan entries and replace them with a space:

In [2]:
okcupidDF['essay7'] = okcupidDF['essay7'].replace(np.nan, '', regex=True)

To determine which of the essays could be labeled as 'Outgoing' and which could be classified as 'Stay at home'... I decided to create a list of search phrases that would typically be found in an essay describing 'Outgoing' activities:

In [3]:
outgoing_search_phrases = ['friend', 'friends', 'out with', 'out and about',
'out in the', 'out at', 'family', 'children', 'club', 'clubbing', 'hanging out with',
'hanging with', 'concert', 'show', 'good company', 'dancing', 'bar',
'bar hopping', 'night life', 'nightlife', 'kids', 'out to dinner', 'out for dinner', 'out on the town', 'event',
'live music', 'socializing', 'go out', 'going out', 'happy hour', 'party']

Need to add a label column to the dataframe that has 1 for each essay 7 considered 'Outgoing' and 0 for 'Stay at home'. This will be considered the 'correct' labels to be compared to the labels predicted by the model:

In [4]:
pattern = '|'.join(outgoing_search_phrases)

# add label column to dataframe
okcupidDF['is_outgoing'] = np.where(okcupidDF['essay7'].str.contains(pattern), 1, 0)

print(okcupidDF['is_outgoing'].head())

# do a quick check for values and NaNs just in case
print(okcupidDF.is_outgoing.value_counts(dropna=False))

0    1
1    0
2    1
3    0
4    0
Name: is_outgoing, dtype: int64
1    31141
0    28805
Name: is_outgoing, dtype: int64


Assuming my choices of search phrases are resonably good, looks like I've got a pretty balanced dataset between 'outgoing' and 'stay at home' entries. I saved each type ( 1 and 0 ) to separate csv files. I spent about 15 minutes per file eyeballing the essays and I would say that the classification is pretty good if I do say so myself...:)

Create a dataframe that's a subset of the okcupidDF. This new dataframe will be the input to the classification model:

In [5]:
feature_data = okcupidDF[['essay7', 'is_outgoing']]

Create training and validation sets, validation set should be 25% of data by default

In [6]:
training_set, validation_set = train_test_split(feature_data, random_state = 1)

Setup the Naive Bayes Classifier Model and run the predictions and probabilities for the validation set and training set:

In [9]:
counter = CountVectorizer()

counter.fit(training_set['essay7'])
# print(counter.vocabulary_)
training_counts = counter.transform(training_set['essay7'])
validation_counts = counter.transform(validation_set['essay7'])

classifier = MultinomialNB()
classifier.fit(training_counts, training_set['is_outgoing'])
predictions_validation = classifier.predict(validation_counts)
probabilities_validation = classifier.predict_proba(validation_counts)
predictions_training = classifier.predict(training_counts)
probabilities_training = classifier.predict_proba(training_counts)

Look at Accuracy for the validation predictions. I'm making an assumption that if the probability associated with the prediction is above 75%, then the prediction is correct:

In [10]:
i = 0
total_predictions = 0
total_predictions_correct = 0
outgoing_predictions = 0
outgoing_predictions_correct = 0
introverted_predictions = 0
introverted_predictions_correct = 0

while i < len(predictions_validation):
    if predictions_validation[i] == 1:
        outgoing_predictions += 1
        total_predictions += 1
        if probabilities_validation[i][1] > 0.75:
            outgoing_predictions_correct += 1
            total_predictions_correct += 1
    if predictions_validation[i] == 0:
        introverted_predictions += 1
        total_predictions += 1
        if probabilities_validation[i][0] > 0.75:
            introverted_predictions_correct += 1
            total_predictions_correct += 1
    i += 1

print('------- Validation Set -------')
print('Total Predictions: ', total_predictions)
print('Total Outgoing Predictions: ', outgoing_predictions)
print('Total Outgoing Predictions Correct: ', outgoing_predictions_correct)
print('Total Introverted Predictions: ', introverted_predictions)
print('Total Introverted Predictions Correct: ', introverted_predictions_correct)
print('Total Predictions Correct: ', total_predictions_correct)

------- Validation Set -------
Total Predictions:  14987
Total Outgoing Predictions:  11990
Total Outgoing Predictions Correct:  8156
Total Introverted Predictions:  2997
Total Introverted Predictions Correct:  2127
Total Predictions Correct:  10283


Look at Precision, Recall an F1 for the validation set:

In [13]:
print(classification_report(validation_set['is_outgoing'], predictions_validation))

              precision    recall  f1-score   support

           0       0.92      0.38      0.54      7220
           1       0.63      0.97      0.76      7767

   micro avg       0.69      0.69      0.69     14987
   macro avg       0.77      0.68      0.65     14987
weighted avg       0.77      0.69      0.66     14987



Look at Accuracy for the training predictions. I'm making an assumption that if the probability associated with the prediction is above 75%, then the prediction is correct:

In [14]:
i = 0
total_predictions = 0
total_predictions_correct = 0
outgoing_predictions = 0
outgoing_predictions_correct = 0
introverted_predictions = 0
introverted_predictions_correct = 0

while i < len(predictions_training):
    if predictions_training[i] == 1:
        outgoing_predictions += 1
        total_predictions += 1
        if probabilities_training[i][1] > 0.75:
            outgoing_predictions_correct += 1
            total_predictions_correct += 1
    if predictions_training[i] == 0:
        introverted_predictions += 1
        total_predictions += 1
        if probabilities_training[i][0] > 0.75:
            introverted_predictions_correct += 1
            total_predictions_correct += 1
    i += 1

print('------- Training Set -------')
print('Total Predictions: ', total_predictions)
print('Total Outgoing Predictions: ', outgoing_predictions)
print('Total Outgoing Predictions Correct: ', outgoing_predictions_correct)
print('Total Introverted Predictions: ', introverted_predictions)
print('Total Introverted Predictions Correct: ', introverted_predictions_correct)
print('Total Predictions Correct: ', total_predictions_correct)

------- Training Set -------
Total Predictions:  44959
Total Outgoing Predictions:  35430
Total Outgoing Predictions Correct:  24346
Total Introverted Predictions:  9529
Total Introverted Predictions Correct:  7319
Total Predictions Correct:  31665


Look at Precision, Recall an F1 for the training set:

In [16]:
print(classification_report(training_set['is_outgoing'], predictions_training))

              precision    recall  f1-score   support

           0       0.93      0.41      0.57     21585
           1       0.64      0.97      0.77     23374

   micro avg       0.70      0.70      0.70     44959
   macro avg       0.79      0.69      0.67     44959
weighted avg       0.78      0.70      0.68     44959



## Step 2 Conclusions:

Looks like accuracy for both the training and validation sets was around 70%. Since the 'outgoing' vs 'stay at home' dataset is reasonably balanced. Precision suggests that our model is really good at predicting 'stay at home' but not so good at predicting the 'outgoing' activities. This is most likely due to the way I set this up. There is certainly more specivity is looking for 'outgoing' activities.

I'm starting to think that this specivity in looking for 'outgoing' activities is certainly showing up in the recall info as well. Perhaps I've inadvertantly introduced bias into the dataset. The more balanced F1 score seems to be reflecting this as well.

Some of this bias may have been prevented if OK Cupid could have asked their clients if they considered themselves 'outgoing' or 'introverted' and I could have used that as the label to be predicted by the model.