## Try 9.3.1: Naive Bayes classification in Python.

In [1]:
# Import packages and functions

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


In [2]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

In [3]:
# Read in the data and view the first five instances.
# File does not include column headers so they are provided via names.
messages = pd.read_table('SMSSpamCollection.csv', names=['Class', 'Message'])
messages.head()

Unnamed: 0,Class,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [7]:
# Split into testing and training sets
X_train, X_test, Y_train, Y_test = train_test_split(
    messages['Message'], messages['Class'], random_state=123
)

In [9]:
# Count the words that appear in the messages
vectorizer = CountVectorizer(ngram_range=(1, 1))
vectorizer.fit(X_train)
# Uncomment the line below to see the words.
#vectorizer.vocabulary_

In [11]:
# Count the words in the training set and store in a matrix
X_train_vectorized = vectorizer.transform(X_train)
X_train_vectorized

<4179x7466 sparse matrix of type '<class 'numpy.int64'>'
	with 55283 stored elements in Compressed Sparse Row format>

In [13]:
# Initialize the model and fit with the training data
NBmodel = MultinomialNB()
NBmodel.fit(X_train_vectorized, Y_train)

In [15]:
# Make predictions onto the training and testing sets.
trainPredictions = NBmodel.predict(vectorizer.transform(X_train))
testPredictions = NBmodel.predict(vectorizer.transform(X_test))

In [17]:
# How does the model work on the training set?
confusion_matrix(Y_train, trainPredictions)

array([[3615,    9],
       [  16,  539]], dtype=int64)

In [19]:
# Display that in terms of correct porportions
confusion_matrix(Y_train, trainPredictions, normalize='true')

array([[0.99751656, 0.00248344],
       [0.02882883, 0.97117117]])

99.7% of real messages are classified correctly.
Just under 3% of spam messages are thought to be real.

In [22]:
# How does the model work on the test set?
confusion_matrix(Y_test, testPredictions, normalize='true')

array([[0.99333888, 0.00666112],
       [0.08854167, 0.91145833]])

About 8.9% of spam messages are classified as real in the test data and about 0.7% of real messages are classified as spam.

In [25]:
# Predict some phrases. Add your own.
NBmodel.predict(
    vectorizer.transform(
        ["Big sale today! Free cash.",
        "I'll be there in 5"]))

array(['spam', 'ham'], dtype='<U4')

## challenge activity 9.3.2: Naive Bayes using scikit-learn.

## 1)
**One surprising recent scam is posting fake jobs. Resumes often contain personal information such as emails and phone numbers, which can be used to steal a person's identity.**

**Researchers collected data on a random sample of 50 job posts, of which 10 are fake. Since job posts are text data, some additional pre-processing is needed.**

* **Reformat the company profiles using CountVectorizer().**

  
**The code contains all imports and loads the dataset. The provided print statement displays all unique words in the company profiles.**

In [27]:


# Import dataset
jobPosts = pd.read_csv('job_posts.csv')

# Create input matrix X and output matrix y
X = jobPosts['company_profile']
y = jobPosts['fake']

# Reformat job posts using CountVectorizer()
vectorizer = CountVectorizer()
vectorizer = vectorizer.fit(X)
XVectorized = vectorizer.transform(X)

# View word list
print(vectorizer.get_feature_names_out())

['abc' 'about' 'accion' 'actioniq' 'aerial' 'affordable' 'agency'
 'aggressive' 'ago' 'aker' 'amani' 'an' 'and' 'anyone' 'apartment'
 'application' 'aptitude' 'are' 'as' 'at' 'babbel' 'based' 'bb' 'been'
 'began' 'benefits' 'best' 'boating' 'bradley' 'breakthrough' 'build'
 'building' 'builds' 'business' 'busting' 'by' 'capital' 'care'
 'carepartners' 'cares' 'change' 'changing' 'charleston' 'co' 'coming'
 'committed' 'communications' 'companies' 'company' 'consumer'
 'contracting' 'corporation' 'created' 'crest' 'csdcsd' 'customer'
 'dealership' 'delivering' 'delivers' 'delivery' 'demand' 'develop'
 'developing' 'develops' 'dice' 'digital' 'distributor' 'drives' 'due'
 'easy' 'ecommerce' 'edison' 'emerging' 'enables' 'enabling' 'enterprise'
 'established' 'everyone' 'executive' 'face' 'family' 'fans' 'finally'
 'finance' 'financial' 'first' 'focused' 'for' 'force' 'formerly' 'fort'
 'fundamentally' 'generation' 'geography' 'gets' 'global' 'going'
 'governesses' 'governors' 'great' 'gr

## 2) Researchers collected data on a random sample of 100 job posts, of which 20 are fake.

* **Initialize a multinomial naive Bayes model to classify company profiles as real or fake.**

  
**The code contains all imports, and loads and processes the dataset. The provided print statement prints a confusion matrix for evaluating the model.**

In [29]:



# Create input matrix X and output matrix y
X = jobPosts['company_profile']
y = jobPosts['fake']

# Reformat job posts using CountVectorizer()
vectorizer = CountVectorizer(ngram_range = (1,1))
vectorizer = vectorizer.fit(X)
XVectorized = vectorizer.transform(X)

NBModel = MultinomialNB()
NBModel = NBModel.fit(XVectorized, y)

# Print confusion matrix
pred = NBModel.predict(vectorizer.transform(X))
print(confusion_matrix(pred, y))

[[40  0]
 [ 0 10]]


## 3) Researchers collected data on a random sample of 100 job posts, of which 20 are fake.

* **Initialize a multinomial naive Bayes model, NBModel, to classify job benefits as real or fake.**
* **Fit NBModel.**
  
**The code contains all imports, and loads and processes the dataset. The provided print statement prints a confusion matrix for evaluating the model.**

In [31]:


# Create input matrix X and output matrix y
X = jobPosts['benefits']
y = jobPosts['fake']

# Reformat job posts using CountVectorizer()
vectorizer = CountVectorizer(ngram_range = (1,1))
vectorizer = vectorizer.fit(X)
XVectorized = vectorizer.transform(X)

# Your code goes here
NBModel = MultinomialNB()
NBModel = NBModel.fit(XVectorized, y)
# Print confusion matrix
pred = NBModel.predict(vectorizer.transform(X))
print(confusion_matrix(pred, y))

[[40  0]
 [ 0 10]]
