# Mystery Friend: Author Identification Using Multinomial Naive Bayes and NLP Techniques

In this project, we'll build a text classifier to determine the author of an anonymous postcard based on the style of writing. Using a machine learning approach, we'll classify the postcard's content as being written by one of three friends: Emma Goldman, Matthew Henson, or TingFang Wu. We'll achieve this by leveraging the power of scikit-learn's Bag-of-Words (BoW) model and Naive Bayes classifier.

Just like you can classify a message as spam or not spam with a spam filter, you can classify writing as related to one friend or another by building a kind of friend writing classifier. You have past writing from all three friends stored up in the variable `friends_docs`, which means you can use scikit-learn's bag-of-words and Naive Bayes classifier to determine who the mystery friend is!

### Objective
The objective is to classify the author of the postcard based on its content by learning from a set of previously written documents from each friend. We will train a Naive Bayes classifier, which is particularly well-suited for text classification tasks, and use it to predict the most likely author of the mystery postcard.

---

## Feature Vectors Are in the Bag with Scikit-Learn

### Step 1: Importing Required Libraries

We will begin by importing the necessary libraries from scikit-learn. The CountVectorizer is used to convert the text data into numerical features (word counts), and MultinomialNB is used to train a Naive Bayes classifier.

In [1]:
# import sklearn modules here:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

### Step 2: Vectorizing the Text Data

The CountVectorizer will convert our textual data (the writings of each friend) into a matrix of word counts. This process is known as vectorization. By doing this, we can represent each document as a numeric vector, where each dimension corresponds to a unique word, and the value represents the frequency of that word in the document.

In [2]:
# Create bow_vectorizer:
bow_vectorizer = CountVectorizer()

In [3]:
#the following code is a lightweight way to install new packages. You will need the `import_ipynb` package for this to 
%pip install import_ipynb

Note: you may need to restart the kernel to use updated packages.


### Step 3: Prepare the Training Data

We will now load the writings of our three friends (Emma Goldman, Matthew Henson, and TingFang Wu) from predefined variables (goldman_docs, henson_docs, and wu_docs). These are lists containing their respective writings.

In [4]:
# Import documents from each friend's writings
import import_ipynb
from goldman_emma_raw import goldman_docs
from henson_matthew_raw import henson_docs
from wu_tingfang_raw import wu_docs

# Combine all friend documents into a single list
friends_docs = goldman_docs + henson_docs + wu_docs

# Vectorize all friend documents using the bag-of-words model
friends_vectors = bow_vectorizer.fit_transform(friends_docs)

### Step 4: Vectorizing the Mystery Postcard

We now need to vectorize the mystery postcard using the same bow_vectorizer that we trained on our friend's writings. The mystery postcard is a string, and it must be transformed into a vector format before making a prediction.

- Define the mystery postcard's text
- Vectorize the mystery postcard

In [5]:
mystery_postcard = """
My friend,
From the 10th of July to the 13th, a fierce storm raged, clouds of
freezing spray broke over the ship, encasing her in a coat of icy mail,
and the tempest forced all of the ice out of the lower end of the
channel and beyond as far as the eye could see, but the _Roosevelt_
still remained surrounded by ice.
Hope to see you soon.
"""

mystery_vector = bow_vectorizer.transform([mystery_postcard])

## This Mystery Friend Gets Classified

### Step 5: Inspecting the Data

Before proceeding, it's good to inspect some sample documents from each friend to get a sense of their writing styles.

- We print a sample document from each friend

In [6]:
print("Emma Goldman Sample: ", goldman_docs[49])
print("Matthew Henson Sample: ", henson_docs[49])
print("TingFang Wu Sample: ", wu_docs[49])

Emma Goldman Sample:   What he gives to the world is only gray and hideous
things, reflecting a dull and hideous existence,--too weak to live,
too cowardly to die
Matthew Henson Sample:  Miss Marie Ahnighito Peary, aged about ten months, who
first saw the light of day at Anniversary Lodge on the 12th of the
previous September, was taken by her mother to her kinfolks in the
South
TingFang Wu Sample:   Let us, for instance, compare England with the United
States


### Step 6: Training the Classifier

Next, we will create a Naive Bayes classifier using MultinomialNB, a variant of Naive Bayes that is well-suited for text classification tasks where features are counts of words.

- Initialize the Naive Bayes classifier

In [7]:
friends_classifier = MultinomialNB()

### Step 7: Training the Classifier

Now we train the Naive Bayes classifier on the vectorized documents from each friend. We'll also define the labels for each document, which indicate the true author of each piece of writing. Then we will train the classifier:

In [8]:
friends_labels = ["Emma"] * 154 + ["Matthew"] * 141 + ["Tingfang"] * 166

friends_classifier.fit(friends_vectors, friends_labels)

### Step 8: Making Predictions

With the classifier trained, we can now make predictions on the mystery postcard. We use the predict() method to determine which friend is most likely 
the author.

- We can predict the author of the mystery postcard
- Also, we can print the probabilities of each prediction to see the results

In [9]:
predictions = friends_classifier.predict(mystery_vector)
print("Predicted Author is: ", predictions)

predictions_prob = friends_classifier.predict_proba(mystery_vector)
print("Prediction Probabilities: ", predictions_prob)

Predicted Author is:  ['Matthew']
Prediction Probabilities:  [[1.88829861e-02 9.81114912e-01 2.10168519e-06]]


## Mystery Revealed!

### Step 9: Final Reveal

Finally, we display the result to reveal the mystery author of the postcard.

- Print the mystery friend's name

In [10]:
mystery_friend = predictions[0] if predictions[0] else "someone else"
print(f"The postcard was from {mystery_friend}!")

The postcard was from Matthew!


### Step 10: Testing with New Data

You can experiment further by adding text from other sources such as books or emails to test how well the classifier generalizes. For example, you can fetch some writings from Gutenberg.org, transform them into the same vectorized format, and observe the classifier's performance.

- Example of adding a new sample and predicting the author

In [11]:
new_sample = "Sample text from a new document"
new_sample_vector = bow_vectorizer.transform([new_sample])
new_sample_prediction = friends_classifier.predict(new_sample_vector)
print("Predicted Author for New Sample: ", new_sample_prediction)

Predicted Author for New Sample:  ['Tingfang']



---

## Conclusion

In this project, we used a text classification approach to determine the author of an anonymous postcard. By vectorizing the text data using a Bag-of-Words model and applying a Naive Bayes classifier, we were able to accurately predict the mystery friend's identity. This demonstrates the power of natural language processing (NLP) techniques in solving real-world problems such as text classification, where the goal is to categorize or label textual data based on patterns learned from training data.

## Insights:
Naive Bayes Efficiency: The Naive Bayes classifier, especially in text classification tasks, can be quite effective, even with simple representations like Bag-of-Words.
Text Vectorization: The effectiveness of the classifier heavily depends on the quality of the vectorization process. Advanced techniques like TF-IDF or word embeddings could further improve performance.
Real-World Application: The approach demonstrated here could be applied to various other tasks, such as spam detection, sentiment analysis, and topic classification, all of which require similar methods.