We'll start by using the [markovify](https://github.com/jsvine/markovify/) library to make some individual sentences in the style of Jane Austen.  These will be the basis for generating a stream of synthetic documents.

In [1]:
import markovify
import codecs
import random

# Markovify uses a single random generator -- notebooks using it will thus 
# only be reproducible if you set a random seed before each cell using markovify
random.seed(0xbaff1ed)

with codecs.open("data/austen.txt", "r", "cp1252") as f:
    text = f.read()

austen_model = markovify.Text(text, retain_original=False, state_size=3)

for i in range(10):
    print(austen_model.make_short_sentence(200))

Such and such-like were the reasonings of Sir Thomas, and in smaller concerns by her sister.
I began to think my caro sposo would be absolutely jealous.
The introduction, however, was immediately made; and she had been last together; much less could her feelings acquit her of having made Harriet unhappy.
Mr. Bennet and his daughters saw all the impropriety of her father's comfort, perhaps even of his life, and you know young people like to be married to a Miss Hawkins.
She could not have believed them.
She was less handsome than her brother; but she could not be a doubt of your secrecy.
I speak feelingly.
She looked about her with due consideration, and found almost everything in his favour, should think highly of himself.
Well, she went on to say something sensible, but knew not what answer she returned to the Park, and Elinor was not blinded by the beauty, or the shrewd look of the youngest, to her want of sense.
The wish was rather eager than lasting.


Constructing single sentences is interesting, but we'd really rather construct larger documents. Here we'll construct a series of documents that have, on average, five sentences.

In [2]:
from scipy.stats import poisson
import numpy as np

def make_basic_documents(sentence_count=5, document_count=1, model=austen_model, seed=None):
    def shortsentence(ct):
        return " ".join([model.make_short_sentence(200) for _ in range(ct + 1)])

    if seed is not None:
        # seed both the Python generator and the NumPy one used by SciPy
        random.seed(seed)
        np.random.seed(seed)
    
    return [shortsentence(ct) for ct in poisson.rvs(sentence_count, size=document_count)]

for doc in make_basic_documents(5, 10, seed=0xdecaf):
    print(doc)
    print("\n###\n")

We cannot have two Agathas, and we must have one Cottager's wife; and I am afraid you will be so well off. JOHNSON TO LADY S. VERNON Edward Street. We were all alive.

###

And we agreed it would be delightful. By one measure I might have written home. I am not sorry for, as I know you have the art of pleasing--the art of pleasing, at least, at Kellynch Hall; and who had opportunities of seeing me. By this time, the subject was equally convinced that it is sometimes carried a little too nice. Jane was not happy. Mr Elliot had made a change indeed! It was only necessary to mention any favourite amusement to engage her to talk.

###

He shook his head. I see it in her eyes, seemed all that he wished. A very proper compliment!--and then follows the application, which I think, my dear Harriet, you cannot find much difficulty in comprehending. Here Fanny, who had hoped to see. The necessity of the measure in a pecuniary light, and the hope of her being well principled and religious.

###

A

We're going to use the Austen model as the main basis for _legitimate messages_ in our sample data set.  For the _spam messages_, we'll train two Markov models on positive and negative product reviews (taken from the [public-domain Amazon fine foods reviews dataset on Kaggle](https://www.kaggle.com/snap/amazon-fine-food-reviews/)).  We'll combine the models from these sources in different proportions so that all words are _possible_ in certain kinds of messages but some words are _more likely_ in legitimate messages or in spam.

In [3]:
import gzip

def train_markov_gz(fn):
    """ trains a Markov model on gzipped text data """
    with gzip.open(fn, "rt", encoding="utf-8") as f:
        text = f.read()
    return markovify.Text(text, retain_original=False, state_size=3)

negative_model = train_markov_gz("data/reviews-1.txt.gz")
positive_model = train_markov_gz("data/reviews-5-100k.txt.gz")

We can combine these models with relative weights, but this yields somewhat unusual results:

In [4]:
legitimate_model = markovify.combine([austen_model, negative_model, positive_model], [196, 2, 2])
spam_model = markovify.combine([austen_model, negative_model, positive_model], [3, 30, 40])

In [5]:
# seed both the Python generator and the NumPy one used by SciPy
random.seed(0xc0ffee)
np.random.seed(0xc0ffee)

for s in make_basic_documents(5, 20, legitimate_model):
    print(s)
    print("\n###\n")

Please keep a close eye on them because they are dried, so they can't travel around in search of them. This information made Elizabeth smile, as she thought of poor Miss Bingley. Some members of their society sent away, and the horses were baited, he was off. But every thing was soon in a fair way of soon knowing by heart. So far, the worst coffee I ever tried in my Keurig machine, socially responsible and sensitive way they're sourced and manufactured. That will just do for me, you know, to interfere. Well, I never observed that.

###

She had fallen into good hands earlier. I am so very unwell! I fancy Lord S. is very good-humoured and pleasant in his own apartment, had they sat in one equally lively; and she gave herself up for lost.

###

It was to herself an amusing and a very respectable man, though his name was Lindsay--for particular reasons however I shall conceal it under that of Talbot. What a delightful ball we had last night. But they always do, you know. Elizabeth replied

In [6]:
random.seed(0xf00)
np.random.seed(0xf00)

for s in make_basic_documents(5, 20, spam_model):
    print(s)
    print("\n###\n")

Give it a try! I just added 2 tbsp to the dish when making it; reminiscent of Kraft Mac and Cheese. Practically inedible. Finally, I put my pet on Iams and the digestive system. There are other sources for this product. Aside from being a very yummy tasting brownie. Spinach and mango? Its a wonder I even can recall what it tasted like.

###

I love them for breakfast. Poison.They are not exaggerated reviews - it is lightly crunchy without destroying my dental work. That oiliness is natural and organic--no fields of wheat were sprayed with perfume and then dipped in tropical syrup. Even as I held the treat in half. It has a wonderfully light coconut, mango flavor. I put this candy out and in no way resembles chicken.

###

As for the looks, I don't mind nearly as much! I'm sure it is the highest temperature you can let it steep very long, which is great as a snack twice a day. Why? I typically enjoy smooth medium roasts and I prefer to sweeten my coffee for nothing. I used part of this 

We can then generate some example documents and save them to a file for use in the next notebook.  

In [7]:
import pandas as pd
import numpy as np

pd.set_option("io.parquet.engine", "pyarrow")

random.seed(0xda7aba5e)
np.random.seed(0xda7aba5e)

df = pd.DataFrame(columns=["label", "text"], dtype=np.dtype(str))

mean_sentences_per_example = 5
examples_per_class = 20000

for (label, model) in [("legitimate", legitimate_model), ("spam", spam_model)]:
    docs = [{"label" : label, "text" : txt} for txt in make_basic_documents(mean_sentences_per_example, examples_per_class, model)]
    df = pd.concat([df, pd.DataFrame(docs)])

df["text"] = df["text"].astype("str")
df["label"] = df["label"].astype("category")
df.to_parquet("data/training.parquet")

Let's go to [the next notebook](01-vectors-and-visualization.ipynb) now!