# Data Science Mathematics
# Markov Chains and Natural Language Generation
# In-Class Activity

Markov Chains are stochastic models that describe a sequence of events in terms of the probabilities of the previous events.

![Markovkate_01.svg.png](attachment:Markovkate_01.svg.png)
In this diagram of a Markov process, E and A represent distinct states.  The numbers represent the probability of the process changing from one state to the other, in the direction of the arrow.  For example, if the process is in state E, there is a 0.3 probability that it stays in state E, and a 0.7 probability chance that it changes to state A.

Markov processes can be used for Natural Language Generation, which is a machine learning technique that can be used to generate coherent sentences thematically similar to a training set of sentences.  In other words, sentences that make sense (i.e., not strings of random words), and that are thematically similar to the training set.

Bodies of text can be modeled as conditional probability chains.  For any given body of text, one can calculate the probability that word Y will follow word X, thus a set of conditional probabilities can be calculated.  Because Markov processes rely on conditional probabilties (recall P(X|Y)...), one can see how they can be use to generate sentences that have a high probability of being realistic and thematically similar to the training body of text.

Python has an interesting library called Markovify that generates sentences from a training set of text.  Become familiar with the library here:

https://github.com/jsvine/markovify

Natural Language Generation has many applications.  For example, it can be used to create bots that generate realistic text.

In this assignment, you will train a Markov model on a set of tweets from known Russian bots that were active during the 2016 Presidential Election.  You will build your own disinformation bot.

1) Import markovify, pickle, and os
    A) You will likely get an error when you run the below that will say it cannot find markovify to fix fot to the command 
    line (xfce) and type: 
    
    `conda install -c conda-forge markovify` if that does not work type `pip install markovify`

In [None]:
import markovify
import pickle
import os

The pickle library is a Python object serialization library.  It allows you to save an artifact ("bit dump") of any Pythonic object for later use.  We will use it to open the training data set, which has already been processed.

os is an operating system library for directory manipulation.

You should have downloaded the 'training_data_Session_2.pkl' file from GitHub, along with this Jupyter notebook, and saved it to a convenient location.

2) Specify the absolute path of the training data file.

In [None]:
input_path = r'training_data_Session_2.pkl' #If the file is not in the same directory you need to specify your path here

The pickled text object is just a serialized string of tweets.  You will need to import the string to train the markov model.  But first, define a function to open a pickle object.

3) Define open_pickle function and open the training text file

In [None]:
#open pickle file
def open_pickle(pickle_path):
    with open(pickle_path, 'rb') as pickle_file:
        #print(pickle_file)
        object_name = pickle.load(pickle_file, encoding ='utf8')
    return object_name

Now you can open the text file

In [None]:
training_text = open_pickle(input_path)

Next, you need to train you model.  This is easy with Markovify...

4) Train your Markov model

In [None]:
twitter_bot = markovify.Text(training_text)

Tweets are limited to 280 characters or less.  Since we want our bot to generate tweets, we have to limit the length of the generated output to no more than 280 characters.

The following is example code that will print 3 markov-generated sentences of no more than 280 characters:

#Print three randomly-generated sentences of no more than 280 characters
>for i in range(3):
    >>print(twitter_bot.make_short_sentence(280))

5) In the cell below, compose a loop that will generate 10 tweets using your Twitter bot.

Now we should check and see if the generated tweets are similar to the training body of tweets.  Let's split our training body into a list of individual sentences.

6) Generate a list of sentences by splitting the text at each period.

In [None]:
test_tweets = training_text.split('. ')

7) Print the first 10 lines of your training set

In [None]:
for i in range(10):
    print(test_tweets[i])

8) Are your generated tweets similar to the training set?  Why or why not? (Type your answer in a cell below)

9) Do you think your bot would pass a Turing Test?  Why or why not? (Type your answer in a cell below)

***Now upload your file. "git add ." , "git commit -m "second homework"", "git push origin master"

***Then send the link and your written homework to the teacher. 