Let's build our our own machine learning algorithm that uses the principles in the first Obama/Trump tweet example to learn what Trump tweets and Obama tweets and be able to predict whether an unknown tweet is Trump or Obama.

We have 9 trump tweets and 10 Obama tweets. We will use the first 6 of each to teach our system what is an Obama Tweet and what is a Trump tweet. We will then give the system our 3 Trump tweets and our 4 Obama tweets and ask it to tell us who tweeted. 

## 1 Load the training data

In [2]:
import pandas as pd

tweets = pd.read_csv("trump-obama-tweets - Sheet1.csv")
tweets

Unnamed: 0,Author,Tweet
0,Trump,The current tax code is a burden on American t...
1,Trump,I am supportive of Lamar as a person & also of...
2,Trump,Democrat Congresswoman totally fabricated what...
3,Trump,The NFL has decided that it will not force pla...
4,Trump,"As it has turned out, James Comey lied and lea..."
5,Trump,The Democrats will only vote for Tax Increases...
6,Trump,"...people not interviewed, including Clinton h..."
7,Trump,"Wow, FBI confirms report that James Comey draf..."
8,Trump,The most important truth our FOUNDERS understo...
9,Obama,I'm grateful to @SenJohnMcCain for his lifetim...


## 2 Create the training and test sets:

Before we start coding our machine learning algorithm we need to prepare our data. Typically 80% of any machine learning or NLP activity is data preparation. Here we will take our 19 tweets and create a 'training set' of 12 tweets and a test set of 7 tweets - 4 Obama and 3 trump.

We will train our algorithm with the 12 training tweets by giving feeding it each tweet and telling it the author so the algorithm can learn what an Obama tweet and a Trump tweet is like.

Once the machine has been trained we will give it the remaining 7 tweets without telling it who is author and see how accurately it classifies the 7 unknown tweets. 

In [3]:
def clean(tweet):
  stopwords = ["a", "about", "above", "above", "across", "after", "afterwards", "again", "against", "all", "almost", "alone", "along", "already", "also","although","always","am","among", "amongst", "amoungst", "amount",  "an", "and", "another", "any","anyhow","anyone","anything","anyway", "anywhere", "are", "around", "as",  "at", "back","be","became", "because","become","becomes", "becoming", "been", "before", "beforehand", "behind", "being", "below", "beside", "besides", "between", "beyond", "bill", "both", "bottom","but", "by", "call", "can", "cannot", "cant", "co", "con", "could", "couldnt", "cry", "de", "describe", "detail", "do", "done", "down", "due", "during", "each", "eg", "eight", "either", "eleven","else", "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone", "everything", "everywhere", "except", "few", "fifteen", "fify", "fill", "find", "fire", "first", "five", "for", "former", "formerly", "forty", "found", "four", "from", "front", "full", "further", "get", "give", "go", "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter", "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his", "how", "however", "hundred", "ie", "if", "in", "inc", "indeed", "interest", "into", "is", "it", "its", "itself", "keep", "last", "latter", "latterly", "least", "less", "ltd", "made", "many", "may", "me", "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly", "move", "much", "must", "my", "myself", "name", "namely", "neither", "never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone", "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on", "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our", "ours", "ourselves", "out", "over", "own","part", "per", "perhaps", "please", "put", "rather", "re", "same", "see", "seem", "seemed", "seeming", "seems", "serious", "several", "she", "should", "show", "side", "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone", "something", "sometime", "sometimes", "somewhere", "still", "such", "system", "take", "ten", "than", "that", "the", "their", "them", "themselves", "then", "thence", "there", "thereafter", "thereby", "therefore", "therein", "thereupon", "these", "they", "thickv", "thin", "third", "this", "those", "though", "three", "through", "throughout", "thru", "thus", "to", "together", "too", "top", "toward", "towards", "twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us", "very", "via", "was", "we", "well", "were", "what", "whatever", "when", "whence", "whenever", "where", "whereafter", "whereas", "whereby", "wherein", "whereupon", "wherever", "whether", "which", "while", "whither", "who", "whoever", "whole", "whom", "whose", "why", "will", "with", "within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves", "the"]
  #stopwords = []
  tweet = tweet.lower()
  for stopword in stopwords:
    tweet = tweet.replace(" " + stopword + " "," ")
  return tweet

obama_training_tweets = []
trump_training_tweets = []
obama_test_tweets = []
trump_test_tweets = []

for row in tweets.itertuples():
    if (row[0] < 15 and row[1] == 'Obama'):
        obama_training_tweets.append(clean(row[2]))
    elif (row[0] > 14 and row[1] == 'Obama'):
        obama_test_tweets.append(clean(row[2]))
    elif (row[0] < 6 and row[1] == 'Trump'):
        trump_training_tweets.append(clean(row[2]))
    elif (row[0] > 5 and row[1] == 'Trump'):
        trump_test_tweets.append(clean(row[2]))
        
        

print("Obama Training:")
print(obama_training_tweets)
print("")
print("Obama Test:")
print(obama_test_tweets)
print("")
print("Trump Training:")
print(trump_training_tweets)
print("")
print("Trump Test:")
print(trump_test_tweets)

Obama Training:
["i'm grateful @senjohnmccain lifetime service country. congratulations, john, receiving year's liberty medal.", 'michelle & i praying victims las vegas. thoughts families & enduring senseless tragedy.', 'proud cheer team usa invictus games today friend joe. represent best country.', "we're expanding efforts help puerto rico & usvi, fellow americans need right now. join http://oneamericaappeal.org ", 'prosecutor, soldier, family man, citizen. beau want better. legacy leave. testament @joebiden.', 'thinking neighbors mexico mexican-american friends tonight. cuidense mucho y fuerte abrazo para todos.']

Obama Test:
['coding important – fun. @csforall, thanks work make sure kid compete high-tech, global economy.', "michelle i want @obamafoundation inspire empower people change world. here's we're getting started fall.", 'we remember lost 9/11 honor defend country ideals. act terror change are.', 'proud mckinley tech students—inspiring young minds make hopeful future.']

Tr

## 3 Probability

In this corpus there are 6 training tweets for Trump and 6 for Obama so by inspection we can deduce that:

P(Obama) = 0.5

P(Trump) = 0.5

Thus without applying Bayes Theorem there is a 50/50 chance the training tweets are either Trump or Obama.

## 4 Prior Probability

### 4.1 Naive Bayes Learning

Now let's calculate the probabilty of Obama and Trump using each word in our corpus. We will concatenate all of the tweets for an author (Obama then Trump) then create we will iterate through every word in every tweet and build a learnings dictionary that will be the probability of that word being used by its author.

For example the word grateful may occur two times in a total of 130 words so our dictionary entry would look like:

`grateful: 0.015` 

In [4]:
def p_word(training_tweets, learnings):
  all_tweets = ""
  for tweet in training_tweets:
    all_tweets = all_tweets + " " + tweet
  
  words = all_tweets.split()
  for word in words:
    learnings[word] = ( float(words.count(word))) / float(len(words))
  return learnings;


obama_learnings = {}
trump_learnings = {}

p_word(obama_training_tweets, obama_learnings)
    
p_word(trump_training_tweets, trump_learnings)
   
print("Obama learning")    
print(obama_learnings)
print("Trump learning")    
print(trump_learnings)

Obama learning
{"i'm": 0.012987012987012988, 'grateful': 0.012987012987012988, '@senjohnmccain': 0.012987012987012988, 'lifetime': 0.012987012987012988, 'service': 0.012987012987012988, 'country.': 0.025974025974025976, 'congratulations,': 0.012987012987012988, 'john,': 0.012987012987012988, 'receiving': 0.012987012987012988, "year's": 0.012987012987012988, 'liberty': 0.012987012987012988, 'medal.': 0.012987012987012988, 'michelle': 0.012987012987012988, '&': 0.03896103896103896, 'i': 0.012987012987012988, 'praying': 0.012987012987012988, 'victims': 0.012987012987012988, 'las': 0.012987012987012988, 'vegas.': 0.012987012987012988, 'thoughts': 0.012987012987012988, 'families': 0.012987012987012988, 'enduring': 0.012987012987012988, 'senseless': 0.012987012987012988, 'tragedy.': 0.012987012987012988, 'proud': 0.012987012987012988, 'cheer': 0.012987012987012988, 'team': 0.012987012987012988, 'usa': 0.012987012987012988, 'invictus': 0.012987012987012988, 'games': 0.012987012987012988, 'tod

### 4.2 Naive Bayes Likelihood

We'll create an algorithm that uses the data we have learned to calculate the likelihood of each of our TEST tweets being either Obama or Trump. 

In [302]:
def p_likelihood(tweet, learnings):
  likelihood = 1.0;
  for word in tweet.split():
    likelihood = likelihood * (learnings.get(word,0.0) + 1.0);
  return likelihood;

## 5 Naive Bayes Classification

Let's use our probabilty and likelihood in Bayes Theorem to classify our test tweets:

In [303]:
def print_result(predicted_author, predicted_author_bayes_probability, other_author_bayes_probability):
  percent = predicted_author_bayes_probability / (other_author_bayes_probability + predicted_author_bayes_probability) * 100.0
  print("predicted author: {:0.2f}".format(percent) + "% " + predicted_author)

def classify(author, tweet, obama_learnings, trump_learnings):
  obama_probability = 0.5 #half of our 'known' tweets are obama.
  obama_likelihood = p_likelihood(tweet, obama_learnings)
  obama_bayes_probability = obama_probability * obama_likelihood

  trump_probability = 0.5 #half of our 'known' tweets are obama.
  trump_likelihood = p_likelihood(tweet, trump_learnings)
  trump_bayes_probability = trump_probability * trump_likelihood
    
  
  print('"' + tweet + '"')
  print("actual author: " + author)

  if obama_bayes_probability > trump_bayes_probability:
    print_result("Obama", obama_bayes_probability, trump_bayes_probability)
  elif obama_bayes_probability < trump_bayes_probability:
    print_result("Trump", trump_bayes_probability, obama_bayes_probability)
  else:
    print("Trump or Obama - equal probability: {:0.2f} {:0.2f}".format(obama_bayes_probability, trump_bayes_probability))
  print("")

#classify the Obama test tweets
for tweet in obama_test_tweets:
  classify("Obama", tweet, obama_learnings, trump_learnings)

#classify the Trump test tweets
for tweet in trump_test_tweets:
  classify("Trump", tweet, obama_learnings, trump_learnings)

"coding important – fun. @csforall, thanks work make sure kid compete high-tech, global economy."
actual author: Obama
Trump or Obama - equal probability: 0.50 0.50

"michelle i want @obamafoundation inspire empower people change world. here's we're getting started fall."
actual author: Obama
predicted author: 52.57% Obama

"we remember lost 9/11 honor defend country ideals. act terror change are."
actual author: Obama
Trump or Obama - equal probability: 0.50 0.50

"proud mckinley tech students—inspiring young minds make hopeful future."
actual author: Obama
predicted author: 53.05% Obama

"...people interviewed, including clinton herself. comey stated oath didn't this-obviously fix? justice dept?"
actual author: Trump
predicted author: 52.78% Trump

"wow, fbi confirms report james comey drafted letter exonerating crooked hillary clinton long investigation complete. many.."
actual author: Trump
predicted author: 58.27% Trump

"the important truth founders understood was: freedom gift g

## Exercises

Exercise 1:

Using our code.txt file copy the 

Exercise 2:

If you've got it right then you should see every result as __Trump or Obama - equal probability: 0.00 0.00__ which isn't very helpful. This is because in every tweet there is at least one word that the predicted author did not use. For example, if the tweet contains "@csforall" and we are predicting the likelihood of Trump using that word our __p_likelihood__ algorithm will return 0 meaning the product of all the words in that likelihood will be 0: 0.0 * 0.1 * 0.2 will always be 0. To fix this we apply Laplace Smoothing:

change this line in p_likelihood to apply Laplace Smoothing: 

`likelihood = likelihood * (learnings.get(word,0.0));` 

to 

`likelihood = likelihood * (learnings.get(word,0.0) + 1.0);`

Now run the code in sections 4.2 and 5 again. You should see classifications now with the first two Obama tweets being classified incorrectly as Trump tweets. Note that generally the classifications are very similar and not far off 50%.

Exercise 3:

Our data set is very small and in most cases all but the most common words (stop words) are present only once. We added the Laplace Smoothing as 1 too so in many cases a word we have applied smoothing to because it's not in our training set.

Let's attempt to offset that bias by multiplying every real word occrence by 10 and reducing the effect of the Laplace Smoothing:

change this line in p_words 

`learnings[word] = ( float(words.count(word))) / float(len(words))`

to

`learnings[word] = ( float(words.count(word) * 10.0)) / float(len(words))`

Now run the code in sections 4.1, 4.2 and 5 again and we should now see more significant differences in our classification. The first tweet will still be incorrectly predicted.


Exercise 4:

Stopwords are the everyday words like 'a', 'the', 'and', 'but' etc.. if we remove these words then we are left with the more significant words which should increase the accuracy of our predictions:

In `def clean` uncomment

`stopwords = ["a", "about", "above", "above", "across"....]`

and comment out

`stopwords = []`

This will remove any stop words from the tweets. Run sections 2,4.1,4.2 and 5 again. Now we have two tweets that are exact matches - 50/50. Why is that ?
    