### Tweet Generation using Markov Chain

* Markov Chain depicts how systems change over time.
* Concept: The next state of a process only depends on the previous state.
* Process:
    * Using a corpus, create a dictionary with keys as current state and values as the options for next state. 
    * {'hi': ['there','john','everyone', .....]}
    * This dictionary allow you to start with a word, and randomly generate the next words based on the word's frequency in the corpus. 

##### Importing libraries

In [1]:
import random
import pandas as pd
from collections import defaultdict

##### Load data with punctuation

In [2]:
# Read in the corpus, including punctuation!
corpus_yearly = pd.read_pickle('corpus_yearly.pkl')
corpus_yearly.head()

Unnamed: 0,transcript
2009,be sure to tune in and watch trump on late nig...
2010,celebrity apprentice to outstanding list of se...
2011,watch me on late night with jimmy tomorrow nig...
2012,my interview the make great again filing and t...
2013,and the are laughing at the deal they just got...


##### Extract 2019 tweets as corpus

In [3]:
# Extract only Ali Wong's text
data_2019 = corpus_yearly.transcript[2019]
data_2019[:200]

'a very good and talented guy a great new book just out “ why we fight ” lots of insight enjoy happy new year to everyone the and the fake news media will be a fantastic year for those not suffering fr'

##### Extract tweets from 2017 as corpus

In [4]:
data_2017_to_2020 = ""
data_2017_to_2020 += corpus_yearly.transcript[2017]
data_2017_to_2020 += corpus_yearly.transcript[2018]
data_2017_to_2020 += corpus_yearly.transcript[2019]
data_2017_to_2020 += corpus_yearly.transcript[2020]
len(data_2017_to_2020)

1481790

##### Markov Chain Model

In [5]:
# Creating dictionary with keys as current state and values as the options for next state.

def markov_chain(text):
    
    # Tokenizing words including punctuation
    words = text.split(' ')
    
    # Intialize dictionary
    m_dict = defaultdict(list)
    
    # Create a  zipped list of all of the word pairs and put them in word: list of next words format
    for current_word, next_word in zip(words[0:-1], words[1:]):
        m_dict[current_word].append(next_word)

    # Convert the default dict back into a dictionary
    m_dict = dict(m_dict)
    return m_dict

In [12]:
p = defaultdict(list)
p

defaultdict(list, {})

In [6]:
# Create the dictionary and take a look at it
data_2019_dict = markov_chain(data_2019)
data_2019_dict

{'a': ['very',
  'great',
  'fantastic',
  'new',
  'wall',
  'strong',
  'dog',
  'total',
  'great',
  'deal',
  'flake',
  'team',
  'lot',
  'bill',
  'bill',
  'great',
  'hero',
  'lot',
  'lot',
  'president',
  'fine',
  'great',
  'very',
  'very',
  'wall',
  'part',
  'fine',
  'campaign',
  'wall',
  'big',
  'small',
  'member',
  'great',
  'record',
  'senator',
  'barrier',
  'wall',
  'year',
  'properly',
  'year',
  'more',
  'productive',
  'steel',
  'sad',
  'provision',
  'president',
  'number',
  'very',
  'proper',
  'glorious',
  'truly',
  'powerful',
  'big',
  'very',
  'record',
  'disgraceful',
  'meeting',
  'total',
  'wall',
  'white',
  'temper',
  'week',
  'wall',
  'great',
  'year',
  'potential',
  'far',
  'section',
  'game',
  'coach',
  'team',
  'century',
  'san',
  'total',
  'lie',
  'great',
  'crooked',
  'good',
  'bad',
  'number',
  'total',
  'fake',
  'strategy',
  'plan',
  'wall',
  'massive',
  'long',
  'badly',
  'shutdown',


In [7]:
# Create the dictionary and take a look at it
data_2017_to_2020_dict = markov_chain(data_2017_to_2020)
data_2017_to_2020_dict

{'well': ['the',
  'but',
  'with',
  'a',
  'he',
  'i',
  'and',
  'just',
  'in',
  'as',
  'heading',
  'actually',
  'as',
  'despite',
  'and',
  'the',
  'want',
  'and',
  'soon',
  'yesterday',
  'reasoned',
  'there',
  'with',
  'such',
  'a',
  'do',
  'i',
  'and',
  'see',
  'really',
  'coming',
  'great',
  'new',
  'together',
  'for',
  'in',
  'trump',
  'nobody',
  'we',
  'now',
  'so',
  'done',
  'received',
  'someday',
  'as',
  'interview',
  'done',
  'as',
  'not',
  'and',
  'built',
  'time',
  'and',
  'she',
  'unemployment',
  'page',
  'a',
  'together',
  'will',
  'like',
  'thank',
  'connected',
  'our',
  'we',
  'than',
  'meaning',
  'for',
  'four',
  'and',
  'tonight',
  'and',
  'it',
  'into',
  'most',
  'in',
  'in',
  'i',
  'which',
  'with',
  'now',
  'again',
  'which',
  'be',
  'is',
  'done',
  'immigration',
  'done',
  'done',
  'known',
  'prepared',
  'so',
  'just',
  'still',
  'charge',
  'we',
  'with',
  'low',
  'togethe

##### Function to generate text

* Argument: The dictionary created above and the desired number of words.

In [14]:
def generate_tweet(chain, count):

    # Capitalizing the first letter of the sentence. 
    word1 = random.choice(list(chain.keys()))
    sentence = word1.capitalize()

    # Generate the second word from the value list. Set the new word as the first word. Repeat.
    for i in range(count-1):
        word2 = random.choice(chain[word1])
        word1 = word2
        sentence += ' ' + word2

    # End it with a period
    sentence += '.'
    return(sentence)

##### Tweet generation using 2019 data 

In [20]:
for i in range(5):
    print(generate_tweet(data_2019_dict, 40) + "\n")  

Bully shouting and ’ s inspiring portrait of slander and is difficult for and will work the us supreme court ” china too bad it ’ s trident pin this never give our border crisis at he ’ s what.

Battleground who my and is a wall are here ” as the ending witch hunt is a powerful used to the last night thank you really rock michigan north high up to mary b your part of the remain as.

Foolishness shortly at it ’ shifty a bipartisan way with down and ensure that he ran a great who to get one of that is unacceptable and trump was just a wall is a big and his family … republican.

Glorious nation from the failing new say “ ” in particular campaign i would do this for one of big us and we do like bob ’ s a basic question as most powerful military is in their case thanks.

Reaffirm our country we ’ t want them a sad joke this i will be quiet south will be a lawsuit against ” now who probe ” despite this will be back in “ fighting and all they took place.



 ##### Tweet generation using 2017 to 2020 data 

In [18]:
for i in range(5):
    print(generate_tweet(data_2017_to_2020_dict, 40) + "\n")  

Constitutionally ” i very much fake continue our and so much good people who they want to beat him any collusion with i promise to retaliate last night is not this process “ the administration your amendment and for our.

Venture to take care quickly fixed during this is on don ’ s all would like last year the fact carbon into just in us the post “ i won if you president why they think what to what this.

Bitten before so easily one of sovereign nation keep drug price decrease in “ everyone that out of control kiss the fake news … prior to begin doing their incredible young socialist agenda against the of our and canada had.

Memorial day for tate a better and job killing animal border to serve our in with zero fairness from their as a true source ” i am angry they could improve our nation again senator and liberty on trade the.

Tend to combat have to fire this ended it cannot allow congress must build but what we are very dishonest fake news there i ’ t even though i was at our treas

##### Results

* As we know Markov chains could be considered for basic text generation. But tweets could be much complicated. 
* The tweets generated have structural and grammatical inaccuracy. 
* Using LSTM would be a better choice.