### Obama - Trump NLP Sentense Generator
Uses sklearn and an n-gram model. Reads a corpus of Obama's speeches and Trump's speeches and cleans them. Then splits them into sentences and uses them (all of Trump's sentences and a random subset of Obama's sentences of the same size) to train an n-gram based SVM classifier with parameters tuned with GridSearch. Tests the model on data, achieving 90 percent precision and recall (2017.8.25).
We then take quotes from famous people and compute whether they speak more like Obama or more like Trump, on a sentence-by-sentence basis.

In [2]:
import pandas as pd, numpy as np, re, random
from sklearn.feature_extraction.text import CountVectorizer , TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from nltk.corpus import stopwords
from nltk import word_tokenize, sent_tokenize
from sklearn.model_selection import train_test_split, GridSearchCV

In [3]:
test_split = 0.2

In [4]:
"""
function: getCleanedData takes a text document and reads it into a python array. Some cleaning is performed, removing
ansi escape codes, whitespace that is not space, typeset quotes and so forth. Returns cleaned data.
"""
def getCleanedData(path):
    data = []
    with open(path,"r") as f:
        data = f.read()
    udata = data.decode("utf-8")
    data= udata.encode("ascii","ignore") #ensures the encoding is ascii
    ansi_escape = re.compile(r'/(\x9B|\x1B\[)[0-?]*[ -\/]*[@-~]/')
    data = ansi_escape.sub('', data)
    whitespace_escape = re.compile(r'[\r\n\t]')
    data = whitespace_escape.sub('', data)
    quotes_escape = re.compile("\\'")
    data= whitespace_escape.sub("'", data)
    data = data.replace(" --",",")
    sentense_end_regex = re.compile("\.([a-zA-Z])")
    data= sentense_end_regex.sub(r'. \1', data)
    return data

In [5]:
# def tokenizeAndFilter(data_sent):
#     data_sent2 = []
#     ld = len(data_sent)
#     cnt = 0
#     for sent in data_sent:
#         cnt+=1
#         #print ('Processed %d out of %d' % (cnt,ld))
#         words = [w.lower() for w in word_tokenize(sent) if w.isalpha() and w.lower() not in stopwords.words("english")]
#         if (len(words) > 4):
#             data_sent2 = data_sent2 + [' '.join(words)]
#     return data_sent2

In [6]:
"""
function: simpleFilter takes a list of sentences and returns it, with each sentences processed for removal of english 
stopwords. Sentences with <= 4 words are not added.
"""
def simpleFilter(data_sent):
    data_sent2 = []
    ld = len(data_sent)
    cnt = 0
    for sent in data_sent:
        cnt+=1
        #print ('Processed %d out of %d' % (cnt,ld))
        words = [w for w in word_tokenize(sent) if w.isalpha() and w.lower() not in stopwords.words("english")]
        if (len(words) > 4):
            data_sent2 = data_sent2 + [sent]
    return data_sent2

In [7]:
#Obtains list of Obama and Trump's Speeches, cleans and sentense tokenizes them.
obama = getCleanedData("./docs/allobamaspeeches.txt")
trump = getCleanedData("./docs/trumpspeeches.txt")
obama_sent = sent_tokenize(obama)
trump_sent = sent_tokenize(trump)
o,t = obama_sent, trump_sent
o,t

([' On behalf of the great state of Illinois, crossroads of a nation, Land of Lincoln, let me express my deepest gratitude for the privilege of addressing this convention.',
  "Tonight is a particular honor for me because, let's face it, my presence on this stage is pretty unlikely.",
  'My father was a foreign student, born and raised in a small village in Kenya.',
  'He grew up herding goats, went to school in a tin-roof shack.',
  'His father, my grandfather, was a cook, a domestic servant to the British.',
  'But my grandfather had larger dreams for his son.',
  'Through hard work and perseverance my father got a scholarship to study in a magical place, America, that shone as a beacon of freedom and opportunity to so many who had come before.',
  'While studying here, my father met my mother.',
  'She was born in a town on the other side of the world, in Kansas.',
  'Her father worked on oil rigs and farms through most of the Depression.',
  "The day after Pearl Harbor my grandfath

In [8]:
#Processes list of speech sentences and uses simpleFilter to remove stopwords and sentences with <= 4 words
#Then splits data into training and test sets
o_old2, t_old2 = list(o), list(t)
o = simpleFilter(obama_sent)
t = simpleFilter(trump_sent)
o = random.sample(o, len(t))
#making training sets
o_tr, o_te = train_test_split(o, test_size = test_split)
t_tr, t_te = train_test_split(t, test_size = test_split)
o_y = [1 for x in range(len(o_tr))] #1 is Obama
t_y = [0 for x in range(len(t_tr))] #0 is Trump
tr = o_tr + t_tr
y = o_y + t_y
#making te*--**-------------------------------st sets
test = o_te + t_te
ote_y = [1 for x in range(len(o_te))]
tte_y = [0 for x in range(len(t_te))]
te_y = ote_y + tte_y

In [9]:
#We use GridSearch among various parameters to find the best n-gram language model for our data.
#The process searches a modified_huber loss and a perception loss
#(other losses were tested but these seemed to work the best for this use case.)
parameters = {'vect__ngram_range': [(1, 2),(1,3)],
              'tfidf__use_idf': [False],
              'vect__stop_words': [None],
              'vect__binary':[True, False],
              'clf__loss': [ 'modified_huber', 'perceptron']
              }


text_clf = Pipeline([('vect', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                      ('clf', SGDClassifier(penalty='l2',
                                            alpha=1e-3, random_state=23, max_iter=5, tol=None))])
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1, scoring = 'f1')
gs_clf.fit(tr,y)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        st...ty='l2', power_t=0.5, random_state=23, shuffle=True,
       tol=None, verbose=0, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'vect__ngram_range': [(1, 2), (1, 3)], 'clf__loss': ['modified_huber', 'perceptron'], 'vect__binary': [True, False], 'tfidf__use_idf': [False], 'vect__stop_words': [None]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='f1', verbose=0)

In [10]:
#We print GridSearch results
print("Best parameters set found on development set:\n")
print(gs_clf.best_params_)
print("Grid scores on development set:\n")
means = gs_clf.cv_results_['mean_test_score']
stds = gs_clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, gs_clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r \n" % (mean, std * 2, params))
    

Best parameters set found on development set:

{'vect__ngram_range': (1, 2), 'clf__loss': 'perceptron', 'tfidf__use_idf': False, 'vect__binary': True, 'vect__stop_words': None}
Grid scores on development set:

0.889 (+/-0.010) for {'vect__ngram_range': (1, 2), 'clf__loss': 'modified_huber', 'tfidf__use_idf': False, 'vect__binary': True, 'vect__stop_words': None} 

0.883 (+/-0.008) for {'vect__ngram_range': (1, 3), 'clf__loss': 'modified_huber', 'tfidf__use_idf': False, 'vect__binary': True, 'vect__stop_words': None} 

0.882 (+/-0.011) for {'vect__ngram_range': (1, 2), 'clf__loss': 'modified_huber', 'tfidf__use_idf': False, 'vect__binary': False, 'vect__stop_words': None} 

0.877 (+/-0.009) for {'vect__ngram_range': (1, 3), 'clf__loss': 'modified_huber', 'tfidf__use_idf': False, 'vect__binary': False, 'vect__stop_words': None} 

0.898 (+/-0.008) for {'vect__ngram_range': (1, 2), 'clf__loss': 'perceptron', 'tfidf__use_idf': False, 'vect__binary': True, 'vect__stop_words': None} 

0.896 (

In [11]:
#We print classification results
print("Detailed classification report:\n")
print("The model is trained on the full development set.\n")
print("The scores are computed on the full evaluation set.\n")
y_true, y_pred = te_y, gs_clf.predict(test)
print(classification_report(y_true, y_pred))
print("\n")

Detailed classification report:

The model is trained on the full development set.

The scores are computed on the full evaluation set.

             precision    recall  f1-score   support

          0       0.88      0.93      0.90      1256
          1       0.92      0.87      0.90      1256

avg / total       0.90      0.90      0.90      2512





In [14]:
#We can now test our model on custom text data. The document is a blog post by a friend.
custom_test = getCleanedData("./tests/test_doc.txt")
custom_test = sent_tokenize(custom_test)
custom_test = simpleFilter(custom_test)
print gs_clf.predict(custom_test)

[0 0 0 0 1 0 0 0 0 0 1 0 1 1 1 0 0 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 1 1 0 1
 0 0 1 0 0 1 1 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 0 0 1 1 0 1 0 1 0 0 0 0 0 0 1
 0 0]


In [19]:
test = pd.read_csv("./tests/litemind-quotes.csv", encoding='latin1')
a = test.values[:,[1,2]]
b = gs_clf.predict(list(a[:,0]))
b = ["Obama" if x == 1 else "Trump" for x in b]
result = np.concatenate((a,np.array([b]).T), axis =1)
result_df = pd.DataFrame(result, columns=['Quote', 'Speaker', 'More like...'])
result_df.to_csv("./tests/qutoes_results.csv", encoding = "utf8")
result_df

Unnamed: 0,Quote,Speaker,More like...
0,History teaches us that men and nations behave...,Abba Eban,Obama
1,"How many legs does a dog have, if you call the...",Abraham Lincoln,Trump
2,"Nearly all men can stand adversity, but if you...",Abraham Lincoln,Obama
3,"When I do good, I feel good; when I do bad, I ...",Abraham Lincoln,Trump
4,Better to remain silent and be thought a fool ...,Abraham Lincoln,Trump
5,You can fool some of the people all of the tim...,Abraham Lincoln,Trump
6,"To the man who only has a hammer, everything h...",Abraham Maslow,Trump
7,My definition of a free society is a society w...,Adlai E. Stevenson Jr.,Obama
8,I use emotion for the many and reserve reason ...,Adolf Hitler,Obama
9,It is always more difficult to fight against f...,Adolf Hitler,Obama
