## UK Political Speech NLP project

### Executive Summary

The primary aim of this project is to build a model that takes the text of a speech by an MP and predicts their political party. I am also looking at the vocabulary used by MPs of different parties and some topic analysis.

#### Data:

The data was scraped from the <a href='http://www.britishpoliticalspeech.org/index.htm'>British Political Speech</a> website. From this I got a dataset of around 3,000 speeches along with the speaker's name, the date of the speech, and a brief summary.

The website didn't list the speaker's party in a reliable format so I used the Wikipedia API to determine this, which involved discarding a few hundred speeches where either the speaker's party couldn't be determined or the speaker wasn't an MP.

I tokenised the speeches and limited the tokens to just those where the stem could be found in the 'words' corpus from NLTK. For training and testing I reduced the dataset to just the two main parties due to severe imbalance in the minority parties.

#### Predictive modelling:

I compared the following classification models:
- Logistic Regression
- Decision Tree (with Ada Boost)
- Random Forest
- Multinomial Naive Bayes
- Bernoulli Naive Bayes
- Support Vector Machines
- Multi-layer Perceptron
- doc2vec
- keras

I used CountVectorizer with all of the models (apart from doc2vec). I also tried a Tf-idf transformer with most of the models. I gridsearched the Logistic Regression, Naive Bayes, and SVM models, and manually tuned the other models where I could. I tried using a number of different ngrams for Logistic Regression and Naive Bayes.<p>
I used the stopwords from the NLTK corpus. As I wanted the model to predict based on just the style of speaking rather than indicators like party names I looked at the strongest coefficients from the initial modelling and added the following:<p>
    'conservative'<br>'conservatives'<br>'tory'<br>'tories'<br>'labour'<br>'jeremy'<br>'corbyn'<br>'george'<br>'may'<br>'pdf'

#### Evaluation:

As this was a classification problem where I wanted accurate results for both the majority and the minority classes I ranked my models based on the F1 scores rather than just accuracy. There was a class imbalance of around 70/30 between Conservative and Labour speeches, and while I felt that I could get away without under/over sampling I decided to rank the models by the macro F1 average rather than the weight average so ensure that the final model wasn't just good at predicting the majority class.

#### Modelling results:

- The sklearn Multi-Layer Perceptron performed the best, with a test accuracy of 0.88 (baseline was 0.72), an F1 score for the Conservative speeches of 0.92, and an F1 for the Labour speeches of 0.77.<br>
- The second best model was the Logistic Regression with 1 ngram. I pulled out the coefficient words and displayed them using wordclouds based on their magnitude.<br>
- Increasing the ngrams had mixed results, but helped noticeable with the Multinomial Naive Bayes model where using ngrams and range 4-6 was optimal.<br>
- Adding a Tf-idf transformer reduced the accuracy and F1 average scores in all cases. I believe this is because it added importance to words that indicated the speech was about a specific subject (e.g. the NHS), but because both parties made speeches about the same subjects this didn't help to differentiate them.<br>
- The Doc2vec model outperformed a number of other untuned models, but I didn't have enough experience with it to optimise it.

#### Additional analysis 1: Vocabulary

I compared the number of distinct word stems used by various speakers in a randomised fixed length sample of their speeches. The samples I took indicated that Conservative speakers used more distinct words than Labour speakers, but running a Bayesian comparison with Pymc3 showed that the difference fell within the 95% credible interval.

#### Additional analysis 2: Topic analysis

I used LdaModel from gensim to generate 6 topics in the speeches, which I classified as:
- Global
- Business
- UK Society
- Crime
- Education
- Health

The Conservative speeches were mostly on Global or Business topics with UK Society being the third biggest group, while Labour speeches were mostly in the UK Society topic, with Global and then Business topics coming second and third.<br>
Crime and Education was about the same for both parties, but Labour seemed to make significantly more speeches about the NHS.<br>
When restricted to just Maiden speeches, UK Society speeches were most common and Global speeches second most common for both parties.

#### Progress blog:

I've written blog posts as I've gone through the stages of this project. I've aimed to make it both interesting to people with some data science knowledge and accessible to people without any:
#### <a href='https://mydsblog.home.blog'>mydsblog</a>

### Data:

In [None]:
'''
My scraping functions. The speeches were organised by speaker surname,
but the page for 'C' surnames was formatted differently so needed special handling.
There were no speeches for 'X' so I used ascii_lowercase from the string module, removed
the 'x', and then used this as my thread list.
This returned the text of the link to each speech (which contained the name of the speaker,
the year, and a description) and the body of the speech itself:
'''

def get_speeches(url1, url2, letter, return_dict):
    if letter == 'c':
        r = requests.get(url2.format(letter))
        soup = BeautifulSoup(r.text, 'html.parser')
        content_list = soup.find('div', attrs={'class':'site-content'})
    else:
        r = requests.get(url1.format(letter))
        soup = BeautifulSoup(r.text, 'html.parser')
        content_list = soup.find('div', attrs={'class':'entry-content'})
    speeches = []
    for result in tqdm_notebook(content_list.findAll('a', href=True)):
        if result.text != 'Speeches Index':
            try:
                speech_url = result.get('href')
                speech_r = requests.get(speech_url)
                speech_soup = BeautifulSoup(speech_r.text, 'html.parser')
                speech_text = speech_soup.find('div', attrs={'class':'entry-content'}).text
                speeches.append([result.text,speech_text])
            except:
                try:
                    speeches.append([result.text,'unable to scrape'])
                except:
                    pass
    return_dict[letter]=speeches

    
def request_thread(url1, url2, letters):
    manager = mp.Manager()
    return_dict = manager.dict()
    threads = []
    for letter in tqdm_notebook(letters):
        thread = threading.Thread(name=letter, 
                                  target=get_speeches, 
                                  args=(url1, url2, letter, return_dict))
        thread.start()
        threads.append(thread)

    for t in threads:
        t.join()

    return return_dict

from string import ascii_lowercase

url1 = "http://www.ukpol.co.uk/speeches/speeches-{}/"
url2 = "http://www.ukpol.co.uk/speeches/{}/"

my_letters = ascii_lowercase
for letter in my_letters:
    if letter == 'x':
        my_letters = my_letters.replace(letter,'')
        
speeches_dict = request_thread(url1, url2, my_letters)
speeches1 = pd.DataFrame([x for letter in speeches_dict.values() for x in letter])
speeches1.columns = ['title','speech']

In [None]:
'''
The speeches had headers containing the date of the speech and a description.
I extracted these headers:
'''

indicators = ['Below is the text',
              'Below is the speech',
              'The below speech',
              'The speech below',
              'Below is the Hansard record',
              'Below is a part of the speech',
              'This is the text',
              'This speech was made',
              'Speech made',
              'Speech given',
              'Below is the statement',
              'Below is the transcript',
              'Below is a transcript',
              'Below is a text',
              'Below is the Q&A',
              'Below is the Passover message',
              'Below is the Christmas message',
              'Below is the 2013 Christmas message',
              'This speech was given',
              'Transcript of speech',
              'Transcript of press conference']

def find_header(text):
    text_split = text.split('\n')
    for indicator in indicators:
        for i, subtext in enumerate(text_split):
            if indicator in subtext:
                return i
    return -1

def get_header(text):
    header_index = find_header(text)
    if  header_index > -1:
        return text.split('\n')[header_index]
    else:
        return 'No header'

def get_text(text):
    header_index = find_header(text)
#     I'm going to return the speech still separated by \n so that I can split out further features later.
    return '\n'.join(text.split('\n')[header_index+1:])

uk_pol_clean = speeches1.copy()
uk_pol_clean.columns = ['title','raw_speech']
uk_pol_clean['description'] = uk_pol_clean.raw_speech.map(lambda x: get_header(x))
uk_pol_clean['text'] = uk_pol_clean.raw_speech.map(lambda x: get_text(x))

In [None]:
'''
I then extracted the speaker name, year, and subject from the titles:
'''
def get_speaker(text):
    longname = text.split('–')[0]
    if '20' in longname or '19' in longname or '18' in longname:
        name = ('{} {}').format(longname.split(' ')[1].strip().replace('-',''),
                                longname.split(' ')[0].strip().replace(',',''))
    elif ',' in longname:
        name = ('{} {}').format(longname.split(',')[1].strip(),
                                longname.split(',')[0].strip())
    else:
        name = longname.strip()
    return name

def get_year(text):
    result = re.findall(year_ident, text)
    if result:
        year = int(result[0])
    else:
        year = 0
    return year

subject_prefixes = ['Speech in Tribute to ','Tribute to ','Speech on ','Speech to ','Speech about ',
                    'Speech following ','Speech after ','Speech After ',
                    'Speech against ','Statement on ','Statement following ','Statement after ']

def get_subject(title, year):
    if year > 0:
        subject = title.split(str(year))[1].strip().replace(' the ',' ')
    else:
        subject = title.strip()
    if 'Maiden Speech' not in subject:
        for prefix in subject_prefixes:
            subject = subject.replace(prefix,'')
    else:
        subject = 'Maiden Speech'
    return subject

uk_pol_clean['speaker'] = uk_pol_clean.title.map(lambda x: get_speaker(x))
uk_pol_clean['year'] = uk_pol_clean.title.map(lambda x: get_year(x))
uk_pol_clean['year'] = uk_pol_clean.year.astype(int)
uk_pol_clean['subject'] = uk_pol_clean.apply(lambda x: get_subject(x['title'],x['year']), axis=1)

In [None]:
'''
I extracted the full date from the previously extracted header:
'''
full_date = re.compile(r'[0-9]{1,2}[stndrdth]{0,2} [A-Z][a-z]+ [0-9]{4}')

def find_full_date(text):
    dates = re.findall(full_date, text)
    if dates:
        date = dates[0]
    else:
        date = np.nan
    return date

uk_pol_clean['date'] = uk_pol_clean.description.map(lambda x: find_full_date(x))

In [None]:
'''
I used the Wikipedia api to get the categories associated with each speaker's page.
I discarded any speaker who returned more or less than one identifiable party.
'''
party_dict = {'Conservative Party':'Conservative',
              'Labour Party':'Labour',
              'Labour Co-operative':'Labour',
              'Liberal Democrat':'Lib Dem',
              'Liberal Party':'Lib Dem',
              'Green Party':'Greens',
              'Scottish National Party':'SNP',
              'Plaid Cymru':'Plaid Cymru',
              'Sinn Féin':'Sinn Féin',
              'Ulster Unionist Party':'UUP',
              'Democratic Unionist Party':'DUP',
              'UK Independence Party':'UKIP',
              'Trades Union':'Labour'}

def get_party(name):
    try:
        wiki_cats = wikipedia.WikipediaPage(title=name).categories
    except:
        try:
            longname = name +' (politician)'
            wiki_cats = wikipedia.WikipediaPage(title=longname).categories
        except:
            try:
                longname = name +' (British politician)'
                wiki_cats = wikipedia.WikipediaPage(title=longname).categories
            except:
                try:
                    longname = name +' (Labour politician)'
                    wiki_cats = wikipedia.WikipediaPage(title=longname).categories
                except:
                    try:
                        longname = name +' (Northern Ireland politician)'
                        wiki_cats = wikipedia.WikipediaPage(title=longname).categories
                    except:
                        wiki_cats = []
    wiki_cats_joined = ', '.join(wiki_cats)
    party_count=0
    party_name = 'No name'
    for ref, party in party_dict.items():
        if ref in wiki_cats_joined:
            if party != party_name:
                party_count+=1
            party_name = party
    return name, party_name, party_count

def get_parties(index_group, indexes, return_dict):
    results = []
    for ind in tqdm_notebook(indexes):
        name = speakers.iloc[ind,0]
        speech_count = speakers.iloc[ind,1]
        name, party_name, party_count = get_party(name)
        results.append([name, speech_count, party_name, party_count])
    return_dict[index_group]=results
    
def request_thread(index_groups):
    manager = mp.Manager()
    return_dict = manager.dict()
    threads = []
    for index_group, indexes in tqdm_notebook(index_groups.items()):
        thread = threading.Thread(name=index_group, 
                                  target=get_parties, 
                                  args=(index_group, indexes, return_dict))
        thread.start()
        threads.append(thread)

    for t in threads:
        t.join()

    return return_dict

# get a list of speakers that I'd extracted from the scraped speech data
speakers = pd.read_csv('speakers.csv')

# split the speaker names up into chunks of 50 so I can thread the Wikipedia requests
name_index_groups = range(speakers.shape[0],0,-50)
indexes = {group:[i for i in range(0,group) if group-i<51] for group in name_index_groups}

parties_dict = request_thread(indexes)
speaker_parties = pd.DataFrame([x for group in parties_dict.values() for x in group],columns=['name','speech_count','party','party_count'])

In [None]:
'''
I wrote a function to tokenise the speeches and compare against 'words' list from NLTK:
'''

words = set(nltk.corpus.words.words())

words_join = ' '.join(words)

def get_tokens(text):
    tokens_text = []
    for i in word_tokenize(text.lower()):
        try:
            if i not in stop and re.findall((stemmer.stem(i) + r'[a-z]*'), words_join):
                tokens_text.append(i)
        except:
            pass
    return tokens_text

uk_pol_df['tokenised'] = uk_pol_df.apply(lambda x: get_tokens(x.name, x.text), axis=1)

### Modelling:

In [None]:
'''
I used a pipeline to quickly train the sklearn models and return scores.
I could have written this into a function but in the end I just copied and pasted
the code for different models:
'''
train_df, test_df = train_test_split(uk_pol_tokens, 
                                     stratify=uk_pol_tokens['party'], 
                                     test_size=0.3, random_state=1)

pipeline = Pipeline([
    ('vect', CountVectorizer(lowercase=True, strip_accents='unicode', stop_words=stop)),
    ('logreg', LogisticRegression(solver='lbfgs'))
]) 

pipeline.fit(train_df.tokenised, train_df.party)

print(pipeline.score(train_df.tokenised, train_df.party))
print(cross_val_score(pipeline, train_df.tokenised, train_df.party, cv=5).mean())
print(pipeline.score(test_df.tokenised, test_df.party))

predictions = pipeline.predict(test_df.tokenised)

print()
print(classification_report(test_df.party, predictions))

pd.DataFrame(confusion_matrix(test_df.party, predictions,
                              labels=test_df.party.unique()),
             columns=test_df.party.unique(),
             index=test_df.party.unique())

In [None]:
'''
For gridsearching I dropped the pipeline and created pre-vectorized train and test sets:
'''
X_train = train_df.tokenised
X_test = test_df.tokenised
y_train = train_df.party
y_test = test_df.party

# basic count vectorizer
cvec = CountVectorizer(lowercase=True, stop_words=stop, strip_accents='unicode')
X_train_v = cvec.fit_transform(X_train)
X_test_v = cvec.transform(X_test)

# vectorizer with 4-6 ngrams
cvec_ngrams = CountVectorizer(lowercase=True, stop_words=stop, ngram_range=(4,6), strip_accents='unicode')
X_train_n = cvec.fit_transform(X_train)
X_test_n = cvec.transform(X_test)

# binary vectorizer for the Bernoulli Naive Bayes model
bvec = CountVectorizer(lowercase=True, stop_words=stop, binary=True)
X_train_b = bvec.fit_transform(X_train)
X_test_b = bvec.transform(X_test)

# Tf-idf vectorizer
tfidf = TfidfVectorizer(stop_words=stop)
X_train_t = tfidf.fit_transform(X_train)
X_test_t = tfidf.transform(X_test)

In [None]:
'''
Example of gridsearched model:
'''
clf = MultinomialNB()

gs_params = {'alpha':np.linspace(0.1,3,20)}

gs = GridSearchCV(clf,
                  gs_params,
                  cv=5,
                  n_jobs=-1,
                  verbose=1)

gs.fit(X_train_v, y_train)

print('Best score: {}'.format(gs.best_score_))
print('Best params: {}'.format(gs.best_params_))
print()
clf = gs.best_estimator_

clf.fit(X_train_v, y_train)

print(clf.score(X_train_v, y_train))
print()
scores = cross_val_score(clf, X_train_v, y_train, cv=5)
print(scores.mean())
print()
print(clf.score(X_test_v, y_test))
print()
predictions = clf.predict(X_test_v)
print(classification_report(y_test, predictions))

pd.DataFrame(confusion_matrix(y_test, predictions),
                   index=['con_actual','lab_actual'],
                   columns=['con_pred','lab_pred'])

In [None]:
'''
Final best model - MLP Classifier.
I tuned it manually as I wasn't sure how gridsearch would react to it, and it's slow to fit:
'''
pipeline = Pipeline([
    ('vect', CountVectorizer(lowercase=True, strip_accents='unicode', stop_words=stop)),
    ('clf', MLPClassifier(activation='relu', 
                                alpha=1e-6, 
                                hidden_layer_sizes=(100,100), 
                                random_state=1))
]) 

pipeline.fit(train_df.tokenised, train_df.party)

print(pipeline.score(train_df.tokenised, train_df.party))
print(cross_val_score(pipeline, train_df.tokenised, train_df.party, cv=5).mean())
print(pipeline.score(test_df.tokenised, test_df.party))

predictions = pipeline.predict(test_df.tokenised)

print()
print(classification_report(test_df.party, predictions))

pd.DataFrame(confusion_matrix(test_df.party, predictions,
                              labels=test_df.party.unique()),
             columns=test_df.party.unique(),
             index=test_df.party.unique())

### Modelling Results:

#### I got the following F1 scores, ranked by macro F1 average:
![Unknown.png](attachment:Unknown.png)
#### The Multi-Layer Perceptron had the best results and gave the following confusion matrix for the test set (columns are predicted, rows are actual):
![Screenshot%202019-01-17%20at%2016.53.02.png](attachment:Screenshot%202019-01-17%20at%2016.53.02.png)

#### Wordclouds from the strongest coefficients from the Logistic Regression model (2nd best):
#### Conservatives:
![Unknown.png](attachment:Unknown.png)
#### Labour:
![Unknown-1.png](attachment:Unknown-1.png)

### Vocabulary:

In [None]:
'''
I wrote a function to concatenate the tokenised speeches for each speaker
and return a count of unique stems for a fixed sample size:
'''
stemmer = PorterStemmer()

def unique_words_sample(name, sample_size, df):
    speaker_text = []
    for i, row in df[df.speaker==name].iterrows():
        speaker_text += row.tokenized
        
#   I realised that the graphs were looking a bit wobbly. shuffling the text smoothed them out.
#   Also, re-running this gave slightly different results so I'm going to run it 5 times and take the mean.
    samples = []
    samples_unique = []
    for sample in range(5):
        
        shuffle(speaker_text)
        stemmed_text = [stemmer.stem(word) for word in speaker_text]
        text_sample = stemmed_text[:sample_size]
        samples.append(len(text_sample))
        samples_unique.append(len(set(text_sample)))
    
#   need to check that we've actually got the required number of words after stemming
    word_count = int(np.mean(samples))
    unique_count = int(np.mean(samples_unique))
    party = df[df.speaker==name].iloc[0].party
    
    return party, word_count, unique_count

vocab = {}
vocab['name'] = []
vocab['party'] = []
vocab['word_count'] = []
vocab['unique_count'] = []

for name in vocab_speakers:
    party, word_count, unique_count = unique_words_sample(name, 15000, uk_pol_final)
    vocab['name'].append(name)
    vocab['party'].append(party)
    vocab['word_count'].append(word_count)
    vocab['unique_count'].append(unique_count)
    
vocab_df = pd.DataFrame(vocab)

# sorting the results into bins and ranking them
vocab_df['bin'] = pd.cut(vocab_df.unique_count,20)
vocab_df['bin_mid'] = vocab_df.bin.map(lambda x: np.round(x.mid,-2))
vocab_df['ranking'] = vocab_df.sort_values('unique_count').groupby('bin_mid').cumcount()
vocab_df['st_rank'] = vocab_df.groupby('bin_mid').ranking.apply(lambda x: x-x.mean())

In [None]:
'''
Plotting the results:
'''
color_dict = {'Conservative':'#0087DC',
              'Labour':'#DC241f',
              'Lib Dem':'#FAA61A'}

color_list = vocab_df.party.map(lambda x: color_dict[x])

rcParams['axes.titlepad']=20

# note: I had to manually adjust the width and height to get pleasing separation between
# the tiles on the graph, but it should be fairly straightforward to set them automatically
# based on the number of tiles in the largest bin and the difference between the highest
# and lowest unique word count.
fig, ax = plt.subplots(figsize=(32,16))

for i, row in vocab_df.iterrows():
    marker_name = '\n'.join(row['name'].split(' '))
    ax.scatter(x=row.bin_mid, 
               y=row.st_rank, 
               color=color_dict[row.party], 
               s=7000, marker='s')
    plt.annotate(marker_name, xy=(row.bin_mid, row.st_rank), 
                 horizontalalignment='center',
                 verticalalignment='center',fontsize='x-large',color='white',fontweight='bold')

bins = vocab_df.bin_mid.sort_values().unique().astype(int)
bin_ticks = np.arange(bins.min(),bins.max()+1,200)

ax.set_yticks([])
ax.set_title('Unique words in 15,000 word sample', fontsize=30)
ax.set_xticks(bin_ticks)
ax.set_xticklabels(bin_ticks, fontsize=20)
ax.xaxis.grid(True)
ax.set_axisbelow(True)

plt.show();

#### Graph of unique words for various speakers based on a 15,000 word sample:
![Unknown.png](attachment:Unknown.png)

In [None]:
'''
I took unique counts for a sample size of 4,000 words from as many Labour and Conservative
speakers as possible and compared them. I ended up with unique word counts for
78 Conservative speakers and 30 Labour speakers. Means of the samples were around 1400,
standard deviations around 150.
'''
conlab_mean = vocab_df2_top.unique_count.mean()
conlab_std = vocab_df2_top.unique_count.std()

std_prior_lower = 1.
std_prior_upper = 1000.

con_unique = vocab_df2_top[vocab_df2_top.party=='Conservative'].unique_count
lab_unique = vocab_df2_top[vocab_df2_top.party=='Labour'].unique_count

with pm.Model() as conlab_model:

    con_mean = pm.Normal('con_mean', mu=conlab_mean, sd=conlab_std)
    lab_mean = pm.Normal('lab_mean', mu=conlab_mean, sd=conlab_std)
    
    con_std = pm.Uniform('con_std', lower=std_prior_lower, upper=std_prior_upper)
    lab_std = pm.Uniform('lab_std', lower=std_prior_lower, upper=std_prior_upper)
    
    con = pm.Normal('con', mu=con_mean, sd=con_std, observed=con_unique)
    lab = pm.Normal('lab', mu=lab_mean, sd=lab_std, observed=lab_unique)
    
    diff_of_means = pm.Deterministic('mean_diff', con_mean - lab_mean)
    
with conlab_model:
    
    trace = pm.sample(5000)

#### Graph of mean_diff:
![Unknown.png](attachment:Unknown.png)

### Topic Analysis:

In [None]:
'''
Generated 6 topics:
'''
doc_clean = [text for text in uk_pol_tokens.tokenised]
dictionary = corpora.Dictionary(doc_clean)
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

Lda = gensim.models.ldamodel.LdaModel
ldamodel = Lda(doc_term_matrix, num_topics=6, id2word = dictionary, passes=50)

topics = ldamodel.print_topics(num_topics=6, num_words=8)
topic_ident = re.compile(r'[a-z]+')

def get_topic(text):
    '''function to return the cleaned topic words list for a text'''
    bowvector = dictionary.doc2bow(text)
    topic_no = sorted(ldamodel[bowvector], key=lambda tup: -1*tup[1])[0][0]
    topic_list_dirty = ldamodel.print_topics(num_topics=6, num_words=8)[topic_no][1].split('+')
    topic_list_clean = []
    for topic in topic_list_dirty:
        topic_list_clean.append(' '.join(re.findall(topic_ident,topic)))
    return ' '.join(topic_list_clean)

uk_pol_tokens['topic'] = uk_pol_tokens.tokenised.map(lambda x: get_topic(x))

In [None]:
'''
Assigned topic titles based on the topic words:
'''
topic_dict = {'eu uk world would us europe people trade':'global',
              'new uk government also world business year investment':'business',
              'people government country work one need make want':'uk society',
              'police crime people public women also work policing':'crime',
              'schools education people children school young work want':'education',
              'nhs health services care patients public service new':'health'}

uk_pol_topics['topic_title'] = uk_pol_topics.topic.map(lambda x: topic_dict[x])

#### Topics by party:
![Unknown.png](attachment:Unknown.png)

#### Maiden Speech topics by party:
![Unknown-1.png](attachment:Unknown-1.png)