# DISCOVERING AND ASSIGNING TOPICS FOR ADVOCACY CAMPAIGN PETITION MAILINGS USING LATENT DIRICHLET ALLOCATION (LDA)

### In this phase of my project I use machine learning models to cluster SumOfUs mailings into issue topics. Then, for each topic, I assign a percentage to each campaign, based on the probability that topic relates to that campaign.  The algorthim I am using is called Latent Dirichlet Allocation (LDA).  

### The LDA algoritm will process the text of 1399 SumOfUs advocacy mailings and discover the top 8 issues, or topics, in SumOfUs campaigns.  The LDA will then go back through each mailing and assign it a percentage for each of the 8 topics.  Some campaigns pertain only to one topic, while others might relate to several topics in differing degreees.

### After I assign the percentages to each campaign, I will go on to the next phase, where I will combine this information with other data about the campaign, such as virality, regional distribution, and the very early behavior of new joiners on that campaign.  All of this will go into a different model, called a logistic regression, that will help me to predict the donation propensity of a new cohort only a week after joining the list.

In [28]:
#import the first Python modules I will be using in the project
import pandas as pd  #for working with data in table form within Python
import numpy as np   #important math and logic functions
from bs4 import BeautifulSoup # a module that works with HTML
import nltk  # import the Natural Language Toolkit, which includes various tools for text analysis
from nltk.probability import FreqDist
from nltk.tokenize import sent_tokenize, word_tokenize, PunktSentenceTokenizer
from nltk.corpus import stopwords
from nltk.tokenize import *
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models # LDA model
import pyLDAvis.gensim #visualization
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) #hide ugly warnings when libraries are slightly out of date

### First, I read in the CSV file that I prepared in the Data Wrangling phase.  The file contains the page id, the name of each mailing tag, and the full HTML in each mailing. Since each mailing can have more than one tag, most mailings  have multiple rows in the file.

In [2]:
camp_txt = pd.read_csv('../capstone/page_mailing_selected.csv', encoding = "ISO-8859-1")  #import CSV as a Pandas table
camp_txt.tail(10)  #displays the last 10 rows of the table

Unnamed: 0,page_id,tag_name,html
2092,16154,#Environment,<div style=width: 320px; float: right;>\r\n<ta...
2093,16148,#Environment,<div style=width: 320px; float: right;>\r\n<ta...
2094,16160,bees,<div style=width: 320px; float: right;>\r\n<ta...
2095,16160,#Environment,<div style=width: 320px; float: right;>\r\n<ta...
2096,16416,#Environment,<p>Dear {{ user.first_name|capfirst|default:Fr...
2097,16458,#Workers_Rights,<div style=width: 320px; float: right;>\r\n<ta...
2098,16479,#Workers_Rights,<div style=width: 320px; float: right;>\r\n<ta...
2099,16736,#Privatization_and_Political_Meddling,<div style=width: 320px; float: right;>\r\n<ta...
2100,16940,oil company,<table style=width: 100%; max-width: 300px; ma...
2101,16940,#Environment,<table style=width: 100%; max-width: 300px; ma...


### In order for my program to work, I need each mailing to be in a single row, not multiple rows for every tag. I do this by taking every tag for a single mailing and turning it into a list.  Then I can 'flatten' the current table.  Now, instead of 2101 rows, my table only has 1399 rows, one for each mailing.

In [3]:
flat = camp_txt.copy()  #copy the current dataframe into a new one (I found this helpful for troubleshooting)

def flatten_frame(df,col):  #df= DataFrame, col= the column to be flattened; in our case, 'tag_name'
    df[col] = df[col].fillna('') #if there are no tags, add an empty string
    headers = list(df.columns.values) #pull in the list of columns
    group_cols = list(set(headers) - set([col])) #get the cols to group on by subtracting the one we are flattening
    df = pd.DataFrame(df.groupby(by=(group_cols))[col].apply(list)).reset_index() #group by all the columns except the tags
    df[col] = df[col].apply(lambda x: ', '.join(x)) #convert the flattened col of tags into a string
    return df

flat = flatten_frame(flat,'tag_name') # executes the function defined above
flat = flat[['page_id','tag_name','html']].sort_values(by='page_id')  #sorts the new flattened table
flat = flat.reset_index(drop=True) #needed so the index keeps the same order of the page_ids

pd.options.display.max_colwidth = 110
flat.tail(10)  #displays the last 10 rows of the table

Unnamed: 0,page_id,tag_name,html
1390,16102,#Womens_Rights,<div style=width: 320px; float: right;>\r\n<table style=width: 300px; margin: 10px; border=0 cellspacing=0...
1391,16118,"taxes, #Economic_Justice, #trade",<div style=width: 320px; float: right;>\r\n<table style=width: 300px; margin: 10px; border=0 cellspacing=0...
1392,16148,#Environment,<div style=width: 320px; float: right;>\r\n<table style=width: 300px; margin: 10px; border=0 cellspacing=0...
1393,16154,#Environment,<div style=width: 320px; float: right;>\r\n<table style=width: 300px; margin: 10px; border=0 cellspacing=0...
1394,16160,"bees, #Environment",<div style=width: 320px; float: right;>\r\n<table style=width: 300px; margin: 10px; border=0 cellspacing=0...
1395,16416,#Environment,<p>Dear {{ user.first_name|capfirst|default:Friend }}</p>\r\n<p><strong>Water issues are sure heating up a...
1396,16458,#Workers_Rights,<div style=width: 320px; float: right;>\r\n<table style=width: 300px; margin: 10px; border=0 cellspacing=0...
1397,16479,#Workers_Rights,<div style=width: 320px; float: right;>\r\n<table style=width: 300px; margin: 10px; border=0 cellspacing=0...
1398,16736,#Privatization_and_Political_Meddling,<div style=width: 320px; float: right;>\r\n<table style=width: 300px; margin: 10px; border=0 cellspacing=0...
1399,16940,"oil company, #Environment",<table style=width: 100%; max-width: 300px; margin-left: 10px; margin-right: 10px; border=0 cellspacing=0 ...


### Now each campaign is on a single row, but in order to analyze the mailing text, I have to extract plain text from all that ugly HTML/Django. I do this with the help of a module called 'Beautiful Soup'. I also have a separate function that takes out all the punctuation from both the html and tag_name fields; I will need that done before the next stage of the analysis.

In [4]:
clean = flat.copy() #copy the results from the last step into a new frame

def clean_soup(df,old_col,new_col): # df=DataFrame, col=the column with the dirty HTML we want to clean
    for index, item in df[old_col].iteritems(): #go row by row through the column
        soup = BeautifulSoup(item, "lxml") #turn the current item into a BeautfulSoup object
        washed = soup.get_text(" ",strip=True) #get text from the soup object and store the text in the washed variable
        df.set_value(index,new_col,washed) #update the clean data frame with the washed text
    df[new_col] = df[new_col].str.replace('{(.+)}', ' ') #remove the django tags
    return df

def remove_punc(df,old_col,new_col):
    df[new_col] = df[old_col].str.replace('[^\w\s]',' ') #replaces most punctuation with spaces
    df[new_col] = df[new_col].str.replace('[_]',' ') #replaces underscores with spaces    
    return df

clean['text_clean'] ='' #a new column for storing our squeaky-clean text
clean['tags_clean'] = '' #a new column for storing our squeaky-clean tags
 
clean = clean_soup(clean,'html','text_clean')  #run my function to clean the html
clean = remove_punc(clean,'text_clean','text_clean')  #run my function to remove punctuation from the text
clean = remove_punc(clean,'tag_name','tags_clean') #run my function to premove punctuation from the tags

pd.options.display.max_colwidth = 95
clean[['page_id','html','text_clean']].tail(10)

Unnamed: 0,page_id,html,text_clean
1390,16102,<div style=width: 320px; float: right;>\r\n<table style=width: 300px; margin: 10px; border=...,Craigslist allow exploitative adverts offering homeless women accommodation for sex Hundre...
1391,16118,<div style=width: 320px; float: right;>\r\n<table style=width: 300px; margin: 10px; border=...,Apple is avoiding more than 13 billion in taxes And that s not even the half of it Deman...
1392,16148,<div style=width: 320px; float: right;>\r\n<table style=width: 300px; margin: 10px; border=...,The world s two biggest greenhouse gas polluters have now ratified the Paris Agreement so...
1393,16154,<div style=width: 320px; float: right;>\r\n<table style=width: 300px; margin: 10px; border=...,The world s two biggest greenhouse gas polluters have now ratified the Paris Agreement so...
1394,16160,<div style=width: 320px; float: right;>\r\n<table style=width: 300px; margin: 10px; border=...,Regulators are rolling out a controversial pesticide banned in Europe to kill Zika mosquito...
1395,16416,<p>Dear {{ user.first_name|capfirst|default:Friend }}</p>\r\n<p><strong>Water issues are su...,Dear Great news After 225000 of us spoke out against the mismanagement of the precious w...
1396,16458,<div style=width: 320px; float: right;>\r\n<table style=width: 300px; margin: 10px; border=...,Foreign worker lists That s what this government wants to make all firms publish Show tha...
1397,16479,<div style=width: 320px; float: right;>\r\n<table style=width: 300px; margin: 10px; border=...,Asos is coming under fire for treating its workers like machines Tell Asos to treat its ...
1398,16736,<div style=width: 320px; float: right;>\r\n<table style=width: 300px; margin: 10px; border=...,Corporations have the government s ear when it comes to Brexit it promised Nissan tariff ...
1399,16940,<table style=width: 100%; max-width: 300px; margin-left: 10px; margin-right: 10px; border=0...,Urgent The Trudeau government could approve the Kinder Morgan tar sands pipeline in three ...


### Finally we have clear, readable text! However, not all of these words are salient for my purposes; they won't help my algothtm figuring out which campaigns belong to which topics. 

### At this point, I need to start whittling down this text to the words that will be most helpful for our analysis. On thing I consider is the part of speech.  Things like prepositions and conjunctions will confuse the LDA when it tries to dicern discrete topics.  Verbs and adverbs are largely unhelpful too.

### A Python module called the Natural Language Toolkit (NLTK) will help mw filter all the words in the mailing by part of speech.  After some trial and error, I decide to keep only nouns and plural nouns.  I do not inclde proper nouns, which strips out the names of the corporations we target. 

In [5]:
filtered = clean[['page_id','text_clean','tags_clean']].copy() #copy relevant results of last step into a new frame

def filter_camp_by_pos(df,old_col,new_col,pos_codes): #function to filter the campaigns by part of speech
    df[new_col] = df.apply(lambda row: nltk.word_tokenize(row[old_col]), axis=1) #tokenize (pre-process) each word
    df[new_col] = df[new_col].apply(lambda row: nltk.pos_tag(row)) #tag each word with the part of speech
    df[new_col] = [[tuple[0].lower() for tuple in row if tuple[1] in pos_codes] for row in df[new_col]]  #select all words matching my filter codes
    return df

pos_codes = ['NN','NNS']  #just nouns and plural nouns

filter_camp_by_pos(filtered,'text_clean','text_filtered',pos_codes)  #runs the function I defined about

pd.options.display.max_colwidth = 80
filtered[['page_id','text_clean','text_filtered']].tail(10)  #displays last 10 rows

### Up until this point, I have kept the text of mailings seperate from the tags, but at this point I want to grab the words from the 'tag_name' column and add them to the mailing text to be evaluated together.  I couldn't do that until after I ran the part of speech filter, because NLTK needs to have sentences in context in order to  properly mark the part of speech; adding short phrases would have created problems.  From here on in, I will be dealing with collections of words, where the order doesn't matter.  Because the tags were specifically chosen to convey topic information, I have chosen to give them 3x more weight than the regular mailing text, based on trial and error.

In [6]:
merged = filtered[['page_id','text_clean','text_filtered','tags_clean']].copy() #copy results of last step into a new frame

def concat_cols(df, filtered_col, unfiltered_col, new_col, coef):
    weighted = ((df[unfiltered_col].str.lower()+' ')*coef) #multiply the unfiltered column by the desired coefficient
    df[new_col] = weighted.apply(lambda row: nltk.word_tokenize(row)) #tokenize (pre-process) each word 
    df[new_col] = df[new_col] + df[filtered_col]
    return df

merged['text_merged'] = '' # a new column to store the combined result of text and tags                     

merged = concat_cols(merged,'text_filtered','tags_clean','text_merged',3)

pd.options.display.max_colwidth =70
merged[['page_id','text_filtered','tags_clean','text_merged']].tail(10)

Unnamed: 0,page_id,text_filtered,tags_clean,text_merged
1390,16102,"[adverts, women, accommodation, sex, hundreds, vile, ads, petition...",Womens Rights,"[womens, rights, womens, rights, womens, rights, adverts, women, a..."
1391,16118,"[taxes, half, tax, evasion, petition, secret, technology, giant, p...",taxes Economic Justice trade,"[taxes, economic, justice, trade, taxes, economic, justice, trade,..."
1392,16148,"[world, s, greenhouse, gas, polluters, excuse, fossil, fuel, indus...",Environment,"[environment, environment, environment, world, s, greenhouse, gas,..."
1393,16154,"[world, s, greenhouse, gas, polluters, governments, excuse, govern...",Environment,"[environment, environment, environment, world, s, greenhouse, gas,..."
1394,16160,"[regulators, pesticide, mosquitoes, bees, bees, regulators, sprayi...",bees Environment,"[bees, environment, bees, environment, bees, environment, regulato..."
1395,16416,"[news, mismanagement, water, water, rates, users, companies, miles...",Environment,"[environment, environment, environment, news, mismanagement, water..."
1396,16458,"[worker, lists, government, firms, plans, workers, workers, headli...",Workers Rights,"[workers, rights, workers, rights, workers, rights, worker, lists,..."
1397,16479,"[fire, workers, machines, workers, petition, cost, fashion, giant,...",Workers Rights,"[workers, rights, workers, rights, workers, rights, fire, workers,..."
1398,16736,"[corporations, government, tariff, access, market, corporations, b...",Privatization and Political Meddling,"[privatization, and, political, meddling, privatization, and, poli..."
1399,16940,"[government, sands, pipeline, days, dozens, organizations, hundred...",oil company Environment,"[oil, company, environment, oil, company, environment, oil, compan..."


### Next I need to assemble a list of words that I want to exclude from the analysis. NLTK has gotten rid of many of the words we don't need based on parts of speech, but other words are specific to SumOfUs use, and have to be listed manually.  Words like 'petition', 'corporation', and 'click' are not salient for us, even though they could be salient in another context. I started with a list of generic stopwords and then added to it by hand.  Using the frequency distribution module of NLTK to make it easier, I can see the most commonly used words in my dataset, and then add the ones I want to exclude to my stoplist.  I ran this step several times until I was left with a list of salient terms.

In [30]:
go = merged[['page_id','text_clean','text_merged']].copy() #copy results of last step into a new frame

def exclude_stopwords(df,old_col,new_col,display_num,export_num):
    stop_words = set(stopwords.words('english')) #read in my text file of stopwords   
    df[new_col] = df[old_col].apply(lambda x:[word for word in x if word not in stop_words]) #remove the stop words
    go_freq = FreqDist(df[new_col].sum()).most_common(display_num) #return the top words that are left by frequency, to display on screen
    export_freq = FreqDist(df[new_col].sum()).most_common(export_num) #return the top words that are left by frequency, to export for viz
    return df, go_freq, export_freq


go['text_go'] = '' #create a new column that will store the text once the stop words have been filtered
    
go, go_freq, export_freq = exclude_stopwords(go,'text_merged','text_go',20,5000)

go.to_csv('go.csv') #exporting the data at thie stage, just in case I want to use it in visualiations later on

go_freq #I use this list to find common words that I don't want in my analysis, then manually add them the stoplist

[('oil', 1179),
 ('food', 974),
 ('water', 788),
 ('tax', 678),
 ('protection', 665),
 ('climate', 628),
 ('health', 521),
 ('liberties', 512),
 ('trade', 498),
 ('media', 469),
 ('economic', 459),
 ('privatization', 457),
 ('meddling', 438),
 ('women', 427),
 ('gmos', 372),
 ('customers', 355),
 ('children', 310),
 ('coal', 292),
 ('farmers', 283),
 ('conditions', 283)]

### Some words have the same root, such as 'work', 'working', and 'worker' or 'economy', 'economic' and 'economists.  For our purposes, these words have basically the same meaning, and should be combined in order to properly represent the weight of each term in the campaign.  I apply three different 'stemmer' algorithms, in order to use the roots of words in my model.

In [8]:
# stem the filtered tokens
stemmed = go[['page_id','text_clean','text_go']].copy() #copy results of last step into a new frame

snowball = SnowballStemmer('english')
lancaster = LancasterStemmer()
porter = PorterStemmer()

def stem(df,old_col,new_col):
    df[new_col] = df[old_col].apply(lambda x: [porter.stem(word) for word in x])
    df[new_col] = df[new_col].apply(lambda x: [snowball.stem(word) for word in x])
    df[new_col] = df[new_col].apply(lambda x: [lancaster.stem(word) for word in x])
    return df

stemmed['text_stemmed'] = ''
stemmed = stem(stemmed,'text_go','text_stemmed')

pd.options.display.max_colwidth =90
stemmed[['page_id','text_go','text_stemmed']].tail(10)

Unnamed: 0,page_id,text_go,text_stemmed
1390,16102,"[womens, womens, womens, adverts, women, accommodation, sex, vile, bans, adverts, wome...","[wom, wom, wom, advert, wom, accommod, sex, vil, ban, advert, wom, exchang, sex, wom, ..."
1391,16118,"[taxes, economic, trade, taxes, economic, trade, taxes, economic, trade, taxes, tax, e...","[tax, econom, trad, tax, econom, trad, tax, econom, trad, tax, tax, ev, secret, techno..."
1392,16148,"[greenhouse, gas, polluters, excuse, fuel, climate, climate, treaty, force, treaty, fu...","[greenh, ga, pollut, exc, fuel, clim, clim, treat, forc, treat, fuel, clim, forc, clim..."
1393,16154,"[greenhouse, gas, polluters, governments, excuse, governments, fuel, climate, governme...","[greenh, ga, pollut, govern, exc, govern, fuel, clim, govern, govern, treat, fuel, cli..."
1394,16160,"[bees, bees, bees, regulators, pesticide, mosquitoes, bees, bees, regulators, spraying...","[bee, bee, bee, reg, pesticid, mosquito, bee, bee, reg, spray, mosquito, reg, pesticid..."
1395,16416,"[mismanagement, water, water, rates, milestone, water, ecology, economy, conversation,...","[mism, wat, wat, rat, mileston, wat, ecolog, econom, conv, anch, med, wat, approv, com..."
1396,16458,"[lists, firms, headline, flames, racism, businesses, workforce, lists, doctors, visa, ...","[list, firm, headlin, flam, rac, bus, workforc, list, doct, vis, stud, workforc, badg,..."
1397,16479,"[machines, fashion, clothes, expense, warehouse, warehouse, water, toilet, breaks, per...","[machin, fash, clo, exp, wareh, wareh, wat, toilet, break, perform, second, turnov, il..."
1398,16736,"[privatization, meddling, privatization, meddling, privatization, meddling, tariff, br...","[priv, meddl, priv, meddl, priv, meddl, tariff, brexit, strategi, min, car, auto, nego..."
1399,16940,"[oil, oil, oil, pipeline, tar, pipeline, megaproject, messages, pipeline, pipeline, co...","[oil, oil, oil, pipelin, tar, pipelin, megaproject, mess, pipelin, pipelin, controvers..."


### Now, at long last, I am finally able to get to the good stuff and build my model!  I use a module called gensim; the name is derived from “generate similar” because it generates topics based on the similarity of words within each group.  The first step is to take my painstakingly cleaned and filtered text and create a giant "bag of words", called a corpus, to feed into a Latent Dirichlet Allocation (LDA) model.  The model will derive topics from our corpus based on word frequency.   It then returns the top four words by frequency in each topic.  The coefficients represent the percentage of that word in the topic as a whole.  This alone does not tell you that much about the topics, so in the next step, I generate an interactive graphic to explore the topics in more depth.

In [13]:
def generate_lda(df,col,topics,words,num_passes):
    camp_corpus = df[col].tolist()
    dictionary = corpora.Dictionary(camp_corpus)
    corpus = [dictionary.doc2bow(text) for text in camp_corpus]
    lda = models.ldamodel.LdaModel(corpus, num_topics=topics, id2word=dictionary, passes=num_passes)
    topics = lda.print_topics(num_topics=topics,num_words=words)
    return lda, corpus, dictionary, topics

lda, corpus, dictionary, all_topics = generate_lda(stemmed,'text_stemmed', 8, 4, 60)

all_topics

[(0, '0.060*clim + 0.057*oil + 0.028*pipelin + 0.028*fuel'),
 (1, '0.045*wom + 0.033*med + 0.018*account + 0.017*viol'),
 (2, '0.123*tax + 0.045*econom + 0.013*cant + 0.012*ev'),
 (3, '0.066*wat + 0.050*oil + 0.034*min + 0.023*prison'),
 (4, '0.051*bee + 0.023*neon + 0.023*protect + 0.019*libert'),
 (5, '0.057*trad + 0.034*drug + 0.031*protect + 0.015*heal'),
 (6, '0.027*factor + 0.025*priv + 0.024*meddl + 0.018*safet'),
 (7, '0.077*food + 0.037*farm + 0.034*gmo + 0.025*heal')]

### Here I have generated an interactive visualization that will help me explore and understand each topic.  Each bubble represents a single topic. The size of the bubble indicating the relative proportion of 'tokens' (words) in the corpus that are related to that topic.  The bubbles are numbered by size in descending order and do not match the topic listing in the previous step.

### The topics are placed on a field with two axes: PC1 and PC2.  PC stands for Principal Component. Principal component analysis (PCA) is a technique used to emphasize variation and bring out strong patterns in a dataset. It's often used to make data easy to explore and visualize. The axes don't mean any one thing but represent a composite of features chosen algorithmically to maximize variation.  Which is to say, the placement of the bubbles on the field shows you how closely related the topics are to each other in terms of the words they contain.  

### A single word can be found in more than one topic when it used in different contexts.  For example, the word 'oil' often appears with 'pipeline', and also often appears with 'palm'.  The algorithm makes guesses as to what topic a word belong to, based on other words in the same campaigns.  When two topics share many terms they are shown on the field as overlapping.

###  The blue bars on the right represent the overall frequency of each term in the corpus.  If you mouse over any bubble, red bars appear to the right.  The red bars represent the estimated number of times a given term was included in a given topic.   If you mouse over any word on the right, the bubbles change size to show the be proportional to the frequency of that term.

In [14]:
vis = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(vis)

In [20]:
pyLDAvis.save_html(vis,'myLDAvis.hml')  #saves out the HTML so I can use it later
prep_string = pyLDAvis.prepared_data_to_html(vis,'simple')

### Now that the topics have been generated, I need to apply them back to the campaigns.  The LDA algorithm looks at each campaign, and guesses a percentage chance that campaign is related to each of the topics.  Note that the topic numbers here related to the original numbered list, not the numbers on the bubbles in the visualization. 


In [24]:
assigned = stemmed[['page_id','text_clean','text_stemmed']].copy()  #copy results of last step into a new frame

def calc_freq(df,old_col,new_col,num):
    for index, row in df.iterrows():
        camp_freq = df['text_stemmed'][index]
        dist = FreqDist(camp_freq)
        df.set_value(index,'freq',dist.most_common(num))
    return df

def assign_topics(df,old_col,new_col,num):
    for index, row in df.iterrows():
        seed = df[old_col][index]
        doc_bow = dictionary.doc2bow(seed)
        camp_tops = lda.get_document_topics(doc_bow,num)
        df.set_value(index,new_col, camp_tops)
    return df
                              
assigned['freq'] = ''
assigned['topics'] = ''

assigned = calc_freq(assigned,'text_stemmed','freq',20)
assigned = assign_topics(assigned,'text_stemmed','topics',0).reset_index(drop=True)

pd.options.display.max_colwidth =180
assigned[['page_id','freq','topics']].tail(10)


Unnamed: 0,page_id,freq,topics
1390,16102,"[(wom, 8), (advert, 5), (sex, 4), (accommod, 3), (exchang, 2), (homeless, 2), (pic, 2), (vil, 2), (rent, 2), (hou, 1), (fem, 1), (pict, 1), (charit, 1), (ban, 1), (return, 1), ...","[(0, 0.00255316875486), (1, 0.982126754688), (2, 0.00255289156061), (3, 0.00255647606754), (4, 0.00255165578172), (5, 0.00255291122412), (6, 0.00255376009662), (7, 0.0025523818..."
1391,16118,"[(tax, 11), (trad, 3), (econom, 3), (account, 2), (period, 2), (investig, 2), (tril, 1), (secret, 1), (jurisdict, 1), (ev, 1), (aid, 1), (landmark, 1), (offsh, 1), (pict, 1), (...","[(0, 0.00304981314656), (1, 0.00305332203658), (2, 0.835371065206), (3, 0.00305041849157), (4, 0.00305140422491), (5, 0.146323395133), (6, 0.00305022771717), (7, 0.003050354043..."
1392,16148,"[(clim, 5), (fuel, 4), (forc, 3), (treat, 3), (pollut, 2), (med, 2), (emiss, 2), (greenh, 1), (govern, 1), (argu, 1), (breakthrough, 1), (exc, 1), (tomorrow, 1), (pledg, 1), (f...","[(0, 0.651482167389), (1, 0.136085514047), (2, 0.00378977389249), (3, 0.00379551976244), (4, 0.00378999162326), (5, 0.193475269085), (6, 0.00379059496915), (7, 0.00379116923282)]"
1393,16154,"[(govern, 8), (clim, 4), (fuel, 4), (pollut, 2), (med, 2), (forc, 2), (emiss, 2), (treat, 2), (greenh, 1), (breakthrough, 1), (exc, 1), (tomorrow, 1), (fal, 1), (pact, 1), (ple...","[(0, 0.63126443834), (1, 0.0327551105414), (2, 0.00338103459657), (3, 0.0033805386808), (4, 0.00338061207978), (5, 0.31907593973), (6, 0.0033813316843), (7, 0.00338099434699)]"
1394,16160,"[(bee, 9), (reg, 7), (beekeep, 3), (mosquito, 3), (chem, 3), (pollin, 2), (liv, 2), (pesticid, 2), (spray, 2), (childr, 1), (expos, 1), (issu, 1), (twist, 1), (child, 1), (toxi...","[(0, 0.00250180858339), (1, 0.0025030344139), (2, 0.00250262334425), (3, 0.199254788839), (4, 0.429245604062), (5, 0.00250258740794), (6, 0.00250358935542), (7, 0.358985963994)]"
1395,16416,"[(wat, 6), (rat, 2), (approv, 1), (mileston, 1), (anch, 1), (ecolog, 1), (commit, 1), (legisl, 1), (med, 1), (econom, 1), (conv, 1), (backlash, 1), (mism, 1), (cla, 1)]","[(0, 0.00595662743821), (1, 0.229361187545), (2, 0.00596783801044), (3, 0.734873521549), (4, 0.00596259037688), (5, 0.00596312122818), (6, 0.00595915873937), (7, 0.005955955113..."
1396,16458,"[(crim, 5), (hat, 4), (rac, 3), (list, 2), (workforc, 2), (street, 2), (clim, 1), (tre, 1), (firm, 1), (artic, 1), (tid, 1), (discrimin, 1), (disgust, 1), (ris, 1), (colleagu, ...","[(0, 0.027875296385), (1, 0.631960873362), (2, 0.00272031470219), (3, 0.00271850821104), (4, 0.00271803113451), (5, 0.00272317567568), (6, 0.326563955463), (7, 0.00271984506656)]"
1397,16479,"[(wareh, 4), (fash, 3), (factor, 2), (search, 1), (tre, 1), (toilet, 1), (tort, 1), (perform, 1), (machin, 1), (angor, 1), (monit, 1), (target, 1), (break, 1), (employ, 1), (jo...","[(0, 0.00320869900173), (1, 0.00321282924715), (2, 0.348745688389), (3, 0.0681157192622), (4, 0.163762500725), (5, 0.114708551826), (6, 0.295039559291), (7, 0.003206452257)]"
1398,16736,"[(meddl, 3), (priv, 3), (taxpay, 2), (sort, 2), (strategi, 2), (tax, 2), (effect, 2), (tre, 1), (dark, 1), (favo, 1), (brexit, 1), (transp, 1), (tariff, 1), (favourit, 1), (bil...","[(0, 0.00284468194864), (1, 0.0797222488117), (2, 0.342432142476), (3, 0.102517134997), (4, 0.00284226229498), (5, 0.0868071749118), (6, 0.109035451351), (7, 0.273798903209)]"
1399,16940,"[(pipelin, 8), (oil, 3), (govern, 3), (street, 2), (panel, 1), (clim, 1), (wal, 1), (scal, 1), (pop, 1), (defend, 1), (megaproject, 1), (pri, 1), (permiss, 1), (land, 1), (prot...","[(0, 0.780314219013), (1, 0.201805200161), (2, 0.00297988344469), (3, 0.00298507117049), (4, 0.00297916429427), (5, 0.00297941472224), (6, 0.00297819763017), (7, 0.002978849563..."


### Now I need to break down that topics field so that each topic gets assigned to its own column. For readibility, I will assign each column a name, and convert the numbers into rounded percentages. Finally, I will both display the final table and save it as a csv for use in my linear regression.  The linear regresssion model will determine how predictive each topic is of a cohort's propensity to donate.  These topics will be combined with other information from the campaign like regional breakdown, campaign virality, and early behavior of new joiners from that cohort.  All of the information will be used to determine the best way to predict the probability that a new member from a given campaign will become a donor.

In [151]:
pivoted = assigned[['page_id','text_clean','freq','topics']].copy()
                                  
def pivot(df,topic_col,headers):
    topicframe = pd.DataFrame()
    topicframe[topic_col] = [[tuple[1] for tuple in topic] for topic in df[topic_col]]  #removes a layer of nesting
    topicframe = pd.DataFrame(topicframe[topic_col].tolist())
    topicframe = round(topicframe*100,0)
    topicframe.columns=headers
    
    df = pd.concat([df, topicframe], axis=1)
    return df

topic_headers = ['fossil','human','econ','habitat','other','consumer','workers','food']

pivoted = pivot(pivoted,'topics', topic_headers)

pivoted.to_csv('mailing_topic.csv')

pd.options.display.max_colwidth =70
pivoted.tail(10)

Unnamed: 0,page_id,text_clean,freq,topics,fossil,human,econ,habitat,other,consumer,workers,food
1390,16102,Craigslist allow exploitative adverts offering homeless women acco...,"[(wom, 8), (advert, 5), (sex, 4), (accommod, 3), (exchang, 2), (ho...","[(0, 0.00255316875486), (1, 0.982126754688), (2, 0.00255289156061)...",0.0,98.0,0.0,0.0,0.0,0.0,0.0,0.0
1391,16118,Apple is avoiding more than 13 billion in taxes And that s not e...,"[(tax, 11), (trad, 3), (econom, 3), (account, 2), (period, 2), (in...","[(0, 0.00304981314656), (1, 0.00305332203658), (2, 0.835371065206)...",0.0,0.0,84.0,0.0,0.0,15.0,0.0,0.0
1392,16148,The world s two biggest greenhouse gas polluters have now ratified...,"[(clim, 5), (fuel, 4), (forc, 3), (treat, 3), (pollut, 2), (med, 2...","[(0, 0.651482167389), (1, 0.136085514047), (2, 0.00378977389249), ...",65.0,14.0,0.0,0.0,0.0,19.0,0.0,0.0
1393,16154,The world s two biggest greenhouse gas polluters have now ratified...,"[(govern, 8), (clim, 4), (fuel, 4), (pollut, 2), (med, 2), (forc, ...","[(0, 0.63126443834), (1, 0.0327551105414), (2, 0.00338103459657), ...",63.0,3.0,0.0,0.0,0.0,32.0,0.0,0.0
1394,16160,Regulators are rolling out a controversial pesticide banned in Eur...,"[(bee, 9), (reg, 7), (beekeep, 3), (mosquito, 3), (chem, 3), (poll...","[(0, 0.00250180858339), (1, 0.0025030344139), (2, 0.00250262334425...",0.0,0.0,0.0,20.0,43.0,0.0,0.0,36.0
1395,16416,Dear Great news After 225000 of us spoke out against the misman...,"[(wat, 6), (rat, 2), (approv, 1), (mileston, 1), (anch, 1), (ecolo...","[(0, 0.00595662743821), (1, 0.229361187545), (2, 0.00596783801044)...",1.0,23.0,1.0,73.0,1.0,1.0,1.0,1.0
1396,16458,Foreign worker lists That s what this government wants to make al...,"[(crim, 5), (hat, 4), (rac, 3), (list, 2), (workforc, 2), (street,...","[(0, 0.027875296385), (1, 0.631960873362), (2, 0.00272031470219), ...",3.0,63.0,0.0,0.0,0.0,0.0,33.0,0.0
1397,16479,Asos is coming under fire for treating its workers like machines ...,"[(wareh, 4), (fash, 3), (factor, 2), (search, 1), (tre, 1), (toile...","[(0, 0.00320869900173), (1, 0.00321282924715), (2, 0.348745688389)...",0.0,0.0,35.0,7.0,16.0,11.0,30.0,0.0
1398,16736,Corporations have the government s ear when it comes to Brexit i...,"[(meddl, 3), (priv, 3), (taxpay, 2), (sort, 2), (strategi, 2), (ta...","[(0, 0.00284468194864), (1, 0.0797222488117), (2, 0.342432142476),...",0.0,8.0,34.0,10.0,0.0,9.0,11.0,27.0
1399,16940,Urgent The Trudeau government could approve the Kinder Morgan tar...,"[(pipelin, 8), (oil, 3), (govern, 3), (street, 2), (panel, 1), (cl...","[(0, 0.780314219013), (1, 0.201805200161), (2, 0.00297988344469), ...",78.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0


In [32]:
#store the LDA variables so that I can recreate the data visualization in my data story
%store lda
%store corpus
%store dictionary

Stored 'lda' (LdaModel)
Stored 'corpus' (list)
Stored 'dictionary' (Dictionary)


### The LDA algorithm has classified all my mailings into topics.  But will my human brain think that it did a good job?  First I will look at the mailings with a 95% or more probability in each category

In [125]:
analyze = pivoted.copy() #copy the output to a new frame
analyze = analyze[['page_id','text_clean'] + topic_headers] #drop all the fields except the text and the topics

pd.options.display.width =250
pd.options.display.max_colwidth = 500

## TOP FOSSIL FUEL CAMPAIGNS

In [126]:
analyze.sort_values(by='fossil',ascending=0).head(3)

Unnamed: 0,page_id,text_clean,fossil,human,econ,habitat,other,consumer,workers,food
686,2636,Outdoor ice rinks where Wayne Gretzky and Sidney Crosby learned to play hockey are becoming an endangered species because of global warming Stephen Harper says he s a big hockey fan Prove it Commit to stopping climate change and protect our backyard rinks Fifty years ago Walter Gretzky cleared away a patch of snow in his Brantford Ontario backyard and made an ice rink where he taught his son Wayne a thing or two about hockey Today due to climate change this critical part of Cana...,99.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
680,2622,MPs are about to vote on whether to allow fracking under our homes without permission The Infrastructure Bill could force fracking onto people across the UK despite 99 being against it Tell the government to drop the pro fracking amendments It s make or break time for fracking RIGHT NOW MPs are debating whether to force fracking on people across the UK drilling for oil and gas under our homes without permission and leaving unknown toxic substances in the ground despite 99 of th...,99.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
568,2325,Today the U S Senate narrowly rejected a bill forcing approval of Keystone XL We know they ll keep trying until the bill passes Ask President Obama to promise to veto this tar sands pipeline We re safe for now The U S Senate has just blocked approval of the controversial Keystone XL pipeline We re out of the woods for the moment but when right wing Republicans take over the Senate in January we definitely won t be Now is the time to remind President Obama that the power to sto...,99.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## TOP HUMAN RIGHTS CAMPAIGNS¶

In [127]:
analyze.sort_values(by='human',ascending=0).head(3)

Unnamed: 0,page_id,text_clean,fossil,human,econ,habitat,other,consumer,workers,food
624,2466,The CIA s barbaric torture program was made possible by two psychologists who earned 81m designing and carrying out brutal abuse The American Psychological Association was complicit relaxing its ethics rules so members could help the CIA Tell the APA to apologize and expel all those linked in torture The CIA s horrific torture program was designed by two professional psychologists They even carried out torture themselves and got paid 81 million to do it The American Psychological As...,0.0,99.0,0.0,0.0,0.0,0.0,0.0,0.0
8,886,They choose younger girls the most vulnerable They do whatever they want Serco staff have been accused of sexually abusing vulnerable women at Yarl s Wood immigration removal centre and witnesses are now being deported This is part of a pattern of abuse Tell the Home Office to break its contract with Serco now Serco one of the largest and most powerful companies in the UK FTSE 250 privately runs Yarl s Wood in Bedfordshire an immigration removal centre where vulnerable women are d...,0.0,99.0,0.0,0.0,0.0,0.0,0.0,0.0
1352,12488,A pilot at WestJet sexually assaulted a flight attendant and the company didn t do enough to protect her before or after she came forward Tell WestJet CEO Gregg Saretsky step down now Sign the petition Warning trigger alert this email may be very difficult to read as the content refers to sexual assault A WestJet pilot sexually assaulted former flight attendant Mandalena Lewis while she was at work When she spoke out management didn t protect her they fired her Now she is go...,0.0,99.0,0.0,0.0,0.0,0.0,0.0,0.0


## TOP ECONOMIC CAMPAIGNS

In [128]:
analyze.sort_values(by='econ',ascending=0).head(3)

Unnamed: 0,page_id,text_clean,fossil,human,econ,habitat,other,consumer,workers,food
1339,12363,When corporations skip out on their taxes we lose funding for the public services we need Raise your voice in support of the ATO s crackdown on multinational companies gaming the tax system Sign the petition Commissioner of Taxation Chris Jordan is on a mission to stop corporate tax avoidance in Australia The Australian Taxation Office ATO just warned 60 multinational companies to stop gaming the system and pay their taxes We ve heard case after case of multinational companies reap...,0.0,0.0,99.0,0.0,0.0,0.0,0.0,0.0
576,2346,Starbucks has been busted by the EU which says its dodgy tax avoidance set up is illegal But the coffee chain continues to wage a PR war rather than just paying tax in the countries where it operates Tell Starbucks to stop conning public services out of millions and pay its taxes Starbucks has been busted for dodging millions in taxes in EU countries But rather than changing its ways it s hired a PR firm to protect its reputation from the growing public backlash Enough already Starbuc...,0.0,0.0,99.0,0.0,0.0,0.0,0.0,0.0
293,1712,Nike has cheated the Treasury out of 9 1m tax on profits from selling Manchester United kits Meanwhile the cuts are destroying lives Tell Nike to pay its taxes When it comes to tax avoidance Nike keeps just doing it The global sportswear giant has cheated the UK Treasury out of 9 1m in taxes from selling Manchester United replica kits by funnelling profits to its Dutch subsidiary Not very sporting is it Nike s lucrative partnership with Man U has netted it 100m in sales over the l...,0.0,0.0,99.0,0.0,0.0,0.0,0.0,0.0


## TOP HABITAT DESTRUCTION CAMPAIGNS

In [129]:
analyze.sort_values(by='habitat',ascending=0).head(3)

Unnamed: 0,page_id,text_clean,fossil,human,econ,habitat,other,consumer,workers,food
645,2546,KFC doesn t know where the palm oil it uses in its food comes from so the chain can t ensure that it s not from plantations that destroy rainforests Tell KFC to adopt a deforestation free policy now KFC says it has absolutely no idea where its palm oil comes from and it s putting that palm oil into loads of products from apple turnovers to grilled chicken to country fried steak In other words KFC the world s second largest restaurant chain can t guarantee that it s not buying fro...,0.0,0.0,0.0,99.0,0.0,0.0,0.0,0.0
1015,6262,Nestlé is at it again It just conditionally bought a bottling facility in Ontario that scientists are calling the stupidest short sighted most criminal use of water they ve ever seen Let s make sure Nestle s dangerous project doesn t go ahead Sign the petition Nestlé is after another Canadian s town s water Nestlé conditionally purchased a water bottling facility in Ontario that can draw 1300 litres of water a minute from a well so deep it punctures the bedrock Residents are rightly...,0.0,0.0,0.0,99.0,0.0,0.0,0.0,0.0
357,1830,A tailings pond from an Imperial Metals copper and gold mine just burst spewing tons of toxic metals and contaminated water into British Columbia s pristine waterways Yet the company s president says the water is almost clean enough to drink Let s call on the president to back up his own claim and drink the contaminated water Dear A tailings pond containing millions of cubic metres of toxic byproducts from an Imperials Metals mine just burst near Likely British Columbia But Imperial Me...,0.0,0.0,0.0,99.0,0.0,0.0,0.0,0.0


## TOP OTHER CAMPAIGNS

In [130]:
analyze.sort_values(by='other',ascending=0).head(3)

Unnamed: 0,page_id,text_clean,fossil,human,econ,habitat,other,consumer,workers,food
1317,11232,The Snoopers Charter is back Tell BT to speak out against the new draft bill and protect us from mass surveillance Sign the petition Our private lives are at risk Home Secretary Theresa May is trying to make a shady deal with the giants of the internet and telecoms industry in an attempt to get access to all of our private data literally all May s Investigatory Powers Bill aka Snoopers Charter will put the capabilities revealed by Snowden into the statute books and actually incr...,0.0,0.0,0.0,0.0,99.0,0.0,0.0,0.0
681,2623,The death penalty has been illegal in Michigan since 1847 yet Michigan s pharmacists can still facilitate executions in other states Join us in urging the Michigan Pharmacy Board to ban pharmacists from using their professional training to enable executions In July of 2014 the state of Arizona executed Joseph Wood by lethal injection with an experimental combination of drugs prepared by pharmacists Witnesses were horrified by the 660 gasps of air Mr Wood took during the almost two hour...,0.0,0.0,0.0,0.0,99.0,0.0,0.0,0.0
749,2839,Lenovo pre installed adware called Superfish on all its laptops without its customers knowledge or consent Worse the adware caused a massive security breach that put their banking info at risk We need to ensure that no corporation tries this again Tell PC manufacturers not to sell computers pre installed with adware Lenovo s website just got hacked But to be fair Lenovo kind of deserved it Lenovo the manufacturer of popular inexpensive computers has installed secret adware called Su...,0.0,0.0,0.0,0.0,99.0,0.0,0.0,0.0


## TOP CONSUMER PROTECTION CAMPAIGNS

In [131]:
analyze.sort_values(by='consumer',ascending=0).head(3)

Unnamed: 0,page_id,text_clean,fossil,human,econ,habitat,other,consumer,workers,food
690,2653,Green Mountain Coffee s single serve coffee is quickly becoming one of the favorite ways to brew coffee but the waste this coffee brewing method leaves in its wake is causing serious environmental and health issues Tell Green Mountain Coffee that you will boycott the single serve coffee machines until 2020 unless they become recyclable Your single serve morning cup of coffee is likely ending up in a landfill What began in 1998 as a niche model of coffee consumption is quickly becoming o...,0.0,0.0,0.0,0.0,0.0,99.0,0.0,0.0
990,4973,Hedge fund managers are making a killing off people with Hepatitis C Let s tell them to stop profiteering off sickness Sign the petition Pharmaceutical companies and hedge fund managers are making a killing off people with Hepatitis C Around 3 5 million people in the US disproportionately the poor uninsured and incarcerated are living with the deadly Hepatitis C virus Pharmaceutical company Gilead Sciences holds the cure a groundbreaking new class of drugs that cures that most ...,0.0,0.0,0.0,0.0,0.0,99.0,0.0,0.0
1095,8514,A hedge fund manager just bought the patent to a life saving AIDS drug and jacked up the price 5500 If the TPP passes this insane profit scheme could become common practice and we have to stop it If the TPP passes this insane profit scheme could become common practice and we have to stop it World leaders are meeting THIS WEEK to finalize details on the top secret TPP deal tell them to scrap this deal now Sign the petition By now you may have heard that 32 year old hedge fund man...,0.0,0.0,0.0,0.0,0.0,99.0,0.0,0.0


## TOP WORKER'S RIGHTS CAMPAIGNS

In [132]:
analyze.sort_values(by='workers',ascending=0).head(4)

Unnamed: 0,page_id,text_clean,fossil,human,econ,habitat,other,consumer,workers,food
710,2709,Im Rixpack Hostel in Berlin wohnen Flüchtlinge unter prekären Bedingungen während der Betreiber sich hohe Mieten vom Senat bezahlen lässt Fordern Sie den Rixpack Betreiber Stefan Richter dazu auf für menschenwürdige Zustände in seinem Hostel zu sorgen kaputt unhygienisch zu wenig Platz so lassen sich die Räumlichkeiten im Berliner Hostel Rixpack beschreiben in denen derzeit 51 Flüchtlingen hausen Dafür kassiert der Betreiber des Hostels geschätzte 30000 monatlich vom Senat Als ...,0.0,0.0,0.0,0.0,0.0,0.0,99.0,0.0
967,4885,Even though the USPS posted a 1 1 billion surplus in the first quarter lawmakers are preparing to privatize one of America s oldest institutions The public postal service provides loads of high paying secure jobs and is a huge employer workers of colour and veterans It belongs to all of us Tell Congress to save our mail Sign the petition The United States Postal Service one of America s greatest oldest institutions is under grave threat of getting sold off Lawmakers are enforcing rid...,0.0,0.0,0.0,0.0,0.0,0.0,99.0,0.0
1125,8864,H M isn t as conscious as it claims 61 of its best factories lack working fire exits H M live up to the binding agreements of the Bangladesh Accord Sign the petition H M brands itself as a leader in garment factory safety sustainability and ethics But a new report says H M s real factory conditions don t match up to its image A new Clean Clothes Campaign report states that 61 of H M s best factories don t have working fire exits These platinum and gold factories are supp...,0.0,0.0,0.0,0.0,0.0,0.0,99.0,0.0
325,1774,Ian Jordan committed suicide in a last desperate attempt to break free from debts he incurred from out of control loan sharks With menacing tactics they target the most vulnerable of our society and entrap them in ever larger debts Call on George Osborne to immediately impose heavier regulation of the loan shark industry Ian Jordan a 60 year old granddad committed suicide after his debts to payday loan companies spiralled out of control Ian s story represents everything that is wro...,0.0,0.0,0.0,0.0,0.0,0.0,99.0,0.0


## TOP FOOD SAFETY CAMPAIGNS

In [133]:
analyze.sort_values(by='food',ascending=0).head(3)

Unnamed: 0,page_id,text_clean,fossil,human,econ,habitat,other,consumer,workers,food
720,2723,Cereal labeled no high fructose corn syrup might not be telling the truth General Mills is misleading customers about what s in its cereal by calling different sorts of sweeteners fructose Demand that General Mills stop dishonestly labelling cereals Cereal giant General Mills is sneaking high fructose corn syrup HFCS into your breakfast The worst part It s in cereal labeled no high fructose corn syrup GM just calls it something else It s a handshake deal between corn syrup manuf...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,99.0
169,1512,McDonald s aggressive marketing to kids is creating a huge health crisis Tell McDonald s to stop predatory advertising to kids now Dear We re not lovin this McDonald s aggressively targets kids with its advertising and it has to stop now Behind that cheerful face of Ronald McDonald lies a large order of corporate greed and a dismissive attitude toward the health of its youngest customers McDonald s knows that if it can hook kids early they ll be customers for life and that means easy...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,99.0
1003,5270,Supermarkets are milking farmers dr y more than half of all British dairy farms have gone out of business in the past ten years Sainsbury s and Tesco guarantee a price for milk that covers the cost of production let s get Morrisons to do the same Sign the petition Morrisons is milking farmers dry and farmers across the UK are taking action More than half of all British dairy farms have gone out of business in the past ten years Those that are left are in dire straits with man...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,99.0


## CAMPAIGNS WHERE NO ONE TOPIC HAD MORE THAN 33% PROBABILITY

In [134]:
analyze.loc[(analyze[topic_headers]<33).all(axis=1)].head(5)

Unnamed: 0,page_id,text_clean,fossil,human,econ,habitat,other,consumer,workers,food
174,1526,America s richest family is trying to keep its billionaire son Rob at the helm of Walmart The family has already shown they ll stop at nothing to boost their bank balance even though they have the power to bring many workers out of poverty Tell big mutual funds Vanguard and Fidelity to vote against the re election of Rob Walton to the board Rob Walton heir to the Walmart fortune thinks he s entitled to be Chairman of Walmart s board forever and is campaigning to be re elected Tha...,29.0,0.0,26.0,0.0,0.0,21.0,22.0,0.0
390,1920,McDonald s is exploiting children in the Philippines Its Kiddie Crew Workshop makes children as young as 6 work in the restaurant and take values formation lessions Tell McDonald s to stop indoctrinating kids in an attempt to expand its customer base McDonald s is notorious for its low wages unhealthy food and monstrous environmental footprint Now there s a new sin to add to the list child labour We kid you not Children in the Philippines are being encouraged to pay McDonald s ...,0.0,22.0,28.0,0.0,0.0,14.0,13.0,23.0
409,1954,In 3 days time human rights defender Andy Hall faces 8 years in prison and a 10m fine just for exposing multiple human rights abuses at a Thai pineapple factory Global fruit giant Dole could be his salvation but only if we act now Tell Dole to stand up for Andy and demand Natural Fruit Ltd drop these outrageous charges Eight years in prison and a 10 million fine That s what Andy Hall faces in only three days time just for exposing multiple human abuses in a Thai pineapple fac...,0.0,18.0,0.0,8.0,32.0,27.0,15.0,0.0
701,2687,Former health secretary Alan Milburn is reportedly racking in 2 million in fees to advise private healthcares companies in the NHS Former ministers shoudn t rake in vast sums from private companies who want to carve up the NHS Tell David Cameron to fix the revolving door policy Former Labour Health Secretary Alan Milburn has raked in over 2 million as a consultant for a range of private healthcare companies who want to further privatise the NHS There are meant to be rules that stop f...,0.0,15.0,8.0,18.0,0.0,22.0,13.0,23.0
727,2737,This is amazing We are winning the fight against McDonald s Its CEO just quit after a three year long public relations nightmare For the first time in 60 years we have McD s on the ropes Tell the incoming CEO how to fix its image problem pay your employees a living wage This is amazing We are winning the fight against McDonald s Don Thompson its CEO is stepping down after three years of labor unrest flat sales and a public relations nightmare On one hand the fast food business mod...,10.0,19.0,31.0,16.0,0.0,0.0,0.0,22.0


### These classifications are far from perfect, but they give me enough information to start using them for rough statistical breakdowns. That phase of the analysis is beyond the scope of the LDA project, but as one example of a summary statistic, I can estimate the topic breakdown of all the campaigns.  This gives us a rough idea of how much we have campaigned on each issue area.

In [123]:
pd.options.display.float_format = '{:.2f}'.format
average_of_topics = pd.DataFrame(analyze[topic_headers].mean(),columns=['average topic probability'])
average_of_topics

Unnamed: 0,average topic probability
fossil,12.23
human,17.57
econ,9.77
habitat,13.92
other,8.63
consumer,10.98
workers,11.32
food,14.29


### Finally, by bringing in data from another file that shows when each mailing was sent, we can break down the data by year, getting a sense of how our campaigning has changed over time.

In [150]:
supplemental = pd.read_csv('../capstone/date_size.csv', encoding = "ISO-8859-1") #read in the file with the mailing_date information

joined = pd.merge(supplemental, analyze, on='page_id') #merge the new file with the topic data, based on page_id
joined['year'] = pd.to_datetime(joined['mailing_date']).apply(lambda x: x.year) #extract year from mailing_date and store in a new columns
joined.groupby(['year'])[topic_headers].mean() #group by the year aand average the topic distributions within each year

Unnamed: 0_level_0,fossil,human,econ,habitat,other,consumer,workers,food
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2013,9.53,16.42,2.95,23.37,9.84,7.84,18.32,10.32
2014,14.23,11.84,9.95,19.88,12.29,11.29,6.93,12.19
2015,13.84,10.0,8.27,26.03,6.03,11.19,7.63,15.92
2016,12.37,20.11,10.11,16.95,4.26,13.89,4.0,17.21


### From this table we can start to draw conclusions about the changes in SumOfUs campaigning over time.  It appears that we have steadily increased our campaigning in the area of food safety, economic justice, and consumer protection.  At the same, our worker's rights campaigning has markedly decreased in recent years, from a high of 18.82% of campaign content in 2013 to a low of only 4% in the first half of 2016.  While I have no objective information I can use to validate this trend data, it generally conforms to my experience of changes at SumOfUs over time.

## In conclusion, Latent Dirichlet Allocation proved to be an effective, if imperfect, method for discovering latent topics in SumOfUs campaign mailings and categorizing the mailings by topic.  I can combine these topic probabilities with other data in order to find trends in campaigning issues, and discover campaigns that straddle issue areas.  Ultimately, I hope to use these topic probabilities as features in a binomial regression to predict the donation propensity of new campaign cohorts.