# Textual analysis of activist campaign information

### MySQL to extract the page name, hand selected, and html content of the top mailing for each page

```
SET @rank_pages:=0; SET @rank:=0;
select ranked.page_id, ranked.page_name, tag.name as tag_name, replace(replace(mail.html,',',''),'"','') as html
from
  (select page_id, page_name, mailing_id, count, 
          IF(@rank_pages=page_id,@rank:=@rank+1,@rank:=1) as rank, @rank_pages:=page_id
  from
	(select p.id as page_id, p.name as page_name, a.mailing_id, count(*) as count
	from core_action as a
	join core_page as p on a.page_id = p.id
	where a.mailing_id is not null
	and p.id not in (5,25,28,46,130,525,561,566,761,935,1304,1394,1862,2678,3712,8559,10668)
	and left(p.name,12) <> "controlshift"
	and p.created_at >= "2013-01-01"
	and p.lang_id = 100
	group by p.id, a.mailing_id
	order by p.id, count(*) desc
	) as unranked
  ) as ranked
join core_mailing as mail on mail.id = ranked.mailing_id
join core_page_tags as cpt on cpt.page_id = ranked.page_id
join core_tag as tag on tag.id = cpt.tag_id and tag.id IN (2,8,10,11,13,15,22,23,24,25,29,30,32,33,34,35,36,39,41,43,45,47,48,49,54,59,60,64,67,72,73,75,80,81,82,84,88,89,91,92,93,94,95,96,98,101,104,105,106,107,109,112,114,115,116,117,120,122,123,125,127,130,133,139,141,142,146,148,151,157,160,161,175,177,178,181,183,185,190,193,201,202,206,207,211,213,222,224,226,227,231,234,239,240,242,243,244,246,248,254,258,260,261,265,267,270,273,280,287,288,289,291,297,303,315,316,322,323,325,327,328,334,345,346,347,348,369,383,389,393,394,402,407,410,412,415,443,445,451,452,463,467,468,471,480,481,485,486,488,489,493,508,518,521,549,550,551,564,567,572,573,574,581,583,587,619,621,624,634,641,659,696,804,820,826,898,900,933,934,937,938,940,941,942,943,944,945,946,947,954,966,967,968,969,972,973,974,975,976,977,1000,1012,1036,1046,1071,1078,1128,1130,1132,1140,1208,1248,1282,1739,1746) 
where rank = 1
order by 1,4
```

### import modules

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

### read the csv into a DataFrame

In [39]:
camp_txt = pd.read_csv('../capstone/text_fields.csv')
pd.options.display.max_colwidth = 500
text_fields.head(4)

Unnamed: 0,page_id,page_name,tag_name,html
0,400,time-warner-al-jazeera,us corporation,<table cellspacing=0 cellpadding=0 align=right>\r\n<tbody>\r\n<tr>\r\n<td id=boxholder>\r\n<table style=border: 1px solid grey; margin-left: 10px; margin-bottom: 5px; width: 220px; cellspacing=0 cellpadding=0 bgcolor=#ffffff>\r\n<tbody>\r\n<tr>\r\n<td style=padding: 10px;>\r\n<p>In a blatantly prejudiced move Time Warner Cable dropped CurrentTV the moment it was sold to Al Jazeera.</p>\r\n<p><strong>Tell Time Warner Cable to pick CurrentTV back up and give its new owners a fair shake.<br /><...
1,400,time-warner-al-jazeera,discrimination,<table cellspacing=0 cellpadding=0 align=right>\r\n<tbody>\r\n<tr>\r\n<td id=boxholder>\r\n<table style=border: 1px solid grey; margin-left: 10px; margin-bottom: 5px; width: 220px; cellspacing=0 cellpadding=0 bgcolor=#ffffff>\r\n<tbody>\r\n<tr>\r\n<td style=padding: 10px;>\r\n<p>In a blatantly prejudiced move Time Warner Cable dropped CurrentTV the moment it was sold to Al Jazeera.</p>\r\n<p><strong>Tell Time Warner Cable to pick CurrentTV back up and give its new owners a fair shake.<br /><...
2,401,hbo-animal-cruelty,us corporation,<table cellspacing=0 cellpadding=0 align=right>\r\n<tbody>\r\n<tr>\r\n<td id=boxholder>\r\n<table style=border: 1px solid grey; margin-left: 10px; margin-bottom: 5px; width: 220px; cellspacing=0 cellpadding=0 bgcolor=#ffffff>\r\n<tbody>\r\n<tr>\r\n<td style=padding: 10px;>\r\n<p>A new lawsuit claims that animal abuse by HBO led to the death of four horses in one season of shooting <em>Luck</em>.</p>\r\n<p><strong>Tell&nbsp; HBO to investigate claims of animal abuse and enact measures to prev...
3,401,hbo-animal-cruelty,animal abuse,<table cellspacing=0 cellpadding=0 align=right>\r\n<tbody>\r\n<tr>\r\n<td id=boxholder>\r\n<table style=border: 1px solid grey; margin-left: 10px; margin-bottom: 5px; width: 220px; cellspacing=0 cellpadding=0 bgcolor=#ffffff>\r\n<tbody>\r\n<tr>\r\n<td style=padding: 10px;>\r\n<p>A new lawsuit claims that animal abuse by HBO led to the death of four horses in one season of shooting <em>Luck</em>.</p>\r\n<p><strong>Tell&nbsp; HBO to investigate claims of animal abuse and enact measures to prev...


### flatten tags for each campaign into a list, then turn the list into a string; each campaign is now a single row

In [55]:
camp_txt = pd.DataFrame(camp_txt.groupby(by=('page_id','page_name','html'))['tag_name'].apply(list)).reset_index()
camp_txt['tag_name'] = pd.DataFrame(camp_txt['tag_name'].apply(', '.join))
camp_txt = camp_txt[['page_id','page_name','tag_name','html']]
camp_txt.head(10)

Unnamed: 0,page_id,page_name,tag_name,html
0,400,time-warner-al-jazeera,"us corporation, discrimination",<table cellspacing=0 cellpadding=0 align=right>\r\n<tbody>\r\n<tr>\r\n<td id=boxholder>\r\n<table style=border: 1px solid grey; margin-left: 10px; margin-bottom: 5px; width: 220px; cellspacing=0 cellpadding=0 bgcolor=#ffffff>\r\n<tbody>\r\n<tr>\r\n<td style=padding: 10px;>\r\n<p>In a blatantly p...
1,401,hbo-animal-cruelty,"us corporation, animal abuse",<table cellspacing=0 cellpadding=0 align=right>\r\n<tbody>\r\n<tr>\r\n<td id=boxholder>\r\n<table style=border: 1px solid grey; margin-left: 10px; margin-bottom: 5px; width: 220px; cellspacing=0 cellpadding=0 bgcolor=#ffffff>\r\n<tbody>\r\n<tr>\r\n<td style=padding: 10px;>\r\n<p>A new lawsuit cl...
2,402,gm-strike,"us corporation, working conditions, workers",<table cellspacing=0 cellpadding=0 align=right>\r\n<tbody>\r\n<tr>\r\n<td id=boxholder>\r\n<table style=border: 1px solid grey; margin-left: 10px; margin-bottom: 5px; width: 220px; cellspacing=0 cellpadding=0 bgcolor=#ffffff>\r\n<tbody>\r\n<tr>\r\n<td style=padding: 10px;>\r\n<p>This man has sti...
3,403,boeing-dreamliner-fire,"us corporation, consumer safety",<table cellspacing=0 cellpadding=0 align=right>\r\n<tbody>\r\n<tr>\r\n<td id=boxholder>\r\n<table style=border: 1px solid grey; margin-left: 10px; margin-bottom: 5px; width: 220px; cellspacing=0 cellpadding=0 bgcolor=#ffffff>\r\n<tbody>\r\n<tr>\r\n<td style=padding: 10px;>\r\n<div>Boeing's new 7...
4,405,anz-mining,"bank, Australian corporation",<table cellspacing=0 cellpadding=0 align=right>\r\n<tbody>\r\n<tr>\r\n<td id=boxholder>\r\n<table style=border: 1px solid grey; margin-left: 10px; margin-bottom: 5px; width: 220px; cellspacing=0 cellpadding=0 bgcolor=#ffffff>\r\n<tbody>\r\n<tr>\r\n<td style=padding: 10px;>\r\n<p>ANZ Bank promise...
5,406,drowning-whales,canada,<table cellspacing=0 cellpadding=0 align=right>\r\n<tbody>\r\n<tr>\r\n<td id=boxholder>\r\n<table style=border: 1px solid grey; margin-left: 10px; margin-bottom: 5px; width: 200px; cellspacing=0 cellpadding=0 bgcolor=#ffffff>\r\n<tbody>\r\n<tr>\r\n<td style=padding: 10px;>\r\n<p>A pod of whales ...
6,407,anglogold,"africa, working conditions, workers, mining",<table cellspacing=0 cellpadding=0 align=right>\r\n<tbody>\r\n<tr>\r\n<td id=boxholder>\r\n<table style=border: 1px solid grey; margin-left: 10px; margin-bottom: 5px; width: 200px; cellspacing=0 cellpadding=0 bgcolor=#ffffff>\r\n<tbody>\r\n<tr>\r\n<td style=padding: 10px;>\r\n<p>Hundreds of thou...
7,408,goldcorp,"workers, mining",<table cellspacing=0 cellpadding=0 align=right>\r\n<tbody>\r\n<tr>\r\n<td id=boxholder>\r\n<table style=border: 1px solid grey; margin-left: 10px; margin-bottom: 5px; width: 200px; cellspacing=0 cellpadding=0 bgcolor=#ffffff>\r\n<tbody>\r\n<tr>\r\n<td style=padding: 10px;>\r\n<div>13 employees o...
8,409,newtown-walmart,"offline action taker, walmart, gun control",<p>Dear {{ user.first_name|default:Friend }}</p>\r\n<p>Over 110000 members of the SumOfUs community signed our petition to Walmart the largest arms dealer in the country calling for it to stop selling assault weapons and high capacity magazines.</p>\r\n<p>This Tuesday January 15 immediately foll...
9,410,hyper-racism,racism,<table cellspacing=0 cellpadding=0 align=right>\r\n<tbody>\r\n<tr>\r\n<td id=boxholder>\r\n<table style=border: 1px solid grey; margin-left: 10px; margin-bottom: 5px; width: 200px; cellspacing=0 cellpadding=0 bgcolor=#ffffff>\r\n<tbody>\r\n<tr>\r\n<td style=padding: 10px;>\r\n<p>An Apple accesso...


### use BeautifulSoup to clean up the html

In [53]:
pd.options.display.max_colwidth = 300
dirty = camp_txt['html']
clean = pd.Series()
for index, item in dirty.iteritems():    
    soup = BeautifulSoup(item, "lxml")
    scrubbed = (soup.get_text(strip=True))
    clean.loc[index] = scrubbed 
camp_txt['html'] = clean
camp_txt

Unnamed: 0,page_id,page_name,tag_name,html
0,400,time warner al jazeera,us corporation discrimination,In a blatantly prejudiced move Time Warner Cable dropped CurrentTV the moment it was sold to Al Jazeera.Tell Time Warner Cable to pick CurrentTV back up and give its new owners a fair shake. On Wednesday Current TV announced that it had been sold to Al Jazeera for half a billion dollars. Before ...
1,401,hbo animal cruelty,us corporation animal abuse,A new lawsuit claims that animal abuse by HBO led to the death of four horses in one season of shootingLuck.Tell HBO to investigate claims of animal abuse and enact measures to prevent animal cruelty in the future. Information has come to light ofshocking instances of animal abuse and death on ...
2,402,gm strike,us corporation working conditions workers,This man has stitched his lips together and declared a hunger strike demanding that General Motors compensate its Colombian employees for debilitating life-long injuries.Tell General Motors to meet with its injured workers and negotiate. Jorge Parra started working as a welder at General Motors’...
3,403,boeing dreamliner fire,us corporation consumer safety,Boeing's new 787 Dreamliners keep catching on fire. Something is clearly wrong with the electrical system.Tell Boeing to recall the 787s immediately. From the startthe Boeing 787 Dreamliner has been plagued with problemsbut now a clear pattern is emerging:the electrical system keeps catching on ...
4,405,anz mining,bank Australian corporation,ANZ Bank promised when signing the Equator Principles to not loan money to projects that have a negative impact on the environment. But it gave a $1.2B loan to a massive new coal mine in NSW.Tell ANZ to overturn the loan and keep its promises to protect the environment. On Monday ANZ Bank announ...
5,406,drowning whales,canada,A pod of whales is drowning in the frozen sea of northern Canada.Tell the Canadian government to save the whales before it's too late! Right now in Canada’s far northa dozen killer whales are drowningunder a sheet of sea ice.Earlier this week residents of the village of Inukjuak about 900 miles ...
6,407,anglogold,africa working conditions workers mining,Hundreds of thousands of miners are dying from an easily preventable lung disease but mining companies like AngloGold Ashanti refuse to provide their workers with protection.Tell Anglo Gold Ashanti to provide protection against silicosis for all workers immediately. One in four gold miners in So...
7,408,goldcorp,workers mining,13 employees of Goldcorp a major mining company were killed by security during a dispute on Tuesday.Demand Goldcorp compensate the victims and ensure this never happens again. Absolutely horrendous news.Goldcorp mine shot 13 employees in Guatemala on Tuesday. We don't have a ton of details but w...
8,409,newtown walmart,offline action taker walmart gun control,Dear Over 110000 members of the SumOfUs community signed our petition to Walmart the largest arms dealer in the country calling for it to stop selling assault weapons and high capacity magazines.This Tuesday January 15 immediately following the anniversary of the shooting at Sandy Hookmembers o...
9,410,hyper racism,racism,An Apple accessory manufacturer Hyper hired models to stand silently clad only in a bikini bottom and paint in order to try and lure people into their booth at the Consumer Electronics Show.It even painted one black model white so she'd fit in better.Join us and demand that Hyper apologize for t...


### a bit more cleaning

In [54]:
camp_txt['page_name'] = camp_txt['page_name'].str.replace('[^\w\s]',' ') #replaces all punctuation in page_name with spaces
camp_txt['tag_name'] = camp_txt['tag_name'].str.replace('[^\w\s]','') #replaces all punctuation in tag_name with empty string
camp_txt['tag_name'] = camp_txt['tag_name'].str.replace('[_]',' ') #replaces underscores in tag_name with spaces
camp_txt['html'] = camp_txt['html'].str.replace("{(.+)}", ' ') #removes django tags from html
camp_txt

Unnamed: 0,page_id,page_name,tag_name,html
0,400,time warner al jazeera,us corporation discrimination,In a blatantly prejudiced move Time Warner Cable dropped CurrentTV the moment it was sold to Al Jazeera.Tell Time Warner Cable to pick CurrentTV back up and give its new owners a fair shake. On Wednesday Current TV announced that it had been sold to Al Jazeera for half a billion dollars. Before ...
1,401,hbo animal cruelty,us corporation animal abuse,A new lawsuit claims that animal abuse by HBO led to the death of four horses in one season of shootingLuck.Tell HBO to investigate claims of animal abuse and enact measures to prevent animal cruelty in the future. Information has come to light ofshocking instances of animal abuse and death on ...
2,402,gm strike,us corporation working conditions workers,This man has stitched his lips together and declared a hunger strike demanding that General Motors compensate its Colombian employees for debilitating life-long injuries.Tell General Motors to meet with its injured workers and negotiate. Jorge Parra started working as a welder at General Motors’...
3,403,boeing dreamliner fire,us corporation consumer safety,Boeing's new 787 Dreamliners keep catching on fire. Something is clearly wrong with the electrical system.Tell Boeing to recall the 787s immediately. From the startthe Boeing 787 Dreamliner has been plagued with problemsbut now a clear pattern is emerging:the electrical system keeps catching on ...
4,405,anz mining,bank Australian corporation,ANZ Bank promised when signing the Equator Principles to not loan money to projects that have a negative impact on the environment. But it gave a $1.2B loan to a massive new coal mine in NSW.Tell ANZ to overturn the loan and keep its promises to protect the environment. On Monday ANZ Bank announ...
5,406,drowning whales,canada,A pod of whales is drowning in the frozen sea of northern Canada.Tell the Canadian government to save the whales before it's too late! Right now in Canada’s far northa dozen killer whales are drowningunder a sheet of sea ice.Earlier this week residents of the village of Inukjuak about 900 miles ...
6,407,anglogold,africa working conditions workers mining,Hundreds of thousands of miners are dying from an easily preventable lung disease but mining companies like AngloGold Ashanti refuse to provide their workers with protection.Tell Anglo Gold Ashanti to provide protection against silicosis for all workers immediately. One in four gold miners in So...
7,408,goldcorp,workers mining,13 employees of Goldcorp a major mining company were killed by security during a dispute on Tuesday.Demand Goldcorp compensate the victims and ensure this never happens again. Absolutely horrendous news.Goldcorp mine shot 13 employees in Guatemala on Tuesday. We don't have a ton of details but w...
8,409,newtown walmart,offline action taker walmart gun control,Dear Over 110000 members of the SumOfUs community signed our petition to Walmart the largest arms dealer in the country calling for it to stop selling assault weapons and high capacity magazines.This Tuesday January 15 immediately following the anniversary of the shooting at Sandy Hookmembers o...
9,410,hyper racism,racism,An Apple accessory manufacturer Hyper hired models to stand silently clad only in a bikini bottom and paint in order to try and lure people into their booth at the Consumer Electronics Show.It even painted one black model white so she'd fit in better.Join us and demand that Hyper apologize for t...
