Identification of Polarized Blog Posts #4

TyJK · 2017-05-09T06:36:29Z

Labelling Blog Sites

We need labelled data for various topics and sentiment and we need a lot of it. We have decided on a form of labelling called distant supervision, where we use heuristics and tags in order to classify far more text than we could possibly label manually, with the idea being the cost of potentially mislabelling some data is outweighed by the far greater volume. In order to do this we have targeted opinion blogs for 3 main reasons:

They contain far more text than a single social media comment
Posts on the same site should largely hold the same sentiment or point of view for a given topic
Unlike news articles, they should be very semantically similar to comments

We will need to scrape this data meaning we first need to label potential target sites. To do this we need people to pick a topic, such as global warming, vaccination, religion/atheism or some other polarizing topic. Once that topic is decided one, try to find blogs that have to do more or less exclusively with this topic, and determine the dominant sentiment of official posts on the site (not comments). Check that the sentiment is fairly consistent between posts and authors (if there's more than one).

Once a site or domain is determined to be a good target, enter the url into a text file. The text file should be named in the format: Topic of Blog Posts - Sentiment (eg. Climate Change - Denial, Abortion - Pro Choice, etc). Each file should contain only one leaning for the sake of easily running them through any automated scraper we create. Avoid ambiguously leaning sites (those that post from both sides) or those whose topic varies significantly .

What should be in the file

The first is the domain of the website, which will be used to limit where a crawler can go and which links it can follow. It should not include 'http://' or 'www', but simply the domain name, such as realclimate.org.

The next is the URL pattern for the blog posts. By this I mean the longest consistent URL for all blog pages on that site. For example for realclimate.org, all of the blog posts can be found by year, eg. http://www.realclimate.org/index.php/archives/2017/05/ or http://www.realclimate.org/index.php/archives/2016/03/. Thus, the common URL would be http://www.realclimate.org/index.php/archives/20. This is not itself a valid URL, but all valid URLs MUST contain this sequence. This makes it easy for anyone scraping using Portia or other scrapers to simply enter this sequence into the ReGex section when designing a spider and then setting it loose. Finally, if you want you can add a subjective evaluation of how extreme you believe the site to be in their position, with 1 being centrist and 5 being extremist. A template is available in the URL Dump folder and remember to name your file with the topic and sentiment

The list of possible topics includes but is not limited to:

Climate Change - IsReal/Skeptic
Abortion - Pro-life/Pro-choice
Religion - Believers/Non-believers
Vaccines - Pro-vaccination/anti-vaccination
Guns - Pro-gun/Anti-gun
Drug Policy - Criminalization/Decriminalization and Legalization

We have deliberately stayed away from topics like Politics - Left/Right or Libertarian/Authoritarian for two reasons:

These sorts of categories are quite general and tend to encompass many of the above topics
Defining what is Left vs what is Right is more subjective and inconsistent person to person.

If you choose to create your own topic, please keep in mind that it should be clear/unambiguous as well as broad. Ie. Yankees vs. Red Sox would not be a good topic as it's very specific. If you have any doubts please comment on this issue with your suggested topic and we'll give you feedback. Also, while any self-directed initiative is encouraged, keep in mind that we'd rather have a bunch of data for just a few topics than sparser data for many topics.

Thank you for your efforts and patiences.

PikioopSo · 2017-05-17T21:47:03Z

@TyJK

What's your take on extremist/radical sites? Are those counted too.

Kip/PiReel

TyJK · 2017-05-17T21:55:17Z

@PiReel

I think that encompassing the full range of stances on each side is important. That said, it's also important to get primarily the middle 90% of perspectives, and only have a few samples from the fringes.

We want the corpus to reflect all views under the broad umbrella of 'for' and 'against', and the extremists are likely to have some of the most distinctive speaking patterns which will be beneficial. But they also tend to be the most vitriolic, and so we definitely want to keep those in the minority. They'll be the easiest to find, so really they'll most likely have to be avoided once a few decently sized examples are found.

PikioopSo · 2017-05-18T18:31:56Z

Thanks for the info, @TyJK.

I also wanted to let you know about an addon called, Pocket made for Firefox. It allows you to share bookmarked pages with a group of collaborators and it also allows you to assign labels to the bookmarked pages.

So things I like about it are:
Good looking interface.
Shareable web pages makes it easier to explain the concepts of complicated subjects.
connectable to third party accounts.

I think it would work really well for a project like this, where we need to organize a bunch of links and share them.

TyJK · 2017-05-18T20:31:18Z

@PiReel
I wasn't aware of that feature, that sounds like exactly what we need, thank you! I'll play around with it to get an idea of how we could use it in a systematic way and then maybe update it so that everyone who's contributing is on the same page.

TyJK · 2017-05-26T06:14:30Z

w.r.t. Pocket, unfortunately it doesn't seem there's any easy bulk sharing feature. For now I'm going to leave the recommendation to just share with a text file, but I'll keep tinkering with it to see how it can be used.

PikioopSo · 2017-05-31T12:53:17Z

@TyJK, sorry for the late response my email is packed with mozsprint stuff, but I was wondering if you were trying to find other people to share pocket stuff with. I guess you were.

I am going to try to do a search for you on Pocket. What should I search for?

PikioopSo · 2017-05-31T12:53:56Z

I believe you can use tags so that people contributing can do tag searches for an echoburst tag or something

TyJK · 2017-05-31T13:53:41Z

@PiReel Tyler JK is what I have the name set up as, hopefully that's unique enough, but if not let me know. I'll see if I can set that up, but I've come across a second advantage to a .txt file, which is I can read it into a scraper I write. Is there a way to download pocket links into something like that? Thanks :)

ghost · 2017-05-31T13:59:24Z

I'm writing an extension that turns the browsers file system in to a "Adobe Bridge" type application that works with Pi Reel.

For your case though we would have to write your scraper as a browser extension that works with pocket.

Which would be a nice piece of bookmarking software to have, but more elaborate.

TyJK · 2017-05-31T14:07:08Z

That seems like one of those things that might be an interesting project on it's own, but not necessarily something we can get up and running right now. I'm already going through the various features you could incorporate with this. From what I can tell, you need 3 things to efficiently scrape a site: the domain, the sub domains it should follow (I've found with most blogs, their archive has a longer url that's consistent for all blog posts, and not shared by non post content) and the webpage elements you want to scrape. Once you have that for each site you should be able to put it into a dictionary or list of lists and then run everything. Still doing research but hopefully I'll have something more detailed by tonight.

TyJK · 2017-06-02T13:39:31Z

I'm taking on Climate Change, both sides, atm. Will upload within the next few hours and then people can add to that if they wish.

TyJK · 2017-06-02T16:07:12Z

Climate Change is up, will be working on Drug Policy - Criminalization/Decriminalization next.

TyJK added help wanted labelling mozsprint labels May 9, 2017

TyJK mentioned this issue May 26, 2017

NLP Models and Data Collection Discussion #8

Open

TyJK closed this as completed Mar 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identification of Polarized Blog Posts #4

Identification of Polarized Blog Posts #4

TyJK commented May 9, 2017 •

edited

Loading

PikioopSo commented May 17, 2017

TyJK commented May 17, 2017

PikioopSo commented May 18, 2017

TyJK commented May 18, 2017

TyJK commented May 26, 2017

PikioopSo commented May 31, 2017

PikioopSo commented May 31, 2017

TyJK commented May 31, 2017

ghost commented May 31, 2017

TyJK commented May 31, 2017

TyJK commented Jun 2, 2017

TyJK commented Jun 2, 2017

Identification of Polarized Blog Posts #4

Identification of Polarized Blog Posts #4

Comments

TyJK commented May 9, 2017 • edited Loading

Labelling Blog Sites

What should be in the file

PikioopSo commented May 17, 2017

TyJK commented May 17, 2017

PikioopSo commented May 18, 2017

TyJK commented May 18, 2017

TyJK commented May 26, 2017

PikioopSo commented May 31, 2017

PikioopSo commented May 31, 2017

TyJK commented May 31, 2017

ghost commented May 31, 2017

TyJK commented May 31, 2017

TyJK commented Jun 2, 2017

TyJK commented Jun 2, 2017

TyJK commented May 9, 2017 •

edited

Loading