# Progress report on newspaper bias identification

This document summarizes our approach and findings when evaluating political and quality biases of different online newspapers.

Running this notebook requires to increase the iopub_data_rate_limit:

```SH
jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10
```

## Load and preprocess data

In [1]:
import numpy as np
import pandas as pd
from bias import Bias

In [2]:
df = pd.read_csv("../data/source/newsclust.csv")
df = df.query("site != 'cbn.com'")

In [3]:
df['bias'] = df.apply(lambda row: Bias.get_bias_for_domain(row['site']), axis=1).astype('str')

In [4]:
df[['date', 'site', 'text', 'title', 'url', 'bias']].head()

Unnamed: 0,date,site,text,title,url,bias
0,2015-01-29T23:14:00.000+02:00,washingtonexaminer.com,Class action filed over United’s ‘low fare gua...,Class action filed over United’s ‘low fare gua...,http://www.washingtonexaminer.com/class-action...,Bias.RIGHT_CENTER
1,2015-01-23T02:00:00.000+02:00,nydailynews.com,Jupiterimages/Getty Images/Goodshoot RF Snuggl...,Portland pro cuddler hosts ‘Cuddle Con’ on Val...,http://www.nydailynews.com/news/national/portl...,Bias.LEFT_CENTER
2,2015-01-17T22:05:00.000+02:00,youngcons.com,Cops have been getting a lot of negative atten...,The Hilarious Reason Why Cops Don’t Want to We...,http://www.youngcons.com/hilarious-reason-cops...,Bias.RIGHT
3,2015-01-21T23:24:00.000+02:00,youngcons.com,Powered by Starbox \nIn the social media satur...,A Third Of All Divorce Cases Are Citing THIS W...,http://www.youngcons.com/third-divorce-cases-c...,Bias.RIGHT
4,2015-01-09T02:00:00.000+02:00,nj.com,View/Post Comments 2013 Star-Ledger file photo...,"Feds name New Castle County, Delaware a high-i...",http://www.nj.com/south/index.ssf/2015/01/feds...,Bias.LEFT_CENTER


## Embbed articles and newspapers into a topic space

In [5]:
from sklearn.decomposition import TruncatedSVD, LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

In [6]:
n_samples = 1000
X_samples = df['text'][:n_samples].values

In [7]:
pipeline = Pipeline([
    ('vect', CountVectorizer(max_df=0.95, min_df=2, max_features=1000, stop_words='english')),
    ('class', LatentDirichletAllocation(n_components=20, learning_method='batch'))
])

X = pipeline.fit_transform(X_samples)

##  Choose projection axes

The topic space will cluster together articles that cover the same events. However it is difficult to interpret in terms of bias as the dimension do not have a well defined semantic meaning. Fortunately we do have labels informing us on the bias of different newspapers. We can use these labels to project the embeddings along a custom axis.

First, we need to compute a global representation of the newspapers topics. To do this we simply average vectors for the individual articles.

In [8]:
centroids = []
labels = []
for site in df['site'].unique():
    idx = df['site'][:n_samples].values == site
    if sum(idx) > 0:
        a = np.mean(X[idx, :], axis=0)
        centroids.append(a)
        labels.append(site)
centroids = np.array(centroids)

We can now choose a few prototypical newspaper vectors to construct our articles. We do just that (after renormalization for a better visual effect).

In [9]:
from sklearn.preprocessing import Normalizer

normalizer = Normalizer()
centroids = normalizer.fit_transform(centroids)

In [10]:
# We can play a bit with the axes to get a mapping that makes more sense.

ic = dict(zip(labels, centroids))

x_axis = ic['breitbart.com'] - ic['dailykos.com'] / 3
y_axis = ic['reuters.com'] - ic['thinkprogress.org']

X_proj = centroids.dot(x_axis)
Y_proj = centroids.dot(y_axis)

## Visualize results

In [11]:
from plotly.offline import init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

from plotly.graph_objs import Bar, Scatter, Figure, Layout, XAxis, YAxis

In [12]:
trace = Scatter(x=X_proj, y=Y_proj, mode='markers+text', text=labels, textposition='top',  marker=dict(size=10))
iplot({
    'data': [trace],
    'layout': Layout(
        xaxis=XAxis(title='Left vs Right'), 
        yaxis=YAxis(title='Biased vs Factual'),
        autosize=False,
        width=1000,
        height=700)},
    show_link=False
)

## Next steps

This is only a preliminary result and many things can be done to improve on it. The main limitation of this techinque is that it only relies on topic information. More features can ()

* Rodrigo looking at better classfiers
* NER + Sentiment as an additional feature
* Grouping by topic + same analysis to get a finer grain visualization