# Facebook politics fact-checking

**If you like my work, please, leave an upvote and/or a comment: it will be really appreciated and it will motivate me in offering more content to the Kaggle community ! :)**

Welcome and thanks for viewing my Kernel!

**Context**

During the 2016 US presidential election, the phrase “fake news” found its way to the forefront in news articles, tweets, and fiery online debates the world over after misleading and untrue stories proliferated rapidly. BuzzFeed News analyzed over 1,000 stories from hyperpartisan political Facebook pages selected from the right, left, and mainstream media to determine the nature and popularity of false or misleading information they shared.

**Content**

This dataset supports the original story “Hyperpartisan Facebook Pages Are Publishing False And Misleading Information At An Alarming Rate” published October 20th, 2016. Here are more details on the methodology used for collecting and labeling the dataset (reproduced from the story):

**More on Our Methodology and Data Limitations**

“Each of our raters was given a rotating selection of pages from each category on different days. In some cases, we found that pages would repost the same link or video within 24 hours, which caused Facebook to assign it the same URL. When this occurred, we did not log or rate the repeat post and instead kept the original date and rating. Each rater was given the same guide for how to review posts:

* “Mostly True: The post and any related link or image are based on factual information and portray it accurately. This lets them interpret the event/info in their own way, so long as they do not misrepresent events, numbers, quotes, reactions, etc., or make information up. This rating does not allow for unsupported speculation or claims.

* “Mixture of True and False: Some elements of the information are factually accurate, but some elements or claims are not. This rating should be used when speculation or unfounded claims are mixed with real events, numbers, quotes, etc., or when the headline of the link being shared makes a false claim but the text of the story is largely accurate. It should also only be used when the unsupported or false information is roughly equal to the accurate information in the post or link. Finally, use this rating for news articles that are based on unconfirmed information.

* “Mostly False: Most or all of the information in the post or in the link being shared is inaccurate. This should also be used when the central claim being made is false.

* “No Factual Content: This rating is used for posts that are pure opinion, comics, satire, or any other posts that do not make a factual claim. This is also the category to use for posts that are of the “Like this if you think...” variety.

“In gathering the Facebook engagement data, the API did not return results for some posts. It did not return reaction count data for two posts, and two posts also did not return comment count data. There were 70 posts for which the API did not return share count data. We also used CrowdTangle's API to check that we had entered all posts from all nine pages on the assigned days. In some cases, the API returned URLs that were no longer active. We were unable to rate these posts and are unsure if they were subsequently removed by the pages or if the URLs were returned in error.”

**Acknowledgements**

This dataset was originally published on GitHub by BuzzFeed News here: https://github.com/BuzzFeedNews/2016-10-facebook-fact-check

# Index of the analysis

* 1 Module importing
* 2 Load and display data
* 3 Distribution of rating for all the dataset
* 4 Who is the author of the majority of the fake news?
* 5 Engagement Analysis

    *     5.1 Engagement on fake news
    *     5.2 Occupy Democrats analysis
    *     5.3 Type of content and engagement
* 6 How do left, mainstream, and right categories of Facebook pages differ in the stories they share?

    *     6.1 Posts engagement for left, right, mainstream pages
    *     6.2 Relation between post type, engagement and category
    
* 7 BuzzFeed’s findings: “the least accurate pages generated some of the highest numbers of shares, reactions, and comments on Facebook” is True?
* 8 Conclusions

# 1. Module importing

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

import os
print(os.listdir("../input"))

# 2. Load and display data

In [None]:
df = pd.read_csv('../input/facebook-fact-check.csv')

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
df.info()

# 3. Distribution of rating for all the dataset

Let's start analyzing the distribution of the content we are reviewing.

Our goal is to visualize how the rating is distributed between the 4 categories available: Mostly True, Mixture of True and False, Mostly False, No Factual Content.

In [None]:
plt.figure(figsize=(20, 10))
plt.tight_layout()
sns.countplot(df['Rating'])
plt.title('Distribution of Rating')

The majority of the content is mostly true: this is not a surprise as Facebook make a strong effort in removing the false content.

# 4. Who is the author of the majority of the fake news?

Let's analyze our dataframe filtering for 'mostly false' news to discover the source.

We will start creating a dataframe called *fakeDf* and then we will create a countplot.

In [None]:
fakeDf = df.loc[df['Rating'] == 'mostly false']

In [None]:
plt.figure(figsize=(20, 10))
plt.tight_layout()
sns.countplot(fakeDf['Page'], palette = 'Set2')
plt.title('Fake news authors')

The graph shows that Politico and CNN Politics were the authors of the majority of the elections fake news.

At this point, we can work on an engagement analysis to discover which is the author of the most engaged posts.

# 5. Engagement Analysis

# 5.1 Engagement on fake news

Let's use again the dataframe we created before that contains news rated as mostly false.

In [None]:
fakeDf.head()

Let's plot the share, reaction and comment to have a visualization of the engagement:

In [None]:
plt.figure(figsize=(20, 10))
plt.tight_layout()
sns.barplot(df['Page'], df['share_count'], palette='pastel')
plt.title('Shares on mostly false news')

In [None]:
plt.figure(figsize=(20, 10))
plt.tight_layout()
sns.barplot(df['Page'], df['reaction_count'], palette='colorblind')
plt.title('Reactions on mostly false news')

In [None]:
plt.figure(figsize=(20, 10))
plt.tight_layout()
sns.barplot(df['Page'], df['comment_count'], palette='deep')
plt.title('Comments on mostly false news')

The graphs show that 'Occupy Democrats' has the leader in engagement for both comments, shares and reactions.

But is Occupy Democrats a website that shows only fake news? Let's see it in the next section!

# 5.2 Occupy Democrats analysis

We will start creating a dataframe that will include only the Occupy Democrats information:

In [None]:
dfOccupy = df.loc[df['Page'] == 'Occupy Democrats']

Let's use the info() function to see the total of rows in this dataframe and some other information that will give us an idea about what we are dealing with:

In [None]:
dfOccupy.info()

209 rows - At this point, we can create a countplot to see the distribution of the rating:

In [None]:
plt.figure(figsize=(20, 10))
sns.countplot('Rating', data=dfOccupy, palette='Set2')
plt.title('Rating for Occupy Democrats')

What we see is that:
* The majority of news published by Occupy Democrats are classified as mostly true;
* Only a small part of the news are classified as mostly false;
* The sum of news classified as 'no factual content' and 'mixture of true and false' is, more or less, equal to the 'mostly true' amount.

# 5.1.3 Type of content and engagement

Using as example Occupy Democrats, let's now review which is the content that receives the higher engagement.

We can start reviewing the dataframe:

In [None]:
dfOccupy.head()

Let's now plot the relationship between the rating, the comments/shares/reactions and the post type

In [None]:
fig = plt.figure(figsize=(20, 15))
ax1 = fig.add_subplot(311)
ax2 = fig.add_subplot(312)
ax3 = fig.add_subplot(313)

sns.barplot(x="Rating", y="reaction_count",hue="Post Type", data=dfOccupy, ax=ax1, palette = 'pastel')

sns.barplot(x="Rating", y="comment_count",hue="Post Type", data=dfOccupy, ax=ax2, palette = 'pastel')

sns.barplot(x="Rating", y="share_count",hue="Post Type", data=dfOccupy, ax=ax3, palette = 'pastel')

What we can see from the 3 graphs above:
* Videos receive more engagement;
* The content rated as 'no factual content' and 'mixture of true and false' receives more engagement;
* Videos are absent from the 'mostly false' category.

# 6. How do left, mainstream, and right categories of Facebook pages differ in the stories they share?

Let's try to visualize the differences in content in accordance with the orientation of the page.

I always start visualizing the first lines of the data as a reminder:

In [None]:
df.head()

# 6.1 Posts engagement for left, right, mainstream pages

Okay, at this point I will create 3 subplots using: 

* The rating on the x-axis;
* The count of reactions, comments, shares on the y axis;
* The category as hue.

Let's see what happens:

In [None]:
fig = plt.figure(figsize=(20, 15))
ax1 = fig.add_subplot(311)
ax2 = fig.add_subplot(312)
ax3 = fig.add_subplot(313)

sns.barplot(x="Rating", y="reaction_count",hue="Category", data=df, ax=ax1, palette = 'pastel')

sns.barplot(x="Rating", y="comment_count",
            hue="Category", data=df, ax=ax2, palette = 'pastel')

sns.barplot(x="Rating", y="share_count",
            hue="Category", data=df, ax=ax3, palette = 'pastel')

Some of the deductions we can make using the 3 graphs above:

* Left pages see more engagement;
* Left non-factual content sees more engagement;
* Mainstream and Right pages receive less engagement.

# 6.2 Relation between post type, engagement and category

Let's proceed in creating a facetgrid with a few barplots containing the relations

In [None]:
vals = ['reaction_count', 'comment_count', 'share_count']
for val in vals:
    g = sns.FacetGrid(df, col="Category")
    g.map(sns.barplot, "Post Type", f'{val}', palette='Set2')

New deductions we can make:

* Video, links and photos are, in this order, the more engaged content;
* Text sees a really little engagement if compared with the other categories;
* Left content sees more engagement in this particular dataset;
* Links receives more reactions that videos and comments, but less comments and less shares than videos.

# 7 Is BuzzFeed’s findings that “the least accurate pages generated some of the highest numbers of shares, reactions, and comments on Facebook” True ?

To answer this question, let's start with a quick review of the dataframe to refresh the data in our minds:

In [None]:
df.head()

**How do we define the least accurate page?**

An approach could start creating a new column "TotalEngagement" and, inside it, summing up all the values for shares, reactions and comments.

In [None]:
df['TotalEngagement'] = df[['share_count', 'reaction_count', 'comment_count']].sum(1)

Okay, let's move on.

**How can we classify a page as 'Least Accurate'?**

In my opinion we have to consider only the content classified as 'mixture of true and false' and mostly false'.

So, let's now create a new dataframe using only the ratings different from 'mostly true' and 'no factual content':

In [None]:
notPositiveDf = df.loc[(df['Rating'] == 'mixture of true and false') | (df['Rating'] == 'mostly false')]

In [None]:
notPositiveDf.head()

Perfect: at this point we will plot two graphs:

1. The first one will show, for each page, the total engagement of the news classified as 'mixture of true and false' and 'mostly false';
2. The second one will be a countplot that will show the total number of news classified as Mixture of True and False and Mostly False per page.

In [None]:
fig = plt.figure(figsize=(20, 15))
ax1 = fig.add_subplot(211)
ax2 = fig.add_subplot(212)

sns.barplot(x="Page", y="TotalEngagement",hue="Rating", data=notPositiveDf, ax=ax1, palette = 'pastel')
plt.title('Mixture of True and False and Mostly False content')
sns.countplot(x="Page", data=notPositiveDf, ax=ax2, palette = 'pastel')

So, it seems that the answer to our question is that **the least accurate pages generated a small number of shares, reactions, and comments on Facebook if compared to the more accurate**.

# 8 Conclusions

**First of all, thank you so much for reading! If you liked my work, please, do not forget to leave an upvote: it will be really appreciated and it will motivate me in offering more content to the Kaggle community ! :)**

I will review and update the kernel periodically following your suggestions or if I want to discover something new (see the changelog at the beginning with the history of the updates).

If you want to ask something, feel free to comment!