# Intro

In this notebook we try to get the top posts from the Reddits about news. In addition, we also directly store the links to the respective news articles. At the end we try to get all news articles of the different publishers via the urls.

# Imports

In [2]:
import praw
from psaw import PushshiftAPI
import pandas as pd 
import numpy as np 
#pd.set_option('display.max_colwidth', -1)

In [3]:
# Install of the PushshiftAPI
#!pip install psaw

Collecting psaw
  Downloading psaw-0.0.12-py3-none-any.whl (15 kB)
Installing collected packages: psaw
Successfully installed psaw-0.0.12


In [2]:
# Login credentials for praw & Reddit API
r = praw.Reddit(client_id='ddxZYbBilApY5A', client_secret='4rxjgOizdOJlhuyD781bi4tCqH8', user_agent='Henlo')
api = PushshiftAPI(r)

---

# Functions 

With the following function we will get the top 1000 posts from a specific reddit.

In [47]:
def get_reddit_details(reddit_name): 
    top_posts = r.subreddit(reddit_name).top(limit=1000)
    lst = []
    title = []
    for post in top_posts:
        lst.append(post.url)
        title.append(post.title)
    df = pd.DataFrame({'title' : title, 'url' : lst})
    return df

We use newspaper3k to fetch all articles from our dataframe with all urls.

In [49]:
from newspaper import Article
import urllib.request, urllib.error

# download and parse articles
def get_articles(start,end,prefix): 
    good = 0
    bad = 0

    articles_txt_list = []

    for link in link_list[start:end]:
            print("{}".format(df.shape[0]-good+bad),end="\r") 
            if link == "None": 
                articles_txt_list.append("None")
                bad += 1 

            else: 
                try:
                    #conn = urllib.request.urlopen(link)
                    n_article = Article(url=link, fetch_images=False, request_timeout=10, number_threads=15)
                    n_article.download()
                    n_article.parse()
                    articles_txt_list.append(n_article.text)
                    good += 1
                except: 
                    bad +=1
                    articles_txt_list.append("None")

    articles_txt_df = pd.DataFrame({"article_txt": articles_txt_list}) 
    articles_txt_df_con = pd.concat([df[start:end].reset_index(drop=True), articles_txt_df], axis=1)
    articles_txt_df_con.to_csv("reddit_news_data/{}_dataset_{}_{}.csv".format(prefix,start,end), index=False)


    print("# Articles:", len(articles_txt_list))
    print("# bad:", bad)
    print("# good:", good)

---

#### r/UplifitingNews

In this Reddit only good news are shared.

In [None]:
df = get_reddit_details('UpliftingNews')

In [None]:
link_list = df.url.tolist()
len(link_list)

In [26]:
get_articles(0,df.shape[0], "uplifiting_news")

# Articles: 990
# bad: 135
# good: 855


In [39]:
uplifting_news = pd.read_csv("reddit_news_data/uplifitng_news_dataset_0_990.csv")
uplifting_news.head()

Unnamed: 0,title,url,0
0,"Man falsely imprisoned for 10 years, uses pris...",https://www.nbcnews.com/news/us-news/defendant...,Attorney Jarrett Adams recently helped overtur...
1,First paralyzed human treated with stem cells ...,https://educateinspirechange.org/science-techn...,Imagine losing control of your car and waking ...
2,Over a Million People Sign Petition Calling Fo...,https://www.newsweek.com/kkk-petition-terroris...,
3,Hollywood Superstar Keanu Reeves Has Secretly ...,https://www.theepochtimes.com/hollywood-supers...,
4,Amazon tribe wins legal battle against oil com...,https://www.disclose.tv/amazon-tribe-wins-laws...,The Amazon Rainforest is well known across the...


In [40]:
uplifting_news.rename(columns={"0": "article_txt"}, inplace=True)

In [41]:
# Drop None & NaN -> appears if the scraping was not successful
uplifting_news.dropna(inplace=True)
uplifting_news = uplifting_news[uplifting_news["article_txt"] != "None"]
uplifting_news.shape

(834, 3)

In [17]:
uplifting_news.head()

Unnamed: 0,title,url,article_txt
0,"Man falsely imprisoned for 10 years, uses pris...",https://www.nbcnews.com/news/us-news/defendant...,Attorney Jarrett Adams recently helped overtur...
1,First paralyzed human treated with stem cells ...,https://educateinspirechange.org/science-techn...,Imagine losing control of your car and waking ...
4,Amazon tribe wins legal battle against oil com...,https://www.disclose.tv/amazon-tribe-wins-laws...,The Amazon Rainforest is well known across the...
6,No children died in traffic accidents in Norwa...,https://www.nrk.no/trondelag/ingen-barn-dode-i...,I 1970 døde det 560 mennesker i den norske tra...
7,President Trump signs animal cruelty bill into...,https://abcnews.go.com/amp/Politics/president-...,President Trump signs animal cruelty bill into...


Looks good.

In [42]:
# Add one column with the source
uplifting_news["source"] = ["UpliftingNews" for i in range(uplifting_news.shape[0])]

In [43]:
# Add one column with the label (0 = bad, 1 = good)
uplifting_news["label"] = [1 for i in range(uplifting_news.shape[0])]

In [44]:
# Safe the new/cleaned dataframe as csv 
uplifting_news.to_csv("reddit_news_data/cleaned/reddit_upliftingnews.csv", index=False)

 ---

#### r/Collapse

In this notebook only bad news are shared.

In [6]:
df = get_reddit_details('collapse')

In [7]:
link_list = df.url.tolist()
len(link_list)

998

In [21]:
get_articles(0,df.shape[0],"collapse")

# Articles: 998
# bad: 26
# good: 972


In [8]:
df.url[0:30]

0     https://www.reddit.com/r/collapse/comments/gv7...
1     https://www.reddit.com/r/collapse/comments/h8f...
2                   https://i.redd.it/nxc4kdfc1d251.jpg
3                   https://i.redd.it/98fyhfza47841.jpg
4                   https://i.redd.it/24nnkbjs69p41.png
5                   https://i.redd.it/ahigharrhhm51.jpg
6     https://preview.redd.it/pmdknot1c8t41.jpg?widt...
7                   https://i.redd.it/5yppxpdkemd51.jpg
8                   https://i.redd.it/oddkv702f6d51.jpg
9                   https://i.redd.it/o02xf18i2h151.jpg
10                  https://i.redd.it/psigl5fv7q351.jpg
11                  https://i.redd.it/am5djw8rk7v51.jpg
12                  https://i.redd.it/jwvg2u2sr6f51.png
13                  https://i.redd.it/s3dipqaut2s51.jpg
14                      https://i.imgur.com/8jPnfr2.jpg
15                  https://i.redd.it/492db98qzlm51.jpg
16    https://finance.yahoo.com/news/world-2-000-bil...
17    https://preview.redd.it/qzt1mydo9lg51.jpg?

We can see that there are many urls which are note useful. 
We will delete them manually.

In [45]:
# Load the csv file as dataframe
collapse_news = pd.read_csv("reddit_news_data/collapse_dataset_0_998.csv")
collapse_news.head()

Unnamed: 0,title,url,article_txt
0,The US is a Shithole Country,https://www.reddit.com/r/collapse/comments/gv7...,I’m so mad right now. I have so much loathing ...
1,This is a class war,https://www.reddit.com/r/collapse/comments/h8f...,"Reposted again. Remember children, hug and kis..."
2,US Senator Tom Cotton calls for the military t...,https://i.redd.it/nxc4kdfc1d251.jpg,
3,Interesting Times,https://i.redd.it/98fyhfza47841.jpg,
4,Put into perspective,https://i.redd.it/24nnkbjs69p41.png,


In [21]:
collapse_news.shape

(998, 3)

In [46]:
# Drop None & NaN -> appears if the scraping was not successful
collapse_news.dropna(inplace=True)
collapse_news = collapse_news[collapse_news["article_txt"] != "None"]
collapse_news.shape

(699, 3)

In [23]:
collapse_news.head()

Unnamed: 0,title,url,article_txt
0,The US is a Shithole Country,https://www.reddit.com/r/collapse/comments/gv7...,I’m so mad right now. I have so much loathing ...
1,This is a class war,https://www.reddit.com/r/collapse/comments/h8f...,"Reposted again. Remember children, hug and kis..."
16,"The World’s 2,000 Billionaires Have More Wealt...",https://finance.yahoo.com/news/world-2-000-bil...,"Wealth inequality is nothing new, but it’s rea..."
27,How humanity solves problems,https://v.redd.it/rixf4tkh3p231,Discussion regarding the potential collapse of...
29,1 in 4 Childless Adults Say Climate Change Has...,https://morningconsult.com/2020/09/28/adults-c...,11% of childless adults say climate change is ...


As I mentioned before there a lot of stuff still in the dataframe which is not useful, i.e urls from reddit.com

In [47]:
# Delete the rows
collapse_news = collapse_news[[True if 'jpg' not in i else False for i in collapse_news.url.values]]
collapse_news = collapse_news[[True if 'youtu' not in i else False for i in collapse_news.url.values]]
collapse_news = collapse_news[[True if 'png' not in i else False for i in collapse_news.url.values]]
collapse_news = collapse_news[[True if 'amazon' not in i else False for i in collapse_news.url.values]]
collapse_news = collapse_news[[True if 'twitter' not in i else False for i in collapse_news.url.values]]
collapse_news = collapse_news[[True if '/r/' not in i else False for i in collapse_news.url.values]]
collapse_news = collapse_news[[True if 'imgur' not in i else False for i in collapse_news.url.values]]
collapse_news = collapse_news[[True if 'wikipedia' not in i else False for i in collapse_news.url.values]]
collapse_news = collapse_news[[True if 'streamable' not in i else False for i in collapse_news.url.values]]
collapse_news = collapse_news[[True if 'pulse' not in i else False for i in collapse_news.url.values]]
collapse_news = collapse_news[[True if 'feralatlas' not in i else False for i in collapse_news.url.values]]
collapse_news = collapse_news[[True if 'thesocialdilemma' not in i else False for i in collapse_news.url.values]]
collapse_news = collapse_news[[True if 'collapsepod' not in i else False for i in collapse_news.url.values]]
collapse_news = collapse_news[[True if 'kisstheground' not in i else False for i in collapse_news.url.values]]
collapse_news = collapse_news[[True if 'redd' not in i else False for i in collapse_news.url.values]]
collapse_news = collapse_news[[True if 'medium.com' not in i else False for i in collapse_news.url.values]]

collapse_news.shape

(466, 3)

In [48]:
# Add one column with the source
collapse_news["source"] = ["CollapseNews" for i in range(collapse_news.shape[0])]

In [49]:
# Add one column with the label (0 = bad, 1 = good)
collapse_news["label"] = [0 for i in range(collapse_news.shape[0])]

In [50]:
# Safe the new/cleaned dataframe as csv 
collapse_news.to_csv("reddit_news_data/cleaned/reddit_collapse.csv", index=False)

 ---

#### r/JustGoodNews

In [48]:
df = get_reddit_details('JustGoodNews')

In [50]:
link_list = df.url.tolist()
len(link_list)

1000

In [51]:
get_articles(0,df.shape[0],"JustGoodNews")

6960



# Articles: 1000
# bad: 94
# good: 906


In [51]:
# Load the csv file as dataframe
justgoodnews = pd.read_csv("reddit_news_data/justgoodnews_dataset_0_1000.csv")
justgoodnews.head()

Unnamed: 0,title,url,article_txt
0,FBI uncovered Russian bribery plot before Obam...,http://thehill.com/policy/national-security/35...,Before the Obama administration approved a con...
1,Keese Love Everyone -- Find him and bring him ...,http://thegatewaypundit.com/2020/08/breaking-4...,The anonymous message board 4Chan has once aga...
2,Muslims hand out thousands of roses at London ...,http://www.standard.co.uk/news/london/muslims-...,ES News email The latest headlines in your inb...
3,"Kendrick Ray Castillo, the 18-year-old who sac...",https://reuters.com/article/us-colorado-shooti...,(The story corrected source in paragraph 11 of...
4,"DNC staffer murdered, was the leak to wikileaks.",http://www.foxnews.com/politics/2017/05/16/sla...,


In [52]:
# Drop None & NaN -> appears if the scraping was not successful
justgoodnews.dropna(inplace=True)
justgoodnews = justgoodnews[justgoodnews["article_txt"] != "None"]
justgoodnews.shape

(893, 3)

In [53]:
# Add one column with the source
justgoodnews["source"] = ["JustGoodNews" for i in range(justgoodnews.shape[0])]

In [54]:
# Add one column with the label (0 = bad, 1 = good)
justgoodnews["label"] = [1 for i in range(justgoodnews.shape[0])]

In [55]:
# Safe the new/cleaned dataframe as csv 
justgoodnews.to_csv("reddit_news_data/cleaned/reddit_justgoodnews.csv", index=False)

---

#### r/JustBadNews

In [52]:
df = get_reddit_details('JustBadNews')

In [53]:
link_list = df.url.tolist()
len(link_list)

1000

In [54]:
get_articles(0,df.shape[0],"JustBadNews")

# Articles: 1000
# bad: 94
# good: 906


In [56]:
# Load the csv file as dataframe
justbadnews = pd.read_csv("reddit_news_data/justbadnews_dataset_0_1000.csv")

In [57]:
# Drop None & NaN -> appears if the scraping was not successful
justbadnews.dropna(inplace=True)
justbadnews = justbadnews[justbadnews["article_txt"] != "None"]
justbadnews.shape

(893, 3)

In [58]:
# Add one column with the source
justbadnews["source"] = ["JustBadNews" for i in range(justbadnews.shape[0])]

In [59]:
# Add one column with the label (0 = bad, 1 = good)
justbadnews["label"] = [0 for i in range(justbadnews.shape[0])]

In [60]:
# Safe the new/cleaned dataframe as csv 
justbadnews.to_csv("reddit_news_data/cleaned/reddit_justbadnews.csv", index=False)