# Intro

In this notebook we try to get the top posts from the Reddits about news. In addition, we also directly store the links to the respective news articles. At the end we try to get all news articles of the different publishers via the urls.

# Imports

In [102]:
import praw
from psaw import PushshiftAPI
import pandas as pd 
import numpy as np 
#pd.set_option('display.max_colwidth', -1)

In [3]:
# Install of the PushshiftAPI
#!pip install psaw

Collecting psaw
  Downloading psaw-0.0.12-py3-none-any.whl (15 kB)
Installing collected packages: psaw
Successfully installed psaw-0.0.12


In [2]:
# Login credentials for praw & Reddit API
r = praw.Reddit(client_id='ddxZYbBilApY5A', client_secret='4rxjgOizdOJlhuyD781bi4tCqH8', user_agent='Henlo')
api = PushshiftAPI(r)

---

# Functions 

With the following function we will get the top 1000 posts from a specific reddit.

In [47]:
def get_reddit_details(reddit_name): 
    top_posts = r.subreddit(reddit_name).top(limit=1000)
    lst = []
    title = []
    for post in top_posts:
        lst.append(post.url)
        title.append(post.title)
    df = pd.DataFrame({'title' : title, 'url' : lst})
    return df

We use newspaper3k to fetch all articles from our dataframe with all urls.

In [49]:
from newspaper import Article
import urllib.request, urllib.error

# download and parse articles
def get_articles(start,end,prefix): 
    good = 0
    bad = 0

    articles_txt_list = []

    for link in link_list[start:end]:
            print("{}".format(df.shape[0]-good+bad),end="\r") 
            if link == "None": 
                articles_txt_list.append("None")
                bad += 1 

            else: 
                try:
                    #conn = urllib.request.urlopen(link)
                    n_article = Article(url=link, fetch_images=False, request_timeout=10, number_threads=15)
                    n_article.download()
                    n_article.parse()
                    articles_txt_list.append(n_article.text)
                    good += 1
                except: 
                    bad +=1
                    articles_txt_list.append("None")

    articles_txt_df = pd.DataFrame({"article_txt": articles_txt_list}) 
    articles_txt_df_con = pd.concat([df[start:end].reset_index(drop=True), articles_txt_df], axis=1)
    articles_txt_df_con.to_csv("reddit_news_data/{}_dataset_{}_{}.csv".format(prefix,start,end), index=False)


    print("# Articles:", len(articles_txt_list))
    print("# bad:", bad)
    print("# good:", good)

---

#### r/UplifitingNews

In this Reddit only good news are shared.

In [None]:
df = get_reddit_details('UpliftingNews')

In [None]:
link_list = df.url.tolist()
len(link_list)

In [26]:
get_articles(0,df.shape[0], "uplifiting_news")

# Articles: 990
# bad: 135
# good: 855


In [41]:
uplifting_news = pd.read_csv("reddit_news_data/uplifitng_news_dataset_0_990.csv")
uplifting_news.head()

Unnamed: 0,title,url,0
0,"Man falsely imprisoned for 10 years, uses pris...",https://www.nbcnews.com/news/us-news/defendant...,Attorney Jarrett Adams recently helped overtur...
1,First paralyzed human treated with stem cells ...,https://educateinspirechange.org/science-techn...,Imagine losing control of your car and waking ...
2,Over a Million People Sign Petition Calling Fo...,https://www.newsweek.com/kkk-petition-terroris...,
3,Hollywood Superstar Keanu Reeves Has Secretly ...,https://www.theepochtimes.com/hollywood-supers...,
4,Amazon tribe wins legal battle against oil com...,https://www.disclose.tv/amazon-tribe-wins-laws...,The Amazon Rainforest is well known across the...


In [42]:
uplifting_news.rename(columns={"0": "article_txt"}, inplace=True)

In [43]:
# Drop None & NaN -> appears if the scraping was not successful
uplifting_news.dropna(inplace=True)
uplifting_news = uplifting_news[uplifting_news["article_txt"] != "None"]
uplifting_news.shape

(834, 3)

In [45]:
uplifting_news.head(20)

Unnamed: 0,title,url,article_txt
0,"Man falsely imprisoned for 10 years, uses pris...",https://www.nbcnews.com/news/us-news/defendant...,Attorney Jarrett Adams recently helped overtur...
1,First paralyzed human treated with stem cells ...,https://educateinspirechange.org/science-techn...,Imagine losing control of your car and waking ...
4,Amazon tribe wins legal battle against oil com...,https://www.disclose.tv/amazon-tribe-wins-laws...,The Amazon Rainforest is well known across the...
6,No children died in traffic accidents in Norwa...,https://www.nrk.no/trondelag/ingen-barn-dode-i...,I 1970 døde det 560 mennesker i den norske tra...
7,President Trump signs animal cruelty bill into...,https://abcnews.go.com/amp/Politics/president-...,President Trump signs animal cruelty bill into...
8,Man finds $24 million lottery ticket in an old...,http://www.cnn.com/2017/10/13/us/lottery-winne...,(CNN) Everyone has that spot in their house or...
9,Police say a teenager who attached uplifting m...,https://www.bbc.com/news/uk-england-tyne-44916409,"""It's just amazing, the response it has had. I..."
10,A Chinese woman whose boy was kidnapped in 198...,https://www.scmp.com/news/china/society/articl...,Mao Yin pictured as a child with his mother Li...
11,13 Year Old Girl nicknamed 'Trash Girl' was re...,https://www.edp24.co.uk/news/environment/norwi...,Video\n\nTrash Girl spends the day at the EDP ...
12,JetBlue caps direct flight ticket prices out o...,https://finance.yahoo.com/news/jetblue-caps-ti...,REUTERS/Mike Blake\n\nJetBlue (JBLU) is cappin...


Looks good.

In [46]:
# Safe the new/cleaned dataframe as csv 
uplifting_news.to_csv("reddit_news_data/cleaned/reddit_upliftingnews.csv", index=False)

 ---

#### r/Collapse

In [6]:
df = get_reddit_details('collapse')

In [7]:
link_list = df.url.tolist()
len(link_list)

998

In [21]:
get_articles(0,df.shape[0],"collapse")

# Articles: 998
# bad: 26
# good: 972


In [8]:
df.url[0:30]

0     https://www.reddit.com/r/collapse/comments/gv7...
1     https://www.reddit.com/r/collapse/comments/h8f...
2                   https://i.redd.it/nxc4kdfc1d251.jpg
3                   https://i.redd.it/98fyhfza47841.jpg
4                   https://i.redd.it/24nnkbjs69p41.png
5                   https://i.redd.it/ahigharrhhm51.jpg
6     https://preview.redd.it/pmdknot1c8t41.jpg?widt...
7                   https://i.redd.it/5yppxpdkemd51.jpg
8                   https://i.redd.it/oddkv702f6d51.jpg
9                   https://i.redd.it/o02xf18i2h151.jpg
10                  https://i.redd.it/psigl5fv7q351.jpg
11                  https://i.redd.it/am5djw8rk7v51.jpg
12                  https://i.redd.it/jwvg2u2sr6f51.png
13                  https://i.redd.it/s3dipqaut2s51.jpg
14                      https://i.imgur.com/8jPnfr2.jpg
15                  https://i.redd.it/492db98qzlm51.jpg
16    https://finance.yahoo.com/news/world-2-000-bil...
17    https://preview.redd.it/qzt1mydo9lg51.jpg?

We can see that there are many urls which are note useful. 
We delete them manually.

In [22]:
# Load the csv file as dataframe
collapse_news = pd.read_csv("reddit_news_data/collapse_dataset_0_998.csv")
collapse_news.head()

Unnamed: 0,title,url,article_txt
0,The US is a Shithole Country,https://www.reddit.com/r/collapse/comments/gv7...,I’m so mad right now. I have so much loathing ...
1,This is a class war,https://www.reddit.com/r/collapse/comments/h8f...,"Reposted again. Remember children, hug and kis..."
2,US Senator Tom Cotton calls for the military t...,https://i.redd.it/nxc4kdfc1d251.jpg,
3,Interesting Times,https://i.redd.it/98fyhfza47841.jpg,
4,Put into perspective,https://i.redd.it/24nnkbjs69p41.png,


In [23]:
collapse_news.shape

(998, 3)

In [24]:
# Drop None & NaN -> appears if the scraping was not successful
collapse_news.dropna(inplace=True)
collapse_news = collapse_news[collapse_news["article_txt"] != "None"]
collapse_news.shape

(699, 3)

In [27]:
collapse_news.head()

Unnamed: 0,title,url,article_txt
0,The US is a Shithole Country,https://www.reddit.com/r/collapse/comments/gv7...,I’m so mad right now. I have so much loathing ...
1,This is a class war,https://www.reddit.com/r/collapse/comments/h8f...,"Reposted again. Remember children, hug and kis..."
16,"The World’s 2,000 Billionaires Have More Wealt...",https://finance.yahoo.com/news/world-2-000-bil...,"Wealth inequality is nothing new, but it’s rea..."
27,How humanity solves problems,https://v.redd.it/rixf4tkh3p231,Discussion regarding the potential collapse of...
29,1 in 4 Childless Adults Say Climate Change Has...,https://morningconsult.com/2020/09/28/adults-c...,11% of childless adults say climate change is ...


As I mentioned before there a lot of stuff still in the dataframe which is not useful, i.e urls from reddit.com

In [29]:
# Delete the rows
collapse_news = collapse_news[[True if 'jpg' not in i else False for i in collapse_news.url.values]]
collapse_news = collapse_news[[True if 'youtu' not in i else False for i in collapse_news.url.values]]
collapse_news = collapse_news[[True if 'png' not in i else False for i in collapse_news.url.values]]
collapse_news = collapse_news[[True if 'amazon' not in i else False for i in collapse_news.url.values]]
collapse_news = collapse_news[[True if 'twitter' not in i else False for i in collapse_news.url.values]]
collapse_news = collapse_news[[True if '/r/' not in i else False for i in collapse_news.url.values]]
collapse_news = collapse_news[[True if 'imgur' not in i else False for i in collapse_news.url.values]]
collapse_news = collapse_news[[True if 'wikipedia' not in i else False for i in collapse_news.url.values]]
collapse_news = collapse_news[[True if 'streamable' not in i else False for i in collapse_news.url.values]]
collapse_news = collapse_news[[True if 'pulse' not in i else False for i in collapse_news.url.values]]
collapse_news = collapse_news[[True if 'feralatlas' not in i else False for i in collapse_news.url.values]]
collapse_news = collapse_news[[True if 'thesocialdilemma' not in i else False for i in collapse_news.url.values]]
collapse_news = collapse_news[[True if 'collapsepod' not in i else False for i in collapse_news.url.values]]
collapse_news = collapse_news[[True if 'kisstheground' not in i else False for i in collapse_news.url.values]]
collapse_news = collapse_news[[True if 'redd' not in i else False for i in collapse_news.url.values]]
collapse_news = collapse_news[[True if 'medium.com' not in i else False for i in collapse_news.url.values]]

collapse_news.shape

(466, 3)

In [30]:
collapse_news.head()

Unnamed: 0,title,url,article_txt
16,"The World’s 2,000 Billionaires Have More Wealt...",https://finance.yahoo.com/news/world-2-000-bil...,"Wealth inequality is nothing new, but it’s rea..."
29,1 in 4 Childless Adults Say Climate Change Has...,https://morningconsult.com/2020/09/28/adults-c...,11% of childless adults say climate change is ...
38,America Is Too Broken to Fight the Coronavirus...,https://www.nytimes.com/2020/06/22/opinion/us-...,When coronavirus cases started exploding on th...
42,The Brazilian president has fired the head of ...,https://www.climatechangenews.com/2019/08/05/b...,The Brazilian president forced a top science o...
57,Coronavirus has led to the largest ever drop i...,https://www.carbonbrief.org/analysis-coronavir...,The global coronavirus pandemic continues to u...


In [None]:
# Safe the new/cleaned dataframe as csv 
collapse_news.to_csv("reddit_news_data/cleaned/reddit_collapse.csv", index=False)

 ---

#### r/JustGoodNews

In [48]:
df = get_reddit_details('JustGoodNews')

In [50]:
link_list = df.url.tolist()
len(link_list)

1000

In [51]:
get_articles(0,df.shape[0],"JustGoodNews")

6960



# Articles: 1000
# bad: 94
# good: 906


In [55]:
# Load the csv file as dataframe
justgoodnews = pd.read_csv("reddit_news_data/justgoodnews_dataset_0_1000.csv")
justgoodnews.head()

Unnamed: 0,title,url,article_txt
0,FBI uncovered Russian bribery plot before Obam...,http://thehill.com/policy/national-security/35...,Before the Obama administration approved a con...
1,Keese Love Everyone -- Find him and bring him ...,http://thegatewaypundit.com/2020/08/breaking-4...,The anonymous message board 4Chan has once aga...
2,Muslims hand out thousands of roses at London ...,http://www.standard.co.uk/news/london/muslims-...,ES News email The latest headlines in your inb...
3,"Kendrick Ray Castillo, the 18-year-old who sac...",https://reuters.com/article/us-colorado-shooti...,(The story corrected source in paragraph 11 of...
4,"DNC staffer murdered, was the leak to wikileaks.",http://www.foxnews.com/politics/2017/05/16/sla...,


In [56]:
# Drop None & NaN -> appears if the scraping was not successful
justgoodnews.dropna(inplace=True)
justgoodnews = justgoodnews[justgoodnews["article_txt"] != "None"]
justgoodnews.shape

(893, 3)

In [None]:
# Safe the new/cleaned dataframe as csv 
justgoodnews.to_csv("reddit_news_data/cleaned/reddit_justgoodnews.csv", index=False)

---

#### r/JustBadNews

In [52]:
df = get_reddit_details('JustBadNews')

In [53]:
link_list = df.url.tolist()
len(link_list)

1000

In [54]:
get_articles(0,df.shape[0],"JustBadNews")

# Articles: 1000
# bad: 94
# good: 906


In [103]:
# Load the csv file as dataframe
justbadnews = pd.read_csv("reddit_news_data/justbadnews_dataset_0_1000.csv")

In [104]:
justbadnews[["article_txt"]].iloc[0]

article_txt    Join the conversation\n\nYou can post now and register later. If you have an account, sign in now to post with your account.
Name: 0, dtype: object

In [105]:
justbadnews.head()

Unnamed: 0,title,url,article_txt
0,"QM Can Be Seen From This Angle, a conversation with an angry person.",http://scienceforums.com/topic/36429-could-qm-be-seen-from-this-angle-a-conversation-with-an-angry-person/,"Join the conversation\n\nYou can post now and register later. If you have an account, sign in now to post with your account."
1,Senior US prosecutor Bharara fired 'after refusing Trump call',http://www.bbc.com/news/world-us-canada-40243184,"""The number of times I would have been expected to be called by the president of the United States would be zero because there has to be some kind of arm's-length relationship given the jurisdiction that various people had."""
2,Trump revealed highly classified information to Russian foreign minister and ambassador,https://www.washingtonpost.com/world/national-security/trump-revealed-highly-classified-information-to-russian-foreign-minister-and-ambassador/2017/05/15/530c172a-3960-11e7-9e48-c4f199710b69_story.html?utm_term=.5f5763864490,"Please enable cookies on your web browser in order to continue.\n\nThe new European data protection law requires us to inform you of the following before you use our website:\n\nWe use cookies and other technologies to customize your experience, perform analytics and deliver personalized advertising on our sites, apps and newsletters and across the Internet based on your interests. By clicking “I agree” below, you consent to the use by us and our third-party partners of cookies and data gathered from your use of our platforms. See our Privacy Policy and Third Party Partners to learn more about the use of data and your rights. You also agree to our Terms of Service."
3,"Woman accuses Al Franken of kissing, groping her without consent | TheHill",http://thehill.com/homenews/news/360656-woman-accuses-al-franken-of-kissing-groping-her-without-consent,"A TV host and sports broadcaster on Thursday accused Sen. Al Franken Alan (Al) Stuart FrankenTina Smith and Jason Lewis tied in Minnesota Ted Cruz mocks Al Franken over 'I Hate Ted Cruz Pint Glass' GOP Senate candidate says Trump, Republicans will surprise in Minnesota MORE (D-Minn.) of kissing and groping her without her consent in 2006.\n\nLeeann Tweeden accused Franken of groping her, without her consent, while she was asleep and provided a photo as evidence.\n\nKABC anchor: Senator Al Franken Kissed and Groped Me Without My Consent, And There’s Nothing Funny About It https://t.co/lG4A1ZTUhC pic.twitter.com/EYIzr9ok2s — Meridith McGraw (@meridithmcgraw) November 16, 2017\n\nThe incident happened in December 2006, she said, when she and Franken, then a comedian, were on a USO tour to ""entertain our troops.""\n\nFranken in a statement apologized for his actions.\n\n\n\nADVERTISEMENT\n\n""It wasn’t until I was back in the U.S. and looking through the CD of photos we were given by the photographer that I saw this one,"" Tweeden wrote about the photo on KABC.\n\n""I felt violated all over again. Embarrassed. Belittled. Humiliated,"" she wrote. ""How dare anyone grab my breasts like this and think it’s funny?""\n\nTweeden wrote that Franken, who was the headliner on the tour, had written some skits and told her he had a part for her, adding that she ""agreed to play along.""\n\n""When I saw the script, Franken had written a moment when his character comes at me for a ‘kiss’. I suspected what he was after, but I figured I could turn my head at the last minute, or put my hand over his mouth, to get more laughs from the crowd,"" she wrote.\n\n""On the day of the show Franken and I were alone backstage going over our lines one last time. He said to me, 'We need to rehearse the kiss.' I laughed and ignored him. Then he said it again. I said something like, ‘Relax Al, this isn’t SNL…we don’t need to rehearse the kiss.’""\n\nShe wrote Franken continued to insist and she got ""uncomfortable.""\n\n""He repeated that actors really need to rehearse everything and that we must practice the kiss. I said ‘OK’ so he would stop badgering me. We did the line leading up to the kiss and then he came at me, put his hand on the back of my head, mashed his lips against mine and aggressively stuck his tongue in my mouth,"" she wrote.\n\n""I immediately pushed him away with both of my hands against his chest and told him if he ever did that to me again I wouldn’t be so nice about it the next time.""\n\nShe said she felt disgusted and violated. She added that she didn't tell anyone what happened at the time, but she was ""angry.""\n\nTweeden also said she fell asleep while on the plane back to the U.S. She said is wasn't until she saw the photo of Franken groping her that she told her husband what had happened and showed him the photo.\n\n""I wanted to shout my story to the world with a megaphone to anyone who would listen, but even as angry as I was, I was worried about the potential backlash and damage going public might have on my career as a broadcaster,"" she wrote.\n\n""But that was then, this is now. I’m no longer afraid.""\n\nShe said she is sharing her story because there may be others.\n\n""I want the days of silence to be over forever,"" she wrote.\n\nTweeden also issued a direct statement to Franken.\n\n""Senator Franken, you wrote the script. But there’s nothing funny about sexual assault,"" she wrote.\n\n""You knew exactly what you were doing. You forcibly kissed me without my consent, grabbed my breasts while I was sleeping and had someone take a photo of you doing it, knowing I would see it later, and be ashamed.""\n\nFranken said he does not remember the incident the same way but acknowledged the photo ""wasn't"" funny.\n\n“I certainly don’t remember the rehearsal for the skit in the same way, but I send my sincerest apologies to Leeann. As to the photo, it was clearly intended to be funny but wasn't. I shouldn't have done it,"" he said in a statement.\n\nTweeden's comments come amid increased reports of sexual harassment in the workplace, including on Capitol Hill. Multiple women in Congress have come forward to say they have been victims of sexual harassment, prompting lawmakers to call for reform.\n\nRep. Jackie Speier (D-Calif.) said Tuesday the House has paid out $15 million in harassment settlements over more than a decade, though a spokesperson later clarified that figure does not only account for sexual harassment claims.\n\nShe said at a Tuesday hearing that two current members of Congress, one Republican and one Democrat, have been accused of sexual harassment.\n\nHer comments come after multiple women have accused GOP Senate candidate Roy Moore of sexual misconduct. Moore is facing growing calls from top Republicans to step aside in the Alabama race, though he has indicated he plans to continue running.\n\nThis report was updated at 11:07 a.m."
4,False flag attack that would frame refugees foiled in Germany.,http://www.independent.co.uk/news/world/europe/german-soldier-syria-refugee-false-flag-terror-attack-posing-arrested-frankfurt-france-bavaria-a7705231.html,"A German soldier found posing as a Syrian refugee has been arrested for allegedly planning a “false flag” shooting attack that would be blamed on asylum seekers.\n\nThe unidentified soldier was detained when he went to retrieve a loaded pistol he had hidden in a bathroom at Vienna International Airport.\n\nThe public prosecutor’s office in Frankfurt said the 28-year-old is suspected of planning a serious “state-threatening act of violence”, fraud and violating firearms laws.\n\nMore than 90 German police officers have worked alongside Austrian and French security forces to search 16 locations across three countries on Wednesday, when a suspected accomplice was arrested in Bavaria.\n\nThe suspect was stationed at Illkirch-Graffenstaden, eastern France. (AFP/Getty Images) (AFP)\n\nInvestigations have revealed that the Bundeswehr lieutenant was stationed at Illkirch-Graffenstaden in France before registering as a refugee back in Germany.\n\nHe gave false information to authorities in Giessen, Hesse, on 30 December 2015 – as Germany was overwhelmed by the arrival of almost a million asylum seekers.\n\nPosing as a Syrian refugee but reportedly speaking in French, rather than Arabic, the man submitted an asylum application at Zirndorf in Bavaria in January last year.\n\n“As a result, he was given shelter in a refugee home and has received monthly financial benefits under this false identity,” the Frankfurt prosecutor’s office said.\n\n“These findings, as well as other evidence, point towards a xenophobic motive for the soldier’s suspected plan to commit an attack using a weapon deposited at Vienna airport.”\n\nIf his plan had succeeded, his fingerprints would have registered on the refugee records system and led investigators to his false identity as a Syrian asylum seeker, turning fresh scrutiny on migrants in Germany.\n\nRefugees settle in Germany Show all 12 1 /12 Refugees settle in Germany Refugees settle in Germany Germany Mohamed Zayat, a refugee from Syria, plays with his daughter Ranim, who is nearly 3, in the one room they and Mohamed's wife Laloosh call home at an asylum-seekers' shelter in Vossberg village on October 9, 2015 in Letschin, Germany. The Zayats arrived approximately two months ago after trekking through Turkey, Greece and the Balkans and are now waiting for local authorities to process their asylum application, after which they will be allowed to live independently and settle elsewhere in Germany. Approximately 60 asylum-seekers, mostly from Syria, Chechnya and Somalia, live at the Vossberg shelter, which is run by the Arbeiter-Samariter Bund (ASB) charity 2015 Getty Images Refugees settle in Germany Germany A refugee child Amnat Musayeva points to a star with her photo and name that decorates the door to her classroom as teacher Martina Fischer looks on at the local kindergarten Amnat and her siblings attend on October 9, 2015 in Letschin, Germany. The children live with their family at an asylum-seekers' shelter in nearby Vossberg village and are waiting for local authorities to process their asylum applications. Approximately 60 asylum-seekers, mostly from Syria, Chechnya and Somalia, live at the Vossberg shelter, which is run by the Arbeiter-Samariter Bund (ASB) charity Getty Images Refugees settle in Germany Germany Kurdish Syrian asylum-applicant Mohamed Ali Hussein (R), 19, and fellow applicant Autur, from Latvia, load benches onto a truckbed while performing community service, for which they receive a small allowance, in Wilhelmsaue village on October 9, 2015 near Letschin, Germany. Mohamed and Autur live at an asylum-applicants' shelter in nearby Vossberg village. Approximately 60 asylum-seekers, mostly from Syria, Chechnya and Somalia, live at the Vossberg shelter, which is run by the Arbeiter-Samariter Bund (ASB) charity Getty Images Refugees settle in Germany Germany Mohamed Ali Hussein ((L), 19, and his cousin Sinjar Hussein, 34, sweep leaves at a cemetery in Gieshof village, for which they receive a small allowance, near Letschin Getty Images Refugees settle in Germany Germany Mohamed Zayat, a refugee from Syria, looks among donated clothing in the basement of the asylum-seekers' shelter that is home to Mohamed, his wife Laloosh and their daughter Ranim as residents' laundry dries behind in Vossberg village on October 9, 2015 in Letschin, Germany. The Zayats arrived approximately two months ago after trekking through Turkey, Greece and the Balkans and are now waiting for local authorities to process their asylum application, after which they will be allowed to live independently and settle elsewhere in Germany Getty Images Refugees settle in Germany Germany Asya Sugaipova (L), Mohza Mukayeva and Khadra Zhukova prepare food in the communal kitchen at the asylum-seekers' shelter that is their home in Vossberg village in Letschin Getty Images Refugees settle in Germany Germany Efrah Abdullahi Ahmed looks down from the communal kitchen window at her daughter Sumaya, 10, who had just returned from school, at the asylum-seekers' shelter that is their home in Vossberg Getty Images Refugees settle in Germany Germany Asylum-applicants, including Syrians Mohamed Ali Hussein (C-R, in black jacket) and Fadi Almasalmeh (C), return from grocery shopping with other refugees to the asylum-applicants' shelter that is their home in Vossberg village in Letschin Getty Images Refugees settle in Germany Germany Mohamed Zayat (2nd from L), a refugee from Syria, smokes a cigarette after shopping for groceries with his daughter Ranim, who is nearly 3, and fellow-Syrian refugees Mohamed Ali Hussein (C) and Fadi Almasalmeh (L) at a local supermarket on October 9, 2015 in Letschin, Germany. All of them live at an asylum-seekers' shelter in nearby Vossberg village and are waiting for local authorities to process their asylum applications, after which they will be allowed to live independently and settle elsewhere in Germany 2015 Getty Images Refugees settle in Germany Germany Kurdish Syrian refugees Leila, 9, carries her sister Avin, 1, in the backyard at the asylum-seekers' shelter that is home to them and their family in Vossberg village in Letschin Getty Images Refugees settle in Germany Germany Somali refugees and husband and wife Said Ahmed Gure (R) and Ayaan Gure pose with their infant son Muzammili, who was born in Germany, in the room they share at an asylum-seekers' shelter in Vossberg village on October 9, 2015 in Letschin, Germany. Approximately 60 asylum-seekers, mostly from Syria, Chechnya and Somalia, live at the Vossberg shelter, which is run by the Arbeiter-Samariter Bund (ASB) charity, and are waiting for authorities to process their application for asylum 2015 Getty Images Refugees settle in Germany Germany German Chancellor Angela Merkel pauses for a selfie with a refugee after she visited the AWO Refugium Askanierring shelter for refugees in Berlin Getty Images\n\nIsis has previously used a similar ploy, giving its militants fake Syrian passports that were found at the scene of the Paris attacks.\n\nThe man’s suspected accomplice, a 24-year-old student, was arrested in Hammelburg for alleged involvement in the plot.\n\nPolice have searched the homes of the two suspects as well as their friends and workplaces, with detectives seizing “extensive material” including mobile phones, laptops and documents.\n\nProsecutors said the soldier had no permission for the 7.65mm pistol stashed in Vienna, while illegal weapons were also found at his accomplice’s house.\n\nBoth men remain in custody in Frankfurt as the probe continues.\n\nThe soldier was arrested days after prosecutors revealed that the man who orchestrated the Dortmund bus bombings had attempted to frame Isis to make money on shares.\n\nPolice chief says Germany 'on high alert' after attack\n\nSergej W, a dual German-Russian national, detonated three bombs targeting a bus carrying the Borussia Dortmund football team, seriously injuring one player on 11 April.\n\nHe left misspelled letters at the scene claiming the attack was retaliation for German military intervention against Isis, but investigations found he was not an Islamist but a trader planning to profit from short-selling shares.\n\nA series of Isis-inspired terror attacks and plots in Germany have raised tensions leading into September’s federal elections, where Angela Merkel is battling to win a fourth term as Chancellor.\n\nRight-wing groups have blamed her decision to open borders to refugees in 2015, while extremists have launched hundreds of attacks on asylum seekers’ accommodation.\n\nAt least two neo-Nazi terror plots have been uncovered, while security services have cracked down on the anti-government Reichsbürger movement after one of its members killed a police officer."


We can see that in row 2 is an error because there is a message about cookie enabling. In row 0 we can also see an error -> "If you have an account, sign in now to post with your account". 
So we have to be careful in the further process and do more data cleaning.

In [106]:
# We can try to find the articles_txts where the cookie message appears
l = []
for a in justbadnews.article_txt:
    if "cookies" in a: 
        #print(a)
        #print("---"*30)
        l.append(a)  
print(len(l))


TypeError: argument of type 'float' is not iterable

In [84]:
justbadnews = justbadnews[[True if 'cookies' not in i else False for i in justbadnews.article_txt.values]]
justbadnews.shape

(865, 3)

In [100]:
l = []
for a in justbadnews.article_txt:
    if "sign in" in a: 
        #print(a)
        #print("---"*30)
        l.append(a)  
print(len(l))

TypeError: argument of type 'float' is not iterable

In [None]:
justbadnews = justbadnews[[True if 'sign in' not in i else False for i in justbadnews.article_txt.values]]
justbadnews.shape

In [60]:
# Drop None & NaN -> appears if the scraping was not successful
justbadnews.dropna(inplace=True)
justbadnews = justbadnews[justbadnews["article_txt"] != "None"]
justbadnews.shape

(893, 3)

In [None]:
# Safe the new/cleaned dataframe as csv 
justbadnews.to_csv("reddit_news_data/cleaned/reddit_justbadnews.csv", index=False)