# **News Generator Term Project**

**Swee Loke** | SCS3546 (Deep Learning) | University of Toronto | Dec 1, 2020

This project use different experiments to decide how to construct the datasets and parameters to use to train a news text generation model.

Data Source: 

* Kaggle ("clmentbisaillon/fake-and-real-news-dataset")
* Kaggle ("rmisra/news-category-dataset")

# **Part 1 - Preprocessing the data files**

This notebook is for preprosessing the data files (the preprocessing of the actual train/validation datasets is presented in the next notebook). 

It downloads the dataset files (json format) and merge the news title and news body column into the same lines and write the combined text into text files (for training in the next notebook).

We want to seperate the constructions of the text files (from any datasets) from the actual training and experiment. The model uses only plain text files as input and can be easily change to accept different text files as input. 

In [None]:
# First we must mount google drive 
import os.path
from google.colab import files
from google.colab import drive
GDRIVE_BASE_PATH = '/content/gdrive'
drive.mount(GDRIVE_BASE_PATH, force_remount=True)

# # Load the project from GitHub and adjust our `HOME_DIR`
HOME_DIR = f'{GDRIVE_BASE_PATH}/My Drive/Colab Notebooks/NewsGenerator/'
DATA_DIR = f'{GDRIVE_BASE_PATH}/My Drive/Colab Notebooks/data/'
# # Going to the home directory and loading the project setup
% cd '$HOME_DIR'

Mounted at /content/gdrive
/content/gdrive/My Drive/Colab Notebooks/NewsGenerator


In [None]:
import pandas as pd
import numpy as np

# Downloading the dataset

- [x] Setup Kaggle command line client
- [x] Download dataset from Kaggle
- [x] Decompress the files into the expected directories

In [None]:
# Now we setup the kaggle token so we can perform the download directly from them
if not os.path.isfile(os.path.expanduser('~/.kaggle/kaggle.json')):
    print('Could not find ~/.kaggle/kaggle.json to download the dataset. Please do the following steps:')
    print('  1. Go to your account on kaggle')
    print('  2. Scroll to API section and click on `Expire API Token` to remove previous tokens')
    print('  3. Click on `Create New API Token`. It will download `kaggle.json` file on your machine')
    print('  4. Upload you kaggle.json in the box below')

    % cd /content
    files.upload()

    ! mkdir -p ~/.kaggle
    ! cp kaggle.json ~/.kaggle/
    ! chmod 600 ~/.kaggle/kaggle.json
    ! pip install -q kaggle
else:
  print("Ready for Kaggle download")

Ready for Kaggle download


We are comparing two datasets in this project. 

* The first one is a short one from News Category dataset, it consist of just a title and a brief description. We refer to this set as short news.

* The second one is longer version, with a title and news text. We refer to this set as long news. 

We also want to download the GloVe (Global Vectors for Word Representation)

In [None]:
# This function download a dataset from kaggle
# category specify if it is from a competition or a dataset
def download_kaggle_dataset(category, dataset_name):
  print("download_kaggle_dataset", category, dataset_name)

  % cd $DATA_DIR
  ! kaggle $category download $dataset_name 
  ! unzip -o -q '*.zip'
  ! rm *.zip
  ! ls 


# SHORT NEWS:
# If we don't have this json file, download this from kaggle 
if os.path.exists(f"{DATA_DIR}/News_Category_Dataset_v2.json"):
    print("short news dataset exists, no need to download")
else:
    dataset = "rmisra/news-category-dataset"
    # ! kaggle datasets list -s $dataset
    download_kaggle_dataset("datasets", dataset)

# LONG NEWS: 
if os.path.exists(f"{DATA_DIR}/True.csv") and os.path.exists(f"{DATA_DIR}/Fake.csv"):
    print("long news dataset exists, no need to download")
else:
    dataset = "clmentbisaillon/fake-and-real-news-dataset"
    # ! kaggle datasets list -s $dataset
    download_kaggle_dataset("datasets", dataset)

# We also want to download the GloVe (Global Vectors for Word Representation)
# download word embeddings
if os.path.exists(f"{DATA_DIR}/glove.6B.50d.txt"):
    print("glove exists, no need to download")
else:
    !wget http://nlp.stanford.edu/data/glove.6B.zip
    !unzip -q glove.6B.zip

short news dataset exists, no need to download
long news dataset exists, no need to download
glove exists, no need to download


In [None]:
!ls  '$DATA_DIR'

combined_news_text.txt	glove.6B.300d.txt	       short_news_text.txt
Fake.csv		glove.6B.50d.txt	       True.csv
fake_news_text.txt	News_Category_Dataset_v2.json  true_news_text.txt
glove.6B.100d.txt	outfile.txt
glove.6B.200d.txt	shakespeare.txt


# **Preprocess the dataset files**

This part we extract the title and text of news in json files (for both long and short news datasets) and save them into plain text files.

The reason we seperate this part is we can use any dataset to construct the text files and in the training part, we only use text files to generate run-time datasets.

# Long News

First we preprocessed the long news dataset. The original news dataset is with label true or fake news. But for our text generator training purpose, we only use the True news (which is already a lot of data). 

We also create a combined_text that we merged both the true and fake news in case we want the bigger datasets.

In [None]:
pd.set_option('max_colwidth', 400)

In [None]:
# The True news csv include the source of the news in the text, we want to remove that and only include the news text.
csv_file1 = DATA_DIR+"True.csv"
true_news = pd.read_csv(csv_file1)
true_news = true_news[['title', 'text']]

# Remove source in the text
true_news['text'] = true_news['text'].str.partition("-")[2]
true_news.head(5)

Unnamed: 0,title,text
0,"As U.S. budget fight looms, Republicans flip their fiscal script","The head of a conservative Republican faction in the U.S. Congress, who voted this month for a huge expansion of the national debt to pay for tax cuts, called himself a “fiscal conservative” on Sunday and urged budget restraint in 2018. In keeping with a sharp pivot under way among Republicans, U.S. Representative Mark Meadows, speaking on CBS’ “Face the Nation,” drew a hard line on federal s..."
1,U.S. military to accept transgender recruits on Monday: Pentagon,"Transgender people will be allowed for the first time to enlist in the U.S. military starting on Monday as ordered by federal courts, the Pentagon said on Friday, after President Donald Trump’s administration decided not to appeal rulings that blocked his transgender ban. Two federal appeals courts, one in Washington and one in Virginia, last week rejected the administration’s request to put ..."
2,Senior U.S. Republican senator: 'Let Mr. Mueller do his job',"The special counsel investigation of links between Russia and President Trump’s 2016 election campaign should continue without interference in 2018, despite calls from some Trump administration allies and Republican lawmakers to shut it down, a prominent Republican senator said on Sunday. Lindsey Graham, who serves on the Senate armed forces and judiciary committees, said Department of Justic..."
3,FBI Russia probe helped by Australian diplomat tip-off: NYT,"Trump campaign adviser George Papadopoulos told an Australian diplomat in May 2016 that Russia had political dirt on Democratic presidential candidate Hillary Clinton, the New York Times reported on Saturday. The conversation between Papadopoulos and the diplomat, Alexander Downer, in London was a driving factor behind the FBI’s decision to open a counter-intelligence investigation of Moscow’..."
4,Trump wants Postal Service to charge 'much more' for Amazon shipments,"President Donald Trump called on the U.S. Postal Service on Friday to charge “much more” to ship packages for Amazon (AMZN.O), picking another fight with an online retail giant he has criticized in the past. “Why is the United States Post Office, which is losing many billions of dollars a year, while charging Amazon and others so little to deliver their packages, making Amazon richer and ..."


In [None]:
true_news.tail(5)

Unnamed: 0,title,text
21412,'Fully committed' NATO backs new U.S. approach on Afghanistan,"NATO allies on Tuesday welcomed President Donald Trump s decision to commit more forces to Afghanistan, as part of a new U.S. strategy he said would require more troops and funding from America s partners. Having run for the White House last year on a pledge to withdraw swiftly from Afghanistan, Trump reversed course on Monday and promised a stepped-up military campaign against Taliban insur..."
21413,LexisNexis withdrew two products from Chinese market,"LexisNexis, a provider of legal, regulatory and business information, said on Tuesday it had withdrawn two products from the Chinese market in March this year after it was asked to remove some content. The issue of academic freedom in China hit the headlines this week after the leading British academic publisher, Cambridge University Press, said it had complied with a request to block onlin..."
21414,Minsk cultural hub becomes haven from authorities,"In the shadow of disused Soviet-era factories in Minsk, a street lined with eclectic bars, art galleries and yoga studios has become a haven from the vigilant eyes of the Belarussian authorities. This place is like an island, said Yegor, 21, who works at popular bar Hooligan. It s the street of freedom. The government of President Alexander Lukashenko, who has ruled Belarus for the past ..."
21415,Vatican upbeat on possibility of Pope Francis visiting Russia,"Vatican Secretary of State Cardinal Pietro Parolin said on Tuesday that there was positive momentum behind the idea of Pope Francis visiting Russia, but suggested there was more work to be done if it were to happen. Parolin, speaking at a joint news conference in Moscow alongside Russian Foreign Minister Sergei Lavrov, did not give any date for such a possible visit. The Eastern and Wester..."
21416,Indonesia to buy $1.14 billion worth of Russian jets,"Indonesia will buy 11 Sukhoi fighter jets worth $1.14 billion from Russia in exchange for cash and Indonesian commodities, two cabinet ministers said on Tuesday. The Southeast Asian country has pledged to ship up to $570 million worth of commodities in addition to cash to pay for the Suhkoi SU-35 fighter jets, which are expected to be delivered in stages starting in two years. Indonesian Trad..."


In [None]:
# Merged both title and text into the same line. 
true_news_text = true_news['title'] + '. ' + true_news['text'] + ' '
true_news_text[:5]

0    As U.S. budget fight looms, Republicans flip their fiscal script.  The head of a conservative Republican faction in the U.S. Congress, who voted this month for a huge expansion of the national debt to pay for tax cuts, called himself a “fiscal conservative” on Sunday and urged budget restraint in 2018. In keeping with a sharp pivot under way among Republicans, U.S. Representative Mark Meadows,...
1    U.S. military to accept transgender recruits on Monday: Pentagon.  Transgender people will be allowed for the first time to enlist in the U.S. military starting on Monday as ordered by federal courts, the Pentagon said on Friday, after President Donald Trump’s administration decided not to appeal rulings that blocked his transgender ban. Two federal appeals courts, one in Washington and one in...
2    Senior U.S. Republican senator: 'Let Mr. Mueller do his job'.  The special counsel investigation of links between Russia and President Trump’s 2016 election campaign should continue wit

In [None]:
# The Fake news csv doesn't have the source of the news in the text, so no need to strip that.
csv_file2 = DATA_DIR+"Fake.csv"
fake_news = pd.read_csv(csv_file2)
fake_news = fake_news[['title', 'text']]

fake_news.head(5)

Unnamed: 0,title,text
0,Donald Trump Sends Out Embarrassing New Year’s Eve Message; This is Disturbing,"Donald Trump just couldn t wish all Americans a Happy New Year and leave it at that. Instead, he had to give a shout out to his enemies, haters and the very dishonest fake news media. The former reality show star had just one job to do and he couldn t do it. As our Country rapidly grows stronger and smarter, I want to wish all of my friends, supporters, enemies, haters, and even the very dis..."
1,Drunk Bragging Trump Staffer Started Russian Collusion Investigation,"House Intelligence Committee Chairman Devin Nunes is going to have a bad day. He s been under the assumption, like many of us, that the Christopher Steele-dossier was what prompted the Russia investigation so he s been lashing out at the Department of Justice and the FBI in order to protect Trump. As it happens, the dossier is not what started the investigation, according to documents obtained..."
2,Sheriff David Clarke Becomes An Internet Joke For Threatening To Poke People ‘In The Eye’,"On Friday, it was revealed that former Milwaukee Sheriff David Clarke, who was being considered for Homeland Security Secretary in Donald Trump s administration, has an email scandal of his own.In January, there was a brief run-in on a plane between Clarke and fellow passenger Dan Black, who he later had detained by the police for no reason whatsoever, except that maybe his feelings were hurt...."
3,Trump Is So Obsessed He Even Has Obama’s Name Coded Into His Website (IMAGES),"On Christmas day, Donald Trump announced that he would be back to work the following day, but he is golfing for the fourth day in a row. The former reality show star blasted former President Barack Obama for playing golf and now Trump is on track to outpace the number of golf games his predecessor played.Updated my tracker of Trump s appearances at Trump properties.71 rounds of golf includin..."
4,Pope Francis Just Called Out Donald Trump During His Christmas Speech,Pope Francis used his annual Christmas Day message to rebuke Donald Trump without even mentioning his name. The Pope delivered his message just days after members of the United Nations condemned Trump s move to recognize Jerusalem as the capital of Israel. The Pontiff prayed on Monday for the peaceful coexistence of two states within mutually agreed and internationally recognized borders. We ...


In [None]:
fake_news.tail(5)

Unnamed: 0,title,text
23476,McPain: John McCain Furious That Iran Treated US Sailors Well,"21st Century Wire says As 21WIRE reported earlier this week, the unlikely mishap of two US Naval vessels straying into Iranian waters just hours before the President s State of the Union speech, followed by the usual parade of arch-neocons coming on TV in real time to declare the incident as an act of aggression by Iran against the United States is no mere coincidence.24 hours after th..."
23477,"JUSTICE? Yahoo Settles E-mail Privacy Class-action: $4M for Lawyers, $0 for Users","21st Century Wire says It s a familiar theme. Whenever there is a dispute or a change of law, and two tribes go to war, there is normally only one real winner after the tribulation the lawyers. Ars TechnicaIn late 2013, Yahoo was hit with six lawsuits over its practice of using automated scans of e-mail to produce targeted ads. The cases, which were consolidated in federal court, all argued t..."
23478,Sunnistan: US and Allied ‘Safe Zone’ Plan to Take Territorial Booty in Northern Syria,"Patrick Henningsen 21st Century WireRemember when the Obama Administration told the world how it hoped to identify 5,000 reliable non-jihadist moderate rebels hanging out in Turkey and Jordan, who might want to fight for Washington in Syria? After all the drama over its infamous train and equip program to create their own Arab army in Syria, they want to give it another try.This week, Pen..."
23479,How to Blow $700 Million: Al Jazeera America Finally Calls it Quits,"21st Century Wire says Al Jazeera America will go down in history as one of the biggest failures in broadcast media history.Ever since the US and its allies began plotting to overthrow Libya and Syria, Al Jazeera has deteriorated from a promising international news network in 2003 into what it has become in 2016 a full-blown agit prop media shop for the US State Department and the Pentagon..."
23480,10 U.S. Navy Sailors Held by Iranian Military – Signs of a Neocon Political Stunt,"21st Century Wire says As 21WIRE predicted in its new year s look ahead, we have a new hostage crisis underway.Today, Iranian military forces report that two small riverine U.S. Navy boats were seized in Iranian waters, and are currently being held on Iran s Farsi Island in the Persian Gulf. A total of 10 U.S. Navy personnel, nine men and one woman, have been detained by Iranian authorities...."


In [None]:
# Merged both title and text into the same line. 
fake_news_text = fake_news['title'] + '. ' + fake_news['text'] + ' '
fake_news_text[:5]

0     Donald Trump Sends Out Embarrassing New Year’s Eve Message; This is Disturbing. Donald Trump just couldn t wish all Americans a Happy New Year and leave it at that. Instead, he had to give a shout out to his enemies, haters and  the very dishonest fake news media.  The former reality show star had just one job to do and he couldn t do it. As our Country rapidly grows stronger and smarter, I w...
1     Drunk Bragging Trump Staffer Started Russian Collusion Investigation. House Intelligence Committee Chairman Devin Nunes is going to have a bad day. He s been under the assumption, like many of us, that the Christopher Steele-dossier was what prompted the Russia investigation so he s been lashing out at the Department of Justice and the FBI in order to protect Trump. As it happens, the dossier...
2     Sheriff David Clarke Becomes An Internet Joke For Threatening To Poke People ‘In The Eye’. On Friday, it was revealed that former Milwaukee Sheriff David Clarke, who was being consider

In [None]:
# We then write both of those news into the plain text files (no longer json)
def write_dataset_to_file(file_name, text_list):
  with open(DATA_DIR+file_name, "w") as outfile:
      outfile.write("\n".join(str(item) for item in text_list))


write_dataset_to_file("true_news_text.txt", true_news_text)
write_dataset_to_file("fake_news_text.txt", fake_news_text)
#combine them 
write_dataset_to_file("combined_news_text.txt", true_news_text+fake_news_text)

# Short News

Now we process the short news dataset. The short news has also the headline and a short description, we will combine them into the same line.

In [None]:
json_filename = DATA_DIR + 'News_Category_Dataset_v2.json'
short_news = pd.read_json(json_filename, lines=True)
short_news = short_news[['headline', 'short_description']]
short_news.head(5)

Unnamed: 0,headline,short_description
0,"There Were 2 Mass Shootings In Texas Last Week, But Only 1 On TV",She left her husband. He killed their children. Just another day in America.
1,Will Smith Joins Diplo And Nicky Jam For The 2018 World Cup's Official Song,Of course it has a song.
2,Hugh Grant Marries For The First Time At Age 57,The actor and his longtime girlfriend Anna Eberstein tied the knot in a civil ceremony.
3,Jim Carrey Blasts 'Castrato' Adam Schiff And Democrats In New Artwork,The actor gives Dems an ass-kicking for not fighting hard enough against Donald Trump.
4,Julianna Margulies Uses Donald Trump Poop Bags To Pick Up After Her Dog,"The ""Dietland"" actress said using the bags is a ""really cathartic, therapeutic moment."""


In [None]:
short_news.tail(5)

Unnamed: 0,headline,short_description
200848,RIM CEO Thorsten Heins' 'Significant' Plans For BlackBerry,Verizon Wireless and AT&T are already promoting LTE devices including smartphones and tablets from RIM's rivals. RIM's first
200849,Maria Sharapova Stunned By Victoria Azarenka In Australian Open Final,"Afterward, Azarenka, more effusive with the press than normal, credited her coach of two years, Sam Sumyk, for his patient"
200850,"Giants Over Patriots, Jets Over Colts Among Most Improbable Super Bowl Upsets Of All Time (VIDEOS)","Leading up to Super Bowl XLVI, the most talked about game could end up being one that occurred a few years ago. After all"
200851,Aldon Smith Arrested: 49ers Linebacker Busted For DUI,CORRECTION: An earlier version of this story incorrectly stated the location of KTVU and the 2011 league leader in sacks
200852,Dwight Howard Rips Teammates After Magic Loss To Hornets,The five-time all-star center tore into his teammates Friday night after Orlando committed 23 turnovers en route to losing


In [None]:
# Merged both headline and short_description into the same line. 
short_news_text = short_news['headline'] + '. ' + short_news['short_description'] +' '
short_news_text[:5]


0                      There Were 2 Mass Shootings In Texas Last Week, But Only 1 On TV. She left her husband. He killed their children. Just another day in America. 
1                                                               Will Smith Joins Diplo And Nicky Jam For The 2018 World Cup's Official Song. Of course it has a song. 
2                            Hugh Grant Marries For The First Time At Age 57. The actor and his longtime girlfriend Anna Eberstein tied the knot in a civil ceremony. 
3       Jim Carrey Blasts 'Castrato' Adam Schiff And Democrats In New Artwork. The actor gives Dems an ass-kicking for not fighting hard enough against Donald Trump. 
4    Julianna Margulies Uses Donald Trump Poop Bags To Pick Up After Her Dog. The "Dietland" actress said using the bags is a "really cathartic, therapeutic moment." 
dtype: object

In [None]:
!ls '$DATA_DIR'

combined_news_text.txt	glove.6B.300d.txt	       short_news_text.txt
Fake.csv		glove.6B.50d.txt	       True.csv
fake_news_text.txt	News_Category_Dataset_v2.json  true_news_text.txt
glove.6B.100d.txt	outfile.txt
glove.6B.200d.txt	shakespeare.txt


In [None]:
# write them to a file
write_dataset_to_file("short_news_text.txt", short_news_text)