### Fake News Project

By: Mateo Anusic, Emil Thorlund, Lucas A. Rosing, Victor Bergien

### Task #1

- Structure, process and clean the text.
- Tokenize the text
- Remove stopwords and compute the size of the vocabulary.
- Compute the reduction rate of the vocabulary size after removing stopwords.
- Remove word variations with stemming and compute the size of the vocabulary.
- Compute the reduction rate of the vocabulary size after stemming.

Describe which procedures (and which libraries) you used and why they are appropriate.

In [1]:
### Code ###
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from cleantext import clean 
import cleantext
import csv
import requests
from io import StringIO
from itertools import islice


data_url = 'https://raw.githubusercontent.com/several27/FakeNewsCorpus/master/news_sample.csv'

#nltk.download('punkt')

response = requests.get(data_url)
response.raise_for_status()  #Raise exeption

csv_data = response.content.decode('utf-8')
csv_file = StringIO(csv_data)

reader = csv.DictReader(csv_file)

start_row = 100
end_row = 102

subset_rows = list(islice(reader, start_row, end_row))

#for row_number, row in enumerate(subset_rows, start=start_row):
#    print(f"Row {row_number}:")
#    for column_name, cell_value in row.items():
#        print(f"  {column_name}: {cell_value}")
#    print()  # Print an empty line to separate rows



In [2]:
date_pattern = re.compile(r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}.\d{6})|'            # YYYY-MM-DD HH:MM:SS.MMMMMM
                        r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})|'                      # YYYY-MM-DD HH:MM:SS
                        r'(\d{4}-\d{2}-\d{2})|'                                        # YYYY-MM-DD
                        r'(\d{4}\.\d{2}\.\d{2})|'                                      # YYYY.MM.DD 
                        r'(\d{2}\.\d{2}\.\d{4})|'                                      # DD.MM.YYYY
                        r'(\d{4}/\d{2}/\d{2})|'                                        # YYYY/MM/DD
                        r'(\d{2}/\d{2}/\d{4})|'                                        # DD/MM/YYYY
                        r'((january|february|march|april|june|july|august|september|'  # <Month> DD YYYY
                        r'october|november|december) \d{2}, \d{4})', re.IGNORECASE)  
number_pattern = re.compile(r'(\d+(?:,\d{3})*(?:\.\d+)?)')
url_pattern = re.compile(r'https?://\S+|www\.\S+|\S+\.com')

def clean_text_and_tokenize(read):
    read = read.lower()
    read = re.sub(r"\s+", " ", read)
    read = re.sub(date_pattern, '<DATE>', read)
    read = re.sub(number_pattern, "<NUM>", read)
    read = re.sub(r"\S+@\S+", "<EMAIL>", read)
    read = re.sub(url_pattern, "<URL>", read)
    tokens = word_tokenize(read)  # Tokenize the text
    return tokens

def remove_stopwords(tokens):
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [token for token in tokens if token.lower() not in stop_words]
    return filtered_tokens

def stem_tokens(tokens):
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(token) for token in tokens]
    
    return stemmed_tokens

In [3]:
for row_number, row in enumerate(subset_rows, start=start_row):
    print(f"Row {row_number}:")
    for column_name, cell_value in row.items():
        tokens = clean_text_and_tokenize(cell_value)  # Clean text and tokenize
        no_stopwords_tokens = remove_stopwords(tokens)
        stemmed_tokens = stem_tokens(no_stopwords_tokens)
        cleaned_cell_value = ' '.join(stemmed_tokens)
        print(f"  {column_name}: {cleaned_cell_value}")
    print()


Row 100:
  : < num >
  id: < num >
  domain: < url >
  type: fake
  url: < url >
  content: greenmedinfo – action item link % reader think stori fact . add two cent . headlin : bitcoin & blockchain search exceed trump ! blockchain stock next ! one link greedmedinfo updat incomplet . letter write campaign locat : make fda advisori , mandatori sourc : < url >
  scraped_at: < date >
  inserted_at: < date >
  updated_at: < date >
  title: greenmedinfo – action item link
  authors: downsiz dc
  keywords: 
  meta_keywords: [ `` ]
  meta_description: 
  tags: 
  summary: 

Row 101:
  : < num >
  id: < num >
  domain: < url >
  type: fake
  url: < url >
  content: < num > annoy twitter auto dm headlin : bitcoin & blockchain search exceed trump ! blockchain stock next ! seen “ cheap supplement < num > ” “ regist busi program ’ receiv endless benefits. ” like thought email spam . ’ real exampl spam , twitter . last week wrote < num > worst social media mistak . one mistak annoy auto direct messa

In [4]:
test_line = "This is an example sentence."
print(clean_text_and_tokenize(test_line))
print(remove_stopwords(clean_text_and_tokenize(test_line)))
print(stem_tokens(remove_stopwords(clean_text_and_tokenize(test_line))))

['this', 'is', 'an', 'example', 'sentence', '.']
['example', 'sentence', '.']
['exampl', 'sentenc', '.']


### Task #2

- Describe how you ended up representing the FakeNewsCorpus dataset (for instance with a Pandas dataframe). Argue for why you chose this design.
- Did you discover any inherent problems with the data while working with it?
- Report key properties of the data set - for instance through statistics or visualization.

The exploration can include (but need not be limited to):

- counting the number of URLs in the content
- counting the number of dates in the content
- counting the number of numeric values in the content
- determining the 100 more frequent words that appear in the content
- plot the frequency of the 10000 most frequent words (any interesting patterns?)
- run the analysis in point 4 and 5 both before and after removing stopwords and applying stemming: do you see any difference?

In [5]:
### Code ###
#Find all the columns
import csv
import urllib.request

url = "https://raw.githubusercontent.com/several27/FakeNewsCorpus/master/news_sample.csv"

#Read CSV file from the url and parse it into a list of dictionaries
with urllib.request.urlopen(url) as response:
    data = [row for row in csv.DictReader(response.read().decode("utf-8").splitlines())]
    
print("Column Names: ", list(data[0].keys()))

Column Names:  ['', 'id', 'domain', 'type', 'url', 'content', 'scraped_at', 'inserted_at', 'updated_at', 'title', 'authors', 'keywords', 'meta_keywords', 'meta_description', 'tags', 'summary']


In [6]:
import csv
import io
import requests
import re

url = "https://raw.githubusercontent.com/several27/FakeNewsCorpus/master/news_sample.csv"
response = requests.get(url)
content = response.content.decode("utf-8")

def count_tokens(rows):
    num_count = 0
    url_count = 0
    date_count = 0
    for row in rows:
        content = row['content']
        num_count += content.count("NUM") #count number of "<NUM>" in column 'content'
        date_count += content.count("DATE") #count number of "<DATE>" in column 'content'
        url_count += content.count("URL") #count number of "<URL>" in column 'content'
    return num_count, date_count, url_count

rows = []
for line in csv.DictReader(io.StringIO(content)):
    line['content'] = clean_text_and_tokenize(line['content'])
    rows.append(line)

num_count, date_count, url_count = count_tokens(rows)

print(f"Number of <NUM> tokens: {num_count}")
print(f"Number of <DATE> tokens: {date_count}")
print(f"Number of <URL> tokens: {url_count}")

Number of <NUM> tokens: 2487
Number of <DATE> tokens: 40
Number of <URL> tokens: 329


In [7]:
#Count number of "<NUM>" in column 'content'
def count_num(rows):
    num_count = 0
    for row in rows:
        content = row['content']
        num_count += content.count("NUM")
    return num_count

rows = []
for line in csv.DictReader(io.StringIO(content)):
    line['content'] = clean_text_and_tokenize(line['content'])
    rows.append(line)
    
print(f"Number of <NUM> tokens: {num_count}")

Number of <NUM> tokens: 2487


In [8]:
def count_date(rows):
    num_date = 0
    for row in rows:
        content = row['content']
        num_date += content.count("DATE")
    return num_date

rows = []
for line in csv.DictReader(io.StringIO(content)):
    line['content'] = clean_text_and_tokenize(line['content'])
    rows.append(line)
    

print(f"Number of <DATE> tokens: {date_count}")

Number of <DATE> tokens: 40


In [9]:
def count_url(rows):
    num_url = 0
    for row in rows:
        content = row['content']
        num_url += content.count("URL")
    return num_url

rows = []
for line in csv.DictReader(io.StringIO(content)):
    line['content'] = clean_text_and_tokenize(line['content'])
    rows.append(line)
    
print(f"Number of <URL> tokens: {url_count}")

Number of <URL> tokens: 329


### Task #3

Apply your data preprocessing pipeline to a larger proportion of the FakeNewsCorpus https://github.com/several27/FakeNewsCorpus/releases/tag/v1.0

You may find it challenging to run your data processing pipeline on the entire FakeNewsCorpus. At a minimum, you should be able to process 10% of the data using your pipeline,

In [10]:
import csv

def process_text(text):
    if text is None:
        return ''
    tokens = clean_text_and_tokenize(text)
    tokens = remove_stopwords(tokens)
    stemmed_tokens = stem_tokens(tokens)
    return ' '.join(stemmed_tokens)

columns_to_process = {'content', 'type', 'meta_description', 'domain', 'title', 'meta_keyboards'}
with open('/volumes/Glyph1TB/newsCorpus/news_cleaned_2018_02_13.csv', encoding="utf-8") as f:
    reader = csv.DictReader(f)

    with open('/volumes/Glyph1TB/newsCorpus/news_cleaned_2018_02_13-results3.csv', 'w', encoding="utf-8") as fOut:
        fieldnames = reader.fieldnames + ["processed_text"]
        writer = csv.DictWriter(fOut, fieldnames=fieldnames)
        writer.writeheader()

        for row in reader:
            processed_row = {column_name: (process_text(cell_value) if column_name in columns_to_process else cell_value) for column_name, cell_value in row.items()}
            writer.writerow(processed_row)


### Task #4

Split the resulting dataset into a training, validation, and test splits. A common strategy is to uniformly at random split the data 80% / 10% / 10%. You will use the training data to train your baseline and advanced models, the validation data can be used for model selection and hyperparameter tuning, while the test data should only be used in Part 4.

In [11]:
### Code ###