## Milestone 1
* [Task 1: Creating a Group](#first-bullet)
* [Task 2: Cleaning the Fake News Data](#second-bullet)
* [Task 3: Data Exploration](#third-bullet)
* [Task 4: Scraping Data Ourselves](#fourth-bullet)
* [Tools and Challenges](#fifth-bullet)

In [1]:
# Importing nessecary tools
import time
import re
import lxml
import string
import datetime
import pandas as pd
import requests
from bs4 import BeautifulSoup
from unidecode import unidecode
from cleantext import clean
import nltk as nk
from nltk.corpus import stopwords
from nltk.stem import *
import pyarrow
import psycopg2
from pathlib import Path

#!pip install upgrade unidecode
#!pip install upgrade pip
#!pip install unidecode
#!pip install --upgrade clean-text
#!pip install --upgrade clean-text[gpl]
# downloading punkt and stopwords
#nltk.download('punkt')
#nltk.download('stopwords')

In [6]:
# pip install clean-text==0.6.0 # det er vigtigt at denne version af clean-text er installeret! 

## Task 1: Creating a Group <a class="anchor" id="first-bullet"></a>

Group
David Andreas Seiler-Holm   (vsd187)

Emma Marie Foged Bendtsen   (jkc422)

Uffe Dalgas                 (nmb442)

Christian Jensen            (cjk248)

## Task 2: Cleaning the Fake News Data<a class="anchor" id="second-bullet"></a>

The following 5 blocks retrieves, processes, structures and cleans the data set. 
The data is already structured into a tabular form when read using the read_csv method. 
There are some uneccesary columns like the Unnamed: 0. For simple explorative analysis we'd just keep it to avoid obsufucation of the code, but for this excercise we'll remove them, to show we can. Pandas was used because it's well-known, easy-to-use Python module which have been (more or less) optimised for work on large amounts of data. Therefore it is easy to get help on Stack Overflow and you quickly get a sense of how to use it correctly. 66 observations of the column content in our dataset are missing out of 1,000,000 observations. It is a very small percentage, so we remove these and expect there to be very little to no loss of information, when we train our model. A dataset containing 1,000,000 is relatively big and it is memory intensive on our laptops to process. After the initial cleaning of the dataset we saved it as a csv-file which is relatively big, but we need it to treat the dataset in chunks for text processing. We saved the treated chunks in a memory-efficient file format parquet to combine them in the end. Treating data in chunks require that we little to no dependence between chunks, which is the case here.

In [60]:
# Loading data from csv into pandas dataframe
data = pd.read_csv('1mio-raw.csv', low_memory = False)
#print(data)

In [2]:
# Defining functions for cleaning text and dates
def clean_text(text):
        if not isinstance(text, str): # check if string 
            return float('nan') # setting to nan which is pandas' value for missing
        else:
            return clean(text,
            fix_unicode=True,               
            to_ascii=True,                  
            lower=True,    
            normalize_whitespace= True,
            no_line_breaks=False,           
            no_urls=True,                  
            no_emails=True,                
            no_phone_numbers=True,         
            no_numbers=True,               
            no_digits=True,                
            no_currency_symbols=True,     
            replace_with_url="<URL>",
            replace_with_email="<EMAIL>",
            replace_with_phone_number="<PHONE>",
            replace_with_number="<NUMBER>",
            replace_with_digit="0",
            replace_with_currency_symbol="<CUR>",
            lang="en"                       
            )

def clean_date(text): 
    if not isinstance(text, str):
        return float('nan')
    else:
        date_pattern = re.compile(r'([a-zA-Z]+) *([1-3]?[0-9]), *([0-9]{4})')
    return date_pattern.sub('<DATE>', text)

In [1]:
# Cleaning text and dates
start = time.time()
data['content'] = data['content'].apply(clean_date)
data['content'] = data['content'].apply(clean_text)
end = time.time()
print ('Minutes elapsed:', round((end - start)/60,1))

# Dropping unnecessary columns
data = data.drop(columns=['Unnamed: 0'])

# Removing rows with missing content
len_old_data = len(data)
data = data[data['content'].notna()]
print(f'Removed {len_old_data - len(data)} observations out of {len_old_data}')

# Saving dataset
data.to_csv('data_for_processing.csv', index = False)

In [3]:
# Defining functions for removing stopwords and stemming
def filter_tokens(tokens, stop_words):
    return [token for token in tokens if token not in stop_words]

def stem_words(words, stemmer):
    stemmed_sentence = [stemmer.stem(word) for word in words]
    return ' '.join(stemmed_sentence)

In [135]:
# Tokenization, stop word removal and stemming
chunk_size = 100000
batch_no = 1
print(f'Number of chunks: {round(len(data)/chunk_size,2)}')

start = time.time()
for chunk in pd.read_csv('data_for_processing.csv', chunksize=chunk_size):
    print(f'Working on chunk {batch_no}: {round(batch_no/(len(data)/chunk_size)*100,2)-10}% done')
    
    # Tokenization
    chunk['content'] = chunk['content'].apply(nk.word_tokenize)

    # Removing stop words
    stop_words = set(stopwords.words('english'))
    chunk['content'] = chunk['content'].apply(lambda tokens: filter_tokens(tokens, stop_words))

    # Stemming 
    stemmer = PorterStemmer()
    chunk['content'] = chunk['content'].apply(lambda words: stem_words(words, stemmer))
    
    chunk.to_parquet(f'chunks/chunk{batch_no}.parquet',index=False)
    batch_no+=1
    
print('100% done')
end = time.time()
print ('Minutes elapsed:', round((end - start)/60,1))

Number of chunks: 10.0
Working on chunk 1: 0.0% done
Working on chunk 2: 10.0% done
Working on chunk 3: 20.0% done
Working on chunk 4: 30.0% done
Working on chunk 5: 40.0% done
Working on chunk 6: 50.0% done
Working on chunk 7: 60.0% done
Working on chunk 8: 70.01% done
Working on chunk 9: 80.01% done
Working on chunk 10: 90.01% done
100% done
Minutes elapsed: 151.6


In [137]:
# saving parquet files as one finished cleaned file
data_dir = Path('chunks/')
cleaned = pd.concat(pd.read_parquet(parquet_file) for parquet_file in data_dir.glob('*.parquet'))
cleaned.to_csv('cleaned.csv')

## Task 3: Data Exploration <a class="anchor" id="third-bullet"></a>

##### Observation 1: An overview of the different article types and the percentage of our dataset classified as the different types
There are 12 different categories for classifications of articles. Political, unreliable, bias, fake, conspiracy, unknown, rumor, clickbait, junksci, satire, reliable and hate.
Out of total 999,934 articles classified in these categories around 30.1% are classified as political. Only 0.7% of articles are actually classified as reliable. 14.8% are classified as unreliable and 12.9% directly as fake. 

##### Observation 2: We check the classifications of different sites to see how they are classifed
It seems that the sites fall under one category i.e. no site have more than one classification. This is unexpected, as we would think a site could fall in multiple categories. This result might be a result from how the article was classified in the first place. It seems like the classification might have been done on a domain level. Which means articles from specific sites might have automatically been deemed to be e.g. unreliable even if they are not. Reading the documentation for the news data on GitHub, our hypothesis of automatic classification on source level and not article level seems to be correct. 

#### Observation 3: We check to see if there is a correlation between type classifications and author anonymity

It is only 15.4% of the reliable articles in our dataset that has an author. This is very surprising as we would expect the reliable articles to have a high degree of authors. Looking at the dataset it seems like christianpost.com in particular is missing authors and visiting the site we can actually see authors on the articles so the low percentage of reliable articles with author is caused by christianpost.com (and some by nutritionfacts.org). This is a little worrying as a model fitted on the data could to a higher degree classify articles without authors as reliable even if this is against our intuition. In accordance with out intuition we do still see that the number of authors on unreliable articles are lower than on reliable articles. Interestingly the fake articles have 68.1% of articles with an author. Acoording to our dataset an article is more likely to be fake if the article has an author listed. Maybe this is due to the fact that our dataset is still relatively small (compared to the full dataset), or maybe the writers of fake articles uses an alias to make it more believable that the article is not fake.

In [12]:
# Observation 1: Count and percentage of article type in Data
data = pd.read_csv('cleaned.csv')
print(f'Classifications: {len(data)}')
print()

article_counts = data['type'].value_counts()
article_pct = data['type'].value_counts(normalize=True).round(4)*100
print(f'''Percentages: 
{article_pct}''')

Classifications: 999934

Percentages: 
political     30.21
unreliable    14.77
bias          14.22
fake          12.92
conspiracy    11.45
unknown        4.94
rumor          4.85
clickbait      2.27
junksci        1.82
satire         1.49
reliable       0.69
hate           0.38
Name: type, dtype: float64


In [81]:
# Observations 2: Which sites have the different classifications?
classifications = pd.unique(data['type'])

for classification in classifications:
    data[classification] = data['type'].map({classification: 1})

data.groupby('domain')[classifications].sum().to_csv('classifications.csv')

In [131]:
# Observations 3: Articles and author anonymity
types_df = pd.DataFrame(article_counts)
types_dict = dict(zip(types_df.index, type_df['type']))

for k, v in types_dict.items():
    count_classification_author = len(data[(data['type'] == k) & (data['authors'] == data['authors'])])
    pct_author = round(count_classification_author/v,3)
    print(f'{k}: {round(pct_author*100,3)}% authors, {round((1-pct_author)*100,3)}% no author')

political: 86.6% authors, 13.4% no author
unreliable: 2.7% authors, 97.3% no author
bias: 36.8% authors, 63.2% no author
fake: 68.1% authors, 31.9% no author
conspiracy: 19.3% authors, 80.7% no author
unknown: 52.3% authors, 47.7% no author
rumor: 88.7% authors, 11.3% no author
clickbait: 57.8% authors, 42.2% no author
junksci: 52.7% authors, 47.3% no author
satire: 88.7% authors, 11.3% no author
reliable: 15.4% authors, 84.6% no author
hate: 86.0% authors, 14.0% no author


## Task 4: Scraping Data Ourselves <a class="anchor" id="fourth-bullet"></a>

We do the scraping in two steps. First we find all links to articles on wikinews. Secondly, we visit the articles and scrape information from them. We scrape two variables: The date the article was published and the content of the article. Furthermore we add the time of the scrape to the data. For the date the article was published, we just use the date format in the articles, but for the scraped at variable we try to be a little more adventurous and use unix time i.e. seconds since 00:00:00 UTC on 1. january 1970. This makes it possible to represent time as an integer in e.g. a database. Furthermore it is a way to standardise time so we don't have to regard time zones.

In [10]:
# Getting the characters for our group
group_11_characters = "ABCDEFGHIJKLMNOPRSTUVWZABCDEFGHIJKLMNOPRSTUVWZ"[11%23:11%23+10]
print('Our characters:', group_11_characters)

Our characters: LMNOPRSTUV


In [11]:
# Retrieving article links
base_url = 'https://en.wikinews.org'
url_extension = '/w/index.php?title=Category:Politics_and_conflicts&pagefrom='
exclude = ['/wiki/Wikinews:Briefs', '/wiki/User:', '/wiki/Template:', '/wiki/Portal:']

articles = []
page_from = ''
end_of_articles = 0

while end_of_articles != 1:
    response = requests.get(f'{base_url}{url_extension}{page_from}')
    soup = BeautifulSoup(response.text, 'lxml')
    soup = soup.find(id='mw-pages') # choosing part of html containing links

    # scrape articles
    temp_articles = []
    for article in soup.find_all('a'):
        temp_articles.append(article.get('href'))
        
    temp_articles = [article for article in temp_articles if '/wiki/' in article]
    
    for exclusion in exclude:
        temp_articles = [article for article in temp_articles if exclusion not in article]

    # get next page link
    page_from = temp_articles[-1].strip('/wiki').replace('_', '+')

    #print(temp_articles)

    articles += temp_articles
    if len(temp_articles) == 1:
        end_of_articles = 1
    
articles = [article for article in articles if article[6] in group_11_characters]

In [16]:
# Scraping data
data_list = []
print("Number of articles:", len(articles)) # Number of articles: 3605 (may 8th 2022)

c = 0
for article in articles: 
    #print(f'Percentage done: {int(c/len(articles)*100)}%')
    print(article)
    response = requests.get(f'{base_url}{article}')
    soup = BeautifulSoup(response.text, 'lxml')
    #print(soup)
    
    # author
    #
    
    article_text = soup.find(class_='mw-parser-output')
    date = article_text.find(class_='published')
    
    content = []
    for text in article_text:
        if text.name == 'p' and text.find(class_='published') == None and text.text not in string.whitespace:
            content.append(text.text)

    article_dict = {'published': article_text.find_all('p')[0].text.strip('\n').strip('\xa0'), 
                    'scraped_at': time.time(), 'content': ' '.join(content), 'type': 'reliable',
                    'domain':'https://en.wikinews.org/', 'url': f'https://en.wikinews.org{article}',
                    'title':soup.find('h1', {'id': 'firstHeading'}).text}
    print(article_dict)
    data_list.append(article_dict)
    c+=1

scraped_data = pd.DataFrame(data_list)
scraped_data.head()
scraped_data.to_csv('scraped_data.csv', index = False

Number of articles: 3609
/wiki/L.A._elects_Latino_Mayor
{'published': 'Saturday, May 21, 2005', 'scraped_at': 1653231302.2945747, 'content': "Mayor-elect Antonio Villaraigosa was swept to victory in Los Angeles on Tuesday, winning nearly 60% of the vote to defeat incumbent James Hahn. Becoming the first Latino mayor in 133 years, the historic election was marked by an anticipated low voter turnout of 30%.\n By capitalizing on voter discontent, Villaraigosa was able to over-come his loss to Hahn in the previous election and gain the key support of the African American community and San Fernando Valley. According to a Los Angeles Times exit poll, 7 in 10 voters said they wanted the city to shift direction, including roughly a third of Hahn's own supporters, .\n In a city where 47 percent of the population is Latino, Villaraigosa largely downplayed his ethenicity during the election and was able to garner cross cultural support. Whites make up 30 percent of the city population, while Afri

KeyboardInterrupt: 

In [5]:
scraped_data = pd.read_csv('scraped_data.csv')
# Cleaning text and dates
scraped_data['content'] = scraped_data['content'].apply(lambda x: clean_date(x))
scraped_data['content'] = scraped_data['content'].apply(lambda x: clean_text(x))

# Removing rows with missing content
len_old_data = len(scraped_data)
scraped_data = scraped_data[scraped_data['content'].notna()]
print(f'Removed {len_old_data - len(scraped_data)} observations out of {len_old_data}')

# Tokenization
scraped_data['content'] = scraped_data['content'].apply(nk.word_tokenize)

# Removing stop words
stop_words = set(stopwords.words('english'))
scraped_data['content'] = scraped_data['content'].apply(lambda tokens: filter_tokens(tokens, stop_words))

# Stemming 
stemmer = PorterStemmer()
scraped_data['content'] = scraped_data['content'].apply(lambda words: stem_words(words, stemmer))
scraped_data.to_csv('cleaned_scraped_data.csv', index = False)

Removed 7 observations out of 3609


## Tools and Challenges <a class="anchor" id="fifth-bullet"></a>

We use the following three libraries for scraping:

*   requests
*   bs4
*   pandas

Requests is a library we use to make requests for HTML on wikinews articles. When then use Beautifulsoup (bs4) to parse the HTML for the information we need. We place the data in a pandas dataframe so we can write it to a csv file. The biggest challenge of the scrape was that certain news articles such as briefs had a structure different from other articles making it difficult to generalise our code. As briefs are a small collection of different articles we ended up sorting these from the articles links we got with the request library. Otherwise we didn't really have any challenges. 