# Clean data

This code imports the data csv file and does the following:
1. Extracts only article text and label columns
2. Removes duplicate articles
3. Makes an index of non-english articles and removes them
4. Replaces all special characters present in the articles which space
5. Saves the clean data as df_article_text.csv in data folder

In [2]:
import os
os.environ['DSSG_PATH'] = '/data/dssg-disinfo'

In [3]:
os.environ['DSSG_PATH']

'/data/dssg-disinfo'

In [1]:
import sys
sys.path.append('../src/preprocessing/')
import setup
import pandas as pd

In [2]:
setup.load_cleandata()

In [3]:
df = pd.read_csv('/data/dssg-disinfo/articles_v3.csv')

In [4]:
df.shape

(28413, 8)

In [33]:
df.tail(n = 10)

Unnamed: 0,article_pk,domain_pk,domain_name,article_url,label,article_headline,article_text,publish_date
28403,69337013,37651,https://www.theepochtimes.com,https://www.theepochtimes.com/europe-awakens-t...,0,EU Awakens to Threat Posed by Chinese Communis...,U.S. Secretary of State Mike Pompeo speaks at ...,2020-06-27T16:26:08Z
28404,66625469,51,aanirfan.blogspot.co.uk,http://aanirfan.blogspot.com/search/label/tax,0,,Aangirfan\nShowing posts with label\ntax\n.\nS...,2020-06-23T06:42:07.264Z
28405,66625472,51,aanirfan.blogspot.co.uk,http://aanirfan.blogspot.com/search/label/COVI...,0,,Aangirfan\nShowing posts with label\nCOVID CUL...,2020-06-23T10:13:04.485Z
28406,67420100,1164,pajamasmedia.com,https://pjmedia.com/columns/stephen-kruiser/20...,0,"The Morning Briefing: Baseball Is Back, So We ...","It’s Baseball, But Will We Recognize It?\n\nYe...",2020-06-24T00:00:00Z
28407,64126916,297,humansarefree.com,https://humansarefree.com/?p=34202,0,Child Porn Discovered on Pentagon Computers (U...,Despite the fact that within the past couple o...,2020-06-19T00:00:00Z
28408,66625479,51,aanirfan.blogspot.co.uk,http://aanirfan.blogspot.com/search/label/PRIV...,0,,Aangirfan\nShowing posts with label\nPRIVATE E...,2020-06-23T06:42:06.857Z
28409,64126944,297,humansarefree.com,https://humansarefree.com/2020/06/child-porn-d...,0,Child Porn Discovered on Pentagon Computers (U...,Within the past couple of years alone millions...,2020-06-19T00:00:00Z
28410,69337068,23747,https://occasion-to-be.com,https://occasion-to-be.com/biden-says-he-would...,0,Biden Says He Would Use Federal Power To Force...,Joe Biden has said that he would use federal p...,2020-06-27T17:06:53.102Z
28411,66625518,51,aanirfan.blogspot.co.uk,http://aanirfan.blogspot.com/search/label/slav...,0,,Aangirfan\nShowing posts with label\nslave tra...,2020-06-23T07:52:04.362Z
28412,69337078,23747,https://occasion-to-be.com,https://occasion-to-be.com/bill-gates-lashes-o...,0,Bill Gates Lashes Out at America for Rejecting...,Bill Gates has lashed out at America for rejec...,2020-06-28T00:06:54.974Z


In [24]:
df[df['label']==0].groupby('domain_name').count().sort_values(by='article_pk',ascending=False).head(n = 10)

Unnamed: 0_level_0,article_pk,domain_pk,article_url,label,article_headline,article_text,publish_date
domain_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
aanirfan.blogspot.co.uk,244,244,244,244,13,244,244
https://www.timesnownews.com,113,113,113,113,100,113,113
breitbart.com,113,113,113,113,113,113,113
https://www.theepochtimes.com,78,78,78,78,78,78,78
infowars.com,62,62,62,62,59,62,62
zerohedge.com,51,51,51,51,51,51,51
https://www.conservapedia.com,46,46,46,46,45,46,46
counterinformation.wordpress.com,45,45,45,45,41,45,45
businessinsider.com,36,36,36,36,34,36,36
activistpost.com,27,27,27,27,27,27,27


In [38]:
df_neg = pd.read_csv('/data/dssg-disinfo/negative_articles.csv', sep=',')

In [39]:
df_neg.groupby('domain_name').count().sort_values(by='article_pk',ascending=False).head(n = 10)

Unnamed: 0_level_0,article_pk,domain_pk,article_url,label,article_headline,article_text
domain_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
businessinsider.com,36,36,36,36,34,36
motherjones.com,26,26,26,26,26,26
foreignpolicy.com,25,25,25,25,25,25
jpost.com,21,21,21,21,21,21
cnn.com,20,20,20,20,20,20
gulfnews.com,17,17,17,17,17,17
independent.co.uk,15,15,15,15,15,15
cnbc.com,14,14,14,14,10,14
politico.com,13,13,13,13,13,13
slate.com,13,13,13,13,13,13


In [9]:
df = setup.drop_noneng(df)

In [10]:
df.shape

(28413, 8)

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from langdetect import detect

In [2]:
# Needs to be run once in the beginning
!pip install langdetect



In [3]:
# Read in the csv files
df_pos = pd.read_csv('data/positive_articles.csv', sep=',')
df_neg = pd.read_csv('data/negative_articles.csv', sep=',')

# Merge positive and negative data into one data frame
df = pd.concat([df_pos, df_neg], ignore_index=True)

# Subsetting label and article_text data
columns = ['label', 'article_text']
df_article_text = df[columns]

In [4]:
df_article_text.shape

(12859, 2)

Problems with the data:

1. Some articles non-english
2. '\n', 'â' characters present
3. Duplicate article_text- different urls but article is same.
4. Name of the author of the article present

In [5]:
# Removing duplicates in article_text the data
df_article_text.drop_duplicates(keep='first', inplace=True)

# Index of non-english rows
non_en_index = []

for index, row in df_article_text.iterrows():
    # Explicitly converting article_text to string because a few of the rows were being captured as non-strings
    lang = detect(str(row['article_text']))
    if lang != 'en':
        non_en_index.append(index)

# Removing non-english articles        
df_article_text.drop(non_en_index, inplace= True)

# Remove the special characters
df_article_text['article_text'] = [article.replace('\n', ' ') for article in df_article_text.article_text]
df_article_text['article_text'] = [article.replace('[^a-zA-Z\d\s:]', ' ') for article in df_article_text.article_text]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [6]:
# Write the clean data into a new csv
df_article_text.to_csv('data/df_article_text.csv', sep=',', index=False)