### Data Processing

<small>Written by: Ali Tobah - tobah@umich.edu</small>

In [1]:
import pandas as pd
import os
import re
import altair as alt

In [2]:
# In some environemnts such as Google Colab, it's
# faster to upload a zip file than the raw data.
# Uncomment the line below if needed to unzip the data.
# Should have the zip file uploaded first, of course.

#!unzip 'data.zip'

# This code creates the subdirectory for the processed data.
# The code in the next cell creates the output csv file, but
# gives an error if the directory doesn't already exist.

if not os.path.exists('data/processed'):
    !mkdir data/processed
    print('Directory created.')
else:
    print('Directory already there.')

Directory already there.


In [3]:
dirPath = 'data/raw'
dirPathSubList = ['/fakeNewsDataset/fake',
               '/fakeNewsDataset/legit',
               '/celebrityDataset/fake',
               '/celebrityDataset/legit']

columnLabels = ['Text', 'Domain', 'Label']
allArticlesList = []
for eaDir in dirPathSubList:
    eaDir = dirPath + eaDir
    allDirsContents = []
    for eaFile in os.listdir(eaDir):
        with open(os.path.join(eaDir, eaFile), 'r') as currFile:
            # File contents, including a title if there is one
            fileText = currFile.read()

            # Get the domain: 'celebrity' or first part
            # of the file name
            if eaDir.split('/')[2] == 'celebrityDataset':
                fileDomain = 'celebrity'
            else:
                fileDomain = re.split(r'(\d+)', eaFile)[0]

            # Label, whether it is fake or legit (real)
            fileLabel = eaDir.split('/')[-1]

            # Compile list of directory contents: Text plus attributes
            allDirsContents.append([fileText, fileDomain, fileLabel])
    
    # Compile list of all articles with attributes
    allArticlesList.extend(allDirsContents)

# Create a dataframe and save in a file.
# Note that the subdirectory should already exist
# as noted above.
articlesDF = pd.DataFrame(allArticlesList, columns=columnLabels)
outFile = 'data/processed/allnewsdataFakeReal.csv'
articlesDF.to_csv(outFile)
    

In [4]:
articlesDF.shape

(980, 3)

In [5]:
articlesDF.head()

Unnamed: 0,Text,Domain,Label
0,6YO Brings Gun to School\n\n\n\nIn Rancho Cuca...,edu,fake
1,UK banks said not prepared for Brexit\n\nThe B...,biz,fake
2,"Alex Jones Vindicated in ""Pizzagate"" Controver...",biz,fake
3,Facebook Messenger is eliminating Emoji's\n\nF...,tech,fake
4,Yahoo Denies Data Breach from 500M Accounts\n\...,tech,fake


In [6]:
chartDF = articlesDF.groupby(['Domain', 'Label']).count().rename(columns={'Text': 'Number of Articles'}).reset_index()
chartDF

Unnamed: 0,Domain,Label,Number of Articles
0,biz,fake,40
1,biz,legit,40
2,celebrity,fake,250
3,celebrity,legit,250
4,edu,fake,40
5,edu,legit,40
6,entmt,fake,40
7,entmt,legit,40
8,polit,fake,40
9,polit,legit,40


In [7]:
# NOTE: This chart will not show on GitHub.
# See the next cell for more information.

articlesChart = alt.Chart(chartDF).mark_bar().encode(
    x = alt.X('Number of Articles:Q'),
    y = alt.Y('Label:N', title=None, axis=alt.Axis(labels=False, tickSize=0)),
    color = alt.Color('Label:N'),
    row = alt.Row('Domain:N')
    )
articlesChart

###Altair images

Altair-generated images don't show on GitHub. The reason is explained here: https://stackoverflow.com/questions/71346406/why-are-my-altair-data-visualizations-not-showing-up-in-github

So this cell contains a workaround for 'articlesChart' from above, embedding a PNG version of the chart into this Markdown cell. The method is explained here: https://medium.com/analytics-vidhya/embedding-your-image-in-google-colab-markdown-3998d5ac2684

<img src='https://drive.google.com/uc?id=1qKEkEMUbZz76Me1pFbUk7A68p-B3wUuq'>