# **NELA Source Aggregation - Initial preprocessing for the NELA dataset**

## **Name: Snehal Yeole**

## **SJSU ID: 012548471**

## **Course: CMPE 257**

## **Factors:**

*  Credibility/Fact Checks

*  Sensationalism

##**Goal**:  

The aim of the project is to develop a strategy to help identify the degree of fakenes of any news article. The project uses NELA-GT 2018 dataset


## **Problem Statement:**

Social media for news consumption is a double-edged sword. On the one hand, its low cost, easy access, and rapid dissemination of information lead people to seek out and consume news from social media. On the other hand, it enables the wide spread of “fake news”, i.e., low quality news with intentionally false information. The extensive spread of fake news has the potential for extremely negative impacts on individuals and society. Therefore, fake news detection on social media has recently become an emerging research that is attracting tremendous attention.


##**Dataset Description**: 

NELA2018 is a political news article dataset that contains 136K articles from 92
media sources in 2017 (Horne, Khedr, and Adalı
2018). The dataset includes sources from mainstream, hyper-partisan, conspiracy,and satire media sources. Along with the news articles, the dataset includes a rich set of natural language features on each news article, and the corresponding Facebook engagement statistics. The dataset contains nearly all of the articles published by the 92 sources during the 7 month period. GDELT is an open database of event-based news articles with temporal and location features.

## **Dataset creation:**

*  Variety of news sources was gathered from varying levels of veracity, including many well-studied misinforming sources and other less well-known sources.

*  The article data was scraped from the gathered
sources’ RSS feeds twice a day for 10 months
in 2018.

*  Source-level veracity labels was combined and corroborated from 8 independent assessments, some of which are used in the misinformation literature, others that are not. These labels provide multiple and complementary ground truth allowing for many different ways to characterize the sources.



## **Machine Learning Lifecycle:**

This notebook mainly deals with the below machine learning steps:

*  Configuration of the System
*  Data Collection
*  Set Data Narrative
*  Exploratory Data Analysis and Visualization
*  Run Stats: mean, median, mode, correlation, variance
*  Data Prep: Curation

This notebook is for determining which indicator label sets have values for which sources. This is basically the data preparation on the NELA-GT dataset in order to use the dataset for implementing our Alternus Vera project factors.



# **Step 1: Configuration of the System**

## **Link of the NELA-GT Source:**
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/ULHLCB

## **Link of the Articles Database:**
https://drive.google.com/open?id=1RAj-SrXAJxJbnK-wVZTDHXtizCsL1abp

## **Link of the Labels associated with sources of the articles:**
https://drive.google.com/open?id=1-HorSxKrHaPPtm0r2qP8kN6jXsBwwVVV

## **Link of the Amalgamated Dataset:**
https://drive.google.com/open?id=1-0Bi5HIhnVm4ZVTuyv38L5K4iJm631YX

## **Link of the Train Dataset:**
https://drive.google.com/open?id=1-0JMmWR27KYzl5hVKpMQhnNNvCtkY2lz

## **Link of the Test Dataset:**
https://drive.google.com/open?id=1-0jNPBxpQBPp1ft_u9U42ynWyUyrf1Fx

# **Step 2: Data Collection**

## **Import required libraries**

In [0]:
%load_ext autoreload
%autoreload 2
import pandas as pd
import numpy as np
import csv
import gensim
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix
from nltk.stem.porter import PorterStemmer
from sklearn import metrics
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.pipeline import Pipeline
from nltk.corpus import stopwords
from string import punctuation
import seaborn as sns
import pandas as pd
import numpy as np
import nltk
import re
import nltk
import logging
import os
import pickle
import re
import sqlite3
import inspect
import sys
import json
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
import matplotlib.pyplot as plt
from scipy import sparse

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## **Mount the shared drive on google colab**

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## **Append the path containing all the datasets**

In [0]:
import sys
sys.path.append('/content/drive/My Drive/AlternusVeraDataSets2019/FinalExam/Transformers/SnehalYeole/Datasets_NELA/Datasets')

## **Import tqdm**

In [0]:
import util
import pandas as pd
from tqdm import tqdm_notebook as tqdm

## **Read labels.csv file**

In [0]:
labels_df = pd.read_csv('/content/drive/My Drive/AlternusVeraDataSets2019/FinalExam/Transformers/SnehalYeole/Datasets_NELA/Datasets/labels.csv')
labels_df = labels_df.rename(columns={"Unnamed: 0": "Source"})

In [0]:
labels_df[labels_df.Source == "MSNBC"]["Media Bias / Fact Check, label"]

74    left_bias
Name: Media Bias / Fact Check, label, dtype: object

## **Printing the labels datset to view the data and columns present**

In [0]:
labels_df

Unnamed: 0,Source,"NewsGuard, Does not repeatedly publish false content","NewsGuard, Gathers and presents information responsibly","NewsGuard, Regularly corrects or clarifies errors","NewsGuard, Handles the difference between news and opinion responsibly","NewsGuard, Avoids deceptive headlines","NewsGuard, Website discloses ownership and financing","NewsGuard, Clearly labels advertising","NewsGuard, Reveals who's in charge, including any possible conflicts of interest","NewsGuard, Provides information about content creators","NewsGuard, score","NewsGuard, overall_class","Pew Research Center, known_by_40%","Pew Research Center, total","Pew Research Center, consistently_liberal","Pew Research Center, mostly_liberal","Pew Research Center, mixed","Pew Research Center, mostly conservative","Pew Research Center, consistently conservative","Wikipedia, is_fake","Open Sources, reliable","Open Sources, fake","Open Sources, unreliable","Open Sources, bias","Open Sources, conspiracy","Open Sources, hate","Open Sources, junksci","Open Sources, rumor","Open Sources, blog","Open Sources, clickbait","Open Sources, political","Open Sources, satire","Open Sources, state","Media Bias / Fact Check, label","Media Bias / Fact Check, factual_reporting","Media Bias / Fact Check, extreme_left","Media Bias / Fact Check, right","Media Bias / Fact Check, extreme_right","Media Bias / Fact Check, propaganda","Media Bias / Fact Check, fake_news","Media Bias / Fact Check, some_fake_news","Media Bias / Fact Check, failed_fact_checks","Media Bias / Fact Check, conspiracy","Media Bias / Fact Check, pseudoscience","Media Bias / Fact Check, hate_group","Media Bias / Fact Check, anti_islam","Media Bias / Fact Check, nationalism","Allsides, bias_rating","Allsides, community_agree","Allsides, community_disagree","Allsides, community_label","BuzzFeed, leaning","PolitiFact, Pants on Fire!","PolitiFact, False","PolitiFact, Mostly False","PolitiFact, Half-True","PolitiFact, Mostly True","PolitiFact, True"
0,21stCenturyWire,,,,,,,,,,,,,,,,,,,,,,,,1.0,,,,,,,,,conspiracy_pseudoscience,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,left,,,,,,
1,ABC News,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,95.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,-1.0,,,,,,,,,,,,,,,left_center_bias,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Lean Left,8964.0,6949.0,somewhat agree,,,,,,,
2,AMERICAblog News,,,,,,,,,,,,,,,,,,,,,,,1.0,,,,,,2.0,,,,,,,,,,,,,,,,,,,,,,left,,,,,,
3,Activist Post,,,,,,,,,,,,,,,,,,,,,,,,1.0,,,,,,,,,conspiracy_pseudoscience,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,left,,,,,,
4,Addicting Info,,,,,,,,,,,,,,,,,,,,,,2.0,,,,,,,1.0,,,,left_bias,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,left,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
189,iPolitics,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,left_center_bias,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,
190,oann,,,,,,,,,,,,,,,,,,,,2.0,,,,,,,,,,1.0,,,,,,,,,,,,,,,,,,,,,right,,,,,,
191,rferl,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
192,sott.net,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,conspiracy_pseudoscience,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,


**We observe that there total 194 rows and 58 columns in the labels dataset. Column1 represents source and rest of 8 columns represent various factors (or features) of evaluating the news outlet. We also observe that there are lot of NaN values for all the features.**

In [0]:
labelSetFinal = labels_df["Source"]
labelSetFinal

0       21stCenturyWire
1              ABC News
2      AMERICAblog News
3         Activist Post
4        Addicting Info
             ...       
189           iPolitics
190                oann
191               rferl
192            sott.net
193    theRussophileorg
Name: Source, Length: 194, dtype: object

In [0]:
labelSet = ["BBC", "MSNBC", "Crikey"]

## **Extracting the list of articles from articles.db file**

In [0]:
dfFinal = pd.DataFrame([])

## **Function to fetch the list of articles from articles.db file**

In [0]:
def nela_load_articles_from_source(source_name, count=-1):
    conn = sqlite3.connect("/content/drive/My Drive/AlternusVeraDataSets2019/FinalExam/Transformers/SnehalYeole/Datasets_NELA/Datasets/articles.db")

    count_string = ""
    if count != -1:
        count_string = "limit " + str(count)

    df = pd.read_sql_query(
        "SELECT * FROM articles WHERE source='"
        + str(source_name)
        + "' " + "LIMIT 100"
        #+ count_string 
        + ";",
        conn,
    )
    return df

## **The above function contains a query that will fetch 100 articles for each of the source outlets and append the articles in the dataframe.**

In [0]:
for x in labelSetFinal:
  df = nela_load_articles_from_source(x)
  dfFinal = dfFinal.append(df)


## **Printing the list of fetch articles from the db file.**

In [0]:
dfFinal

Unnamed: 0,date,source,name,content
0,2018-07-16,21stCenturyWire,Israel Attacks Iranian-Linked Airbase in Alepp...,This latest report indicates Israel used its I...
1,2018-07-16,21stCenturyWire,Laurel Canyon the CIA Counter Culture Dave McG...,The United States is a pop cultural superpower...
2,2018-07-16,21stCenturyWire,PURELY POLITICAL Dont Expect Any Real Evidence...,As the mainstream press tries to get a grip on...
3,2018-07-17,21stCenturyWire,Iran Files Lawsuit Against US in International...,Iran has filed a lawsuit alleging the US viola...
4,2018-07-17,21stCenturyWire,No Exemptions for EU US Wants unprecedented fi...,At some point along their quest for regime cha...
...,...,...,...,...
95,2018-08-06,theRussophileorg,NATO to Set Up Its First Air Base in Western B...,This\r\n\r\n[post](https://www.strategic-cultu...
96,2018-08-06,theRussophileorg,NBCs Chuck Todd Trumps Dehumanizing Tweet on P...,This\r\n\r\n[post](https://truepundit.com/nbcs...
97,2018-08-06,theRussophileorg,Naya Pakistan New Pakistan Should Apply to Joi...,This\r\n\r\n[post](https://www.globalresearch....
98,2018-08-06,theRussophileorg,Naya Pakistan Should Apply To Join The Eurasia...,This\r\n\r\n[post](https://www.eurasiafuture.c...


# **Step 3: Set Data Narrative : Set Business Objectives, what use case are you solving for**

We observed that the articles gathered in this dataset is found in an sqlite database. The database has one table name articles. This table has 4 textual columns:

date: Date of article in yyyy-mm-dd format.
source: Source of article.
name: Title of article.
content: Clean text content of article.
The rows of the article are sorted first with respect to date and then with respect to source.The dataset's articles are also provided in plain-text files, with file-structure and file naming convension.

We also observe that labels dataset conatins different source outlets and features for these articles. In order to determine the degree of fakeness for an article we will amalgamate articles dataset and labels dataset on the basis of source column. After amalgamation, we can select among different features available in order to determine the degree of fakeness of article from a particular source.

## **Step 4: Exploratory Data Analysis and Visualization**

## **4.1 Feature Analysis and Engineering**

In this project, we will deal with the below label to determine the degree of fakeness of article.

Media Bias / Fact Check, factual_reporting

This fetaure has values in the range 1.0 to 5.0

1 - represents article is fake
5 - represents article is true

## **4.2 Analyze data**

**Check for NaN values in the list of articles fetched.**

In [0]:
dfFinal.isnull().sum()

date       0
source     0
name       0
content    0
dtype: int64

# **Step 5. Data Prep: Curation**

## **5.1 Feature Selection and Extraction : what are the main features to use in this data set?**

In this project, we will deal with the below label to determine the degree of fakeness of article.

Media Bias / Fact Check, factual_reporting

This fetaure has values in the range 1.0 to 5.0

1 - represents article is fake
5 - represents article is true

## **5.2 Data Verification: Do we have enough data?**

We dont have enough data in one dataset. Therefore, we need to amalgamate articles dataset and labels dataset, in order to get labels and source for each of the articles in the articles dataset.

## **5.3 Possibility of Amalgamation1: Add Dataset 2**


**Calculating the lenght of content for each article and appending to the dataset.**

In [0]:
length = []
[length.append(len(str(text))) for text in dfFinal['content']]
dfFinal['length'] = length
dfFinal.head()

Unnamed: 0,date,source,name,content,length
0,2018-07-16,21stCenturyWire,Israel Attacks Iranian-Linked Airbase in Alepp...,This latest report indicates Israel used its I...,1388
1,2018-07-16,21stCenturyWire,Laurel Canyon the CIA Counter Culture Dave McG...,The United States is a pop cultural superpower...,1448
2,2018-07-16,21stCenturyWire,PURELY POLITICAL Dont Expect Any Real Evidence...,As the mainstream press tries to get a grip on...,7384
3,2018-07-17,21stCenturyWire,Iran Files Lawsuit Against US in International...,Iran has filed a lawsuit alleging the US viola...,2897
4,2018-07-17,21stCenturyWire,No Exemptions for EU US Wants unprecedented fi...,At some point along their quest for regime cha...,1355


In [0]:
dfFinal.shape

(18071, 5)


    Source names NewsGuard labels:
    NewsGuard, Does not repeatedly publish false content 1 true, 0 false
    NewsGuard, Gathers and presents information responsibly 1 true, 0 false
    NewsGuard, Regularly corrects or clarifies errors 1 true, 0 false
    NewsGuard, Handles the difference between news and opinion responsibly 1 true, 0 false
    NewsGuard, Avoids deceptive headlines 1 true, 0 false
    NewsGuard, Website discloses ownership and financing 1 true, 0 false
    NewsGuard, Clearly labels advertising 1 true, 0 false
    NewsGuard, Reveals who's in charge, including any possible conflicts of interest 1 true, 0 false
    NewsGuard, Provides information about content creators 1 true, 0 false
    NewsGuard, score 0-100
    NewsGuard, overall_class 1 good, 0 bad Pew Research Center
    Pew Research Center, known_by_40% 1 true, 0 false
    Pew Research Center, total 1 trusted, 0 undecided, -1 not trusted
    Pew Research Center, consistently_liberal 1 trusted, 0 undecided, -1 not trusted
    Pew Research Center, mostly_liberal 1 trusted, 0 undecided, -1 not trusted
    Pew Research Center, mixed 1 trusted, 0 undecided, -1 not trusted
    Pew Research Center, mostly conservative 1 trusted, 0 undecided, -1 not trusted
    Pew Research Center, consistently conservative 1 trusted, 0 undecided, -1 not trusted Wikipedia
    Wikipedia, is_fake 1 marked Open Sources
    Open Sources, reliable # tag
    Open Sources, fake # tag
    Open Sources, unreliable # tag
    Open Sources, bias # tag
    Open Sources, conspiracy # tag
    Open Sources, hate # tag
    Open Sources, junksci # tag
    Open Sources, rumor # tag
    Open Sources, blog # tag
    Open Sources, clickbait # tag
    Open Sources, political # tag
    Open Sources, satire # tag
    Open Sources, state # tag Media Bias / Fact Check
    Media Bias / Fact Check, label label
    Media Bias / Fact Check, factual_reporting bad 1 - 5 good
    Media Bias / Fact Check, extreme_left 1 true, 0 false
    Media Bias / Fact Check, right 1 true, 0 false
    Media Bias / Fact Check, extreme_right 1 true, 0 false
    Media Bias / Fact Check, propaganda 1 true, 0 false
    Media Bias / Fact Check, fake_news 1 true, 0 false
    Media Bias / Fact Check, some_fake_news 1 true, 0 false
    Media Bias / Fact Check, failed_fact_checks 1 true, 0 false
    Media Bias / Fact Check, conspiracy 1 true, 0 false
    Media Bias / Fact Check, pseudoscience 1 true, 0 false
    Media Bias / Fact Check, hate_group 1 true, 0 false
    Media Bias / Fact Check, anti_islam 1 true, 0 false
    Media Bias / Fact Check, nationalism 1 true, 0 false Allsides
    Allsides, bias_rating rating label
    Allsides, community_agree # votes agreeing
    Allsides, community_disagree # votes disagreeing
    Allsides, community_label agreement label BuzzFeed
    BuzzFeed, leaning left, right Politifact
    PolitiFact, Pants on Fire! # counts
    PolitiFact, False # counts
    PolitiFact, Mostly False # counts
    PolitiFact, Half-True # counts
    PolitiFact, Mostly True # counts
    PolitiFact, True # counts


In [0]:
labels_df[["Source", "Media Bias / Fact Check, factual_reporting", "Media Bias / Fact Check, label"]]
labels_df[["Source","factLabel","bias_satire_conspiracy_reputation"]] = labels_df[["Source","Media Bias / Fact Check, factual_reporting", "Media Bias / Fact Check, label"]]
df_label1 = labels_df[["Source","factLabel","bias_satire_conspiracy_reputation"]]
df_label1

Unnamed: 0,Source,factLabel,bias_satire_conspiracy_reputation
0,21stCenturyWire,3.0,conspiracy_pseudoscience
1,ABC News,4.0,left_center_bias
2,AMERICAblog News,,
3,Activist Post,2.0,conspiracy_pseudoscience
4,Addicting Info,3.0,left_bias
...,...,...,...
189,iPolitics,4.0,left_center_bias
190,oann,,
191,rferl,,
192,sott.net,3.0,conspiracy_pseudoscience


## **Checking for null values in the dataset.**

In [0]:
df_label1.isnull().sum()

Source                                0
factLabel                            82
bias_satire_conspiracy_reputation    59
dtype: int64


**Replacing all the NaN values in the factLabel and bias column.**

In [0]:
# df_label['label'].fillna('nolabel',inplace=True)
df_label1['factLabel'] = df_label1['factLabel'].replace(np.nan, 0.0)
df_label1['bias_satire_conspiracy_reputation'] = df_label1['bias_satire_conspiracy_reputation'].replace(np.nan, "nolabel")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [0]:
df_label1.isnull().sum()

Source                               0
factLabel                            0
bias_satire_conspiracy_reputation    0
dtype: int64

In [0]:
df_label1

Unnamed: 0,Source,factLabel,bias_satire_conspiracy_reputation
0,21stCenturyWire,3.0,conspiracy_pseudoscience
1,ABC News,4.0,left_center_bias
2,AMERICAblog News,0.0,nolabel
3,Activist Post,2.0,conspiracy_pseudoscience
4,Addicting Info,3.0,left_bias
...,...,...,...
189,iPolitics,4.0,left_center_bias
190,oann,0.0,nolabel
191,rferl,0.0,nolabel
192,sott.net,3.0,conspiracy_pseudoscience


## **Printing the unique values of Factlabel**

In [0]:
df_label1.factLabel.unique()

array([3., 4., 0., 2., 1., 5.])

**Defining the function to Encode the Factlable into three categories. Adding the encoded label to the dataset.**

In [0]:
true_labels = ['4.0', '5.0']
false_labels = ['1.0', '2.0']
neutral_nolabels = ['0.0', '3.0']
def simplify_label(input_label):
    if input_label in true_labels:
        return 1
    elif input_label in false_labels:
        return 0
    else:
        return 2   

In [0]:
df_label1["encoded_FactLabel"] = df_label1.apply(lambda row: simplify_label(row['factLabel']), axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [0]:
df_label1

Unnamed: 0,Source,factLabel,bias_satire_conspiracy_reputation,encoded_FactLabel
0,21stCenturyWire,3.0,conspiracy_pseudoscience,2
1,ABC News,4.0,left_center_bias,2
2,AMERICAblog News,0.0,nolabel,2
3,Activist Post,2.0,conspiracy_pseudoscience,2
4,Addicting Info,3.0,left_bias,2
...,...,...,...,...
189,iPolitics,4.0,left_center_bias,2
190,oann,0.0,nolabel,2
191,rferl,0.0,nolabel,2
192,sott.net,3.0,conspiracy_pseudoscience,2


## **Dataset Amalgamation**

In [0]:
df_label1.rename(columns = {'Source':'source'}, inplace = True) 
df_merge_col1 = pd.merge(dfFinal, df_label1, on='source')

df_merge_col1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(**kwargs)


Unnamed: 0,date,source,name,content,length,factLabel,bias_satire_conspiracy_reputation,encoded_FactLabel
0,2018-07-16,21stCenturyWire,Israel Attacks Iranian-Linked Airbase in Alepp...,This latest report indicates Israel used its I...,1388,3.0,conspiracy_pseudoscience,2
1,2018-07-16,21stCenturyWire,Laurel Canyon the CIA Counter Culture Dave McG...,The United States is a pop cultural superpower...,1448,3.0,conspiracy_pseudoscience,2
2,2018-07-16,21stCenturyWire,PURELY POLITICAL Dont Expect Any Real Evidence...,As the mainstream press tries to get a grip on...,7384,3.0,conspiracy_pseudoscience,2
3,2018-07-17,21stCenturyWire,Iran Files Lawsuit Against US in International...,Iran has filed a lawsuit alleging the US viola...,2897,3.0,conspiracy_pseudoscience,2
4,2018-07-17,21stCenturyWire,No Exemptions for EU US Wants unprecedented fi...,At some point along their quest for regime cha...,1355,3.0,conspiracy_pseudoscience,2
...,...,...,...,...,...,...,...,...
18066,2018-08-06,theRussophileorg,NATO to Set Up Its First Air Base in Western B...,This\r\n\r\n[post](https://www.strategic-cultu...,9599,0.0,nolabel,2
18067,2018-08-06,theRussophileorg,NBCs Chuck Todd Trumps Dehumanizing Tweet on P...,This\r\n\r\n[post](https://truepundit.com/nbcs...,3240,0.0,nolabel,2
18068,2018-08-06,theRussophileorg,Naya Pakistan New Pakistan Should Apply to Joi...,This\r\n\r\n[post](https://www.globalresearch....,4102,0.0,nolabel,2
18069,2018-08-06,theRussophileorg,Naya Pakistan Should Apply To Join The Eurasia...,This\r\n\r\n[post](https://www.eurasiafuture.c...,3900,0.0,nolabel,2


# **5.4 Data Cleansing**

## **Checking NaN values in the merged dataset**

In [0]:
df_merge_col1.isnull().sum()

date                                 0
source                               0
name                                 0
content                              0
length                               0
factLabel                            0
bias_satire_conspiracy_reputation    0
encoded_FactLabel                    0
dtype: int64

## **Renaming the name column to statement**

In [0]:
df_merge_col1.rename(columns = {'name':'statement'}, inplace = True) 
df_merge_col1

Unnamed: 0,date,source,statement,content,length,factLabel,bias_satire_conspiracy_reputation,encoded_FactLabel
0,2018-07-16,21stCenturyWire,Israel Attacks Iranian-Linked Airbase in Alepp...,This latest report indicates Israel used its I...,1388,3.0,conspiracy_pseudoscience,2
1,2018-07-16,21stCenturyWire,Laurel Canyon the CIA Counter Culture Dave McG...,The United States is a pop cultural superpower...,1448,3.0,conspiracy_pseudoscience,2
2,2018-07-16,21stCenturyWire,PURELY POLITICAL Dont Expect Any Real Evidence...,As the mainstream press tries to get a grip on...,7384,3.0,conspiracy_pseudoscience,2
3,2018-07-17,21stCenturyWire,Iran Files Lawsuit Against US in International...,Iran has filed a lawsuit alleging the US viola...,2897,3.0,conspiracy_pseudoscience,2
4,2018-07-17,21stCenturyWire,No Exemptions for EU US Wants unprecedented fi...,At some point along their quest for regime cha...,1355,3.0,conspiracy_pseudoscience,2
...,...,...,...,...,...,...,...,...
18066,2018-08-06,theRussophileorg,NATO to Set Up Its First Air Base in Western B...,This\r\n\r\n[post](https://www.strategic-cultu...,9599,0.0,nolabel,2
18067,2018-08-06,theRussophileorg,NBCs Chuck Todd Trumps Dehumanizing Tweet on P...,This\r\n\r\n[post](https://truepundit.com/nbcs...,3240,0.0,nolabel,2
18068,2018-08-06,theRussophileorg,Naya Pakistan New Pakistan Should Apply to Joi...,This\r\n\r\n[post](https://www.globalresearch....,4102,0.0,nolabel,2
18069,2018-08-06,theRussophileorg,Naya Pakistan Should Apply To Join The Eurasia...,This\r\n\r\n[post](https://www.eurasiafuture.c...,3900,0.0,nolabel,2


## **Saving the amalgamated dataset to the google drive.**

In [0]:
df_merge_col1.to_csv('/content/drive/My Drive/AlternusVeraDataSets2019/FinalExam/Transformers/SnehalYeole/Datasets_NELA/Datasets/credibility_factchecks_NELA.csv')

## **Conclusion:**

The Data Preprocessing and Data Preparation on NELA dataset required analysis on the sources and articles from the dataset. There are several labels that are specific to a factor contributing in the determination of news fakeness or truthfulness. Choosing the appropriate label for the factor and amalagmating the NELA dataset with articles and labels dataset was a challenge as it required iterating the sources and articles and amalgamating the evenly distributed labels with NELA dataset. 