# Data merging

In this notebook, I will be importing multiple datasets from different sources, and I will merge them all into a big cohesive dataset.

## Step 1: Datasets harmonization
This step consists of transforming all the datasets into a unified format with a consistent structure. This step is crucial for enabling subsequent operations like data merging and analysis.<br>
For each dataset, I will only keep two features : the text feature (the text of the news) and the label : 1 for fake and 0 for true.

### ISOT Fake News Dataset
[ISOT Fake News Dataset](https://onlineacademiccommunity.uvic.ca/isot/2022/11/27/fake-news-detection-datasets/) <br>
The ISOT Fake News dataset is a compilation of several thousands fake news and truthful articles, obtained from different legitimate news sites and sites flagged as unreliable by Politifact.com.


In [1]:
import zipfile
zip_path = 'C:\\Users\\Slim\\fake-news-classification\\data\\zipfiles\\news_dataset.zip'

# Specify the target directory where the files should be extracted
target_directory = 'C:\\Users\\Slim\\fake-news-classification\\data'

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    # Extract all files to the target directory
    zip_ref.extractall(target_directory)

In [2]:
import pandas as pd
fakedf = pd.read_csv('C:\\Users\\Slim\\fake-news-classification\\data\\Fake.csv')
truedf = pd.read_csv('C:\\Users\\Slim\\fake-news-classification\\data\\True.csv')

In [3]:
fakedf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23481 entries, 0 to 23480
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    23481 non-null  object
 1   text     23481 non-null  object
 2   subject  23481 non-null  object
 3   date     23481 non-null  object
dtypes: object(4)
memory usage: 733.9+ KB


In [4]:
truedf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21417 entries, 0 to 21416
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    21417 non-null  object
 1   text     21417 non-null  object
 2   subject  21417 non-null  object
 3   date     21417 non-null  object
dtypes: object(4)
memory usage: 669.4+ KB


In [5]:
fakedf['label'] = 1
truedf['label'] = 0

In [6]:
df1 = pd.concat([fakedf, truedf], ignore_index=True)

In [7]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44898 entries, 0 to 44897
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    44898 non-null  object
 1   text     44898 non-null  object
 2   subject  44898 non-null  object
 3   date     44898 non-null  object
 4   label    44898 non-null  int64 
dtypes: int64(1), object(4)
memory usage: 1.7+ MB


In [8]:
df1.drop(['title', 'subject', 'date'], axis=1, inplace=True)

In [9]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44898 entries, 0 to 44897
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    44898 non-null  object
 1   label   44898 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 701.7+ KB


In [10]:
df1['label'].value_counts()

label
1    23481
0    21417
Name: count, dtype: int64

### WELFake dataset
[WELFake dataset](https://zenodo.org/record/4561253) <br>
WELFake is a dataset of 72,134 news articles with 35,028 real and 37,106 fake news.


In [11]:
zip_path = 'C:\\Users\\Slim\\fake-news-classification\\data\\zipfiles\\WELFake_Dataset.zip'

# Specify the target directory where the files should be extracted
target_directory = 'C:\\Users\\Slim\\fake-news-classification\\data'

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    # Extract all files to the target directory
    zip_ref.extractall(target_directory)

In [12]:
df2 = pd.read_csv('C:\\Users\\Slim\\fake-news-classification\\data\\WELFake_Dataset.csv')

In [13]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72134 entries, 0 to 72133
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  72134 non-null  int64 
 1   title       71576 non-null  object
 2   text        72095 non-null  object
 3   label       72134 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 2.2+ MB


In [14]:
df2 = df2.dropna()

In [15]:
df2.drop(['Unnamed: 0', 'title'], axis=1, inplace=True)

In [16]:
df2['label'] = df2['label'].map({0: 1, 1: 0})

In [17]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 71537 entries, 0 to 72133
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    71537 non-null  object
 1   label   71537 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 1.6+ MB


In [18]:
df2['label'].value_counts()

label
0    36509
1    35028
Name: count, dtype: int64

### Fake news dataset
[Fake news dataset from Kaggle](https://www.kaggle.com/datasets/hassanamin/textdb3) <br>
This is a labeled dataset that I acquired from Kaggle.


In [19]:
zip_path = 'C:\\Users\\Slim\\fake-news-classification\\data\\zipfiles\\archive.zip'

# Specify the target directory where the files should be extracted
target_directory = 'C:\\Users\\Slim\\fake-news-classification\\data'

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    # Extract all files to the target directory
    zip_ref.extractall(target_directory)

In [20]:
df3 = pd.read_csv('C:\\Users\\Slim\\fake-news-classification\\data\\fake_or_real_news.csv')

In [21]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6335 entries, 0 to 6334
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  6335 non-null   int64 
 1   title       6335 non-null   object
 2   text        6335 non-null   object
 3   label       6335 non-null   object
dtypes: int64(1), object(3)
memory usage: 198.1+ KB


In [22]:
df3.drop(['Unnamed: 0', 'title'], axis=1, inplace=True)

In [23]:
df3['label'].value_counts()

label
REAL    3171
FAKE    3164
Name: count, dtype: int64

In [24]:
df3['label'] = df3['label'].map({'REAL': 0, 'FAKE': 1})

In [25]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6335 entries, 0 to 6334
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    6335 non-null   object
 1   label   6335 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 99.1+ KB


In [26]:
df3['label'].value_counts()

label
0    3171
1    3164
Name: count, dtype: int64

### Fake-Real News Kaggle dataset
[Fake-real news Kaggle dataset](https://www.kaggle.com/datasets/techykajal/fakereal-news) <br>
This is another labeled dataset that I acquired from Kaggle.


In [27]:
zip_path = 'C:\\Users\\Slim\\fake-news-classification\\data\\zipfiles\\archive2.zip'

# Specify the target directory where the files should be extracted
target_directory = 'C:\\Users\\Slim\\fake-news-classification\\data'

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    # Extract all files to the target directory
    zip_ref.extractall(target_directory)

In [28]:
df4 = pd.read_csv('C:\\Users\\Slim\\fake-news-classification\\data\\New Task.csv', encoding='latin-1')

In [29]:
df4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9960 entries, 0 to 9959
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   News_Headline  9960 non-null   object
 1   Link_Of_News   9960 non-null   object
 2   Source         9960 non-null   object
 3   Stated_On      9960 non-null   object
 4   Date           9960 non-null   object
 5   Label          9960 non-null   object
dtypes: object(6)
memory usage: 467.0+ KB


In [30]:
df4.drop(columns=['Link_Of_News', 'Source', 'Stated_On', 'Date'], axis=1, inplace=True)

In [31]:
df4.rename(columns={"News_Headline": "text", "Label": "label"}, inplace=True)

In [32]:
df4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9960 entries, 0 to 9959
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    9960 non-null   object
 1   label   9960 non-null   object
dtypes: object(2)
memory usage: 155.8+ KB


In [33]:
df4['label'].value_counts()

label
FALSE          2273
barely-true    1737
mostly-true    1722
half-true      1685
pants-fire     1402
TRUE           1036
full-flop        70
half-flip        27
no-flip           8
Name: count, dtype: int64

In [34]:
df4 = df4[df4['label'].isin(['TRUE', 'FALSE'])]

In [35]:
df4.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3309 entries, 0 to 9959
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    3309 non-null   object
 1   label   3309 non-null   object
dtypes: object(2)
memory usage: 77.6+ KB


In [36]:
df4['label'].value_counts()

label
FALSE    2273
TRUE     1036
Name: count, dtype: int64

In [37]:
df4['label'] = df4['label'].map({'FALSE': 1, 'TRUE': 0})

In [38]:
df4['label'].value_counts()

label
1    2273
0    1036
Name: count, dtype: int64

### Fake News Dataset
[Fake News Dataset from Kaggle](https://www.kaggle.com/datasets/pnkjgpt/fake-news-dataset) <br>
This is another labeled fake news dataset from kaggle.


In [39]:
zip_path = 'C:\\Users\\Slim\\fake-news-classification\\data\\zipfiles\\archive3.zip'

# Specify the target directory where the files should be extracted
target_directory = 'C:\\Users\\Slim\\fake-news-classification\\data'

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    # Extract all files to the target directory
    zip_ref.extractall(target_directory)

In [40]:
df5 = pd.read_csv('C:\\Users\\Slim\\fake-news-classification\\data\\train.csv')

In [41]:
df5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40000 entries, 0 to 39999
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   index       40000 non-null  int64 
 1   title       40000 non-null  object
 2   text        40000 non-null  object
 3   subject     40000 non-null  object
 4   date        40000 non-null  object
 5   class       40000 non-null  object
 6   Unnamed: 6  1 non-null      object
dtypes: int64(1), object(6)
memory usage: 2.1+ MB


In [42]:
df5.drop(['index', 'title', 'subject', 'date', 'Unnamed: 6'], axis=1, inplace=True)

In [43]:
df5.rename(columns={"class": "label"}, inplace=True)

In [44]:
df5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40000 entries, 0 to 39999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    40000 non-null  object
 1   label   40000 non-null  object
dtypes: object(2)
memory usage: 625.1+ KB


In [45]:
df5['label'].value_counts()

label
Fake                20886
Real                19113
February 5, 2017        1
Name: count, dtype: int64

In [46]:
df5 = df5.drop(df5[df5['label'] == 'February 5, 2017'].index)

In [47]:
df5['label'] = df5['label'].map({'Fake': 1, 'Real': 0})

In [48]:
df5['label'].value_counts()

label
1    20886
0    19113
Name: count, dtype: int64

## Step 2: Data merging
This step consists of bringing together all the datasets above through concatenation, and then saving the unified dataset as a new CSV file for further usage.

In [49]:
finaldf = pd.concat([df1, df2, df3, df4, df5], axis=0, ignore_index=True)

In [50]:
finaldf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 166078 entries, 0 to 166077
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    166078 non-null  object
 1   label   166078 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.5+ MB


In [51]:
file_path = 'C:\\Users\\Slim\\fake-news-classification\\data\\final.csv'
finaldf.to_csv(file_path, index=False)

In [52]:
import zipfile

def zip_csv_file(csv_file_path, zip_file_path):
    # Create a new ZIP file
    with zipfile.ZipFile(zip_file_path, 'w') as zip_file:
        # Add the CSV file to the ZIP file
        zip_file.write(csv_file_path, arcname='data.csv')
    print(f'{csv_file_path} has been zipped to {zip_file_path}.')

In [53]:
csv_file_path = 'data/final.csv'
zip_file_path = 'data/final.zip'
zip_csv_file(csv_file_path, zip_file_path)

data/final.csv has been zipped to data/final.zip.


For reasons of size management, I will be deleting the zip and csv files that are no longer needed since I have the final.csv file that contains all the data.

In [54]:
import shutil
shutil.rmtree('data/zipfiles')

In [55]:
from os import listdir
files = listdir('data')

In [56]:
from os import remove
for file in files:
    if file != 'final.zip':
        remove('data/'+file)

In [57]:
shutil.move("C:\\Users\\Slim\\fake-news-classification\\data\\final.zip", "C:\\Users\\Slim\\final.zip")

'C:\\Users\\Slim\\final.zip'

In [None]:
shutil.rmtree('data/zipfiles')