## Data Reduction 

Before Data reduction, performing EDA process again to maintain the integrity of the original data.
- [Data Sampling](#Data-Sampling)
    - [Simple random sampling with replacement (SRSWR)](#Simple-random-sampling-with-replacement-(SRSWR))
    - [Simple random sampling without replacement (SRSWOR)](#Simple-random-sampling-without-replacement-(SRSWOR))
    - [Cluster sampling](#Cluster-Sampling)
    - [Stratiﬁed sampling](#Stratified-Sampling)
    
A new dataset is obtained by performing data sampling process and saved obtained data into `sample-dataset/` directory.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("real-dataset/WELFake_Dataset.csv")

In [3]:
df.head(4)

Unnamed: 0.1,Unnamed: 0,title,text,label
0,0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1
1,1,,Did they post their votes for Hillary already?,1
2,2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1
3,3,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0


In [4]:
df.shape

(72134, 4)

In [5]:
df.drop(["Unnamed: 0"], axis=1, inplace=True)

In [6]:
# Replace 1 and 0 by Real and Fake (news).

df['label'] = df['label'].replace({1: 'Real', 0: 'Fake'})

In [7]:
# filling NaN values with '' empty string
df = df.fillna('')

In [8]:
df.shape

(72134, 3)

In [9]:
df['label'].value_counts()

label
Real    37106
Fake    35028
Name: count, dtype: int64

In [11]:
## function to calculate tabulation
def calculate_tabulation(dataframe):
    '''
    Calucate tabulation by count and percentage of given dataframe.
    For label attribute only.
    '''
    
    label_count = dataframe['label'].value_counts()
    label_percentage = (label_count / len(dataframe) * 100 ).round(2)
    
    tabulation = pd.DataFrame({'Count': label_count, 'Percentage': label_percentage})
    
    total_count = tabulation['Count'].sum()
    total_percentage = tabulation['Percentage'].sum()
    tabulation.loc['Total'] = [total_count, total_percentage]
    
    return tabulation

In [12]:
calculate_tabulation(df).T

label,Real,Fake,Total
Count,37106.0,35028.0,72134.0
Percentage,51.44,48.56,100.0


In [13]:
df.head(3)

Unnamed: 0,title,text,label
0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,Real
1,,Did they post their votes for Hillary already?,Real
2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",Real


## Data Sampling

In [14]:
# 25% of given dataset.

percentage = 0.25
sample_size = int(percentage * df.shape[0])
sample_size

18033

###  Simple random sampling with replacement (SRSWR) 

In [16]:
df_srs_wr = df.sample(n=sample_size, replace=True, random_state=42, ignore_index=True)

In [19]:
df_srs_wr.head()

Unnamed: 0,title,text,label
0,IT’S CHRISTMAS IN OCTOBER: American Debt Is $1...,This one sentence should scare every taxpaying...,Real
1,High School Forced to Change Mascot over Accus...,A school district in Kentucky has canceled the...,Fake
2,Vanguard CEO 'encouraged' by efforts to revise...,(Reuters) - Vanguard Group Chief Executive Bil...,Fake
3,UNREAL! CBS’S TED KOPPEL Tells Sean Hannity He...,,Real
4,I’m Running Out of Popcorn – Harvey Organ,Let us have a look at the data for today \n. \...,Real


In [17]:
calculate_tabulation(df_srs_wr)

Unnamed: 0_level_0,Count,Percentage
label,Unnamed: 1_level_1,Unnamed: 2_level_1
Real,9287.0,51.5
Fake,8746.0,48.5
Total,18033.0,100.0


In [20]:
df_srs_wr.to_csv('sample-dataset/dataset-with-replacement.csv')

### Simple random sampling without replacement (SRSWOR)

In [23]:
df_srs_wor = df.sample(n=sample_size, replace=False, random_state=42, ignore_index=True)

In [24]:
calculate_tabulation(df_srs_wor).T

label,Real,Fake,Total
Count,9163.0,8870.0,18033.0
Percentage,50.81,49.19,100.0


In [25]:
df_srs_wor.to_csv('sample-dataset/dataset-without-replacement.csv')

### Cluster Sampling

In [26]:
# Assuming you have some clusters in your data (e.g., based on 'label')
clusters = df['label'].unique()
selected_clusters = np.random.choice(clusters, size=2, replace=False)

df_cluster_sampling = df[df['label'].isin(selected_clusters)]

In [27]:
calculate_tabulation(df_cluster_sampling)

Unnamed: 0_level_0,Count,Percentage
label,Unnamed: 1_level_1,Unnamed: 2_level_1
Real,37106.0,51.44
Fake,35028.0,48.56
Total,72134.0,100.0


In [28]:
df_cluster_sampling.to_csv('sample-dataset/dataset-with-cluster-sampling.csv')

### Stratified Sampling

In [29]:
# Assuming you have strata in your data (e.g., based on 'label')
strata_size = sample_size  // 2
strata_samples = df.groupby('label').apply(lambda x: x.sample(n=strata_size, random_state=42))

# Combine the samples from each stratum
df_stratified_sampling = strata_samples.reset_index(drop=True)

In [30]:
calculate_tabulation(df_stratified_sampling)

Unnamed: 0_level_0,Count,Percentage
label,Unnamed: 1_level_1,Unnamed: 2_level_1
Fake,9016.0,50.0
Real,9016.0,50.0
Total,18032.0,100.0


In [31]:
df_stratified_sampling.to_csv('sample-dataset/dataset-with-stratified-sampling.csv')