## Introduction

To explore the use of online sources in health communication, two publicly available datasets were used, HINTS and Reddit Pushshift. The Health Information National Trends Survey (HINTS) collects information about the Amercian's use of cancer-related information. In this project, eleven questions from the HINTS survey were collected for review. These questions focus on gathering information about individuals' behaviors, trust, and perceptions related to cancer information and health communication. Collectively, these questions provide insights into the sources and levels of trust Americans place in health and cancer information, their experiences in searching for such information, and their perceptions of the reliability of online and social media health content. HINTS is used to better understand how the American public looks for health information for themselves or their loved ones.  

The Reddit Pushshift dataset includes free text data from the Reddit online forum where users can post and look for information. It also includes subreddits, which branch off into various subreddits—dedicated communities centered around specific topics. By examining this dataset, this project complements insights from HINTS, offering a unique perspective on how individuals seek, share, and discuss cancer-related health information online. The analysis of Reddit data allows researchers to explore the dynamic and informal exchanges that occur in digital forums to better our understanding of how cancer information is communicated.  

## HINTS Preparation of Data

In [30]:
import pyreadr
import pandas as pd
from IPython.display import display

# Load the .rda file
result = pyreadr.read_r('/Users/elizabethkovalchuk/Documents/DSAN6000/Project/fall-2024-project-team-35/data/HINTS6_R_20240524/hints6_public.rda')

# Extract the DataFrame from the loaded data
hints = result['public']  # Assuming 'public' is the name of the R object in the file

# Specify the columns to select
columns = [
    "HHID", "SeekCancerInfo", "CancerFrustrated", "CancerTrustDoctor",
    "CancerTrustFamily", "CancerTrustGov", "CancerTrustCharities",
    "CancerTrustReligiousOrgs", "CancerTrustScientists", "Electronic2_HealthInfo",
    "MisleadingHealthInfo", "TrustHCSystem"
]

# Select the relevant columns
hints_select = hints[columns]

# # Convert the 'updatedate' column if required (commented for now)
# hints_select['updatedate'] = pd.to_datetime(hints_select['updatedate'] / 1000, unit='s')

# Preview the first few rows
print("Sample data from the HINTS dataset:")
display(hints_select.head())
print(f"Shape of the original dataset: {hints_select.shape}")

Sample data from the HINTS dataset:


Unnamed: 0,HHID,SeekCancerInfo,CancerFrustrated,CancerTrustDoctor,CancerTrustFamily,CancerTrustGov,CancerTrustCharities,CancerTrustReligiousOrgs,CancerTrustScientists,Electronic2_HealthInfo,MisleadingHealthInfo,TrustHCSystem
0,21000006,No,"Inapplicable, coded 2 in SeekCancerInfo",A lot,Missing data (Not Ascertained),Missing data (Not Ascertained),Missing data (Not Ascertained),Missing data (Not Ascertained),Missing data (Not Ascertained),Question answered in error (Commission Error),I do not use social media,Very
1,21000009,No,"Inapplicable, coded 2 in SeekCancerInfo",A lot,Some,A lot,Some,Some,A lot,Yes,I do not use social media,Very
2,21000020,Yes,Somewhat disagree,A lot,Some,Some,A little,Not at all,A lot,Yes,Some,Somewhat
3,21000022,No,"Inapplicable, coded 2 in SeekCancerInfo",A lot,Missing data (Not Ascertained),Missing data (Not Ascertained),Missing data (Not Ascertained),Missing data (Not Ascertained),Missing data (Not Ascertained),"Inapplicable, coded 2 in UseInternet",I do not use social media,Somewhat
4,21000039,No,"Inapplicable, coded 2 in SeekCancerInfo",Some,Some,Some,Not at all,Not at all,Some,Yes,A lot,Somewhat


Shape of the original dataset: (6252, 12)


In [31]:
# Count missing values in each column
missing_values = hints_select.isna().sum()

# Display the count of missing values
print("Missing values per column:")
display(missing_values)


Missing values per column:


HHID                        0
SeekCancerInfo              0
CancerFrustrated            0
CancerTrustDoctor           0
CancerTrustFamily           0
CancerTrustGov              0
CancerTrustCharities        0
CancerTrustReligiousOrgs    0
CancerTrustScientists       0
Electronic2_HealthInfo      0
MisleadingHealthInfo        0
TrustHCSystem               0
dtype: int64

In [32]:
# List of ordinal columns
ordinal_columns = [
    "SeekCancerInfo", "CancerFrustrated", "CancerTrustDoctor",
    "CancerTrustFamily", "CancerTrustGov", "CancerTrustCharities",
    "CancerTrustReligiousOrgs", "CancerTrustScientists", "Electronic2_HealthInfo",
    "MisleadingHealthInfo", "TrustHCSystem"
]

# Display unique values for each ordinal column
print("Unique values for ordinal columns:")
for column in ordinal_columns:
    unique_values = hints_select[column].unique()
    print(f"\nColumn: {column}")
    print(f"Unique Values: {unique_values}")


Unique values for ordinal columns:

Column: SeekCancerInfo
Unique Values: ['No', 'Yes', 'Missing data (Not Ascertained)']
Categories (3, object): ['Missing data (Not Ascertained)', 'No', 'Yes']

Column: CancerFrustrated
Unique Values: ['Inapplicable, coded 2 in SeekCancerInfo', 'Somewhat disagree', 'Strongly disagree', 'Somewhat agree', 'Strongly agree', 'Question answered in error (Commission Error)', 'Missing data (Filter Missing)', 'Missing data (Not Ascertained)', 'Multiple responses selected in error']
Categories (9, object): ['Inapplicable, coded 2 in SeekCancerInfo', 'Missing data (Filter Missing)', 'Missing data (Not Ascertained)', 'Multiple responses selected in error', ..., 'Somewhat agree', 'Somewhat disagree', 'Strongly agree', 'Strongly disagree']

Column: CancerTrustDoctor
Unique Values: ['A lot', 'Some', 'Not at all', 'A little', 'Missing data (Not Ascertained)', 'Multiple responses selected in error']
Categories (6, object): ['A little', 'A lot', 'Missing data (Not Asce

In [33]:
# Define the valid scales for each column
valid_scales = {
    "CancerFrustrated": ['Somewhat disagree', 'Strongly disagree', 'Somewhat agree', 'Strongly agree'],
    "CancerTrustDoctor": ['A lot', 'Some', 'Not at all', 'A little'],
    "CancerTrustFamily": ['A lot', 'Some', 'Not at all', 'A little'],
    "CancerTrustGov": ['A lot', 'Some', 'Not at all', 'A little'],
    "CancerTrustCharities": ['A lot', 'Some', 'Not at all', 'A little'],
    "CancerTrustReligiousOrgs": ['A lot', 'Some', 'Not at all', 'A little'],
    "CancerTrustScientists": ['A lot', 'Some', 'Not at all', 'A little'],
    "TrustHCSystem": ['A lot', 'Some', 'Not at all', 'A little'],
    "Electronic2_HealthInfo": ['Yes', 'No'], 
    "MisleadingHealthInfo": ['I do not use social media', 'None', 'A little', 'Some', 'A lot']  
}

# Create a copy of the original DataFrame
hints_cleaned = hints_select.copy()

# Filter the DataFrame
for column, scale in valid_scales.items():
    hints_cleaned = hints_cleaned[hints_cleaned[column].isin(scale)]

# Display the cleaned dataset and its shape
print("Data after filtering invalid values:")
display(hints_cleaned.head())
print(f"Shape of the cleaned dataset: {hints_cleaned.shape}")

Data after filtering invalid values:


Unnamed: 0,HHID,SeekCancerInfo,CancerFrustrated,CancerTrustDoctor,CancerTrustFamily,CancerTrustGov,CancerTrustCharities,CancerTrustReligiousOrgs,CancerTrustScientists,Electronic2_HealthInfo,MisleadingHealthInfo,TrustHCSystem
51,21000330,Yes,Somewhat disagree,Some,Not at all,Some,Some,Not at all,A lot,Yes,A lot,A little
112,21000976,Yes,Somewhat agree,A lot,Some,Some,Some,Some,A lot,Yes,Some,A little
136,21001112,Yes,Somewhat disagree,A little,A little,Not at all,Not at all,Not at all,A little,No,A lot,Not at all
157,21001283,Yes,Somewhat disagree,A lot,Some,Not at all,A little,Some,Not at all,No,I do not use social media,Not at all
181,21001548,Yes,Strongly agree,A lot,Some,Not at all,Some,A lot,A little,Yes,Some,A little


Shape of the cleaned dataset: (323, 12)


In [35]:
# Count the number of NA or NaN values in each column
na_count = hints_cleaned.isna().sum()
#print("NA values count per column:")
#print(na_count)
#print(hints_cleaned.shape)
# Count unique values in the 'SeekCancerInfo' column
value_counts = hints_cleaned['SeekCancerInfo'].value_counts()
#print("Unique value counts in 'SeekCancerInfo':")
#print(value_counts)

# Save the cleaned dataset to an Excel file
output_file = "../data/csv/hints_cleaned_forML_spearman.csv"
hints_cleaned.to_csv(output_file, index=False)

#print(f"Cleaned dataset saved as {output_file}")


Unique value counts in 'SeekCancerInfo':
SeekCancerInfo
Yes                               323
Missing data (Not Ascertained)      0
No                                  0
Name: count, dtype: int64
Cleaned dataset saved as ../data/csv/hints_cleaned_forML_spearman.csv


## Reddit Preparation of the Data

The data was queried from the Reddit Pushshift dataset. Following the themes captured in the HINTs dataset, we performanced an intial eight queries searching for comments that included keywords in each of the questions in the HINTs dataset. The initial query was performed in AWS on a sample of the data. After reviewing some of the comments, all the unique subreddits were found. Searching through these subreddits, we made a list of subreddits that actually included comments about cancer and filtered out any of the subreddits that were not relevant to health at all.  

List of Cancer Subreddits that discussed cancer in the comments.

    subreddit_list = ['CrohnsDisease', 'thyroidcancer', 'AskDocs', 'UlcerativeColitis', 'Autoimmune', 
                  'BladderCancer', 'breastcancer', 'CancerFamilySupport', 'doihavebreastcancer', 
                  'WomensHealth', 'ProstateCancer', 'cll', 'Microbiome', 'predental', 'endometrialcancer', 
                  'cancer', 'Hashimotos', 'coloncancer', 'PreCervicalCancer', 'lymphoma', 'Lymphedema', 
                  'CancerCaregivers', 'braincancer', 'lynchsyndrome', 'nursing', 'testicularcancer', 'leukemia', 
                  'publichealth', 'Health', 'Fuckcancer', 'HealthInsurance', 'BRCA', 'Cancersurvivors', 
                  'pancreaticcancer', 'skincancer', 'stomachcancer']

These subreddits were compared to a random sample from the full Reddit dataset excluding the list of cancer subreddits above.  

Queries were conducted in Azure ML using Spark, with the data sourced from the instructor's Azure Blob container. Comments from both cancer-related and non-cancer subreddits were processed using an Azure ML job and saved as Parquet files in an Azure Blob container. The job applied a filter to separate cancer subreddits into one Parquet file and non-cancer subreddits into another. For the non-cancer subreddits, the data was randomized before filtering out the cancer-related subreddits.


In [None]:
# Path to the Azure ML Blob Container
workspace_default_storage_account = "projectgstoragedfb938a3e"
workspace_default_container = "azureml-blobstore-becc8696-e562-432e-af12-8a5e3e1f9b0f"
workspace_wasbs_base_url = f"wasbs://{workspace_default_container}@{workspace_default_storage_account}.blob.core.windows.net/"

comments_path = "cancer/comments"
submissions_path = "cancer/submissions"

PySpark was used to clean the data by removing leading and trailing whitespaces, removing punctuation (using regex), removing underscores, and converting to lowercase. Both subsets of data were limited to 10,000 rows in order to allow a reasonable compute time for each job. After the data was cleaned it was saved into a two parquet files in an Azure ML blob container to use for the rest of the project. The combined cancer subreddits and non-cancer subreddit totaled in 20,000 rows.

In [None]:
# Cancer subset of Reddit Data saved to an Azure ML Blob Container
output_path = f"{workspace_wasbs_base_url}cancer_subreddit.parquet"

# Non-cancer subset of Reddit Data saved to an Azure ML Blob Container
output_path = f"{workspace_wasbs_base_url}not_cancer_subreddit.parquet"

The source code for the cleaning the Reddit data is in GitHub:
[fall-2024-project-team-35/code/spark-job-sample-data](https://github.com/gu-dsan6000/fall-2024-project-team-35/tree/main/code/spark-job-sample-data)