# Note

- In the ZIP file in this submission we already include a cleaned version of the sensitive dataset 'finance' and public dataset 'imdb'.
- Therefore, you only need to execute this notebook if you want to download and preprocess other datasets.
- On the other hand, if you want to run the algorithms only using the datasets provided in the submission file ('finance' and 'imdb'), please skip this notebook and go straight to `2_run_dpsu.ipynb`.

In [None]:
import pandas as pd

In [None]:
import re
import wget
import pandas as pd
import email

In [None]:
import nltk
nltk.download('punkt')

**Pre-processing function from Gopi et. al 2020**

In [None]:
def reddit_preprocess(text): 
    text = re.sub(r'https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)
    text = re.sub(r'\n', ' ', text, flags=re.MULTILINE)
    text = re.sub(r'\[removed\]', ' ', text, flags=re.MULTILINE)
    text = re.sub(r'\[deleted\]', ' ', text, flags=re.MULTILINE)
    sentences = nltk.tokenize.sent_tokenize(text)
    sentences = [" ".join(nltk.tokenize.word_tokenize(s)) for s in sentences] 
    return " ".join(sentences) 

# Reddit

The Reddit dataset was gathered by [Gopi et. al 2020](https://arxiv.org/abs/2002.09745) and made available in their [github repository](https://github.com/heyyjudes/differentially-private-set-union).

To download the dataset, please go to [this link](https://github.com/heyyjudes/differentially-private-set-union/blob/ea7b39285dace35cc9e9029692802759f3e1c8e8/data/clean_askreddit.csv.zip) and download the file `clean_askreddit.csv.zip` by clicking on the button **Download**.

After that, unzip the content and put the file `clean_askreddit.csv` in this project's `data` folder.

In [None]:
# (Gopi et. al 2020) already shares the cleaned dataset.
# Therefore, we do not have to clean it again.

# Twitter

The Twitter dataset was made available in the Kaggle platform.

To download the dataset, please go to [this link](https://www.kaggle.com/thoughtvector/customer-support-on-twitter) and click on the button **Download** in the upper right corner.

After that, unzip the content and put the file `twcs/twcs.csv` in this project's `data` folder.

In [None]:
df_twitter = pd.read_csv('data/twcs.csv')

In [None]:
df_twitter = df_twitter[['author_id', 'text']]

In [None]:
df_twitter['clean_text'] = df_twitter['text'].apply(lambda x: reddit_preprocess(x))

In [None]:
df_twitter = df_twitter[['author_id', 'clean_text']]
df_twitter.columns = ['author', 'clean_text']

In [None]:
# Save cleaned .csv
df_twitter.to_csv('data/twitter_cleaned.csv', index=False)

# Finance

The Finance dataset was made available in the Kaggle platform.

To download the dataset, please go to [this link](https://www.kaggle.com/miguelaenlle/massive-stock-news-analysis-db-for-nlpbacktests). In the data explorer, choose `analyst_ratings_processed.csv` and then click on the symbol with a arrow pointing down to download just this file.

After that, unzip the content and put the file `analyst_ratings_processed.csv` in this project's `data` folder.

**PLEASE NOTE THAT WE ARE ALREADY SUBMITTING THE CLEANED VERSION OF THIS DATASET AT `data\finance_cleaned.csv`**

In [None]:
df_finance = pd.read_csv('data/analyst_ratings_processed.csv')

In [None]:
df_finance = df_finance[['title']]

In [None]:
df_finance['author'] = range(df_finance.shape[0])
df_finance['author'] = df_finance['author'].apply(lambda x: 'a' + str(x))

In [None]:
df_finance['clean_text'] = df_finance['title'].apply(lambda x: reddit_preprocess(x))

In [None]:
df_finance.drop(['title'], axis=1, inplace=True)

In [None]:
# Save cleaned .csv
df_finance.to_csv('data/finance_cleaned.csv', index=False)

# IMDB

The IMDB dataset was made available in the Kaggle platform.

To download the dataset, please go to [this link](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews) and click on the button **Download** in the upper right corner.

After that, unzip the content and put the file `IMDB Dataset.csv` in this project's `data` folder.

**PLEASE NOTE THAT WE ARE ALREADY SUBMITTING THE CLEANED VERSION OF THIS DATASET AT `data\imdb_cleaned.csv`**

In [None]:
df_imdb = pd.read_csv('data/IMDB Dataset.csv')

In [None]:
df_imdb['author'] = range(df_imdb.shape[0])
df_imdb['author'] = df_imdb['author'].apply(lambda x: 'a' + str(x))

In [None]:
df_imdb['clean_text'] = df_imdb['review'].apply(lambda x: reddit_preprocess(x))

In [None]:
df_imdb.drop(['review', 'sentiment'], axis=1, inplace=True)

In [None]:
# Save cleaned .csv
df_imdb.to_csv('data/imdb_cleaned.csv', index=False)

# Covid

The Covid dataset was made available in the Kaggle platform.

To download the dataset, please go to [this link](https://www.kaggle.com/fmitchell259/covid19-medical-paperscsv) and click on the button **Download** in the upper right corner.

After that, unzip the content and put the file `kaggle_covid-19_open_csv_format.csv` in this project's `data` folder.

In [None]:
df_covid = pd.read_csv('data/kaggle_covid-19_open_csv_format.csv')

In [None]:
df_covid = df_covid[['text_body']].dropna()

In [None]:
df_covid['author'] = range(df_covid.shape[0])
df_covid['author'] = df_covid['author'].apply(lambda x: 'a' + str(x))

In [None]:
df_covid['clean_text'] = df_covid['text_body'].apply(lambda x: reddit_preprocess(x))

In [None]:
df_covid.drop(['text_body'], axis=1, inplace=True)

In [None]:
# Save cleaned .csv
df_covid.to_csv('data/covid_cleaned.csv', index=False)

# Songs

The Songs dataset was made available in the Kaggle platform.

To download the dataset, please go to [this link](https://www.kaggle.com/edenbd/150k-lyrics-labeled-with-spotify-valence) and click on the button **Download** in the upper right corner.

After that, unzip the content and put the file `labeled_lyrics_cleaned.csv` in this project's `data` folder.

In [None]:
df_songs = pd.read_csv('data/labeled_lyrics_cleaned.csv', index_col=0)

In [None]:
df_songs['author'] = df_songs['artist']

In [None]:
df_songs['clean_text'] = df_songs['seq'].apply(lambda x: reddit_preprocess(x))

In [None]:
df_songs.drop(['artist', 'seq', 'song', 'label'], axis=1, inplace=True)

In [None]:
# Save cleaned .csv
df_songs.to_csv('data/songs_cleaned.csv', index=False)

# Wiki

The Wiki dataset was made available in the Kaggle platform.

To download the dataset, please go to [this link](https://www.kaggle.com/markwijkhuizen/simplenormal-wikipedia-abstracts-v1) and click on the button **Download** in the upper right corner.

After that, unzip the content and put the file `wikipedia_abstracts.pkl` in this project's `data` folder.

In [None]:
df_wiki = pd.read_pickle('data/wikipedia_abstracts.pkl')

In [None]:
df_wiki = df_wiki[['abstract_original']]

In [None]:
df_wiki = df_wiki.dropna()

In [None]:
df_wiki['author'] = range(df_wiki.shape[0])
df_wiki['author'] = df_wiki['author'].apply(lambda x: 'a' + str(x))

In [None]:
df_wiki['clean_text'] = df_wiki['abstract_original'].apply(lambda x: reddit_preprocess(x))

In [None]:
df_wiki.drop(['abstract_original'], axis=1, inplace=True)

In [None]:
# Save cleaned .csv
df_wiki.to_csv('data/wikipedia_cleaned.csv', index=False)

# Enron

The Enron dataset was made available in the Kaggle platform.

To download the dataset, please go to [this link](https://www.kaggle.com/wcukierski/enron-email-dataset) and click on the button **Download** in the upper right corner.

After that, unzip the content and put the file `emails.csv` in this project's `data` folder.

In [None]:
df_enron = pd.read_csv("data/emails.csv")

In [None]:
df_enron['author'] = df_enron['file'].str.split('/').str[0]

In [None]:
def body_clean(mess):
    '''Function to extract body/message of e-mail'''
    e = email.message_from_string(mess)
    e2 = e.get_payload()
    return reddit_preprocess(e2)

In [None]:
df_enron['clean_text'] = df_enron['message'].apply(lambda x: body_clean(x))

In [None]:
df_enron.drop(['file','message'], axis=1, inplace=True)

In [None]:
# Save cleaned .csv
df_enron.to_csv('data/enron_cleaned.csv', index=False)