# Twitter Financial Dataset Preparation

In [1]:
from datasets import load_dataset

# Specify the dataset you want to download
dataset_name = "zeroshot/twitter-financial-news-topic"

# Load the dataset
dataset = load_dataset(dataset_name)

Found cached dataset csv (/Users/carlosvarela/.cache/huggingface/datasets/zeroshot___csv/zeroshot--twitter-financial-news-topic-7546d9df5195ffd4/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)


  0%|          | 0/2 [00:00<?, ?it/s]

In [2]:
# Access the dataset splits
train_data = dataset["train"]
test_data = dataset["validation"]

In [3]:
train_data

Dataset({
    features: ['text', 'label'],
    num_rows: 16990
})

In [4]:
import pandas as pd
import spacy
import matplotlib.pyplot as plt

In [5]:
#converting to dataframes for initial exploration:
train_df = pd.DataFrame(train_data)
test_df = pd.DataFrame(test_data)

In [6]:
print('train dataframe info:')
print(train_df.info())
print()
print('test dataframe info:')
print(test_df.info())

train dataframe info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16990 entries, 0 to 16989
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    16990 non-null  object
 1   label   16990 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 265.6+ KB
None

test dataframe info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4117 entries, 0 to 4116
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    4117 non-null   object
 1   label   4117 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 64.5+ KB
None


In [7]:
print('train dataframe description: ',train_df.describe())
print()
print('test dataframe description: ',test_df.describe())

train dataframe description:                label
count  16990.000000
mean       9.547616
std        6.401000
min        0.000000
25%        2.000000
50%        9.000000
75%       16.000000
max       19.000000

test dataframe description:               label
count  4117.000000
mean      9.488220
std       6.448169
min       0.000000
25%       2.000000
50%       9.000000
75%      16.000000
max      19.000000


In [8]:
#Perform initial EDA prior to cleaning for training split:
from ydata_profiling import ProfileReport

profile = ProfileReport(train_df, title="Twitter Financial News Profiling Report")

In [9]:
profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



In [10]:
#perform initial EDA prior to cleaning for testing split:
from ydata_profiling import ProfileReport

testing_set_profile = ProfileReport(test_df, title="Twitter Financial News (test data) Report")

In [11]:
testing_set_profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



## Data cleaning

**Deleting Unnecesary Data**
- We will remove leading or lagging spaces
- We will remove special characters prior to tokenization for better performance

In [12]:
#df_a['title'] = df_a['title'].str.lower().replace('[^a-zA-Z0-9\s\/-]', '', regex=True).dropna() # We keep '-' & '/'
#df_b['title'] = df_b['title'].str.lower().replace('[^a-zA-Z0-9\s\/-]', '', regex=True).dropna()
train_df['cleaned title'] = train_df['text'].str.lower().replace('[\/:@]', '', regex=True)
test_df['cleaned title'] = test_df['text'].str.lower().replace('[\/:@]', '', regex=True)

In [13]:
train_df

Unnamed: 0,text,label,cleaned title
0,Here are Thursday's biggest analyst calls: App...,0,here are thursday's biggest analyst calls appl...
1,Buy Las Vegas Sands as travel to Singapore bui...,0,buy las vegas sands as travel to singapore bui...
2,"Piper Sandler downgrades DocuSign to sell, cit...",0,"piper sandler downgrades docusign to sell, cit..."
3,"Analysts react to Tesla's latest earnings, bre...",0,"analysts react to tesla's latest earnings, bre..."
4,Netflix and its peers are set for a ‘return to...,0,netflix and its peers are set for a ‘return to...
...,...,...,...
16985,KfW credit line for Uniper could be raised to ...,3,kfw credit line for uniper could be raised to ...
16986,KfW credit line for Uniper could be raised to ...,3,kfw credit line for uniper could be raised to ...
16987,Russian https://t.co/R0iPhyo5p7 sells 1 bln r...,3,russian httpst.cor0iphyo5p7 sells 1 bln roubl...
16988,Global ESG bond issuance posts H1 dip as supra...,3,global esg bond issuance posts h1 dip as supra...


In [14]:
test_df

Unnamed: 0,text,label,cleaned title
0,Analyst call of the day for @CNBCPro subscribe...,0,analyst call of the day for cnbcpro subscriber...
1,"Loop upgrades CSX to buy, says it's a good pla...",0,"loop upgrades csx to buy, says it's a good pla..."
2,BofA believes we're already in a recession — a...,0,bofa believes we're already in a recession — a...
3,JPMorgan sees these derivative plays as best w...,0,jpmorgan sees these derivative plays as best w...
4,Morgan Stanley's Huberty sees Apple earnings m...,0,morgan stanley's huberty sees apple earnings m...
...,...,...,...
4112,Dollar bonds of Chinese developers fall as str...,3,dollar bonds of chinese developers fall as str...
4113,Longer maturity Treasury yields have scope to ...,3,longer maturity treasury yields have scope to ...
4114,Pimco buys €1bn of Apollo buyout loans from ba...,3,pimco buys €1bn of apollo buyout loans from ba...
4115,Analysis: Banks' snubbing of junk-rated loan f...,3,analysis banks' snubbing of junk-rated loan fu...


**Inquiries**
- Should we leave capital letters for certain words like USA?
- Should we keep punctuation for situations such as U.S.?

**Tokenization, lemmatization, and stop word removal**

- Tokenizers divide strings into lists of substrings. For example, tokenizers can be used to find the words and punctuation in a string.
- spacy: Spacy is a library used for NLP. We will use it to work with text pre-processing, removing stop word, and to extract information from the text using modules and functions.

In conclusion, we will use spacy's libraries to tokenize and remove stopwords comparing against is pre-built stop word library.

**Preparing the records**

In [15]:
from spacy.lang.en.stop_words import STOP_WORDS

In [16]:
# Loading the language model:
nlp = spacy.load("en_core_web_sm")

# Applying model to a dataframe column:
train_df['docs'] = train_df['cleaned title'].apply(nlp)
test_df['docs'] = test_df['cleaned title'].apply(nlp)

# Defining a function to remove stop words and punctuations using spacy's assets:
def nlp_tokenizer(doc):
    docs_no_stops = [token.lemma_ for token in doc if token.lemma_ not in STOP_WORDS and not token.is_punct]
    return docs_no_stops

train_df['docs'] = train_df['docs'].apply(nlp_tokenizer)
test_df['docs'] = test_df['docs'].apply(nlp_tokenizer)

In [17]:
train_df.head(3)

Unnamed: 0,text,label,cleaned title,docs
0,Here are Thursday's biggest analyst calls: App...,0,here are thursday's biggest analyst calls appl...,"[thursday, big, analyst, apple, amazon, tesla,..."
1,Buy Las Vegas Sands as travel to Singapore bui...,0,buy las vegas sands as travel to singapore bui...,"[buy, las, vegas, sand, travel, singapore, bui..."
2,"Piper Sandler downgrades DocuSign to sell, cit...",0,"piper sandler downgrades docusign to sell, cit...","[piper, sandler, downgrade, docusign, sell, ci..."


In [18]:
test_df.head(3)

Unnamed: 0,text,label,cleaned title,docs
0,Analyst call of the day for @CNBCPro subscribe...,0,analyst call of the day for cnbcpro subscriber...,"[analyst, day, cnbcpro, subscriber, goldman, s..."
1,"Loop upgrades CSX to buy, says it's a good pla...",0,"loop upgrades csx to buy, says it's a good pla...","[loop, upgrade, csx, buy, good, place, park, m..."
2,BofA believes we're already in a recession — a...,0,bofa believes we're already in a recession — a...,"[bofa, believe, recession, stock, beat, , htt..."


In [19]:
train_docs = train_df[['docs','label']]
test_docs = test_df[['docs','label']]

In [20]:
train_docs

Unnamed: 0,docs,label
0,"[thursday, big, analyst, apple, amazon, tesla,...",0
1,"[buy, las, vegas, sand, travel, singapore, bui...",0
2,"[piper, sandler, downgrade, docusign, sell, ci...",0
3,"[analyst, react, tesla, late, earning, break, ...",0
4,"[netflix, peer, set, return, growth, analyst, ...",0
...,...,...
16985,"[kfw, credit, line, uniper, raise, 8, bln, eur...",3
16986,"[kfw, credit, line, uniper, raise, 8, bln, eur...",3
16987,"[russian, , httpst.cor0iphyo5p7, sell, 1, bln...",3
16988,"[global, esg, bond, issuance, post, h1, dip, s...",3


In [21]:
test_docs

Unnamed: 0,docs,label
0,"[analyst, day, cnbcpro, subscriber, goldman, s...",0
1,"[loop, upgrade, csx, buy, good, place, park, m...",0
2,"[bofa, believe, recession, stock, beat, , htt...",0
3,"[jpmorgan, derivative, play, good, way, bet, e...",0
4,"[morgan, stanley, huberty, apple, earning, mis...",0
...,...,...
4112,"[dollar, bond, chinese, developer, fall, stres...",3
4113,"[long, maturity, treasury, yield, scope, highe...",3
4114,"[pimco, buy, €, 1bn, apollo, buyout, loan, ban...",3
4115,"[analysis, bank, snubbing, junk, rate, loan, f...",3


## EDA Post-cleaning

In [22]:
# Apply lambda function to convert all docs into strings:
train_df['entities'] = train_df['docs'].apply(lambda tokens: ' '.join(tokens))
test_df['entities'] = test_df['docs'].apply(lambda tokens: ' '.join(tokens))

In [23]:
cleaned_test_set_profile = ProfileReport(test_df, title="Twitter Financial News (Test data) Report post cleaning")
cleaned_test_set_profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



In [24]:
cleaned_train_set_profile = ProfileReport(train_df, title="Twitter Financial News (Train data) Report post cleaning")
cleaned_train_set_profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



Dataframes called "test_docs" and "train_docs" are the two sets ready for modeling.

In [25]:
# Exporting EDA files as HTML:
cleaned_test_set_profile.to_file("cleaned_test_set_profile.html")
cleaned_train_set_profile.to_file("cleaned_train_set_profile.html")



Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]



Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [29]:
train_df
model_df = train_df[['docs','label']]
model_df.to_csv('twitter_training_data_processed.csv', index = False)

In [31]:
test_model_df = test_df[['docs','label']]
test_model_df.to_csv('twitter_testing_data_processed.csv', index = False)