## Exploring the dataset for announcements by listed companies on the Hong Kong Stock Exchange (HKEx)

**Aim:** 
See if there are any notable features in the announcements so as to gain some intuition on what features the neural network will potentially give more weight. 

All data is scraped from the website of the HKEX - https://www.hkexnews.hk/ 

Data will be passed into a deep neural network to train a classifier.

In [173]:
import pandas as pd
from plotly.offline import iplot
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import text
import plotly.express as px
import cufflinks as cf
import re
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)

In [154]:
# Read in all the data from separate CSV files
cct = pd.read_csv('ch14A_cct.csv')
agm = pd.read_csv('notice_of_agm.csv')
nt = pd.read_csv('notifiable_transactions.csv')
tk = pd.read_csv('takeovers_code_3_7.csv')
th = pd.read_csv('trading_halt.csv')
ar = pd.read_csv('annual_results_annt.csv')

In [155]:
# Simplify Labels
cct.Label = "Connected Transactions"
tk.Label = "Takeover Offer"
ar.Label = "Annual Results"

In [156]:
additional_stop_words = {'hong', 'kong', 'stock', 'exchange', '000'}
stop_words = text.ENGLISH_STOP_WORDS.union(additional_stop_words)

In [157]:
# Define the get bigrams helper function 
def get_top_n_bigrams(corpus, n=None):
    vec = CountVectorizer(ngram_range=(2, 2), stop_words=stop_words).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

In [158]:
df_temp = pd.DataFrame(get_top_n_bigrams(cct.Text, 20), columns = ['Text' , 'count'])
df_temp.groupby('Text').sum()['count'].sort_values(ascending=False).iplot(
    kind='bar', yTitle='Count', linecolor='black', title='Top 20 bigrams for Connected Transactions after removing stopwords')

As we can see above, the bigram "chapter 14a" is the 5th most common bigram in the Connected Transactions dataset. Given my experience representing listed companies in Hong Kong legally, I am aware that the rules regarding connected transactions are provided in Chapter 14A of the Listing Rules of the Stock Exchange. 

In [159]:
df_temp = pd.DataFrame(get_top_n_bigrams(agm.Text, 20), columns = ['Text' , 'count'])
df_temp.groupby('Text').sum()['count'].sort_values(ascending=False).iplot(
    kind='bar', yTitle='Count', linecolor='black', title='Top 20 bigrams for Announcements of Annual General Meetings after removing stopwords')

No surprise here, that the bigrams "general meeting" and "annual general" are the most frequent ones for this category.

In [160]:
df_temp = pd.DataFrame(get_top_n_bigrams(nt.Text, 20), columns = ['Text' , 'count'])
df_temp.groupby('Text').sum()['count'].sort_values(ascending=False).iplot(
    kind='bar', yTitle='Count', linecolor='black', title='Top 20 bigrams for Notifiable Transactions after removing stopwords')

Notifiable transactions usually involve mergers and acquisitions, so seeing the bigram "target company" in 4th place is not too surprising. The bigrams "percentage ratios" and "applicable percentage" relate to the requirement for companies to publish this type of announcements only when the proposed transaction meets a certain percentage threshold of equity transfer in the transaction.

In [161]:
df_temp = pd.DataFrame(get_top_n_bigrams(tk.Text, 20), columns = ['Text' , 'count'])
df_temp.groupby('Text').sum()['count'].sort_values(ascending=False).iplot(
    kind='bar', yTitle='Count', linecolor='black', title='Top 20 bigrams for Takeover Offer after removing stopwords')

Takeover offers are governed by the Takeovers Code, so no surprise here.

In [162]:
df_temp = pd.DataFrame(get_top_n_bigrams(th.Text, 20), columns = ['Text' , 'count'])
df_temp.groupby('Text').sum()['count'].sort_values(ascending=False).iplot(
    kind='bar', yTitle='Count', linecolor='black', title='Top 20 bigrams for Announcements of Trading Halts after removing stopwords')

As for this type of announcement, the most common bigrams are words that are common across *any* types of announcement, which is an observation in line with the relative shorter length of Trading Halt announcements as we shall see below. 

In [163]:
df_temp = pd.DataFrame(get_top_n_bigrams(ar.Text, 20), columns = ['Text' , 'count'])
df_temp.groupby('Text').sum()['count'].sort_values(ascending=False).iplot(
    kind='bar', yTitle='Count', linecolor='black', title='Top 20 bigrams for Announcements of Annual Results after removing stopwords')

Announcements of Annual Results is related to bigrams such as "31 december" or "year ended" as they signify the end of the year. 

## Next, let's look at the length of each corpus across types of announcements to see if any inferences can be drawn from there.

In [164]:
# Concatenate all dataframes
df = cct.append(agm).append(nt).append(tk).append(th).append(ar)

In [174]:
# Preprocess Text column in dataframe to remove numbers so that mean word count corresponds better to depth of content
def preprocess(corpus):
    corpus = corpus.lower()
    cleanr = re.compile(r'[^a-z ]')
    corpus = re.sub('\n', ' ', corpus)
    corpus = re.sub(cleanr, ' ', corpus)
    corpus = re.sub(r'\s+', ' ', corpus)
    return corpus

df.Text = df.Text.apply(lambda x: preprocess(x))

In [175]:
# Create word count column
df['Word Count'] = df.Text.apply(lambda x: len(x.split()))

In [176]:
# Function to find mean word count of each type of announcement
def find_mean_WC(label):
    sum_count = df['Word Count'].loc[df.Label == f'{label}'].sum()
    rows = len(df.loc[df.Label ==f'{label}'])
    return sum_count/rows

In [177]:
# Create dictionary of Label : Mean Word Count
labels = set(df.Label.tolist())
mean_word_count = dict()
for label in labels:
    mean_word_count[label] = find_mean_WC(label)

In [178]:
# Sort by descending order
mean_word_count_sorted = sorted(mean_word_count.items(), key=lambda x: x[1], reverse=True)
mean_word_count_sorted

[('Annual Results', 9297.058531746032),
 ('Connected Transactions', 4696.560869565217),
 ('Notifiable Transactions', 3139.0132960111964),
 ('Notice of AGM', 1982.2100801377574),
 ('Takeover Offer', 1929.3349900596422),
 ('Trading Halt', 281.9398023360288)]

In [179]:
labels = []
mwc = []
for x, y in mean_word_count_sorted:
    labels.append(x)
    mwc.append(y)

In [180]:
fig = px.pie(values=mwc, names=labels, title='Relative Proportion of Mean Word Counts Across Types of Announcements')

In [181]:
fig.show()

Annual results occupy the number 1 spot in terms of length of announcement. This is due to the sheer amount of financial metrics to disclose as part of annual results. 

Connected Transactions and Notifiable Transactions have the two highest mean word counts after annual results. This is due to these types of announcements often having to disclose at length the background of entering into such transactions.