# Api user guide
Below we give examples on how to use the endpoints of the web service.
All the endpoints except root return a dictionary with 3 keys one per section, namely 'Section1', 'Section1A', 'Section7', which correspond to 'Item1', 'Item1A' and 'Item7' as per the Readme file. They may also contain other keys depending on the user input.
This choice reflects a preference for having different topic families, modelled with documents that belong to related but slightly different domains.
The web service contains the following endpoints:
* root - Entry point
* get_topics_time - Extracts the evolution of topics over time.
* get_topics_sentiment - Extracts the evolution of topic sentiment over time.
* get_topics_url - Allows the use to input an url, and get the corresponding topics from each of the three topic models.
* docs - Contains the documentation of the endpoints.
Below we give examples on how to use the 3 get endpoints. Please check the docs endpoint for more details.

### get_topics_time

We illustrate four  cases:
1. Full retrieval; Gets the data without any filtering or resampling.
2. Resampling: This allows the user to resample for a desired frequency.
3. Top_n : This extracts only the top_n more important topics, including the outliers topic.
4. Resampling and Top_n: Combines both 2 and 3.

Let's start by calling the endpoint and extract the keys of the dictionary response. The computation of the resampling may take a while.

In [2]:
import requests
import pandas as pd
host = 'http://127.0.0.1:8080'
def get_topics_time_dict(freq: str = None, top_N: int = None):
    """Returns a dictionary of 3 dataframes"""
    if freq and top_N:
        response = requests.get(f"{host}/get_topics_time", params={'top_n':f'{top_N}', 'freq':f'{freq}'})
    elif top_N is not None:
        response = requests.get(f"{host}/get_topics_time", params={'top_n':f'{top_N}'})
    elif freq is not None:
        response = requests.get(f"{host}/get_topics_time", params={'freq':f'{freq}'})
    else:
        response = requests.get(f"{host}/get_topics_time")
    text = response.json()
    return text

# 4 dictionaries for each of the four cases.

dict_full = get_topics_time_dict()
dict_resample = get_topics_time_dict(freq='1Y', top_N=None)
dict_top_n = get_topics_time_dict(top_N=20, freq=None)
dict_top_n_resample = get_topics_time_dict(top_N=20, freq='1Y')

#Print the keys
for k in [dict_full,dict_resample, dict_top_n, dict_top_n_resample]:
    print(k.keys())

dict_keys(['Section1', 'Section1A', 'Section7'])
dict_keys(['frequency', 'Section1', 'Section1A', 'Section7'])
dict_keys(['top_n', 'Section1', 'Section1A', 'Section7'])
dict_keys(['frequency', 'top_n', 'Section1', 'Section1A', 'Section7'])


We can see two things:
1. All the dictionaries have 3 keys in common, namely 'Section1', 'Section1A', and 'Section7'.
1. Depending on the user input the dictionary may an additional frequency or/and top_n key.

Let's parse one of the dataframes, so that we can validate the answer. We will use the dict_top_n_resample, but recall we need to exclude the frequency and top_n keys.

In [6]:
sections = ['Section1', 'Section1A', 'Section7']
dict_df= {s: pd.read_json(dict_top_n_resample[s], orient='records') for s in sections}

We can now inspect one of the dataframes. Say the one corresponding to the management discussions, section 7.

In [29]:
dict_df['Section7'].head(5)

Unnamed: 0,Topic,Words,Timestamp,Frequency
0,-1,"28 2023, january 28, fiscal 2023, 8203 adjustm...",2023-12-31,128
1,-1,"30 2023, 2023, fiscal 2023, june 30, ended june",2023-12-31,415
2,-1,"31 2019, 2019, 2019 2018, 8203, 2018",2020-12-31,9043
3,-1,"31 2019, 2019, 2019 compared, 2018, start 8226",2020-12-31,6388
4,-1,"31 2020, 2020, 2020 2019, covid 19, covid",2021-12-31,7268


We can see the words and the topics associated. Let's verify that we actually have yearly stamps and 20 + 1 unique timestamps.

In [32]:
print(f"Unique Topics (including the one for outliers) {dict_df['Section7'].loc[:,'Topic'].nunique()}")
print(f"Unique Timestamps  {dict_df['Section7'].loc[:,'Timestamp'].unique()}")

Unique Topics (including the one for outliers) 21
Unique Timestamps  ['2023-12-31T00:00:00.000000000' '2020-12-31T00:00:00.000000000'
 '2021-12-31T00:00:00.000000000' '2022-12-31T00:00:00.000000000'
 '2019-12-31T00:00:00.000000000']


Thus, we verify the endpoint yields what is expected (see also the [tests](test_routes.py) for more robust testing). We now move to get_topic_sentiment.

### get_topic_sentiment

This endpoint is to be used to analyze the sentiment for topics

In [3]:
def get_topics_sentiment_dict(freq: str = None):
    """Returns a dictionary of 3 dataframes with sentiment"""
    if freq:
        response = requests.get(f"{host}/get_topics_sentiment", params={'freq': f'{freq}'})
    else:
        response = requests.get(f"{host}/get_topics_sentiment")
    text = response.json()
    return text

sentiment_full = get_topics_sentiment_dict()
sentiment_resampled = get_topics_sentiment_dict(freq='3M')

for el in [sentiment_full, sentiment_resampled]:
    print(el.keys())

dict_keys(['Section1', 'Section1A', 'Section7', 'frequency'])
dict_keys(['Section1', 'Section1A', 'Section7', 'frequency'])


Here, we can see that both cases received the frequency key. Let's look inside.

In [4]:
print(sentiment_resampled['frequency'])
print(sentiment_full['frequency'])

3M
No frequency specified


We can see that when frequency is not specified that message is passed to the frequency key. This discrepancy with the behaviour of the get_topics_time was already noted, and future versions of the service will handle that. Let's analyze one of the dataframes, again for section 7.

In [7]:
sentiment_df = {s: pd.read_json(sentiment_resampled[s], orient='records') for s in sections}
sentiment_df['Section7'].head()

Unnamed: 0,Topic,Name,Timestamp,mean,median
0,-1,-1_cash_debt_income_expense,2019-01-31,0.141774,0.0
1,-1,-1_cash_debt_income_expense,2019-04-30,0.087433,0.0
2,-1,-1_cash_debt_income_expense,2019-07-31,0.076477,0.0
3,-1,-1_cash_debt_income_expense,2019-10-31,0.096196,0.0
4,-1,-1_cash_debt_income_expense,2020-01-31,0.097544,0.0


We see the medians and the mean sentiment. Let's plot the mean in the graph. For ease of visualization, let's exclude the outliers which are represented by the topic -1, and include only the top 20, since in bertopic the numbers correspond to the ranking in frequency.

In [11]:
import plotly.express as px

sentiment_to_plot = sentiment_df['Section7'].loc[sentiment_df['Section7'].Topic.isin(range(1, 21)),:]
fig = px.line(data_frame=sentiment_to_plot, x='Timestamp', y='mean', color='Name')
fig.update_layout(
    title=f'Topic sentiment - Section 7',
    xaxis_title='Time',
    yaxis_title='Average Sentiment per period'
)


fig.show()

As github cannot render directly from fig.show(), the [image](topic-sentiment-Section%207.png) showing is below.
![Sentiment - Section 7](topic-sentiment-Section%207.png)



Above we can see some oscillation in the topics. For instance we can see topic 19 which relates to research and development  We can also see topic 15 which relates to the fuel expenses, and passenger revenue whose sentiment was very negative both in May 2021, and May 2020. This corresponds broadly with the pandemic period, where a significant number of airlines had to be bailed out by their respective governments, and this was  most likely reflected by management discussions and disclosures in the  10-k filings. It is worth to remark that looking at the topics aggregated frequencies can actually help in getting a complete picture of the sentiment for certain topics. You can simply change the frequency parameter to achieve this.
Finally, we can see the frequency corresponds to what is expected.

In [40]:
print(f"Unique Timestamps  {sentiment_df['Section7'].loc[:,'Timestamp'].unique()}")

Unique Timestamps  ['2019-01-31T00:00:00.000000000' '2019-04-30T00:00:00.000000000'
 '2019-07-31T00:00:00.000000000' '2019-10-31T00:00:00.000000000'
 '2020-01-31T00:00:00.000000000' '2020-04-30T00:00:00.000000000'
 '2020-07-31T00:00:00.000000000' '2020-10-31T00:00:00.000000000'
 '2021-01-31T00:00:00.000000000' '2021-04-30T00:00:00.000000000'
 '2021-07-31T00:00:00.000000000' '2021-10-31T00:00:00.000000000'
 '2022-01-31T00:00:00.000000000' '2022-04-30T00:00:00.000000000'
 '2022-07-31T00:00:00.000000000' '2022-10-31T00:00:00.000000000'
 '2023-01-31T00:00:00.000000000' '2023-04-30T00:00:00.000000000'
 '2023-07-31T00:00:00.000000000' '2023-10-31T00:00:00.000000000'
 '2019-02-28T00:00:00.000000000' '2019-05-31T00:00:00.000000000'
 '2019-08-31T00:00:00.000000000' '2019-11-30T00:00:00.000000000'
 '2020-02-29T00:00:00.000000000' '2020-05-31T00:00:00.000000000'
 '2020-08-31T00:00:00.000000000' '2020-11-30T00:00:00.000000000'
 '2021-02-28T00:00:00.000000000' '2021-05-31T00:00:00.000000000'
 '2021

### get_topics_url
We use this endpoint to extract the topics present in any website according to the 3 topic models. Let's see what we get if we test it with
the ecb and the nasa websites. We do two types of extraction:
1. All topics
2. All unique topics


In [13]:
def get_topics_url_lists(url, keep_all=False):
    response = requests.get(f"{host}/get_topics_url", params={'url': url, 'keep_all': f'{keep_all}'})
    return response.json()


In [18]:
topics_1 = get_topics_url_lists(url='https://www.ecb.europa.eu/home/html/index.en.html')
topics_2 = get_topics_url_lists(url='https://www.nasa.gov')
topics_1_keep_all = get_topics_url_lists(url='https://www.ecb.europa.eu/home/html/index.en.html', keep_all=True)

Let' see the topics corresponding to the section7 model.

In [19]:
print(f'ecb: {topics_1["Section7"]}')
print(f'Nasa: {topics_2["Section7"]}')

ecb: ['1674_climate change_change impacts_address climate_cdp climate', '2334_fte basis_fte_income fte_purposes bancorp', '2600_french tax_booking com_million euros_authorities issued', '44_rate risk_rate scenarios_yield curve_rate scenario', '43_var_var model_stressed var_var based', '50_credit losses_allowance credit_forecast period_allowance loan', '36_libor_alternative reference_reference rates_sofr', '1358_investment grade_grade securities_bonds_issuers investment', '2098_nitrogen fertilizer_fertilizer demand_supply demand_fertilizers', '1269_stress_stress testing_historical stress_stress events', '477_levelized foreign_160 levelized_translation 160_currency translation', '1809_fomc_funds rate_pnc expects_pnc', '2005_corporation 8217_financial markets_instability impact_certain corporation', '170_emissions_ghg emissions_paris agreement_epa', '81_currencies_exchange rates_currency exchange_foreign currencies', '1924_segment ebitda_europe segment_ebitda year_ebitda included', '1761_

We can see that Nasa has only one topic while ecb has many attached topics. Let's look at the non-unique topics fpr ecb

In [25]:
from collections import Counter
counter = Counter(topics_1_keep_all["Section1"])
import pandas as pd

df = pd.DataFrame.from_dict(counter, orient='index').reset_index()
df = df.rename(columns={'index':'element', 0:'topic_count'})
display(df.sort_values('topic_count', ascending=False))


Unnamed: 0,element,topic_count
0,225_libor_benchmarks indices_libor settings_re...,3
4,878_monetary policy_monetary policies_instrume...,3
5,1726_59 table_8226 expectation_fiscal_8226 fiscal,3
1,189_withdrawal agreement_transition period_ref...,3
10,366_bonds_bond market_corporate bond_corporate...,3
3,574_subject risk_current expectations_credit l...,2
12,64_credit facility_term loan_senior unsecured_...,1
17,4_data protection_personal data_ccpa_protectio...,1
16,381_free trade_footwear apparel_trade restrict...,1
15,576_return periods_catastrophe loss_events_cat...,1


We see 5 topics with counts of 3. Let's see all topics matched for the nasa website.

In [100]:
for s in sections:
    print(f'Nasa-{s}-model topics: {topics_2[s]} \n')

Nasa-Section1-model topics: ['756_mission systems_national security_missions_mission support'] 

Nasa-Section1A-model topics: ['503_gps_satellites_gnss_satellite'] 

Nasa-Section7-model topics: ['615_campus_life science_marcus_certification'] 



Let's look at the topics for a particular company.

In [36]:
topics_3 = get_topics_url_lists(url='https://www.homedepot.ca/en/home.html')

for s in sections:
    print(f'The Home Depot-{s}-model topics: {topics_3[s]} \n')

The Home Depot-Section1-model topics: ['1231_home depot_pro_depot_pro customers', '634_home improvement_lowe 8217_lowe_improvement products'] 

The Home Depot-Section1A-model topics: ['1485_buyers_buyers sellers_contracted purchase_sellers buyers', '694_delivery times_pickup_home improvement_delivery options'] 

The Home Depot-Section7-model topics: ['354_assortment_shopping_merchandising_shopping experience', '1176_income shipments_sales distributors_rebate programs_certain distributors', '2132_premium outlets_designer outlet_outlets_outlet'] 



And we can see topics related to furniture design, shopping, which reflects the domain of the company. We finish with wikipedia.

In [102]:
topics_4 = get_topics_url_lists(url='https://www.wikipedia.org')
for s in sections:
    print(f'Wikipedia-{s}-model topics: {topics_4[s]} \n')

Wikipedia-Section1-model topics: ['346_8220 duralast_paints 174_coatings_weed control', '1619_parish la_la 160_parish_operated 160', '968_methods treatment_ep_methods_formulation', '807_160 hydro_argentina 160_gas diesel_160 chile', '978_vitiligo_ruxolitinib cream_cream_ruxolitinib', '436_eurosport_tvn_viewers_international networks', '1822_8212 64_kucing liar_big gossan_160 grasberg', '1532_38 europe_zoledronate_zoledronate generics_generics various', '272_africa 160_160 africa_total africa_asia 160', '587_web addresses_textual references_inactive textual_reference report'] 

Wikipedia-Section1A-model topics: ['576_32 32_32_160 32_160 160', '854_coffee_arabica coffee_coffee beans_quality arabica', '190_search services_travel services_trivago_online travel', '340_eylea_vegf_eye_asthma', '391_jakafi_myelofibrosis_host disease_graft versus', '1517_indonesia_indonesia 8217_indonesia government_mining operations', '542_spam_quality content_search results_linkedin microsoft'] 

Wikipedia-Se

## Other uses

You can use the above get_topics_url_lists function to track the evolution of topics for a specific website we may be monitoring for changes. For instance, this could be used to track whether a certain company filed with SEC some report that could be potentially market moving. In particular, pne could build an alert system to alert the user anytime something related to restructuring happens.

In [26]:
import pandas as pd
df = pd.read_csv('../all_filings_and_sections.csv')

In [30]:
df.drop(columns='Unnamed: 0').to_csv('../data_sample.csv')

In [35]:
df.drop(columns='Unnamed: 0').head(1).to_csv('data_sample.csv')