<center style='font-size:150%; color:#4cd4b4'><strong>Demo on QesNLP API Service</strong></center>


$-$ latest @08/14/2023;

$-$ `QesNLP API beta version`

---

<br />

> This guide notebook walks through how to utilize the <span style='color:#4cd4b4'>*QesNLP*</span> API. At the core of this API is QES's proprietary language model, known as the Generative Fine-tuned Transformer (GFT). GFT is fine-tuned with a vast type of financial text data to better to the unique linguistic patterns and terminologies prevalent in the finance domain. This quant finance  tailor-made language model forms the basis for a broad array of Natural Language Processing (NLP) tasks, ranging from text preprocessing to downstream classification. 


# Overview


* QES Financial Natural Language Processing - <span style='color:#4cd4b4'>**QesNLP**</span>   - API provides a suite of NLP toolkits with state-of-the-art deep learning language model designed for applications in finance and investment domain.


* The QesNLP API service consists of the following four functionality modules.

    - <span style='color:#4cd4b4'>**Preprocessing**</span> Module - This module preprocess the raw text input into parsed machine-readable NLP data structure, including tokenization, summarization, entity recognition, and keyphrases identification.

    - <span style='color:#4cd4b4'>**Embedding**</span> Module - This module embed the input text document/sentence into contextual vectors based on the NLP pre-trained language models. The contextualized vectors fit in with the standard machine learning algorithms and could empower the downstream NLP tasks.
       
    - <span style='color:#4cd4b4'>**Exposure**</span> Module - This module renders the thematic distance between text document to well-defined theme by leveraging the contextual embeddings of both text documents and theme clusters. 

    - <span style='color:#4cd4b4'>**Classification**</span> Module - This module supports downstream NLP classification tasks, such as sentiment analysis.
    
* ⚠️ Like conventional Machine Learning algorithm, the GFT language model can still generate noisy outputs which might be incorrect. However, on average, it exhibits superior accuracy compared to traditional bag-of-words methods and could add incremental value on Financial NLP tasks. Moreover, we are dedicated to keep iterating the model to improve the capability in the forthcoming generations. 

* Please contact Luo.Qes@wolferesearch.com if you have further questions. 


---

# Usage Demo 

**Test Input**: A list of financial texts.

Here, we extract the following 7 texts from real world as input documents:

- Investors thought Wendy’s 15.5% 2022 restaurant-margin target was somewhat conservative.

- We expect solid results from DASH. Nice beat/strong guide should be rewarded but upside is likely to balanced given high expectations.

- Technicals and incremental news flow will likely continue to drive trading through year end. We wouldn’t be surprised to see some more near-term upside, including the SPX trading into the 4050-4100 range.

- While the moves across markets were epic, we believe that our intermediate-term bearish base case remains intact. This includes core inflation remaining very persistent, the Fed hiking to 5-6%, and a recession hitting in 2023.

- We believe that earnings and guidance are likely to begin to come under pressure in the coming quarters. Our sense is that companies beating on the top- and bottom-lines and providing constructive outlooks should have an increased chance of outperforming their peers in the months ahead.
            
- Our business in China market will likely to benefit with the China reopening and recovery scenario.
            
- To better meet the business development requirement, we have accelerated the application of our Fintech. As we have already introduced like AI and blockchain and all the other advanced technology, we have made a lot of explorations and applications in AI and achieved a lot of outcome in empowering our business.

## Requirements and Presteps

1. Copy pyqes [python file]( https://github.com/wolferesearch/docs/tree/master/micro-services/api/python/pyqes) from github to your local directory from Github. 
2. Ensure you have [Pandas](https://pandas.pydata.org/) and [requests](https://pypi.org/project/requests/) package in your python kernel. 

## Authentication and Connection

The API is protected using Username and Password. In case you have not received it, please [email](mailto:luo.qes@wolferesearch.com) to apply for API account. 

The connection object is the gateway to accessing the API. It allows you to access the catalog, portfolios, templates, risk models etc. 

In [16]:
from pyqes import conn
from pyqes import nlp

connection = conn.Connection(username = 'xxx',
                             password = 'xxx',
                             URL = 'http://feed.luoquant.com/nlp')


Load the prestored test inputs. 

In [12]:
# specify the input
texts = nlp.TEST_DATA_SET['SAMPLE1']

print(texts)

['Investors thought Wendy’s 15.5% 2022 restaurant-margin target was somewhat conservative.', 'We expect solid results from DASH. Nice beat/strong guide should be rewarded but upside is likely to balanced given high expectations.', 'Technicals and incremental news flow will likely continue to drive trading through year end. We wouldn’t be surprised to see some more near-term upside, including the SPX trading into the 4050-4100 range.', 'While the moves across markets were epic, we believe that our intermediate-term bearish base case remains intact. This includes core inflation remaining very persistent, the Fed hiking to 5-6%, and a recession hitting in 2023.', 'We believe that earnings and guidance are likely to begin to come under pressure in the coming quarters. Our sense is that companies beating on the top- and bottom-lines and providing constructive outlooks should have an increased chance of outperforming their peers in the months ahead.', 'Our business in global market will like

Launch <span style='color:#4cd4b4'>**API Class**</span> `nlp.NLPApi` to interact with all NLP Service. Initiate a singleton using the authorized connection object

In [13]:
nlp_api = nlp.NLPApi(connection)



---

## a. QesNLP/preprocessing 

Preprocessing Module preprocess the raw text input into parsed machine-readable NLP data structure, including tokenization, summarization, entity recognition, and keyphrases identification.



### Keyphrases / Keyentities Extraction

> The phrase extraction function `.get_key_entity` extracts the diverse keyterms - n-gram - from the given input text. 

The output of this function is a tuple, which consists of two elements - the identified key terms and a corresponding similarity score. The similarity score ranges between 0 and 1, with a higher value signifying a stronger association between the key term and the input text. In other words, a score closer to 1 indicates that the extracted key term is highly relevant or related to the context of the input text.


In [15]:
nlp_api.get_key_entity(list_of_texts = texts)

[[['wendy', 0.3962],
  ['investors', 0.3838],
  ['margin target', 0.3514],
  ['restaurant', 0.2262]],
 [['dash', 0.5709],
  ['strong guide', 0.4556],
  ['solid results', 0.3868],
  ['high expectations', 0.2787],
  ['nice beat', 0.2667]],
 [['spx trading', 0.6201],
  ['term upside', 0.3451],
  ['year end', 0.3164],
  ['technicals', 0.277],
  ['incremental news flow', 0.2255]],
 [['core inflation', 0.4825],
  ['recession', 0.4425],
  ['bearish base case', 0.3862],
  ['fed hiking', 0.2564],
  ['moves', 0.1843]],
 [['pressure', 0.3416],
  ['earnings', 0.3207],
  ['guidance', 0.2984],
  ['constructive outlooks', 0.1991],
  ['months', 0.0872]],
 [['economy recovery scenario', 0.5808],
  ['global market', 0.5643],
  ['china opening', 0.4717],
  ['business', 0.4133]],
 [['fintech', 0.6401],
  ['blockchain', 0.4081],
  ['business development requirement', 0.4076],
  ['ai', 0.3902],
  ['applications', 0.3491]]]

In [17]:
nlp_api.get_key_phrases(list_of_texts = texts)

[[['investors thought wendy', 0.6327],
  ['target somewhat conservative', 0.5881],
  ['wendy restaurant', 0.4744],
  ['restaurant margin', 0.3445],
  ['somewhat', 0.1834]],
 [['results dash nice', 0.6862],
  ['guide rewarded upside', 0.5599],
  ['beat strong guide', 0.5354],
  ['upside likely balanced', 0.4375],
  ['expect solid', 0.3584]],
 [['spx trading range', 0.6345],
  ['near term upside', 0.4735],
  ['drive trading year', 0.4587],
  ['technicals incremental', 0.3033],
  ['news flow likely', 0.2848]],
 [['inflation remaining persistent', 0.6236],
  ['fed hiking recession', 0.4765],
  ['bearish base case', 0.3862],
  ['moves markets epic', 0.3566],
  ['remains intact includes', 0.3407]],
 [['earnings guidance likely', 0.6213],
  ['pressure coming quarters', 0.5514],
  ['companies beating lines', 0.3351],
  ['months ahead', 0.3119],
  ['outperforming peers', 0.2935]],
 [['benefit china opening', 0.6663],
  ['business global market', 0.6427],
  ['economy recovery scenario', 0.5808],

### Summarization

The Summarization API `summarize` is a powerful tool that distills the essential content from a provided input text and returns a succinct summary. This API is capable of handling batch-level summarization tasks, allowing it to process multiple inputs concurrently with computational efficiency.


In [35]:
## This may take some time.
nlp_api.summarize(texts)

{'0': 'Investors thought Wendy’s 15.5% 2022 restaurant-margin target was somewhat conservative.',
 '1': 'We expect solid results from DASH. Nice beat/strong guide should be rewarded but upside is likely to balanced given high expectations.',
 '2': 'Technicals and incremental news flow will likely continue to drive trading through year end. We wouldn’t be surprised to see some more near-term upside, including the SPX trading into the 4050-4100 range.',
 '3': 'We believe that our intermediate-term bearish base case remains intact. This includes core inflation remaining very persistent, the Fed hiking to 5-6%, and a recession hitting in 2023.',
 '4': 'We believe that earnings and guidance are likely to come under pressure in the coming quarters. Our sense is that companies beating on the top- and bottom-lines and providing constructive outlooks should have an increased chance of outperforming their peers in the months ahead.',
 '5': 'Our business in global market will likely to benefit wi

---

## b. QesNLP/embedding 

> QesNLP's financial text embeddings `get_embedding` serve as an advanced tool to quantify the relatedness or similarity between different pieces of text. Essentially, text embeddings are multi-dimensional representations based on the GFT language model, where each text string is transformed into a vector, consisting of a list of floating-point numbers.

These embeddings find extensive applications across several domains, including:

- Search Functionality: Here, the relevance of the search results is determined based on the degree of similarity to a user's query string, thus optimizing the search results by aligning them closely with the user's intent.

- Clustering: Embeddings can be used to cluster or group text strings based on their similarity. This enables easier identification and categorization of related texts, improving data organization and retrieval.

- Recommendation Systems: In recommendation engines, items that share similar text strings, and thus similar embeddings, can be recommended to users, enhancing personalization and user experience.

- Anomaly Detection: Text embeddings can help identify outliers or anomalies, text strings that exhibit little to no relatedness to a given set of text, thus facilitating better monitoring and error detection.

- Diversity Measurement: By analyzing similarity distributions, text embeddings can measure the level of diversity within a given text dataset, providing crucial insights into the range and spread of content.

- Text Classification: Text embeddings can be used to assign a class or label to text strings. The classification is based on the similarity of the text to known labels, aiding in automated content classification and tagging.

<br/>

In terms of relatedness measurement, the 'distance' between two vectors in the multi-dimensional embedding space serves as the metric. A smaller distance between two vectors indicates a high degree of relatedness, suggesting that the text strings are very similar. 




In [1]:
nlp_api.get_embedding

Object `nlp_api.get_embedding` not found.


In [34]:
nlp_api.get_embedding(texts)[:2]

[[0.01513153500854969,
  -0.022457074373960495,
  0.004049981944262981,
  0.1051681712269783,
  -0.026309631764888763,
  0.11871398240327835,
  -0.008968377485871315,
  -0.12893173098564148,
  0.03029773198068142,
  0.009070790372788906,
  0.06864392757415771,
  -0.006944524589926004,
  0.07088586688041687,
  -0.026452602818608284,
  0.02657601423561573,
  -0.06798046082258224,
  0.017157988622784615,
  0.044337280094623566,
  0.0027087912894785404,
  0.04105174541473389,
  -0.04839600995182991,
  0.03366716951131821,
  0.023741252720355988,
  0.0932169184088707,
  -0.021327653899788857,
  -0.06773635745048523,
  -0.01632431335747242,
  0.031410448253154755,
  -0.0633777379989624,
  0.022282054647803307,
  -0.0327615812420845,
  -0.0686522051692009,
  -0.05086243897676468,
  0.06490565836429596,
  -0.006202189717441797,
  0.02210240252315998,
  -0.03232797607779503,
  -0.042912986129522324,
  0.051925577223300934,
  -0.030198048800230026,
  0.07424700260162354,
  0.06149189919233322,
 

---

## c. QesNLP/exposure

> The `QesNLP/exposure` module quantifiy the thematic distance between text document to well-defined theme by leveraging the contextual embeddings of both text documents and theme clusters. 

At the heart of our NLP exposure on theme lies our <span style='color:#4cd4b4'>**ThemeGFT**</span>  – a machine learnt NLP embedding and clustering pipeline empowered by GFT model, used to identify and create themes fine-tuned by humans annotation. 

<span style='color:#4cd4b4'>**Thematic contextual embeddings**</span> provide a nuanced and powerful way to represent the meaning of words, phrases, and documents within a specific theme or domain. By incorporating contextual information from surrounding words and the overarching theme, these embeddings capture not only the general meaning of words but also the subtle variations in meaning that occur in different thematic contexts. This allows for a more accurate representation of the relationships between text elements, leading to better quality document classification and topics model within the domain. 

We currently provide specialized suites of fine-tuned ThemeGFT to quantify the contextual exposure from given text document to predefined themes:

- `General` offers encompasses 21 core universal themes, ranging from Artificial Intelligence (AI) and Environmental, Social, and Governance (ESG), to Margin.

- `China` offers a dedicated themes relevant to China, which is based on our China Reopen thematic research. 

The exposure score computed by these models varies between 0 and 1, where a score of 1 implies complete semantic exposure to a particular theme. This numerical representation offers a quantifiable measure of the extent to which a document aligns with the defined thematic context.

In [14]:
# General ThemeGFT model
nlp_api.get_general_theme_exposure(texts)


Unnamed: 0,AI,ESG,SupplyChain,Inflation,Tech,HumanCapital,Margin,CashFlow,RevGrowth,Cost,...,Debt,Earnings,Buyback,Dividend,Marketing,GlobalMarket,OperatingPerformance,Tax,Energy,Demand
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.819087,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.001185,0.0,0.0,0.0,0.0
6,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [33]:
# ChinaReopen ThemeGFT Model
nlp_api.get_china_theme_exposure(texts)


Unnamed: 0,China/COVID,China/Economy,China/SupplyChain,China/Market&Business,China/SalesGrowth,China/Tariff,China/Invest,China/Reopen
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,1.0,0.0,1.0,0.120171,0.0,0.761358,0.18679
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


---

## d. QesNLP/classification

>  The classification module supports downstream NLP financial classification tasks, featuring financial sentiment analysis.

**Input**: A financial text.

**Argument**: The model type including analyst tone (`analyst-tone`), news sentiment (`news-sentiment`), social media sentiment (`twitter-sentiment`), etc.

**Output**: probability with associated label, sum up to one.

**Models**: There are 5 fine-tuned NLP classification models available designed for different application domain like financial news or analyst comments. 

| Model Type/Argument |                                                                                                                                                           Description                                                                                                                                                          |                                    Label                                    |             Notes            |   |
|:-------------------:|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------:|:----------------------------:|---|
|     analyst-tone    | FinBERT-analyst-tone model is fine-tuned on 10,000 manually annotated (positive, negative, neutral) sentences from analylst reports. This model achieves superior performance on financial tone anlaysis task.                                                                                                                 | positive, neutral, negative                                                 | the default sentiment engine |   |
|    news-sentiment   | FinBERT-news-sentiment model is fine-tuned on 10,000 manually annotated (positive, negative, neutral) sentences from financial news. This model achieves superior performance on financial sentiment anlaysis task.                                                                                                            | positive, neutral, negative                                                 |                              |   |
|  twitter-sentiment  | The BERT-twitter model is fine-tuned on trained on ~58M English tweets and fine-tuned for sentiment analysis with the TweetEval benchmark, a unified benchmark for tweet classification consisting of seven heterogeneous tasks that are core to social media NLP research such as Sentiment Analysis and Emotion Recognition. | positive, neutral, negative                                                 |                              |   |
|   twitter-emotion   |                                                                                                                                                                                                                                                                                                                                | joy, anger, sadness, optimism                                               |                              |   |
|   forward-looking   | Forward-looking statements (FLS) inform investors of managers’ beliefs and opinions about firm's future events or results.                                                                                                                                                                                                     | not forward looking, non-specific forward-looking, specific forward-looking |                              |   |



In [16]:
# Generic sentiment identification model 
nlp_api.compute_sentiment(list_of_texts = texts, model = 'analyst-tone')


Unnamed: 0,neutral,positive,negative
0,0.2218236,0.017037,0.7611394
1,4.367909e-09,1.0,2.229531e-09
2,0.0146564,0.983423,0.00192086
3,3.217681e-06,7e-06,0.9999905
4,5.961951e-07,0.999311,0.0006887265
5,3.992363e-09,1.0,7.782749e-09
6,3.308855e-07,1.0,1.769387e-07


* Forward-looking tone detection `.compute_forward_looking_tone` : detect if a given text document statement is forward-looking or not. Forward-Looking Statements (FLS) are typically declarations made by company management that convey their beliefs, expectations, or predictions about the company's future events or results.
        Parameters

In [17]:
nlp_api.compute_forward_looking_tone(list_of_texts = texts)


Unnamed: 0,notFL,nonspecificFL,specificFL
0,0.989592,0.004289,0.006119
1,0.094768,0.355109,0.550124
2,0.024221,0.315546,0.660233
3,0.921976,0.025728,0.052295
4,0.063698,0.388626,0.547676
5,0.009463,0.375107,0.615431
6,0.961722,0.013028,0.02525


* Social media emotion identification `.compute_social_emotion`: fine-tuned with social media comments allowing it to adeptly discern the underlying emotions conveyed within these interactions. Moreover, it also possesses the ability to comprehend the sentiments expressed through emojis, further enhancing its understanding of nuanced digital communication.

In [22]:
nlp_api.compute_social_emotion(list_of_texts = ['GME to the moon? 😭',
                                                'GME to the moon! 🚀'])


Unnamed: 0,anger,joy,optimism,sadness
0,0.012298,0.071588,0.019461,0.896654
1,0.078456,0.77103,0.125856,0.024658
