## **The US VS China Trade War**

We first need to mount the drive to have access to Quotebank datasets from 2015 to 2020. For the purpose of the demonstration of the feasibility of our project we will focus on the year 2018 when the trade war between US and China started.

In [2]:
import os
from google.colab import drive
drive.mount('/content/drive')
!ls "/content/drive/My Drive"

Mounted at /content/drive
'Colab Notebooks'   Equity.csv	 Quotebank      VIX.csv
 Dollar.csv	    Oil.csv	 Routine.xlsx


In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import bz2
import json
import nltk  # nltk is a library that helps us compute synonyms and antonyms of words
nltk.download('wordnet')
from nltk.corpus import wordnet

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


This function will allow us to extend a list of words by adding synonyms and antonyms using nltk library.

In [4]:
def extend_vocabulary(words):
    extended_vocab = []
    for word in words:
        for syn in wordnet.synsets(word):
            for lm in syn.lemmas():
                word = lm.name()
                word = word.lower()
                word = word.replace('_', ' ')
                extended_vocab.append(word)
    extended_vocab = np.unique(extended_vocab)
    return extended_vocab

target_words = extend_vocabulary(["china"])
print(target_words)

In [8]:
trade_words = ["trade", "business", "stock", "price"]
extend_vocabulary(trade_words)

array(['ancestry', 'banal', 'barter', 'blood', 'blood line', 'bloodline',
       'breed', 'broth', 'business', 'business concern', 'business deal',
       'business enterprise', 'business organisation',
       'business organization', 'business sector', 'buy in', 'byplay',
       'carry', 'caudex', 'clientele', 'commercial enterprise',
       'commonplace', 'concern', 'cost', 'craft', 'damage', 'deal',
       'descent', 'farm animal', 'fund', 'gillyflower', 'gunstock',
       'hackneyed', 'inventory', 'job', 'leontyne price', 'line',
       'line of descent', 'line of work', 'lineage', 'livestock',
       'malcolm stock', 'mary leontyne price', 'merchandise',
       'monetary value', 'neckcloth', 'occupation', 'old-hat', 'origin',
       'parentage', 'patronage', 'pedigree', 'price', 'sell', 'shopworn',
       'sprout', 'stage business', 'standard', 'stemma', 'stock',
       'stock certificate', 'stock up', 'stockpile', 'store', 'strain',
       'swap', 'switch', 'swop', 'terms', 'thre

# **Quotebank data collection:**
As explained above, we will focus on year 2018. We first retrieve quotes related to the word "china" and its extended vocabulary list.

In [5]:
def extract_quotes_with_words(path_input_file, path_output_file, target_words):
    with bz2.open(path_input_file, 'rb') as input_file:
        with bz2.open(path_output_file, 'wb') as output_file:
            for instance in input_file:
                instance = json.loads(instance)
                quote = instance['quotation']
                for word in target_words:
                    if word in quote.lower():
                        output_file.write((json.dumps(instance)+'\n').encode('utf-8'))
                        break  #allow us to avoid duplicates if a quote countains several target_words
    return None

In [13]:
extract_quotes_with_words('/content/drive/MyDrive/Quotebank/quotes-2018.json.bz2', '/content/quotes-2018-china.json.bz2', target_words)

Then, among "china" related quotes, we extract quotes dealing specificly with trade topics. The corresponding file will be available in the repository.

In [14]:
extract_quotes_with_words('/content/quotes-2018-china.json.bz2', '/content/quotes-2018-china_trade.json.bz2', trade_words)

In [10]:
path_china_2018 = '/content/quotes-2018-china.json.bz2' 
path_china_trade_2018 = '/content/quotes-2018-china_trade.json.bz2'

This function allows us to compute the number of quotes extracted in the new archive file without laoding all the data.

In [6]:
def count_quotes(path_file):
    instances = 0
    with bz2.open(path_file, 'rb') as file:
        for instance in file:
            instances += 1
    return instances

count_quotes(path_china_trade_2018)

Now that the number of quotes has been significantly reduced, we can laod it into a dataframe to perform analysis.

In [7]:
def create_dataframe_from_json_bz2(path_file):
    with bz2.open(path_file, 'rb') as file:
        df = pd.read_json(file, lines=True)
    return df

In [15]:
china_trade_2018 = create_dataframe_from_json_bz2(path_china_trade_2018)

In [16]:
china_trade_2018.head()

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
0,2018-04-13-001441,A Digital Free Trade Zone between China and Ma...,Peter Wong,"[Q7177768, Q7177774, Q7177776, Q9456225]",2018-04-13 19:30:55,1,"[[Peter Wong, 0.782], [None, 0.218]]",[http://nst.com.my/business/2018/04/356893/chi...,E
1,2018-04-13-001441,A Digital Free Trade Zone between China and Ma...,Peter Wong,"[Q7177768, Q7177774, Q7177776, Q9456225]",2018-04-13 19:30:55,1,"[[Peter Wong, 0.782], [None, 0.218]]",[http://nst.com.my/business/2018/04/356893/chi...,E
2,2018-03-23-003097,A rough week for the markets... as fears of a ...,,[],2018-03-23 10:28:51,2,"[[None, 0.9112], [President Donald Trump, 0.08...",[http://www.breitbart.com/news/world-stock-mar...,E
3,2018-04-08-011525,"Every day of the week China, comes into our ho...",Peter Navarro,[Q7176052],2018-04-08 04:00:00,17,"[[Peter Navarro, 0.6696], [None, 0.208], [LARR...",[http://dailylocal.com/general-news/20180408/a...,E
4,2018-05-14-023366,For the President to become suddenly concerned...,Jonathan Fenby,[Q15072639],2018-05-14 20:43:32,2,"[[Jonathan Fenby, 0.8834], [None, 0.1166]]",[https://www.fxstreet.com/news/wall-street-dow...,E


In [17]:
len(china_trade_2018)

18020

We will now create some basics feature using NLP python librairies to demonstrate the feasibility of our project. TextBlob and vaderSentiment are using a rule-based approach, meaning that they aggregate the sentiment of each word in a sentence to give the sentence a polarity.

In [18]:
from textblob import TextBlob

testimonial = TextBlob("The food was great!")
print(testimonial.sentiment)

Sentiment(polarity=1.0, subjectivity=0.75)


In [19]:
def create_textblob_features(df):
    df['textblob_polarity'] = df.quotation.apply(lambda quote: TextBlob(quote).sentiment[0])
    df['textblob_subjectivity'] = df.quotation.apply(lambda quote: TextBlob(quote).sentiment[1])
    return None

create_textblob_features(china_trade_2018)

In [20]:
!pip install vaderSentiment

Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[?25l[K     |██▋                             | 10 kB 23.3 MB/s eta 0:00:01[K     |█████▏                          | 20 kB 28.3 MB/s eta 0:00:01[K     |███████▉                        | 30 kB 30.2 MB/s eta 0:00:01[K     |██████████▍                     | 40 kB 21.7 MB/s eta 0:00:01[K     |█████████████                   | 51 kB 20.1 MB/s eta 0:00:01[K     |███████████████▋                | 61 kB 12.7 MB/s eta 0:00:01[K     |██████████████████▏             | 71 kB 13.9 MB/s eta 0:00:01[K     |████████████████████▉           | 81 kB 14.8 MB/s eta 0:00:01[K     |███████████████████████▍        | 92 kB 16.1 MB/s eta 0:00:01[K     |██████████████████████████      | 102 kB 16.3 MB/s eta 0:00:01[K     |████████████████████████████▋   | 112 kB 16.3 MB/s eta 0:00:01[K     |███████████████████████████████▏| 122 kB 16.3 MB/s eta 0:00:01[K     |████████████████████████████████| 125 

In [21]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer


analyzer = SentimentIntensityAnalyzer()
sentence = "The food was great!" 
vs = analyzer.polarity_scores(sentence)
print(vs)

{'neg': 0.0, 'neu': 0.406, 'pos': 0.594, 'compound': 0.6588}


In [22]:
def create_vader_features(df):
    analyzer = SentimentIntensityAnalyzer()
    df['vader_compound'] = df.quotation.apply(lambda quote: analyzer.polarity_scores(quote)['compound'])
    df['vader_neg'] = df.quotation.apply(lambda quote: analyzer.polarity_scores(quote)['neg'])
    df['vader_pos'] = df.quotation.apply(lambda quote: analyzer.polarity_scores(quote)['pos'])
    df['vader_neu'] = df.quotation.apply(lambda quote: analyzer.polarity_scores(quote)['neu'])
    return None

create_vader_features(china_trade_2018)

Flair library is using a different approach for sentiment analysis, using embeddings to capture the context of the sentence before computing the sentence label.

In [24]:
!pip install flair

Collecting flair
  Downloading flair-0.9-py3-none-any.whl (319 kB)
[?25l[K     |█                               | 10 kB 18.9 MB/s eta 0:00:01[K     |██                              | 20 kB 20.4 MB/s eta 0:00:01[K     |███                             | 30 kB 23.5 MB/s eta 0:00:01[K     |████                            | 40 kB 27.1 MB/s eta 0:00:01[K     |█████▏                          | 51 kB 25.6 MB/s eta 0:00:01[K     |██████▏                         | 61 kB 14.4 MB/s eta 0:00:01[K     |███████▏                        | 71 kB 13.8 MB/s eta 0:00:01[K     |████████▏                       | 81 kB 13.5 MB/s eta 0:00:01[K     |█████████▎                      | 92 kB 14.3 MB/s eta 0:00:01[K     |██████████▎                     | 102 kB 14.8 MB/s eta 0:00:01[K     |███████████▎                    | 112 kB 14.8 MB/s eta 0:00:01[K     |████████████▎                   | 122 kB 14.8 MB/s eta 0:00:01[K     |█████████████▍                  | 133 kB 14.8 MB/s eta 0:00:01

In [25]:
from flair.models import TextClassifier
from flair.data import Sentence

classifier = TextClassifier.load('en-sentiment')
sentence = Sentence('The food was great!')
classifier.predict(sentence)

# print sentence with predicted labels
print('Sentence above is: ', sentence.labels)

2021-11-12 16:01:12,110 https://nlp.informatik.hu-berlin.de/resources/models/sentiment-curated-distilbert/sentiment-en-mix-distillbert_4.pt not found in cache, downloading to /tmp/tmpx40lc9ua


100%|██████████| 265512723/265512723 [00:09<00:00, 27440249.03B/s]

2021-11-12 16:01:21,866 copying /tmp/tmpx40lc9ua to cache at /root/.flair/models/sentiment-en-mix-distillbert_4.pt





2021-11-12 16:01:22,843 removing temp file /tmp/tmpx40lc9ua
2021-11-12 16:01:23,907 loading file /root/.flair/models/sentiment-en-mix-distillbert_4.pt


Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Sentence above is:  [POSITIVE (0.9961)]


In [26]:
def create_flair_features(df):
    classifier = TextClassifier.load('en-sentiment')
    
    def quote_flair_score(sentence):
        flair_sentence = Sentence(sentence)
        classifier.predict(flair_sentence)
        return flair_sentence.labels[0].score

    df['flair_pred'] = df.quotation.apply(lambda quote: quote_flair_score(quote))
    return None

In [27]:
create_flair_features(china_trade_2018)
china_trade_2018.head()

2021-11-12 16:01:53,419 loading file /root/.flair/models/sentiment-en-mix-distillbert_4.pt


Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase,vader_compound,vader_neg,vader_pos,vader_neu,flair_pred
0,2018-04-13-001441,A Digital Free Trade Zone between China and Ma...,Peter Wong,"[Q7177768, Q7177774, Q7177776, Q9456225]",2018-04-13 19:30:55,1,"[[Peter Wong, 0.782], [None, 0.218]]",[http://nst.com.my/business/2018/04/356893/chi...,E,0.5106,0.0,0.096,0.904,0.995053
1,2018-04-13-001441,A Digital Free Trade Zone between China and Ma...,Peter Wong,"[Q7177768, Q7177774, Q7177776, Q9456225]",2018-04-13 19:30:55,1,"[[Peter Wong, 0.782], [None, 0.218]]",[http://nst.com.my/business/2018/04/356893/chi...,E,0.5106,0.0,0.096,0.904,0.995053
2,2018-03-23-003097,A rough week for the markets... as fears of a ...,,[],2018-03-23 10:28:51,2,"[[None, 0.9112], [President Donald Trump, 0.08...",[http://www.breitbart.com/news/world-stock-mar...,E,-0.7717,0.283,0.0,0.717,0.986649
3,2018-04-08-011525,"Every day of the week China, comes into our ho...",Peter Navarro,[Q7176052],2018-04-08 04:00:00,17,"[[Peter Navarro, 0.6696], [None, 0.208], [LARR...",[http://dailylocal.com/general-news/20180408/a...,E,0.1531,0.079,0.098,0.823,0.999764
4,2018-05-14-023366,For the President to become suddenly concerned...,Jonathan Fenby,[Q15072639],2018-05-14 20:43:32,2,"[[Jonathan Fenby, 0.8834], [None, 0.1166]]",[https://www.fxstreet.com/news/wall-street-dow...,E,0.5719,0.0,0.065,0.935,0.901506


In [40]:
# china_trade_2018.to_json('/content/quotes-2018-china_trade_features.json', compression='bz2')

In [34]:
len(china_trade_2018)

16392

In [39]:
len(np.unique(china_trade_2018.quoteID))

16392

We can see that there are a few duplicates because we added several times quotations with several target words. We added a "break" statement in the corresponding functions for future usage.

In [33]:
china_trade_2018.drop_duplicates(subset='quoteID', keep='first', inplace=True)

In [38]:
china_trade_2018.columns

Index(['quoteID', 'quotation', 'speaker', 'qids', 'date', 'numOccurrences',
       'probas', 'urls', 'phase', 'vader_compound', 'vader_neg', 'vader_pos',
       'vader_neu', 'flair_pred', 'textblob_polarity',
       'textblob_subjectivity'],
      dtype='object')