<a href="https://colab.research.google.com/github/aisudev/SE-Project-Core-Engine/blob/main/Centiment_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Centiment

## Download Dataset

In [1]:
!gdown 1oQUYSp1ySoRnGrbZ2T0ogqWE5SXYjtnS
!gdown 1Zbj5seq-emJhH2afC2FG8CzILE0ZYRIf

Downloading...
From: https://drive.google.com/uc?id=1oQUYSp1ySoRnGrbZ2T0ogqWE5SXYjtnS
To: /content/financial_data.csv
100% 1.20M/1.20M [00:00<00:00, 136MB/s]
Downloading...
From: https://drive.google.com/uc?id=1Zbj5seq-emJhH2afC2FG8CzILE0ZYRIf
To: /content/crypto_data.csv
100% 1.88M/1.88M [00:00<00:00, 145MB/s]


In [2]:
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm

In [3]:
df1 = pd.read_csv('./financial_data.csv').drop(['Unnamed: 0'], axis=1)
df1.head()

Unnamed: 0,sentence,sentiment
0,"$ESI on lows, down $1.50 to $2.50 BK a real po...",negative
1,Shell's $70 Billion BG Deal Meets Shareholder ...,negative
2,SSH COMMUNICATIONS SECURITY CORP STOCK EXCHANG...,negative
3,$SAP Q1 disappoints as #software licenses down...,negative
4,$AAPL afternoon selloff as usual will be bruta...,negative


In [4]:
df2 = pd.read_csv('./crypto_data.csv').drop(['Unnamed: 0', 'news_url', 'title'], axis=1)
df2.columns = ['sentence', 'sentiment']
df2.sentiment = df2.sentiment.apply(lambda x: str(x).lower())
df2.head()

Unnamed: 0,sentence,sentiment
0,Anthony Albanese's cabinet will focus mainly o...,neutral
1,Experts point out sticking points as well as g...,neutral
2,This crypto fund specializes in the fastest-gr...,neutral
3,FTX's U.S. arm previously acquired crypto deri...,neutral
4,The price of Bitcoin hit its lowest since 2020...,positive


In [5]:
df = pd.concat([df1, df2], axis=0)
df = df.sample(frac=1, )
df.head()

Unnamed: 0,sentence,sentiment
6896,Tim Cockroft brings with him an excellent trac...,positive
3319,The liquidity providing was interrupted on May...,neutral
1709,BAC See some negative price action here tomorr...,negative
1842,AAP Free-falling Now to 457,negative
8303,"Britain's FTSE steadies, supported by Dixons C...",positive


In [6]:
df.sentiment.value_counts()

neutral     5203
positive    4700
negative    4404
Name: sentiment, dtype: int64

## Downsampling

In [7]:
re_n = min(df.sentiment.value_counts().values)
neg = df[df['sentiment'] == 'negative']
neu = df[df['sentiment'] == 'neutral'].iloc[:re_n, :]
pos = df[df['sentiment'] == 'positive'].iloc[:re_n, :]
df = pd.concat([neg, neu, pos])
df.sentiment.value_counts()

negative    4404
neutral     4404
positive    4404
Name: sentiment, dtype: int64

## Split dataset

In [8]:
from sklearn.model_selection import train_test_split
x = df.sentence.to_numpy()
y = df.sentiment.to_numpy()

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.1, shuffle=True, random_state=42)
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((11890,), (1322,), (11890,), (1322,))

In [9]:
import nltk
nltk.download(['stopwords', 'punkt'])

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from string import punctuation
import re

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Preprocessing

In [10]:
def preprocessing(data):
  # lowercase
  data = [item.lower() for item in data]

  # clear puncuation
  table = str.maketrans('', '', punctuation)
  data = [item.translate(table) for item in data]
  data = [re.sub(r'\d+', 'num', item) for item in data]
  
  # clear stopwords
  stopword = set(stopwords.words('english') + ['\x03', '.com', 'cryptograph', 'ambcrypto', 'u.today', 'coingape', 'the dialy hodl'])
  data = [[word for word in item.split() if word not in stopword] for item in data]

  # stemming
  stemmer = PorterStemmer()
  data = [' '.join([stemmer.stem(word) for word in item]) for item in data]
  return data

x_train = preprocessing(x_train)
x_train[:5]

['aap num week support',
 'fortum intend spend much euro num bn becom sole owner tgknum',
 'proud welcom anoth distribut facil north mississippi region known logist center unit state said gray swoop execut director mda',
 'da said fincen current author patriot act would like stop actor engag illicit transact ransomwar attack darknet market',
 'invn also line watch list trade top red candl left']

## CountVectorize and TFIDF Transform

In [11]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression

countVect = CountVectorizer()
x_train_count = countVect.fit_transform(x_train)

tfidf = TfidfTransformer()
x_train_tfidf = tfidf.fit_transform(x_train_count)
x_train_tfidf.shape

(11890, 14848)

## Training

In [12]:
model = LogisticRegression(max_iter=200)
model.fit(x_train_tfidf, y_train)

x_test = preprocessing(x_test)
x_test_tfidf = tfidf.transform(countVect.transform(x_test))
y_pred = model.predict(x_test_tfidf)
y_pred

array(['neutral', 'neutral', 'positive', ..., 'negative', 'positive',
       'positive'], dtype=object)

In [13]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

    negative       0.73      0.77      0.75       447
     neutral       0.66      0.67      0.66       439
    positive       0.71      0.66      0.68       436

    accuracy                           0.70      1322
   macro avg       0.70      0.70      0.70      1322
weighted avg       0.70      0.70      0.70      1322



## Tuning Hyperparameter

In [14]:
from sklearn.model_selection import GridSearchCV, KFold
cv = KFold(n_splits=2)

solvers = ['newton-cg', 'lbfgs', 'liblinear']
penalty = ['l2']
c_values = [100, 10, 1.0, 0.1, 0.01]
params = {
    'max_iter': [200, 300, 400],
    'penalty': penalty,
    'solver': solvers,
    'C': c_values
}

grid = GridSearchCV(LogisticRegression(), params, scoring='accuracy')
grid.fit(x_train_tfidf, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

GridSearchCV(estimator=LogisticRegression(),
             param_grid={'C': [100, 10, 1.0, 0.1, 0.01],
                         'max_iter': [200, 300, 400], 'penalty': ['l2'],
                         'solver': ['newton-cg', 'lbfgs', 'liblinear']},
             scoring='accuracy')

In [20]:
grid.best_estimator_

LogisticRegression(max_iter=200, solver='liblinear')

In [21]:
model = grid.best_estimator_
model.fit(x_train_tfidf, y_train)
y_pred = model.predict(x_test_tfidf)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

    negative       0.73      0.77      0.75       447
     neutral       0.66      0.68      0.67       439
    positive       0.71      0.65      0.68       436

    accuracy                           0.70      1322
   macro avg       0.70      0.70      0.70      1322
weighted avg       0.70      0.70      0.70      1322



## Content

In [17]:
from nltk.tokenize import sent_tokenize

In [18]:
text = """The surging DeFi sector has resulted in a mass minting of Tether in 2020 — including $3B last month alone — which has pushed its market capitalization over $15 billion.
At the beginning of the year there was just over $4 billion USDT in circulation, and today that figure is over $15 billion. DeFi has been the driving force behind the Tether mining machine as more and more liquidity pools are based on stablecoins. It was reported that Tether’s average daily transfer value had exceeded that of PayPal late last month as demand for the stablecoin continues to surge.
Tether made the milestone announcement and pointed out that last month the market capitalization has increased by billions more:
Tether has just surpassed a $15 billion market capitalization!

In only one month, Tether’s market cap has increased by more than $3 billion, maintaining its number one spot as the most liquid, stable and trusted currency! pic.twitter.com/MLOWkiIDvF
— Tether (@Tether_to) September 17, 2020
An infographic from Flipsidecrypto.com depicts the movements of Tether between users and exchanges this month. The main centralized exchanges still account for the lion’s share of USDT trade with Binance and Bitfinex holding around $2 billion between them.
Image - flipsidecrypto.com
According to the Tether Transparency Report, the amount of UDST on Ethereum has now increased to over $10 billion, or almost two thirds of the entire supply. There is currently around $4.2 billion on Tron and $1.3 billion circulating on Omni.
Late last month, Tether conducted a billion dollar token swap from Bitfinex to Binance as reported by Cointelegraph. The swap was initiated because Binance had a surplus of $1 billion USDT based on the TRON blockchain and wanted to trade it for the equivalent amount of Ethereum-based Tether.
On September 15, another swap was initiated by Tether as demand for the ERC-20 version of the stablecoin exceeds that of other networks, such as Tron.
Tomorrow Tether will coordinate with a 3rd party to perform two chain swaps (conversion from Tron to ERC20 protocol) for 1B USDt.
Tether total supply will not change during this process.
Read more here: https://t.co/abfgnELSvi
— Tether (@Tether_to) September 14, 2020
However there are ongoing moves to shift Tether transfers onto other networks from Ethereum as gas fees continue to cripple the network. Over the past month USDT has been made available on the Layer 2 OMG Network and launched on the high speed Solana blockchain.
Meanwhile some in the crypto community are still calling for a full audit which will determine whether there are $15 billion real dollars and assets backing up the stablecoin, or the whole thing is a house of cards.
The truth may be revealed as part of the ongoing Tether lawsuit in New York. The Office of the Attorney General (OAG) filed a letter on September 8 which asked for disclosure of financial documents. The lawsuit concerns allegations that Bitfinex had ‘lost’ around $1 billion in customer funds and used Tether reserves to mask the imbalance. Tether and Bitfinex have rejected the lawsuit as baseless.
"""

In [22]:
from collections import Counter
sent = sent_tokenize(text)
sent = preprocessing(sent)

sent_count = countVect.transform(sent)
sent_tfidf = tfidf.transform(sent_count)

result = model.predict(sent_tfidf)
print(Counter(result))
for i in range(len(sent)):
  print(f"{sent[i]} => {result[i]}")

Counter({'neutral': 12, 'positive': 6, 'negative': 4})
surg defi sector result mass mint tether num — includ numb last month alon — push market capit num billion => neutral
begin year num billion usdt circul today figur num billion => positive
defi drive forc behind tether mine machin liquid pool base stablecoin => neutral
report tether’ averag daili transfer valu exceed paypal late last month demand stablecoin continu surg => positive
tether made mileston announc point last month market capit increas billion tether surpass num billion market capit => positive
one month tether’ market cap increas num billion maintain number one spot liquid stabl trust currenc => neutral
pictwittercommlowkiidvf — tether tetherto septemb num num infograph flipsidecryptocom depict movement tether user exchang month => negative
main central exchang still account lion’ share usdt trade binanc bitfinex hold around num billion => neutral
imag flipsidecryptocom accord tether transpar report amount udst ethereu