<a href="https://colab.research.google.com/github/aisudev/SE-Project-Core-Engine/blob/main/Centiment_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Centiment

## Download Dataset

In [1]:
!gdown 1oQUYSp1ySoRnGrbZ2T0ogqWE5SXYjtnS
!gdown 1Zbj5seq-emJhH2afC2FG8CzILE0ZYRIf

Downloading...
From: https://drive.google.com/uc?id=1oQUYSp1ySoRnGrbZ2T0ogqWE5SXYjtnS
To: /content/financial_data.csv
100% 1.20M/1.20M [00:00<00:00, 139MB/s]
Downloading...
From: https://drive.google.com/uc?id=1Zbj5seq-emJhH2afC2FG8CzILE0ZYRIf
To: /content/crypto_data.csv
100% 1.88M/1.88M [00:00<00:00, 143MB/s]


In [2]:
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm

In [3]:
df1 = pd.read_csv('./financial_data.csv').drop(['Unnamed: 0'], axis=1)
df1.head()

Unnamed: 0,sentence,sentiment
0,"$ESI on lows, down $1.50 to $2.50 BK a real po...",negative
1,Shell's $70 Billion BG Deal Meets Shareholder ...,negative
2,SSH COMMUNICATIONS SECURITY CORP STOCK EXCHANG...,negative
3,$SAP Q1 disappoints as #software licenses down...,negative
4,$AAPL afternoon selloff as usual will be bruta...,negative


In [4]:
df2 = pd.read_csv('./crypto_data.csv').drop(['Unnamed: 0', 'news_url', 'title'], axis=1)
df2.columns = ['sentence', 'sentiment']
df2.sentiment = df2.sentiment.apply(lambda x: str(x).lower())
df2.head()

Unnamed: 0,sentence,sentiment
0,Anthony Albanese's cabinet will focus mainly o...,neutral
1,Experts point out sticking points as well as g...,neutral
2,This crypto fund specializes in the fastest-gr...,neutral
3,FTX's U.S. arm previously acquired crypto deri...,neutral
4,The price of Bitcoin hit its lowest since 2020...,positive


In [5]:
df = pd.concat([df1, df2], axis=0)
df = df.sample(frac=1, )
df.head()

Unnamed: 0,sentence,sentiment
8069,"Furthermore , efficiency improvement measures ...",positive
210,Glencore blames rivals for creating metals glut,negative
2355,MTG holding small short pos from 5.72 for swing,negative
298,Operating loss amounted to EUR 0.9 mn in the f...,negative
5733,The business has sales of about ( Euro ) 35 mi...,neutral


In [6]:
df.sentiment.value_counts()

neutral     5203
positive    4700
negative    4404
Name: sentiment, dtype: int64

## Downsampling

In [7]:
re_n = min(df.sentiment.value_counts().values)
neg = df[df['sentiment'] == 'negative']
neu = df[df['sentiment'] == 'neutral'].iloc[:re_n, :]
pos = df[df['sentiment'] == 'positive'].iloc[:re_n, :]
df = pd.concat([neg, neu, pos])
df.sentiment.value_counts()

negative    4404
neutral     4404
positive    4404
Name: sentiment, dtype: int64

## Split dataset

In [8]:
from sklearn.model_selection import train_test_split
x = df.sentence.to_numpy()
y = df.sentiment.to_numpy()

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.1, shuffle=True, random_state=42)
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((11890,), (1322,), (11890,), (1322,))

In [9]:
import nltk
nltk.download(['stopwords', 'punkt'])

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from string import punctuation
import re

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Preprocessing

In [10]:
def preprocessing(data):
  # lowercase
  data = [item.lower() for item in data]

  # clear puncuation
  table = str.maketrans('', '', punctuation)
  data = [item.translate(table) for item in data]
  data = [re.sub(r'\d+', 'num', item) for item in data]
  
  # clear stopwords
  stopword = set(stopwords.words('english') + ['\x03', '.com', 'cryptograph', 'ambcrypto', 'u.today', 'coingape', 'the dialy hodl'])
  data = [[word for word in item.split() if word not in stopword] for item in data]

  # stemming
  stemmer = PorterStemmer()
  data = [' '.join([stemmer.stem(word) for word in item]) for item in data]
  return data

x_train = preprocessing(x_train)
x_train[:5]

['jack dorsey firstev tweet sold nft nonfung token last year num million worth thousand dollar',
 'mr skogster current serv manag respons abb oy system modul low voltag drive',
 'drug backlanc gone user argecap health well pfe jnj hit new numweek high morn',
 'capit structur solidium may complement financi instrument futur',
 'financi account standard board fasb wednesday reportedli unanim vote begin project review account exchangetrad digit asset commod']

## CountVectorize and TFIDF Transform

In [11]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression

countVect = CountVectorizer()
x_train_count = countVect.fit_transform(x_train)

tfidf = TfidfTransformer()
x_train_tfidf = tfidf.fit_transform(x_train_count)
x_train_tfidf.shape

(11890, 14790)

## Training

In [12]:
model = LogisticRegression(max_iter=200)
model.fit(x_train_tfidf, y_train)

x_test = preprocessing(x_test)
x_test_tfidf = tfidf.transform(countVect.transform(x_test))
y_pred = model.predict(x_test_tfidf)
y_pred

array(['positive', 'positive', 'positive', ..., 'negative', 'negative',
       'negative'], dtype=object)

In [13]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

    negative       0.72      0.77      0.74       447
     neutral       0.67      0.65      0.66       439
    positive       0.69      0.67      0.68       436

    accuracy                           0.70      1322
   macro avg       0.69      0.70      0.69      1322
weighted avg       0.70      0.70      0.70      1322



## Tuning Hyperparameter

In [14]:
from sklearn.model_selection import GridSearchCV, KFold
cv = KFold(n_splits=2)

solvers = ['newton-cg', 'lbfgs', 'liblinear']
penalty = ['l2']
c_values = [100, 10, 1.0, 0.1, 0.01]
params = {
    'max_iter': [200, 300, 400],
    'penalty': penalty,
    'solver': solvers,
    'C': c_values
}

grid = GridSearchCV(LogisticRegression(), params, scoring='accuracy')
grid.fit(x_train_tfidf, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

GridSearchCV(estimator=LogisticRegression(),
             param_grid={'C': [100, 10, 1.0, 0.1, 0.01],
                         'max_iter': [200, 300, 400], 'penalty': ['l2'],
                         'solver': ['newton-cg', 'lbfgs', 'liblinear']},
             scoring='accuracy')

In [15]:
grid.best_estimator_

LogisticRegression(max_iter=200)

In [16]:
model = grid.best_estimator_
model.fit(x_train_tfidf, y_train)
y_pred = model.predict(x_test_tfidf)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

    negative       0.72      0.77      0.74       447
     neutral       0.67      0.65      0.66       439
    positive       0.69      0.67      0.68       436

    accuracy                           0.70      1322
   macro avg       0.69      0.70      0.69      1322
weighted avg       0.70      0.70      0.70      1322



## Content

In [17]:
from nltk.tokenize import sent_tokenize

In [42]:
text = """Disclaimer: The findings of the following analysis are the sole opinions of the writer and should not be considered investment advice
Bitcoin Dominance took a huge leap earlier this month as it soared from 41.5% on 10 May to touch 45.47% on 19 May. This surge meant that Bitcoin’s share of the total market capitalization of the crypto-market rose by a huge amount, even though the price per Bitcoin remained around the same – Around $29k. Therefore, altcoins are shedding value much faster than Bitcoin, and long-term investors would be wise to remain cautious of the movement of this metric.
For The Sandbox, a buying opportunity for long-term horizon investors is not yet present. The trend, in fact, remained overwhelmingly bearish at press time.
SAND- 12 Hour Chart
Source: SAND/USDT on TradingView
The $4.4, $3.6, and $2.65-zones have been critical support levels over the past three months. The price has broken beneath each of them, and at press time, SAND was trading at $1.28. These levels had acted as strong resistance when SAND pushed north in October and November last year.
The next stronghold of the buyers lay around the $1-area, with $1.08 marked as a support level on the charts. However, the series of lower highs and lower lows over the past few months suggested that buyers run a high risk of large losses if they attempt to DCA into a steady downtrend.
Instead, long-term investors might want to wait for signs of strength from buyers before allocating some capital towards the crypto-asset.
Rationale
Source: SAND/USDT on TradingView
The price formed a hidden bearish divergence with the momentum indicator, RSI. The price formed lower highs (white) while the RSI made higher highs. This bearish divergence suggested a continuation of the downtrend, and therefore, the price could move toward the $1-mark in the days or weeks to come.
The RSI has been beneath the neutral 50 line since the start of April, which highlighted the bearish trend of SAND. The Stochastic RSI also formed a bearish crossover, adding a bit more confluence to the bearish bias.
The OBV did pick up slightly over the past week as it formed higher lows, but the buying volume is dwarfed by the selling volume of the past few weeks. Alongside the same, the CMF has also been below the -0.05 mark over the past six weeks. This meant that significant capital flow was directed out of the markets, highlighting selling pressure.
Conclusion
The indicators aligned to show seller strength in recent weeks, and the prospects don’t look great for a bullish reversal. Buyers would want to wait for market sentiment to shift, while short-sellers would be interested in SAND’s reaction at the $1.19 and $1.53-levels, as well as a breakdown under the psychological $1-support.
Read the best crypto stories of the day in less than 5 minutes
Subscribe to get it daily in your inbox.


Please select your Email Preferences.
THE DAILY DIGEST
THE WEEKLY DIGEST
"""

In [43]:
from collections import Counter
sent = sent_tokenize(text)
sent = preprocessing(sent)

sent_count = countVect.transform(sent)
sent_tfidf = tfidf.transform(sent_count)

result = model.predict(sent_tfidf)
print(Counter(result))
for i in range(len(sent)):
  print(f"{sent[i]} => {result[i]}")

Counter({'negative': 11, 'neutral': 8, 'positive': 5})
disclaim find follow analysi sole opinion writer consid invest advic bitcoin domin took huge leap earlier month soar num num may touch num num may => neutral
surg meant bitcoin’ share total market capit cryptomarket rose huge amount even though price per bitcoin remain around – around numk => neutral
therefor altcoin shed valu much faster bitcoin longterm investor would wise remain cautiou movement metric => neutral
sandbox buy opportun longterm horizon investor yet present => positive
trend fact remain overwhelmingli bearish press time => negative
sand num hour chart sourc sandusdt tradingview num num numzon critic support level past three month => negative
price broken beneath press time sand trade num => negative
level act strong resist sand push north octob novemb last year => positive
next stronghold buyer lay around numarea num mark support level chart => negative
howev seri lower high lower low past month suggest buyer run h

## Majority Score

In [44]:
def majority_score(scores:Counter):
  sentiment = ''
  sent_score = 0
  size = 0

  for key in scores.keys():
    size += scores[key]
    if sent_score < scores[key]:
      sent_score = scores[key]
      sentiment = key
  
  polarity = ((1 * scores['positive']) + (-1 * scores['negative'])) / size
  return sentiment, polarity

majority_score(Counter(result))

('negative', -0.25)