# Financial News Sentiment Prediction

Natural Language Processing (NLP) involves a set of techniques and methodologies used to enable computers to understand, interpret, and generate human language. Here is a general process for NLP:

Text Collection: The first step is to gather the data that will be used for NLP. This can be done by scraping websites, using existing datasets, or generating synthetic data.

Text Preprocessing: The collected text data must be cleaned, normalized, and transformed into a format that can be used by NLP algorithms. This process includes tasks such as tokenization (splitting text into words or phrases), part-of-speech tagging (assigning grammatical labels to words), and stemming or lemmatization (reducing words to their root form).

Feature Extraction: Next, the most relevant features must be extracted from the preprocessed text. This could include features like word frequency, n-grams (sequences of n words), or word embeddings (dense vector representations of words).

Model Training: Once the features are extracted, a machine learning model is trained on the data. Depending on the task, different types of models can be used, such as supervised learning models (e.g., classification) or unsupervised learning models (e.g., clustering).

Model Evaluation: The performance of the model is evaluated using metrics such as accuracy, precision, recall, and F1 score. This step helps to identify the strengths and weaknesses of the model and can guide further improvements.

Model Deployment: After the model is trained and evaluated, it can be deployed to a production environment for use in applications such as chatbots, sentiment analysis, or text classification.

Model Maintenance: The NLP system must be continually monitored and updated to ensure that it remains accurate and effective. This may involve retraining the model on new data or making changes to the preprocessing or feature extraction steps.





# Importing the library

In [76]:
import numpy as np
import pandas as pd
#Tokenizer
from tensorflow.keras.preprocessing.text import Tokenizer
#pad_sequences
from tensorflow.keras.preprocessing.sequence import pad_sequences
#train_test_split
from sklearn.model_selection import train_test_split
#neural_network
import tensorflow as tf

# loading the dataset

In [77]:
df=pd.read_csv('/kaggle/input/sentiment-analysis-for-financial-news/all-data.csv',names=['Label','Text'],encoding='latin-1')
df

Unnamed: 0,Label,Text
0,neutral,"According to Gran , the company has no plans t..."
1,neutral,Technopolis plans to develop in stages an area...
2,negative,The international electronic industry company ...
3,positive,With the new production plant the company woul...
4,positive,According to the company 's updated strategy f...
...,...,...
4841,negative,LONDON MarketWatch -- Share prices ended lower...
4842,neutral,Rinkuskiai 's beer sales fell by 6.5 per cent ...
4843,negative,Operating profit fell to EUR 35.4 mn from EUR ...
4844,negative,Net sales of the Paper segment decreased to EU...


# Getting Preliminary Information

In [78]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4846 entries, 0 to 4845
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Label   4846 non-null   object
 1   Text    4846 non-null   object
dtypes: object(2)
memory usage: 75.8+ KB


# Preprocessing

In [79]:
def get_sequences(texts):
    #creating the tokenizer object
    tokenizer=Tokenizer()
    #applying function called fit_on_texts on objects
    tokenizer.fit_on_texts(texts)
    #giving each word number 
    sequences=tokenizer.texts_to_sequences(texts)
    #returning sequences
    max_seq_length=np.max(list(map(lambda x:len(x),sequences)))
    sequences=pad_sequences(sequences,maxlen=max_seq_length,padding='post')
    return sequences

In [80]:
sequences=get_sequences(df['Text'])
sequences

array([[  94,    5, 3498, ...,    0,    0,    0],
       [ 840,  336,    5, ...,    0,    0,    0],
       [   1,  293,  656, ...,    0,    0,    0],
       ...,
       [  42,   31,  242, ...,    0,    0,    0],
       [  30,   27,    2, ...,    0,    0,    0],
       [  27,    3,   35, ...,    0,    0,    0]], dtype=int32)

In [81]:
def preprocess_inputs(df):
    df=df.copy()
    sequences=get_sequences(df['Text'])
    label_mapping={'negative':0,
                  'neutral':1,
                  'positive':2}
    y=df['Label']=df['Label'].replace(label_mapping)
    sequences_train,sequences_test,y_train,y_test=train_test_split(sequences,y,train_size=0.7)
    return sequences_train,sequences_test,y_train,y_test

In [82]:
sequences_train,sequences_test,y_train,y_test=preprocess_inputs(df)
print(sequences_train.shape)
print(sequences_test.shape)
print(y_train.shape)
print(y_test.shape)

(3392, 71)
(1454, 71)
(3392,)
(1454,)


# What is TensorFlow
TensorFlow is an open source software library developed by Google for building and training machine learning models. It provides a comprehensive set of tools for building and deploying machine learning models across a range of platforms, including desktops, mobile devices, and the cloud.

At its core, TensorFlow is a computational framework for building and executing numerical computations using data flow graphs. In a data flow graph, nodes represent mathematical operations, while edges represent the flow of data between these operations. TensorFlow provides a rich set of APIs for building these data flow graphs, as well as high-level APIs for building and training machine learning models.

One of the key features of TensorFlow is its support for building deep neural networks, which are a class of machine learning models that are particularly effective for tasks such as image recognition and natural language processing. TensorFlow provides a wide range of pre-built layers and models for building deep neural networks, as well as APIs for building custom layers and models.

TensorFlow also provides a variety of tools for training machine learning models, including built-in optimizers for stochastic gradient descent and other optimization algorithms, as well as tools for monitoring training progress and visualizing results. It also supports distributed training across multiple machines or GPUs, which can significantly speed up training times for large models.

In addition to training models, TensorFlow also provides tools for deploying models in production environments. This includes support for exporting trained models to a variety of formats, as well as serving models using a variety of deployment targets, including mobile devices, web applications, and cloud services.

Another key feature of TensorFlow is its support for automatic differentiation, which is a technique for computing the gradient of a function with respect to its input parameters. This is particularly useful for training machine learning models, as it allows for efficient computation of gradients during the training process.

Finally, TensorFlow also provides a variety of tools and APIs for working with data, including tools for reading and writing data in a variety of formats, as well as APIs for manipulating and transforming data.

Overall, TensorFlow is a powerful and flexible tool for building and deploying machine learning models, with a rich set of APIs and tools for working with data, training models, and deploying models in production environments. Its support for deep neural networks and distributed training makes it particularly well-suited for building large-scale machine learning systems, and its open source nature and active community make it a popular choice for machine learning researchers and practitioners around the world.





# Training the Model

In [87]:
#This line defines the input layer for the model, with a shape of (sequences_train.shape[1],) which represents the length of the input sequences.

inputs=tf.keras.Input(shape=(sequences_train.shape[1],))


#This line adds an embedding layer to the model, which is used to convert each integer in the input sequences to a dense vector of fixed size. The input_dim parameter specifies the size of the vocabulary (i.e., the maximum integer index), output_dim specifies the size of the embedding vector, and input_length specifies the length of the input sequences.


x=tf.keras.layers.Embedding(input_dim=10123,output_dim=128,
                            input_length=sequences_train.shape[1]
                           )(inputs)


#This line adds a Gated Recurrent Unit (GRU) layer to the model, which is a type of recurrent neural network (RNN) that can model sequential data. The 256 parameter specifies the number of units in the GRU layer, and activation='tanh' specifies the activation function to use.

x=tf.keras.layers.GRU(256,activation='tanh')(x)


#This line adds a fully connected output layer to the model, which has 3 output nodes and uses a softmax activation function to produce a probability distribution over the output classes. The Dense function is used to create this layer.


outputs=tf.keras.layers.Dense(3,activation='softmax')(x)
#This line creates a tf.keras.Model object that encapsulates the input and output layers of the model.
model=tf.keras.Model(inputs=inputs,outputs=outputs)
#This line compiles the model, specifying the optimizer, loss function, and evaluation metrics to use during training. The Adam optimizer is used, along with sparse categorical cross-entropy loss and accuracy as the evaluation metric.
model.compile(optimizer='adam',
             loss='sparse_categorical_crossentropy',
             metrics=['accuracy'])

#This line trains the model on the training data, with a validation split of 0.2 (i.e., 20% of the data is used for validation). The batch_size and epochs parameters control the size of the mini-batches used during training and the number of epochs to train for, respectively. The callbacks parameter specifies a list of callbacks to use during training, in this case including an early stopping callback that will stop training if the validation loss does not improve for 3 epochs, and restore the best weights based on the validation loss. The history object returned by model.fit contains information about the training and validation loss and accuracy over each epoch.


history=model.fit(sequences_train,
                 y_train,validation_split=0.2,
                 batch_size=32,
                 epochs=100,
                 callbacks=[
                     tf.keras.callbacks.EarlyStopping(
                     monitor='val_loss',
                     patience=3,
                         restore_best_weights=True)
                 ])

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100


In [88]:
model.evaluate(sequences_test,y_test)



[0.9368535280227661, 0.5852819681167603]

In [89]:
model.predict(sequences_test)



array([[0.13867693, 0.5603542 , 0.30096886],
       [0.13867693, 0.5603542 , 0.3009689 ],
       [0.13867694, 0.5603542 , 0.3009689 ],
       ...,
       [0.13867694, 0.56035423, 0.3009689 ],
       [0.13867696, 0.56035423, 0.3009689 ],
       [0.13867696, 0.56035423, 0.3009689 ]], dtype=float32)

In [None]:
df.iloc[0,:]