<a href="https://www.kaggle.com/code/srivabhi22/sentiment-analysis-using-fasttext?scriptVersionId=187387804" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

<h1 align="center"> Sentiment Analysis and Classification using FastText </h1>

<h2>Before we Begin:  </h2>
If you liked my work, please upvote my notebook since it will help me making more such notebooks and also help in my data science profile making

<h2> Introduction </h2>
In this notebook, we will be inspecting the financial news dataset, exploring it and performing sentiment classification using FastText!


<h2> Objectives of this notebook: </h2>
<ul>
<li> To study about the provided dataset about Financial news </li>
<li> Preprocessing the textual data to convert it into a useful one</li>
<li> Using the FastTextlibrary to create word emebeddings and using those embeddings to train our classifier ML model 
<li>Evaluating model accuracy and performance on validation data</li>
</ul>


<h2> Outline: </h2>
I. <b>Understanding our data</b><br>
II. <b>Preprocessing</b><br>
III. <b>Visualization of our data</b><br>
IV. <b>Feature extraction using FastText</b><br>
 V. <b>Model Training and Evaluation</b>
        

<h2>Importing the necessary packages  </h2>


In [None]:
import random
from collections import Counter
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
import re
from wordcloud import WordCloud
import plotly.express as px
import plotly.graph_objects as go
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.naive_bayes import GaussianNB,MultinomialNB,BernoulliNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score
import fasttext
import warnings
warnings.filterwarnings('ignore')

## Reading the dataset

In [None]:
data = pd.read_csv('/kaggle/input/sentiment-analysis-for-financial-news/all-data.csv', encoding_errors='ignore', names=["sentiment", "content"])
data

<h2>Data Preprocessing and EDA(Exploratory Data Analysis)</h2>


### Checking for null values

In [None]:
data.isnull().sum() # no missing values

### Checking for duplicates

In [None]:
print(f" Duplicated before : {data.duplicated().sum()}") # 6 duplicates
data.drop_duplicates(keep='first',inplace=True,ignore_index=True)
print(f" Duplicated after : {data.duplicated().sum()}")

<h2> Producing the corpus </h2>
1.Keeping only the alphanumeric characters and the dates <br>
2. Lowering the upper-case characters <br>
3. Tokenization <br>

In [None]:
corpus=[]
for sentence in data['content']:
    sentence = re.sub(r'[^a-zA-Z0-9\s]', '', sentence)
    sentence = sentence.lower()
    corpus.append(sentence)
    
data['corpus']=pd.DataFrame({'corpus':corpus})
data

### Analysing the distribution of type of sentiments among the data

In [None]:
unique_vals, counts = np.unique(data['sentiment'], return_counts=True)
df = pd.DataFrame({'Sentiment': unique_vals, 'Counts': counts})

fig = px.histogram(df, x='Sentiment', y='Counts', title='Sentiment labels v/s counts', histfunc='sum')
fig.show()

### Wordcloud for visualization of types of words associated with each type of sentiment

In [None]:
wordcloud1 = WordCloud(random_state=0,normalize_plurals=False,width=400,height=300)
wordcloud2 = WordCloud(random_state=42,normalize_plurals=False,width=400,height=300)
wordcloud3 = WordCloud(random_state=32,normalize_plurals=False,width=400,height=300)

wc1=wordcloud1.generate(str(data[data['sentiment']=="negative"]['corpus']))
wc2=wordcloud2.generate(str(data[data['sentiment']=="neutral"]['corpus']))
wc3=wordcloud3.generate(str(data[data['sentiment']=="positive"]['corpus']))

plt.figure(figsize=(15,10))
plt.subplot(1,3,1)
plt.title('Negative')
plt.imshow(wc1,interpolation='bilinear')
plt.axis('off')

plt.subplot(1,3,2)
plt.title('Neutral')
plt.imshow(wc2,interpolation='bilinear')
plt.axis('off')

plt.subplot(1,3,3)
plt.title('Positive')
plt.imshow(wc3,interpolation='bilinear')
plt.axis('off')

plt.show()

## Feature Extraction using FastText 
### (Creating Word Embeddings)

In [None]:
data['sentiment']=data['sentiment'].apply(lambda x:  '__label__' + x)
data['sentiment_corpus']=data['sentiment'] + ' ' + data['corpus']

In [None]:
data

### Splitting of data into train,test

In [None]:
X_train,X_test,y_train,y_test = train_test_split(data['sentiment_corpus'],data['sentiment'],test_size=0.2,random_state=42)

### Saving the data in the input format of fasTtext

In [None]:
X_train.to_csv("finance.train",columns=['sentiment_corpus'],index=False , header=False)
X_test.to_csv("finance.test",columns=['sentiment_corpus'],index=False , header=False)

In [None]:
print(X_train.shape)
print(X_test.shape)

### Model training

In [None]:
model = fasttext.train_supervised(input="/kaggle/working/finance.train",epoch=10,lr=0.1,wordNgrams=1)

### Model Evaluation

In [None]:
print(model.test("/kaggle/working/finance.train"))
print(model.test("/kaggle/working/finance.test"))

### Saving the model

In [None]:
model.save_model("fasttext.bin")

In [None]:
# loaded_model=fasttext.load_model('/kaggle/working/fasttext.bin')
# model.predict(data['corpus'][180])