Title: Sentiment Analysis on a Custom Text Dataset    
Author: Raikibul Hasan

**Table of contents**<a id='toc0_'></a>    
- [Preparation](#toc1_)    
    - [Install libraries](#toc1_1_1_)    
    - [Import libraries](#toc1_1_2_)    
- [Data Expolaration](#toc2_)  
    - [Load the dataset & overview of the data](#_)     
    - [Identify and handling missing data](#_)      
    - [Visualize the distribution](#_)   
- [Sentiment Analysis](#toc3_)
    <!-- - [Overview of the data](#)   
    - [Extract keyword using PKE](#toc2_1_1_)    
    - [Split the article into an array/list of individual sentences](#toc2_1_2_)    
    - [Map the sentences which contain the keywords](#toc2_1_3_)    
    - [Get the sense of the word](#toc2_1_4_)    
    - [first distractor generate from WordNet](#toc2_1_5_)    
    - [Second distractor generator](#toc2_1_6_)    
    - [Find and map the distractors to the keywords](#toc2_1_7_)     -->
    <!--/- [Show generates MCQ](#toc2_1_8_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Preparation](#toc0_)
In this section, we will import the necessary libraries and dataset.

### <a id='toc1_1_1_'></a>[Install libraries](#toc0_)
***If you have installed  library once skip it otherwise make installed varibale false and run***

In [1]:
installed = True

In [70]:
def install_library():
    !pip install pandas
    !pip install numpy
    !pip install worldcloud
    !pip install nltk
    !pip install seaborn
    !pip install tensorflow
    !pip install matplotlib
    !pip install scikit-learn
    !pip install xgboost


if not installed:
    install_library()


In [2]:
import pandas as pd
import numpy as np
from wordcloud import WordCloud
from collections import Counter
import matplotlib.pyplot as plt
import re
import seaborn as sns
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.manifold import TSNE
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, LSTM, Dense, Dropout, Bidirectional, Concatenate, Flatten
import xgboost as xgb
import warnings
import shap

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
warnings.filterwarnings("ignore")
np.random.seed(42)

In [6]:
file_path = 'dataset.csv'
data = pd.read_csv(file_path)


In [13]:
data.head(50)

Unnamed: 0,date,id,content,username,like_count,retweet_count
0,2023-03-29 22:58:21+00:00,1641213230730051584,"Free AI marketing and automation tools, strate...",RealProfitPros,0.0,0.0
1,2023-03-29 22:58:18+00:00,1641213218520481805,@MecoleHardman4 Chat GPT says it’s 15. 😂,AmyLouWho321,0.0,0.0
2,2023-03-29 22:57:53+00:00,1641213115684536323,https://t.co/FjJSprt0te - Chat with any PDF!\n...,yjleon1976,0.0,0.0
3,2023-03-29 22:57:52+00:00,1641213110915571715,"AI muses: ""In the court of life, we must all f...",ChatGPT_Thinks,0.0,0.0
4,2023-03-29 22:57:26+00:00,1641213003260633088,Most people haven't heard of Chat GPT yet.\nFi...,nikocosmonaut,0.0,0.0
5,2023-03-29 22:57:20+00:00,1641212975012016128,@nytimes No! Chat Gpt has been putting togethe...,cordydbarb,0.0,0.0
6,2023-03-29 22:57:06+00:00,1641212917868646400,@ylzkrtt Yes also by chat gpt you can make gen...,gomezfidelphani,1.0,0.0
7,2023-03-29 22:57:02+00:00,1641212902375063552,@robinhanson @razibkhan Most people haven't he...,nikocosmonaut,0.0,0.0
8,2023-03-29 22:56:52+00:00,1641212856984109072,Yours Robotically - by Shaun Usher - Letters o...,lawyermarketer,0.0,0.0
9,2023-03-29 22:56:49+00:00,1641212845441585152,This is a metaphor for the limited perception ...,ashleighgrente2,2.0,0.0


In [11]:
data.isnull().sum()

date              0
id                6
content           6
username         34
like_count       62
retweet_count    62
dtype: int64

In [18]:
# Calculate the percentage of missing values for each column
missing_values = data.isnull().sum() / len(data) * 100
missing_values_table = pd.DataFrame({'Column': data.columns, 'Percentage Missing': missing_values})
missing_values_table.head()

Unnamed: 0,Column,Percentage Missing
date,date,0.0
id,id,0.0012
content,content,0.0012
username,username,0.0068
like_count,like_count,0.012399


In [20]:
#Drop missing values
data=data.dropna()

In [21]:
data.isnull().sum()

date             0
id               0
content          0
username         0
like_count       0
retweet_count    0
dtype: int64

In [23]:
data = data.drop(columns=['date','id','username','retweet_count'])


KeyError: "['date', 'id', 'username', 'retweet_count'] not found in axis"

In [24]:
data.head(2)

Unnamed: 0,content,like_count
0,"Free AI marketing and automation tools, strate...",0.0
1,@MecoleHardman4 Chat GPT says it’s 15. 😂,0.0
