**Twitter Sentiment Analysis**

data source: [kaggle](https://www.kaggle.com/code/stoicstatic/twitter-sentiment-analysis-for-beginners/data)

The dataset being used is the sentiment140 dataset. It contains 1,600,000 tweets extracted using the Twitter API. The tweets have been annotated (0 = Negative, 4 = Positive) and they can be used to detect sentiment.

It contains the following 6 fields:

sentiment: the polarity of the tweet (0 = negative, 4 = positive)
ids: The id of the tweet (2087)
date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
flag: The query (lyx). If there is no query, then this value is NO_QUERY.
user: the user that tweeted (robotickilldozr)
text: the text of the tweet (Lyx is cool)
We require only the sentiment and text fields, so we discard the rest.

Furthermore, we're changing the sentiment field so that it has new values to reflect the sentiment. (0 = Negative, 1 = Positive)

**Import and setup stuff**

In [None]:
# utilities
import re
import pandas as pd
import numpy as np
import pickle 

#plotting
import seaborn as sns
from wordcloud import WordCloud
import matplotlib.pyplot as plt

#nltk
import os
from nltk.stem import WordNetLemmatizer

#sklearn
from sklearn.svm import LinearSVC

**Importing dataset**

In [None]:
#Running this cell will provide you with a token to link your drive to this notebook
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
df = pd.read_pickle('/content/drive/MyDrive/Fix upload github/Sentiment-BNB.pkl')
df.info()

AttributeError: ignored

In [None]:
df = pd.read_pickle('/content/drive/MyDrive/Fix upload github/Sentiment-BNB.pkl')
df
#with open("/content/drive/MyDrive/Fix upload github/Sentiment-BNB.pickle", "rb") as f:
 
 #   object = pkl.load(f)
    
#df.to_csv(r'file.csv')

BernoulliNB(alpha=2)

In [None]:
# Checkout the labels of our data
labels_csv = pd.read_csv("drive/My Drive/Twitter/twitter_training.csv",
                         names=["Id","Entity","Target","Text"],header=None)
# Deleting Entity and ID 
labels_csv = labels_csv.drop(['Entity','Id'], axis=1)

# Swicthing text and target position
data = labels_csv[['Text','Target']]

# Getting info from labels_csv
print(data.info())

# print 3 upper of label_csv data
print(data.head(3))

# Check the dimension
data.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74682 entries, 0 to 74681
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Text    73996 non-null  object
 1   Target  74682 non-null  object
dtypes: object(2)
memory usage: 1.1+ MB
None
                                                Text    Target
0  im getting on borderlands and i will murder yo...  Positive
1  I am coming to the borders and I will kill you...  Positive
2  im getting on borderlands and i will kill you ...  Positive


(74682, 2)

In [None]:
# examining closer, we find there are 4909 duplicate rows
np.sum(data.duplicated())

4909

In [None]:
# let's drop the duplicates
df = data.drop_duplicates()
df.shape

(69773, 2)

In [None]:
df['Target'].value_counts()

Negative      21238
Positive      19139
Neutral       17111
Irrelevant    12285
Name: Target, dtype: int64

In [None]:
# checking for completeness of data

print(f"{np.sum(df['Text'].isna())} rows have no Text")
print(f"{np.sum(df['Target'].isna())} rows have no Target")
print(f"{np.sum(df['Sentiment'].isna())} rows have no Sentiment")

4 rows have no Text
0 rows have no Target
0 rows have no Sentiment


In [None]:
# labeling the target, -1 = Negative, 0 = Neutral and irrelevant, 1 = Positive 
sentiment = []

for i in df["Target"]:
    if i == "Positive":
        sentiment.append(1)
    elif (i == "Irrelevant") or (i == "Neutral"):
        sentiment.append(0)
    else:
        sentiment.append(-1)
df["Sentiment"] = sentiment

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


In [None]:
df.head(3)

Unnamed: 0,Text,Target,Sentiment
0,im getting on borderlands and i will murder yo...,Positive,1
1,I am coming to the borders and I will kill you...,Positive,1
2,im getting on borderlands and i will kill you ...,Positive,1


In [None]:
stop_words = set (stopwords.words("english"))

**Text Cleaner**

In [None]:
df["Text"] = df["Text"].str.replace("\d","")

In [None]:
def cleaner(data):
    # Tokens
    tokens = word_tokenize(str(data).replace("'", "").lower()) 
    
    # Remove Puncs
    without_punc = [w for w in tokens if w.isalpha()]
    
    # Stopwords
    without_sw = [t for t in without_punc if t not in stop_words]
    
    # Lemmatize
    text_len = [WordNetLemmatizer().lemmatize(t) for t in without_sw]
    # Stem
    text_cleaned = [PorterStemmer().stem(w) for w in text_len]
    
    return " ".join(text_cleaned)

In [None]:
 import nltk
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [None]:
df["Text"] = df["Text"].apply(cleaner)
df["Text"].head()

0     im get borderland murder
1             come border kill
2       im get borderland kill
3    im come borderland murder
4     im get borderland murder
Name: Text, dtype: object

In [None]:
df["Text"] = df["Text"].str.replace("im"," ")
df["Text"].head()

0       get borderland murder
1            come border kill
2         get borderland kill
3      come borderland murder
4       get borderland murder
Name: Text, dtype: object

**NLP with `spaCy` library**

In [None]:
# download spaCy model for American English
!python3 -m spacy download en_core_web_sm
import spacy 
import en_core_web_sm
nlp = en_core_web_sm.load()

**Preprocessing**

In [None]:
# We want to also keep #hashtags as a token, so we will modify the spaCy model's token_match
import re
# Retrieve the default token-matching regex pattern
re_token_match = spacy.tokenizer._get_regex_pattern(nlp.Defaults.token_match)
# Add #hashtag pattern
re_token_match = f"({re_token_match}|#\\w+)"
nlp.tokenizer.token_match = re.compile(re_token_match).match
# Now let's try again
s = "2020 can't get any worse #ihate2020 @bestfriend <https://t.co>"
doc = nlp(s)
# Let's look at the lemmas and is stopword of each token
print(f"Token\\t\\tLemma\\t\\tStopword")
print("="*40)
for token in doc:
    print(f"{token}\\t\\t{token.lemma_}\\t\\t{token.is_stop}")

In [None]:
#Assign comments
comments = data.loc[0, 'Text']
print(comments)

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
[sent.text for sent in nlp(comments).sents]

In [None]:
%%timeit -n 10
# SpaCy with DependencyParser
nlp = spacy.load("en_core_web_sm")
data.loc[:5000, "Text"].apply(lambda x: [sent.text for sent in nlp(x).se

In [None]:
from spacy.lang.en import English
nlp_en = English()

In [None]:
data['Text'] = data['Text'].apply(lambda x: [sent.text for sent in nlp(x).sents])

In [None]:
def preprocess(text):
  text = text.str.replace("\n", " ")
  return text

data['Text'] = preprocess(data['Text'])

**Tokenization**

In [None]:
df = pd.DataFrame({'sentences': ['This is a very good site. I will recommend it to others.', 'Can you please give me a call at 9983938428. have issues with the listings.', 'good work! keep it up']})
df

In [None]:
df = data.apply(lambda row: nltk.word_tokenize(row['Text']), axis=1)
df

In [None]:
def clean_file(text):
    text = text.lower()
    return text.replace("\n","")

In [None]:
x = data['Text'].apply(clean_file)
x 

In [None]:
def cleaner(main):
    # Tokens
    tokens = word_tokenize(str(main).replace("'", "").lower()) 
    
    # Remove Puncs
    without_punc = [w for w in tokens if w.isalpha()]
    
    # Stopwords
    without_sw = [t for t in without_punc if t not in stop_words]
    
    # Lemmatize
    text_len = [WordNetLemmatizer().lemmatize(t) for t in without_sw]
    # Stem
    text_cleaned = [PorterStemmer().stem(w) for w in text_len]
    
    return " ".join(text_cleaned)

In [None]:
x=data['Text']
y=data['Target']
x=x.apply(clean_text)