**Data-Preprocessing**

In [14]:
from google.colab import drive
drive.mount("/drive", force_remount=True)
import csv
import pandas as pd
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
nltk.download('punkt')
nltk.download('stopwords')

fakeArticle = pd.read_csv('/drive/My Drive/Colab Notebooks/IST 664/2023/Fake.csv')
trueArticle = pd.read_csv('/drive/My Drive/Colab Notebooks/IST 664/2023/True.csv')
tweetsData = pd.read_csv('/drive/My Drive/Colab Notebooks/IST 664/2023/Tweets.csv')

fake = fakeArticle['text'][0:50]
true = trueArticle['text'][0:50]
print("fake data")
print(fake)
print("---------")
print("true data")
print(true)
print("tweets data")
print(tweetsData)

Mounted at /drive


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


fake data
0     Donald Trump just couldn t wish all Americans ...
1     House Intelligence Committee Chairman Devin Nu...
2     On Friday, it was revealed that former Milwauk...
3     On Christmas day, Donald Trump announced that ...
4     Pope Francis used his annual Christmas Day mes...
5     The number of cases of cops brutalizing and ki...
6     Donald Trump spent a good portion of his day a...
7     In the wake of yet another court decision that...
8     Many people have raised the alarm regarding th...
9     Just when you might have thought we d get a br...
10    A centerpiece of Donald Trump s campaign, and ...
11    Republicans are working overtime trying to sel...
12    Republicans have had seven years to come up wi...
13    The media has been talking all day about Trump...
14    Abigail Disney is an heiress with brass ovarie...
15    Donald Trump just signed the GOP tax scam into...
16    A new animatronic figure in the Hall of Presid...
17    Trump supporters and the so-call

**Data Preprocessing**

Lower-case

To maintain uniformity and consistency among the tokens format for better understanding and processing I am using islower() which makes all token to lower case leters.

Tokenization

To separate data into individual words I am using "punkt" tokenizer which separates by spaces and special characters(non-alphabetical characters)

Removing stop-words

Removing words which doesn't add any weightage to the content of the sentence

In [15]:
# create preprocess_text function
def preprocess_text(text):

    # Tokenize the text
    tokens = word_tokenize(text.lower())

    # Remove stop words
    filtered_tokens = [token for token in tokens if token not in stopwords.words('english')]


    # Lemmatize the tokens
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]


    # Join the tokens back into a string
    #processed_text = ' '.join(lemmatized_tokens)
    #return processed_text
    return lemmatized_tokens

Tokenize test data(Fake and True news Article data)

In [16]:
sentsInEachFakeArticle = []*50
allSentsInFakeArticle = []
for i in range(50):
    currentArticleSents = nltk.sent_tokenize(fake[i])
    sentsInEachFakeArticle.append(currentArticleSents)
    for j in currentArticleSents:
        allSentsInFakeArticle.append(j)
print(sentsInEachFakeArticle)



In [17]:
sentsInEachTrueArticle = []*50
allSentsInTrueArticle = []
for i in range(50):
    currentArticleSents = nltk.sent_tokenize(true[i])
    sentsInEachTrueArticle.append(currentArticleSents)
    for j in currentArticleSents:
        allSentsInTrueArticle.append(j)
print(allSentsInTrueArticle)



Drop the unwanted columns from the training dataset

In [18]:
tweets = tweetsData.drop(['tweet_id', 'airline_sentiment_confidence',
       'negativereason', 'negativereason_confidence', 'airline',
       'airline_sentiment_gold', 'name', 'negativereason_gold',
       'retweet_count', 'tweet_coord', 'tweet_created',
       'tweet_location', 'user_timezone'], axis=1)

In [19]:
print(tweets)

      airline_sentiment                                               text
0               neutral                @VirginAmerica What @dhepburn said.
1              positive  @VirginAmerica plus you've added commercials t...
2               neutral  @VirginAmerica I didn't today... Must mean I n...
3              negative  @VirginAmerica it's really aggressive to blast...
4              negative  @VirginAmerica and it's a really big bad thing...
...                 ...                                                ...
14635          positive  @AmericanAir thank you we got on a different f...
14636          negative  @AmericanAir leaving over 20 minutes Late Flig...
14637           neutral  @AmericanAir Please bring American Airlines to...
14638          negative  @AmericanAir you have my money, you change my ...
14639           neutral  @AmericanAir we have 8 ppl so we need 2 know h...

[14640 rows x 2 columns]


Pre-process the training data

In [20]:
nltk.download('wordnet')

tweets['text'] = tweets['text'].apply(preprocess_text)
print(tweets)

[nltk_data] Downloading package wordnet to /root/nltk_data...


      airline_sentiment                                               text
0               neutral           [@, virginamerica, @, dhepburn, said, .]
1              positive  [@, virginamerica, plus, 've, added, commercia...
2               neutral  [@, virginamerica, n't, today, ..., must, mean...
3              negative  [@, virginamerica, 's, really, aggressive, bla...
4              negative    [@, virginamerica, 's, really, big, bad, thing]
...                 ...                                                ...
14635          positive  [@, americanair, thank, got, different, flight...
14636          negative  [@, americanair, leaving, 20, minute, late, fl...
14637           neutral  [@, americanair, please, bring, american, airl...
14638          negative  [@, americanair, money, ,, change, flight, ,, ...
14639           neutral  [@, americanair, 8, ppl, need, 2, know, many, ...

[14640 rows x 2 columns]


In [21]:
tweetscontent = list(zip(tweets['text'], tweets['airline_sentiment']))
print(tweetscontent)



Define set of words that will be used for features


It is imperative to establish the specific set of words that will serve as the basis for feature extraction. This set comprises all the words found within the entire document collection, with the exception that we will confine it to include only the 2000 words that occur with the highest frequency. This process is pivotal in shaping the scope and efficiency of our analysis, as it enables us to focus on the most significant and commonly appearing terms in the corpus while disregarding the less prevalent ones.

In [22]:
import random
random.shuffle(tweetscontent)

all_words_list = [word for (sent,cat) in tweetscontent for word in sent]
all_words = nltk.FreqDist(all_words_list)

# get the 2000 most frequently appearing keywords in the corpus
word_items = all_words.most_common(2000)
word_features = [word for (word,count) in word_items]
print(word_features[:50])

['@', '.', '!', '?', 'flight', 'united', ',', '#', 'usairways', 'americanair', 'southwestair', 'jetblue', "n't", ':', "'s", 'get', 'http', 'hour', 'thanks', 'cancelled', 'u', 'service', ';', 'time', 'customer', '...', 'help', '&', 'bag', 'plane', '-', "'m", 'amp', ')', 'hold', 'need', '2', 'would', 'thank', 'one', 'still', 'please', 'call', 'airline', 'day', 'gate', 'ca', 'delayed', 'back', 'virginamerica']


In [23]:
negationwords = ['no', 'not', 'never', 'none', 'nowhere', 'nothing', 'noone', 'rather', 'hardly', 'scarcely', 'rarely', 'seldom', 'neither', 'nor']

Define Bag-Of-Words


We have the ability to specify features for each document by utilizing only the words, often referred to as BOW (Bag of Words) or unigram features. For every keyword (or word) within the word_features set, we will assign a feature label in the format 'contains(keyword),' and the feature value will be a Boolean, signifying whether the document includes that specific word.

Negating opinions plays a vital role in opinion classification. In this approach, we employ a straightforward method. We identify negation terms like "not," "never," and "no," including those within contractions like "doesn't." Our approach involves negating the word immediately following a negation term, as opposed to strategies that negate words until the next punctuation mark or use syntax to determine the extent of the negation. Thus, when processing the document, we adhere to the first strategy, modifying the feature to represent a negated word if it follows a negation term.

In [24]:
def NOT_features(document, word_features, negationwords):
    features = {}
    for word in word_features:
        features['V_{}'.format(word)] = False
        features['V_NOT{}'.format(word)] = False
    # go through document words in order
    for i in range(0, len(document)):
        word = document[i]
        if ((i + 1) < len(document)) and ((word in negationwords) or (word.endswith("n't"))):
            i += 1
            features['V_NOT{}'.format(document[i])] = (document[i] in word_features)
        else:
            features['V_{}'.format(word)] = (word in word_features)
    return features
# define the feature sets
NOT_featuresets = [(NOT_features(d, word_features, negationwords), c) for (d, c) in tweetscontent]

print(NOT_featuresets[0])



Building the Model(NaiveBayes Classifier)

I have used NaiveBayesClassifier and K-Fold Cross Validation to build the model with k=5

In [25]:
import numpy as np
from sklearn.model_selection import KFold
k=5
kf = KFold(n_splits=k, shuffle=True, random_state=42)
accuracies = []
true_labels = []
predicted_labels = []

for train_index, test_index in kf.split(NOT_featuresets):
    train_set = [NOT_featuresets[i] for i in train_index]
    test_set = [NOT_featuresets[i] for i in test_index]

    # Train a Naive Bayes classifier
    classifier = nltk.NaiveBayesClassifier.train(train_set)

    # Evaluate the classifier on the test set
    accuracy = nltk.classify.accuracy(classifier, test_set)
    accuracies.append(accuracy)

    for features, label in test_set:
        true_labels.append(label)
        predicted_labels.append(classifier.classify(features))

# Calculate and print the average accuracy
avg_accuracy = np.mean(accuracies)
print(f"Average Accuracy: {avg_accuracy:.2f}")

Average Accuracy: 0.78


Average Accuracy of the model is almost same for both the feature sets with difference of 0.01

In [26]:
from sklearn.metrics import precision_score, recall_score, f1_score
precision = precision_score(true_labels, predicted_labels, average='weighted')
recall = recall_score(true_labels, predicted_labels, average='weighted')
f1 = f1_score(true_labels, predicted_labels, average='weighted')

print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-Score: {f1:.2f}")

Precision: 0.79
Recall: 0.78
F1-Score: 0.78


Precision, recall and F1-Score are also same for the two models

In [27]:
classifier.show_most_informative_features(30)

Most Informative Features
                    V_hr = True           negati : positi =     36.8 : 1.0
              V_passbook = True           positi : negati =     35.1 : 1.0
                V_street = True           neutra : negati =     33.8 : 1.0
              V_favorite = True           positi : negati =     32.5 : 1.0
               V_awesome = True           positi : neutra =     31.4 : 1.0
               V_helpful = True           positi : neutra =     30.1 : 1.0
                 V_kudos = True           positi : negati =     28.9 : 1.0
                V_battle = True           neutra : negati =     28.0 : 1.0
                V_dragon = True           neutra : negati =     28.0 : 1.0
           V_outstanding = True           positi : negati =     27.3 : 1.0
                  V_wall = True           neutra : negati =     26.1 : 1.0
                 V_raise = True           positi : negati =     24.7 : 1.0
             V_fantastic = True           positi : negati =     22.6 : 1.0

classify the sentiment polarity of fake and true news article

In [28]:
nltk.download('punkt')
faketexttokens = [nltk.word_tokenize(text) for text in allSentsInFakeArticle]
truetexttokens = [nltk.word_tokenize(text) for text in allSentsInTrueArticle]

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [29]:
print(faketexttokens)



In [30]:
fakefeatureset = [NOT_features(token, word_features, negationwords) for token in faketexttokens]
print(fakefeatureset[0])
classifier.classify(fakefeatureset[0])



'neutral'

In [31]:
truefeatureset = [NOT_features(token, word_features, negationwords) for token in truetexttokens]
print(truefeatureset[0])
classifier.classify(truefeatureset[0])



'neutral'

**Sentiment Polarity – Analysis of Fake and Real “text”**

In [46]:
fakeposcount = [None]*50
fakenegcount = [None]*50
for i in range(len(sentsInEachFakeArticle)):
    currpos = 0
    currneg = 0
    currneu = 0
    faketokens = [nltk.word_tokenize(text) for text in sentsInEachFakeArticle[i]]
    for token in faketokens:
        feature = NOT_features(token, word_features, negationwords)
        output = classifier.classify(feature)
        if output == 'positive':
             currpos = currpos + 1
        if output == 'negative':
             currneg = currneg + 1
        if output == 'neutral':
          currneu = currneu + 1
    fakeposcount[i] = currpos
    fakenegcount[i] = currneg

In [45]:
trueposcount = [None]*50
truenegcount = [None]*50
for i in range(len(sentsInEachTrueArticle)):
    currposi = 0
    currnegi = 0
    currneut = 0
    truetokens = [nltk.word_tokenize(text) for text in sentsInEachTrueArticle[i]]
    for token in truetokens:
        feature = NOT_features(token, word_features, negationwords )
        output = classifier.classify(feature)
        if output == 'positive':
             currposi = currposi + 1
        if output == 'negative':
             currnegi = currnegi + 1
        if output == 'neutral':
            currneut = currneut +1
    trueposcount[i] = currposi
    truenegcount[i] = currnegi

In [44]:
print("comparative analysis of the positive sentences in Fake and True")
for i in range(50):
  if fakeposcount[i] <= trueposcount[i]:
      print(fakeposcount[i], "<=", trueposcount[i])
  else:
      print(fakeposcount[i], ">", trueposcount[i])

comparative analysis of the positive sentences in Fake and True
13 > 5
5 <= 5
6 <= 7
7 > 4
8 <= 11
3 <= 4
4 <= 8
2 <= 3
7 > 3
9 > 2
5 <= 7
0 <= 4
7 > 2
6 > 3
7 > 5
6 > 4
11 <= 11
5 <= 6
13 > 1
5 > 3
5 > 1
10 > 6
7 > 2
4 > 3
5 <= 10
7 > 4
4 <= 4
4 <= 6
1 > 0
13 > 6
8 > 4
7 > 2
4 > 2
6 <= 6
4 <= 16
4 <= 6
9 > 0
4 <= 7
7 > 3
15 > 0
10 > 6
6 > 2
2 <= 5
9 > 0
10 > 3
7 > 3
9 > 3
6 > 4
4 > 0
6 > 0


In [48]:
print("comparative analysis of the negative sentences in Fake and True")
for i in range(50):
  if fakenegcount[i] <= truenegcount[i]:
      print(fakenegcount[i], "<=", truenegcount[i])
  else:
      print(fakenegcount[i], ">", truenegcount[i])

comparative analysis of the negative sentences in Fake and True
2 <= 5
2 <= 2
12 > 1
2 <= 5
5 <= 11
7 <= 7
3 <= 7
3 > 1
2 > 0
1 > 0
5 > 3
5 > 1
6 > 0
0 <= 0
5 > 3
3 > 2
2 <= 6
2 > 0
3 > 2
6 > 4
6 <= 6
0 <= 3
3 <= 4
3 > 2
3 <= 3
3 <= 6
3 > 0
5 > 3
4 > 0
2 > 1
3 > 0
3 > 0
5 > 0
5 > 1
2 <= 8
2 <= 7
0 <= 8
4 <= 7
7 > 4
4 > 2
5 > 0
1 <= 4
3 <= 3
5 > 0
0 <= 3
1 > 0
5 <= 5
5 > 1
3 > 1
3 > 2




The comparative analysis of positive and negative sentences in the "True" and "Fake" categories reveals distinct patterns and relationships between numbers. In the positive sentences, greater-than, less-than, and equality relationships depict varying magnitudes, which can be essential for decision-making and trend identification. Similarly, the negative sentences provide insights into the relationships between numbers, highlighting situations where values are less than or equal to each other.

In [None]:
Saving the output to CSV file

In [36]:
import csv
import pandas as pd

fakeFile = "/drive/My Drive/Colab Notebooks/IST 664/2023/Output/Fake.csv"
trueFile = "/drive/My Drive/Colab Notebooks/IST 664/2023/Output/True.csv"

column_name1 = "text"
column_name2 = "the number of positive sentences in text"
column_name3 = "the number of negative sentences in text"
header = [ "text", "the number of positive sentences", "the number of negative sentences"]

with open(fakeFile, 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(header)
    for a1, a2, a3 in zip(fake, fakeposcount, fakenegcount):
        writer.writerow([a1, a2, a3])

with open(trueFile, 'w', newline='') as file:
    writer1 = csv.writer(file)
    writer1.writerow(header)
    for a1, a2, a3 in zip(true, trueposcount, truenegcount):
        writer1.writerow([a1, a2, a3])