<a href="https://colab.research.google.com/github/ysavine/tweepfakes/blob/main/linguistic-feature-analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Linguistic Features Analysis**

Data
*   https://arxiv.org/abs/2008.00036 (Tweepfakes)


Resources for Linguistic Feature Extraction
*   https://github.com/sherbold/chatgpt-student-essay-study/blob/v1.1/calc_linguistic_features.ipynb




## **Pre-Processing**

In [None]:
!pip install lexicalrichness
!pip install spacy

In [None]:
import pandas as pd
import re
import numpy as np
import spacy
from scipy.interpolate import make_interp_spline
from collections import Counter
from lexicalrichness import LexicalRichness

In [None]:
nlp = spacy.load('en_core_web_sm')

# Dataframes
test = pd.read_csv('test.csv', sep=';')
train = pd.read_csv('train.csv', sep=';')
valid = pd.read_csv('validation.csv', sep=';')

In [None]:
# Remove titles or additional less relevant portions of texts

# Current format of TweepFakes : screen_name;text;account.type;class_type
    # Username
    # Tweet or Tweep
    # Human or bot
    # Type of bot

# For Tweepfakes - drop username (0), drop type of bot (3)
df_list = [test, train, valid]
for df in df_list:
  df = df.drop(df.columns[[0, 3]], axis=1)

In [None]:
test.shape

(2558, 4)

In [None]:
train.shape

(20712, 4)

In [None]:
valid.shape

(2302, 4)

In [None]:
# Preprocesses data with spaCy for later use

# Tweets - preprocess with spaCy

test["tw_spacy"] = test["text"].apply(lambda x: nlp(x))
test["tw_lemma"] = test["tw_spacy"].apply(lambda x: " ".join([y.lemma_ for y in x]))

train["tw_spacy"] = train["text"].apply(lambda x: nlp(x))
train["tw_lemma"] = train["tw_spacy"].apply(lambda x: " ".join([y.lemma_ for y in x]))

valid["tw_spacy"] = valid["text"].apply(lambda x: nlp(x))
valid["tw_lemma"] = valid["tw_spacy"].apply(lambda x: " ".join([y.lemma_ for y in x]))

# Type of account - check to make sure that entries are only 'human' or 'bot'

# Print all unique values or variations in the specified column
unique_values = train['account.type'].unique()

# Display the unique values
for value in unique_values:
    print(value)

bot
human


## Linguistic Features Analysis

I will only be looking at the training data for linguistic features analysis.

In [None]:
# Separate the "human" and "bot" data into separate dfs

# Filtering the DataFrame based on the entries in the 2nd column
bot_df = train[train['account.type'] == 'bot']
human_df = train[train['account.type'] == 'human']

## **Sentence Count**

In [None]:
# Count number of sentences

def num_of_sent(text):
    i = 0
    for sentence in text:
        i += 1
    return i

bot_df["bot_sent_count"] = bot_df["text"].apply(lambda x: num_of_sent(x))
human_df["human_sent_count"] = human_df["text"].apply(lambda x: num_of_sent(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bot_df["bot_sent_count"] = bot_df["text"].apply(lambda x: num_of_sent(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  human_df["human_sent_count"] = human_df["text"].apply(lambda x: num_of_sent(x))


## **Word Count**

In [None]:
# Count number of words

def num_of_words(text):
    count = len(text.split())
    return count

bot_df["bot_word_count"] = bot_df["text"].apply(lambda x: num_of_words(x))
human_df["human_word_count"] = human_df["text"].apply(lambda x: num_of_words(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bot_df["bot_word_count"] = bot_df["text"].apply(lambda x: num_of_words(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  human_df["human_word_count"] = human_df["text"].apply(lambda x: num_of_words(x))


## **Sentence Complexity**

In [None]:
# Calcualtes the number of specified dependency label within a sentence
def calculate_dep_score(text):
    temp = []
    for sentence in nlp(text).sents:
        temp.append(sent_complexity_structure(sentence))
    return np.mean(temp)

# Return the number of specified dependency labels found
def sent_complexity_structure(doc):
    return len([token for token in doc if (token.dep_ == "acl" or token.dep_ == "conj" or token.dep_ == "advcl"or token.dep_ == "ccomp"
    or token.dep_ == "csubj" or token.dep_ == "discourse" or token.dep_ == "parataxis")])

# Calculates the dependency depth
def calculate_dep_length(text):
    temp = []
    for sentence in nlp(text).sents:
        temp.append(walk_tree(sentence.root, 0))
    return np.mean(temp)

# Walks the dependency tree and returns the depth
def walk_tree(node, depth):
    if node.n_lefts + node.n_rights > 0:
        return max(walk_tree(child, depth + 1) for child in node.children)
    else:
        return depth


bot_df["bot_sent_complex_tags"] = bot_df["text"].apply(lambda x: calculate_dep_score(x))
human_df["human_sent_complex_tags"] = human_df["text"].apply(lambda x: calculate_dep_score(x))

bot_df["bot_sent_complex_depth"] = bot_df["text"].apply(lambda x: calculate_dep_length(x))
human_df["human_sent_complex_depth"] = human_df["text"].apply(lambda x: calculate_dep_length(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bot_df["bot_sent_complex_tags"] = bot_df["text"].apply(lambda x: calculate_dep_score(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  human_df["human_sent_complex_tags"] = human_df["text"].apply(lambda x: calculate_dep_score(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bot_df["bot_sent_comp

## **Lexical Diversity**

In [None]:
# calculates MTLD score for the whole essay

def calculate_lex_richness_MTLD2(text):
    lex = LexicalRichness(text)
    if hasattr(lex, 'words') and lex.words > 0:
        lex_rich_score = lex.mtld(threshold=0.72)
        return lex_rich_score
    else:
        return 0

bot_df["bot_LD"] = bot_df["text"].apply(lambda x: calculate_lex_richness_MTLD2(x))
human_df["human_LD"] = human_df["text"].apply(lambda x: calculate_lex_richness_MTLD2(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bot_df["bot_LD"] = bot_df["text"].apply(lambda x: calculate_lex_richness_MTLD2(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  human_df["human_LD"] = human_df["text"].apply(lambda x: calculate_lex_richness_MTLD2(x))


## **Discourse Markers**

In [None]:
# # Counts the number of modals from the list of modals

# modals = pd.read_csv("markers/modals.csv", sep=",", encoding="UTF-8", header=None)
# modals[0] = modals[0].apply(lambda x: x.replace('_', ' '))

# # Counts the number of modals per essay
# def count_total_modals(text):
#     counter = 0
#     for modal in modals.itertuples():
#         if modal[1] in text:
#             counter += text.count(modal[1])
#     return counter

# bot_df["bot_modals1"] = bot_df["train_lemma"].apply(lambda x: count_total_modals(x))
# human_df["human_modals1"] = human_df["train_lemma"].apply(lambda x: count_total_modals(x))


In [None]:
# # Counts the number of modals using POS tagging

# bot_df["bot_pos"] = bot_df["tw_spacy"].apply(lambda x: " ".join([y.tag_ for y in x]))
# human_df["human_pos"] = human_df["tw_spacy"].apply(lambda x: " ".join([y.tag_ for y in x]))

# bot_df["bot_modals2"] = bot_df["bot_pos"].str.count(r'MD')
# human_df["bot_modals2"] = human_df["human_pos"].str.count(r'MD')

In [None]:
# # Calculates total number of modals per essay

# bot_df["bot_modals_all"] = bot_df["bot_modals2"] + bot_df["bot_modals1"]
# human_df["human_modals_all"] = human_df["human_modals2"] + human_df["human_modals1"]

## **Epistemic Markers**

In [None]:
# Counts the total number of epistemic markers per essay

def find_epistemic_markers(text):
    ep_markers = []
    ep_markers.extend(re.findall(r"(?:I|We|we|One|one)(?:\s\w+)?(?:\s\w+)?\s(?:believes?|thinks?|means?|worry|worries|know|guesse?s?|assumes?)\s(?:that)?", text))
    ep_markers.extend(re.findall(r"(?:It|it)\sis\s(?:believed|known|assumed|thought)\s(?:that)?", text))
    ep_markers.extend(re.findall(r"(?:I|We|we)\s(?:am|are)\s(?:thinking|guessing)\s(?:that)?", text))
    ep_markers.extend(re.findall(r"(?:I|We|we|One|one)(?:\s\w+)?\s(?:do|does)\snot\s(?:believe?|think|know)\s(?:that)?", text))
    ep_markers.extend(re.findall(r"(?:I|We|we|One|one)\swould(?:\s\w+)?(?:\snot)?\ssay\s(?:that)?", text))
    ep_markers.extend(re.findall(r"I\sam\s(?:afraid|sure|confident)\s(?:that)?", text))
    ep_markers.extend(re.findall(r"(?:My|my|Our|our)\s(?:experience|opinion|belief|knowledge|worry|worries|concerns?|guesse?s?)\s(?:is|are)\s(?:that)?", text))
    ep_markers.extend(re.findall(r"[In]n\s(?:my|our)(?:\s\w+)?\sopinion", text))
    ep_markers.extend(re.findall(r"As\sfar\sas\s(?:I|We|we)\s(?:am|are)\sconcerned", text))
    ep_markers.extend(re.findall(r"(?:I|We|we|One|one)\s(?:can|could|may|might)(?:\s\w+)?\sconclude\s(?:that)?", text))
    ep_markers.extend(re.findall(r"I\s(?:am\swilling\sto|must)\ssay\s(?:that)?", text))
    ep_markers.extend(re.findall(r"One\s(?:can|could|may|might)\ssay\s(?:that)?", text))
    ep_markers.extend(re.findall(r"[Oo]ne\s(?:can|could|may|might)\ssay\s(?:that)?", text))
    ep_markers.extend(re.findall(r"[Ii]t\sis\s(?:obvious|(?:un)?clear)", text))
    ep_markers.extend(re.findall(r"[Ii]t\s(?:seems|feels|looks)", text))
    return len(ep_markers)

bot_df["bot_EpMarkers"] = bot_df["text"].apply(lambda x: find_epistemic_markers(x))
human_df["human_EpMarkers"] = human_df["text"].apply(lambda x: find_epistemic_markers(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bot_df["bot_EpMarkers"] = bot_df["text"].apply(lambda x: find_epistemic_markers(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  human_df["human_EpMarkers"] = human_df["text"].apply(lambda x: find_epistemic_markers(x))


## **Nomilisations**

In [None]:
# Counts the total number of nominalisations per essay

def nominalisation_counter(text):
    suffixes_n = r'\b[A-Z]*\w+(?:tion|ment|ance|ence|ion|it(?:y|ies)|ness|ship)(?:s|es)?\b'

    nom_nouns = []
    nouns = [token.text for token in text if token.pos_ == 'NOUN']
    nom_nouns = [noun for noun in nouns if re.match(suffixes_n, noun)]

    return(len(nom_nouns))

bot_df["bot_nominalisation"] = bot_df["tw_spacy"].apply(lambda x: nominalisation_counter(x))
human_df["human_nominalisation"] = human_df["tw_spacy"].apply(lambda x: nominalisation_counter(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bot_df["bot_nominalisation"] = bot_df["tw_spacy"].apply(lambda x: nominalisation_counter(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  human_df["human_nominalisation"] = human_df["tw_spacy"].apply(lambda x: nominalisation_counter(x))


In [None]:
# Counts the average number of features (discourse markers, modals, epistemic markers, nominalisations) per sentence for each essay

def average_per_sentence(feature, sent):
    average = feature/sent
    return(average)

#bot_df["bot_dm_per_sent"] = bot_df.apply(lambda row: average_per_sentence(row["bot_discourse"], row["bot_sent_count"]), axis=1)
#human_df["human_dm_per_sent"] = human_df.apply(lambda row: average_per_sentence(row["human_discourse"], row["human_sent_count"]), axis=1)

#bot_df["bot_mod_per_sent"] = essays.apply(lambda row: average_per_sentence(row["bot_modals_all"], row["bot_sent_count"]), axis=1)
#human_df["human_mod_per_sent"] = essays.apply(lambda row: average_per_sentence(row["human_modals_all"], row["human_sent_count"]), axis=1)

bot_df["bot_ep_per_sent"] = bot_df.apply(lambda row: average_per_sentence(row["bot_EpMarkers"], row["bot_sent_count"]), axis=1)
human_df["human_ep_per_sent"] = human_df.apply(lambda row: average_per_sentence(row["human_EpMarkers"], row["human_sent_count"]), axis=1)

bot_df["bot_nom_per_sent"] = bot_df.apply(lambda row: average_per_sentence(row["bot_nominalisation"], row["bot_sent_count"]), axis=1)
human_df["human_nom_per_sent"] = human_df.apply(lambda row: average_per_sentence(row["human_nominalisation"], row["human_sent_count"]), axis=1)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bot_df["bot_ep_per_sent"] = bot_df.apply(lambda row: average_per_sentence(row["bot_EpMarkers"], row["bot_sent_count"]), axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  human_df["human_ep_per_sent"] = human_df.apply(lambda row: average_per_sentence(row["human_EpMarkers"], row["human_sent_count"]), axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stab

## **Print Results**

In [None]:
print("Sentence complexity based on a number of certain dependency tags")
print("bot:", np.mean(bot_df["bot_sent_complex_tags"]))
print("human: ", np.mean(human_df["human_sent_complex_tags"]), "\n")

print("Sentence complexity based on the tree depth")
print("bot:", np.mean(bot_df["bot_sent_complex_depth"]))
print("human: ", np.mean(human_df["human_sent_complex_depth"]), "\n")

print("MTLD lexical diversity score")
print("bot:", np.mean(bot_df["bot_LD"]))
print("human: ", np.mean(human_df["human_LD"]), "\n")

# print("Average number of discourse markers per essay")
# print("bot:", np.mean(bot_df["bot_discourse"]))
# print("human: ", np.mean(human_df["human_discourse"]), "\n")

# print("Average number of modals (from the list) per essay")
# print("bot:", np.mean(bot_df["bot_modals1"]))
# print("human: ", np.mean(human_df["human_modals1"]), "\n")

# print("Average number of modals (POS-tags) per essay")
# print("bot:", np.mean(bot_df["bot_modals2"]))
# print("human: ", np.mean(human_df["human_modals2"]), "\n")

print("Average number of epistemic markers per essay")
print("bot:", np.mean(bot_df["bot_EpMarkers"]))
print("human: ", np.mean(human_df["human_EpMarkers"]), "\n")

print("Average number of nominalisations per essay")
print("bot:", np.mean(bot_df["bot_nominalisation"]))
print("human: ", np.mean(human_df["human_nominalisation"]))

Sentence complexity based on a number of certain dependency tags
bot: 1.1799230490980057
human:  0.8386741974958812 

Sentence complexity based on the tree depth
bot: 4.415448563903522
human:  3.4726414700058257 

MTLD lexical diversity score
bot: 34.61954627126504
human:  48.19291374067965 

Average number of epistemic markers per essay
bot: 0.016611937415491596
human:  0.01979146553388685 

Average number of nominalisations per essay
bot: 0.32963106045972573
human:  0.3387719636995559
