# Lexical Analysis: LIWC

LIWC (Linguistic Inquiry and Word Count) has categories for negativity, anger, anxiety, and certainty that can indirectly capture cynical language.

Words related to doubt (e.g., “skeptical,” “dishonest”), corruption (e.g., “manipulative,” “deceit”), and power imbalance (e.g., “elite,” “rigged”) can be analyzed.


## Setup


In [1]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from collections import Counter
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /Users/pmui/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /Users/pmui/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

## Define Cynicism-Related LIWC Word Categories

Since LIWC is proprietary, we can create custom categories that align with mistrust, negativity, and power imbalance:


In [19]:
# Define LIWC-style categories for cynicism analysis
liwc_cynicism = {
    "skepticism": {"skeptical", "dishonest", "untrustworthy", "suspicious", "questionable"},
    "mistrust": {"skeptical", "dishonest", "untrustworthy", "suspicious", "questionable"},
    "questionable": {"skeptical", "dishonest", "untrustworthy", "suspicious", "questionable"},
    "dishonesty": {"dishonest", "deceit", "corrupt", "fraudulent", "scandal"},
    "manipulation": {"manipulative", "deceit", "corrupt", "fraudulent", "scandal"},
    "deceit": {"deceit", "corrupt", "fraudulent", "scandal"},
    "fraud": {"fraudulent", "scandal", "manipulative", "deceit", "corrupt"},
    "rigged": {"rigged", "fake", "fraud", "deceit", "corrupt"},
    "doubt": {"skeptical", "dishonest", "untrustworthy", "suspicious", "questionable"},
    "corruption": {"manipulative", "deceit", "corrupt", "fraudulent", "scandal"},
    "power_imbalance": {"elite", "rigged", "oppressed", "exploited", "authoritarian"},
    "negativity": {"bad", "worst", "failure", "hopeless", "disaster"},
    "anger": {"angry", "furious", "outraged", "hate", "resentful"},
    "anxiety": {"worried", "fear", "concerned", "nervous", "insecure"},
    "certainty": {"obvious", "definitely", "undeniable", "clearly", "absolute"}
}

## Create a Function to Compute LIWC Scores

This function tokenizes each text, checks for the presence of predefined LIWC words, and computes their frequency:


In [20]:
def liwc_analysis(text):
    tokens = word_tokenize(text.lower())  # Tokenize and lowercase
    word_counts = Counter(tokens)  # Count word frequencies
    
    liwc_scores = {category: sum(word_counts[word] for word in words) for category, words in liwc_cynicism.items()}
    
    total_words = sum(word_counts.values())
    if total_words > 0:
        liwc_scores = {category: round(count / total_words, 4) for category, count in liwc_scores.items()}  # Normalize by total words
    
    return liwc_scores


## Apply Analysis to a DataFrame

Assume you have a DataFrame df with a column "text" containing different texts:


In [21]:
# Sample DataFrame with texts
data = {
    "text": [
        "The government is rigged and corrupt, and the elite exploit the weak.",
        "I feel worried about the future, and the situation seems hopeless.",
        "They are dishonest and manipulative. It’s obvious that the system is broken.",
        "I trust the process and believe things will get better."
    ]
}

df = pd.DataFrame(data)

# Apply the LIWC analysis function to each text
df["liwc_scores"] = df["text"].apply(liwc_analysis)

In [22]:
# Expand the dictionary into separate columns
liwc_df = df.join(pd.json_normalize(df["liwc_scores"])).drop(columns=["liwc_scores"])

In [23]:
print("LIWC Cynicism Analysis:")
display(liwc_df)

LIWC Cynicism Analysis:


Unnamed: 0,text,skepticism,mistrust,questionable,dishonesty,manipulation,deceit,fraud,rigged,doubt,corruption,power_imbalance,negativity,anger,anxiety,certainty
0,"The government is rigged and corrupt, and the ...",0.0,0.0,0.0,0.0714,0.0714,0.0714,0.0714,0.1429,0.0,0.0714,0.1429,0.0,0.0,0.0,0.0
1,"I feel worried about the future, and the situa...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0769,0.0,0.0769,0.0
2,They are dishonest and manipulative. It’s obvi...,0.0625,0.0625,0.0625,0.0625,0.0625,0.0,0.0625,0.0,0.0625,0.0625,0.0,0.0,0.0,0.0,0.0625
3,I trust the process and believe things will ge...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Each row in the output DataFrame contains:

LIWC-based scores for cynicism indicators (e.g., negativity, doubt, corruption, anger)


## Summarizing LIWC with a Cynicism Score


In [26]:
liwc_cynicism.keys() - "certainty"

{'anger',
 'anxiety',
 'certainty',
 'corruption',
 'deceit',
 'dishonesty',
 'doubt',
 'fraud',
 'manipulation',
 'mistrust',
 'negativity',
 'power_imbalance',
 'questionable',
 'rigged',
 'skepticism'}

In [27]:
def compute_cynicism_score(row):
    # Extract LIWC values from each row
    cynicism_categories = liwc_cynicism.keys() - "certainty"
    certainty_weight = 1  # Certainty is subtracted
    
    # Compute the score as an average of all categories, subtracting certainty
    cynicism_score = (sum(row[category] for category in cynicism_categories) - certainty_weight * row["certainty"]) / len(cynicism_categories)
    
    return round(cynicism_score, 4)  # Rounded for clarity

In [29]:
# Apply the function to compute the Cynicism Score
liwc_df["cynicism_lexical"] = liwc_df.apply(compute_cynicism_score, axis=1)

display(liwc_df)

Unnamed: 0,text,skepticism,mistrust,questionable,dishonesty,manipulation,deceit,fraud,rigged,doubt,corruption,power_imbalance,negativity,anger,anxiety,certainty,cynicism_score,cynicism_lexical
0,"The government is rigged and corrupt, and the ...",0.0,0.0,0.0,0.0714,0.0714,0.0714,0.0714,0.1429,0.0,0.0714,0.1429,0.0,0.0,0.0,0.0,0.0429,0.0429
1,"I feel worried about the future, and the situa...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0769,0.0,0.0769,0.0,0.0103,0.0103
2,They are dishonest and manipulative. It’s obvi...,0.0625,0.0625,0.0625,0.0625,0.0625,0.0,0.0625,0.0,0.0625,0.0625,0.0,0.0,0.0,0.0,0.0625,0.0333,0.0333
3,I trust the process and believe things will ge...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Statistical Goodness of Fit for the Summary Score

To ensure the robustness of the cynicism score, we need a goodness-of-fit measure that evaluates how well the summary score represents the underlying LIWC components.

Here are some goodness-of-fit approaches:

(A) Cronbach’s Alpha (Reliability Measure)
Measures internal consistency (how well the LIWC categories correlate as a single scale).
If α > 0.7, the categories are reliable in measuring cynicism.

(B) Principal Component Analysis (PCA) – Dimensionality Reduction
If one principal component explains most of the variance (>70%), then the cynicism score is well-represented by the combined LIWC categories.

(C) Spearman’s Rank Correlation
Measures the monotonic relationship between the summary score and each LIWC component.
High Spearman (ρ > 0.5) means the score faithfully captures cynical language.


In [30]:
liwc_cynicism.keys() - "certainty"

{'anger',
 'anxiety',
 'certainty',
 'corruption',
 'deceit',
 'dishonesty',
 'doubt',
 'fraud',
 'manipulation',
 'mistrust',
 'negativity',
 'power_imbalance',
 'questionable',
 'rigged',
 'skepticism'}

### Cronbach’s Alpha


- If α > 0.7 → Score is reliable
- If α < 0.7 → May need better weighting or additional categories


In [32]:
def cronbach_alpha(df):
    """Compute Cronbach's Alpha for internal consistency of LIWC categories."""
    items = df[liwc_cynicism.keys()]
    item_vars = items.var(axis=0, ddof=1)
    total_var = items.sum(axis=1).var(ddof=1)
    n = items.shape[1]
    return (n / (n - 1)) * (1 - sum(item_vars) / total_var)

# Compute Cronbach’s Alpha
alpha = cronbach_alpha(liwc_df)
print(f"Cronbach's Alpha: {alpha:.4f}")

Cronbach's Alpha: 0.7918


### PCA – How Much of the Variance is Captured by the Score?

If PCA explains > 70% variance → Summary score is a good fit.


In [38]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Standardize data
scaler = StandardScaler()
liwc_scaled = scaler.fit_transform(liwc_df[liwc_cynicism.keys()])

# Run PCA
pca = PCA(n_components=1)
pca.fit(liwc_scaled)

explained_variance = pca.explained_variance_ratio_[0]
print(f"Variance Explained by 1st Principal Component: {explained_variance:.4f}")


Variance Explained by 1st Principal Component: 0.5419


### Spearman’s Correlation Between LIWC Features and the Summary Score

If |ρ| > 0.5 → Strong correlation, meaning the score effectively summarizes cynicism.


In [40]:
from scipy.stats import spearmanr

correlations = {col: spearmanr(liwc_df[col], liwc_df["cynicism_lexical"])[0] for col in liwc_cynicism.keys()}
print("Spearman’s Rank Correlation with Cynicism Score:")
for category, corr in correlations.items():
    print(f"{category}: {corr:.4f}")

Spearman’s Rank Correlation with Cynicism Score:
skepticism: 0.2582
mistrust: 0.2582
questionable: 0.2582
dishonesty: 0.9487
manipulation: 0.9487
deceit: 0.7746
fraud: 0.9487
rigged: 0.7746
doubt: 0.2582
corruption: 0.9487
power_imbalance: 0.7746
negativity: -0.2582
anger: nan
anxiety: -0.2582
certainty: 0.2582


  correlations = {col: spearmanr(liwc_df[col], liwc_df["cynicism_lexical"])[0] for col in liwc_cynicism.keys()}
