In [15]:
import pandas as pd

## 1. Natural Language Processing - A Naive Example

Before diving into real Twitter data, let’s start with a simple example.
Here’s a small corpus consisting of three short documents:

- Document 1: It is going to rain today.
- Document 2: Today I am not going outside.
- Document 3: I am going to watch the season premiere.

In [None]:
Document1= "It is going to rain today."
Document2= "Today I am not going outside."
Document3= "I am going to watch the season premiere."
Doc = [Document1 ,
 Document2 , 
 Document3]
print(Doc)


From this example, we’ll learn how to convert raw text into numerical features — or what we might call columns of numbers. This process is often referred to as vectorization.

Once we represent text as vectors, we unlock the ability to perform various types of analysis, including:
- Summarization
- Clustering
- Topic modeling
- Information retrieval (e.g., finding similar texts)
- Predictive modeling

The core idea in Natural Language Processing (NLP) is transforming unstructured text into structured numerical form. While there are many ways to do this, we’ll focus on one of the most widely used and interpretable methods: TF-IDF (Term Frequency–Inverse Document Frequency).

TF-IDF is useful in many NLP applications. For example:
- Search engines use it to rank the relevance of a document to a search query.
- It’s also used in text classification, summarization, and topic modeling.

After learning TF-IDF, we’ll apply it in a downstream task — topic modeling — to uncover hidden themes across the documents.

While we won’t cover every vectorization technique or downstream task, this example will give you a strong foundation for understanding how an NLP pipeline works.


### 1.1 Vectorization: Term Frequency(TF) — Inverse Document Frequency(IDF) Vectorization
A corpus can be defined as a collection of documents. In our example, each sentence is a document, and they collectively form a corpus.  

To vectorize text data, we use a TF-IDF method. 
- We first tokenize the text, and then assign an importance score for every term. 
- The importance score of a term is high when it occurs a lot in a given document and rarely in others. 
- In short, commonality within a document measured by TF is balanced by rarity between documents measured by IDF. The resulting TF-IDF score reflects the importance of a term for a document in the corpus.
 

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer() #TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
analyze = vectorizer.build_analyzer()
print("Document 1",analyze(Document1))
print("Document 2",analyze(Document2))
print("Document 3",analyze(Document3))

X = vectorizer.fit_transform(Doc)

print(X)
df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
df


We tokenize and generate a vocab of the document. For each document, we can find the TF= (Number of repetitions of word in a document) / (# of words in a document). We can further find the IDF=Log[(Number of documents) / (Number of documents containing the word)]

| words      | Doc1 | Doc2| Doc3|IDF Value|
| ----------- | ----------- |----------- |----------- |----------- |
| going      | 0.16     |0.16|0.12|0|
| to   | 0.16       |0|0.12|0.41|
|today|0.16|0.16|0|0.41|
|i|0|0.16|0.12|0.41|
|am|0|0.16|0.12|0.41|
|it|0.16|0|0|1.09|
|is |0.16|0|0|1.09|
|rain|0.16|0|0|1.09|

We then construct a document-term matrix using the TF-IDF scores:

| Docs      | going |to|today|i|am|it|is|rain|
| ------ |------ |------ |------ |------ |------ |------ |------ |------ |
| Doc1      | 0  |0.07|0.07|0|0|0.17|0.17|0.17|0.17|
| Doc2   | 0  |0|0.07|0.07|0.07|0|0|0|
|Doc3|0|0.05|0|0.05|0.05|0|0|0|

It is easy to see that 'it', 'is', and 'rain' are important for Doc 1 but not Doc 2 or Doc 3. Each row of the document-term matrix can be thought of as a numeric representation of the documents, which we often term vectors. These numeric representations help you to find similarities between documents. 
 
> You might have noticed that stop words such as “to” and “is” are included above. These are usually filtered out in real-world NLP tasks because they don’t carry much meaning.

To perform vectorization in Python, we use the <code>TfidfVectorizer</code> from the <code>sklearn</code> package.

The steps are:
- Create the vectorizer.
- Fit it on your corpus.
- Transform your corpus into vectors.

The function **TfidfVectorizer** takes two parameters. 
- max_df is used for removing terms that appear too frequently, also known as "corpus-specific stop words". For example:
    - max_df = 0.50 means "ignore terms that appear in more than 50% of the documents".
    - max_df = 25 means "ignore terms that appear in more than 25 documents".
    - The default max_df is 1.0, which means "ignore terms that appear in more than 100% of the documents". Thus, the default setting does not ignore any terms.
- min_df is used for removing terms that appear too infrequently. For example:
    - min_df = 0.01 means "ignore terms that appear in less than 1% of the documents".
    - min_df = 5 means "ignore terms that appear in less than 5 documents".
    - The default min_df is 1, which means "ignore terms that appear in less than 1 document". Thus, the default setting does not ignore any terms.



In [None]:
docs=Doc
#Convert a collection of raw documents to a matrix of TF-IDF features.
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=0.1, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(docs)


tfidf_df = pd.DataFrame.sparse.from_spmatrix(tfidf, columns=tfidf_vectorizer.get_feature_names_out())
tfidf_df

Now, after removing stop words, the resulting matrix looks like this (we'll call it M):

|     | outside   | premiere | rain     | season   | today    | watch    |
|-----|-----------|----------|----------|----------|----------|----------|
| 0   | 0         | 0        | 0.795961 | 0        | 0.605349 | 0        |
| 1   | 0.795961  | 0        | 0        | 0        | 0.605349 | 0        |
| 2   | 0         | 0.57735  | 0        | 0.57735  | 0        | 0.57735  |


It is notable that <code>tfidf</code> is a sparse matrix. If you'd like to view it as a full DataFrame, use:


### 1.2 Non-negative Matrix Factorization (NMF)

TF-IDF vectors are great, but high-dimensional. When we have hundreds or thousands of terms, interpretation becomes difficult.

To reduce this complexity and uncover latent themes, we use Non-negative Matrix Factorization (NMF), a powerful technique for **topic modeling**.

 If we think of the document-term matrix $M$ as a $m \times n$ matrix with $m$ documents and $n$ terms, $M$ can be factorized as 




$$
M=W \times H
$$

- M: Original document-term matrix (e.g., m docs × n terms)
- W: Document-topic matrix (m docs × k topics)
- H: Topic-term matrix (k topics × n terms)
- k: Number of topics

NMF finds W and H such that their product approximates M, and all values remain non-negative.

This technique helps extract topics from text — where each topic is a combination of words, and each document can belong to multiple topics with different strengths.
 
 

The function NMF takes two parameters. 
- n_components is the number of topics
- random_state controls the random number generator used in the attribute combining process.

In [None]:
from sklearn.decomposition import NMF

nmf_model = NMF(n_components=2, random_state=0)
#nmf_model.fit(tfidf)
W = nmf_model.fit_transform(tfidf)  # Document-topic matrix

# Display topics
feature_names = tfidf_vectorizer.get_feature_names_out()
topic_names=[]
# Assume nmf_model and feature_names are already defined
topic_names = []

# Loop through each topic
for topic_index in range(len(nmf_model.components_)):
    topic = nmf_model.components_[topic_index]
    print(topic)
    # Get the indices of the top 3 words (largest values in the topic)
    sorted_indices = topic.argsort()  # sorts from smallest to largest

    print(sorted_indices)
    top_indices = sorted_indices[-3:]  # get the last 3 (top 3 words)
    
    # Reverse to make it largest to smallest
    top_indices = top_indices[::-1]

    # Get the actual word names for these indices
    top_words = []
    for i in top_indices:
        top_words.append(feature_names[i])
    
    # Join the top words into a single string
    top_words_string = " ".join(top_words)

    # Print and save
    print("Topic #{}:".format(topic_index))
    print(top_words_string)
    topic_names.append(top_words_string)
topic_df = pd.DataFrame(W, columns=topic_names)
topic_df

topic_df = pd.DataFrame(nmf_model.components_ ,columns=feature_names)
topic_df

This is the W matrix (document-topic distribution):


|     | today outside rain | watch season premiere |
|-----|--------------------|------------------------|
| 0   | 0.490981           | 0.000000               |
| 1   | 0.490981           | 0.000000               |
| 2   | 0.000000           | 0.840054               |

And this is the H matrix (topic-word distribution):

|     | outside   | premiere | rain     | season   | today    | watch    |
|-----|-----------|----------|----------|----------|----------|----------|
| 0   | 0.810582  | 0.000000 | 0.810582 | 0.000000 | 1.232936 | 0.000000 |
| 1   | 0.000000  | 0.687278 | 0.000000 | 0.687278 | 0.000000 | 0.687278 |


## 2 Analyzing Twitter Data

Finally, we get to practice using the Twitter data! 
### 2.1 What Social Media Accounts to Search?

To identify social media accounts related to AI tools, we perform a Google search using the keyword "AI marketing tools". Below are the Search Engine Results Pages (also known as “SERPs” or “SERP”).

The first few results are sponsored links, and one organic result points us to [15 Best AI Marketing Tools in 2023-2024](https://improvado.io/blog/best-ai-marketing-tools). Among the recommended AI tools, we are particularly interested in [Grammarly](https://twitter.com/Grammarly). Let's collect tweets generated by Grammarly's official account and examine which tweets get more likes.


> Grammarly is a cloud-based typing assistant. It reviews spelling, grammar, punctuation, clarity, engagement, and delivery mistakes in English texts, detects plagiarism, and suggests replacements for the identified errors. For a brief introduction to Grammarly, watch this [video](https://www.youtube.com/watch?v=zd64pGNLjVY).


### 2.2. Data Collection
Twitter has its API service. To simplify this data collection process, I built a little package.


In [4]:
#!pip3 install --upgrade --force-reinstall git+https://github.com/tantantan12/itom6219.git


In [8]:
import os
os.environ["BEARER_TOKEN"] = "AAAAAAAAAAAAAAAAAAAAAA7fGwEAAAAATek8qNEHmKiwy5NeLLGGLu%2FOllc%3DvMI6a81TOlLcj6fthUgm5xT66tHGcKYcklMRLcRZjxQBKpqWJp"


from itom6219 import user_info, user_tweets, user_tweets_all
user=user_info(["grammarly"])
user
#tweets=user_tweets(["grammarly"], exclude_replies=True, exclude_retweets=True)

#tweets_all=user_tweets_all(["sunomusic","TSwiftLyricsBot"],max_total=1000, exclude_replies=True, exclude_retweets=True)

Unnamed: 0,name,description,verified,username,id,created_at,public_metrics.followers_count,public_metrics.following_count,public_metrics.tweet_count,public_metrics.listed_count,public_metrics.like_count,public_metrics.media_count
0,Grammarly,Good writing moves work forward. #StandWithUkr...,True,Grammarly,47191725,2009-06-14T22:23:52.000Z,227922,3455,41476,2849,21049,9977


In [13]:
tweets=user_tweets(["grammarly"])

Error fetching tweets for user Grammarly: 429


In [16]:
# We use pd.read_csv to read csv file
file_path = 'AI_tweets_all.csv'
df = pd.read_csv(file_path)
df

FileNotFoundError: [Errno 2] No such file or directory: 'AI_tweets_all.csv'

#### Data Exploration

In [None]:
import plotly.express as px
import numpy as np

df['datetime']=pd.to_datetime(df['created_at'])

df['log_view']=np.log1p(df['public_metrics.impression_count'])  
df['log_view']=np.log1p(df['public_metrics.impression_count'])  

# create a line plot with Plotly Express
fig = px.line(df, x='datetime', y='log_view', title='Impression Over Time', template='plotly_white')

# display the plot
fig.show()

In [None]:
df.sort_values(by='log_view', ascending=False)

## 3 Vectorization

In [None]:
docs=df['text']
#Convert a collection of raw documents to a matrix of TF-IDF features.
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(docs)


tfidf_df = pd.DataFrame(tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
tfidf_df

## 3.4 Topic Modeling

In [None]:

# Apply NMF
from sklearn.decomposition import NMF

nmf_model = NMF(n_components=7, random_state=0)
nmf_model.fit(tfidf)
W = nmf_model.fit_transform(tfidf)  # Document-topic matrix

# Display topics
feature_names = tfidf_vectorizer.get_feature_names_out()
topic_names=[]
for topic_index in range(len(nmf_model.components_)):
    topic = nmf_model.components_[topic_index]
    # Get the indices of the top 3 words (largest values in the topic)
    sorted_indices = topic.argsort()  # sorts from smallest to largest
    top_indices = sorted_indices[-4:]  # get the last 3 (top 3 words)
    # Reverse to make it largest to smallest
    top_indices = top_indices[::-1]
    # Get the actual word names for these indices
    top_words = []
    for i in top_indices:
        top_words.append(feature_names[i])
    # Join the top words into a single string
    top_words_string = " ".join(top_words)
    # Print and save
    print("Topic #{}:".format(topic_index))
    print(top_words_string)
    topic_names.append(top_words_string)

topic_df = pd.DataFrame(W, columns=topic_names)
topic_df


## 4 Linear Regression

Linear regression is one of the most commonly used techniques in data analysis. It helps us understand the relationship between one or more input variables (features) and an output variable (target). In the simplest case, it tries to draw a straight line that best fits the data.

In our example, we want to understand:

- How do the topics of Grammarly’s tweets influence the number of likes?
- Which topics are more likely to lead to higher engagement (likes)?
- Which topics seem to have less impact or even negative impact?

Each tweet is represented as a set of topic weights (from NMF), and our target is the like count for that tweet.

We’ll use the topic weights (<code>topic_df</code>) as features, and the like count (<code>df['public_metrics.like_count']</code>) as the target.


The model assumes a relationship of the form:

$$
\text{Like\_Count} = \beta_0 + \beta_1 \cdot \text{Topic}_1 + \beta_2 \cdot \text{Topic}_2 + \dots + \beta_k \cdot \text{Topic}_k
$$

- $\beta_0$ is the intercept.  
- $\beta_1, \beta_2, \dots, \beta_k$ are **coefficients** for each topic.  
- A **positive coefficient** ($\beta_i > 0$) means the topic is associated with **more likes**.  
- A **negative coefficient** ($\beta_i < 0$) means the topic is associated with **fewer likes**.



In [None]:
import pingouin as pg

# Combine X and y into a single dataframe
df_model = topic_df.copy()
df_model['ratio'] = df['public_metrics.like_count'] / df['log_view']

# Run linear regression
result = pg.linear_regression(df_model.drop(columns='ratio'), df_model['ratio'])

# Round coef and pval to 3 decimal places
result[['names', 'coef', 'pval']] = result[['names', 'coef', 'pval']].round(3)

# Display the rounded result
result[['names', 'coef', 'pval']]


