In [15]:
import pandas as pd

## 1. Natural Language Processing - A Naive Example

Before diving into real Twitter data, let’s start with a simple example.
Here’s a small corpus consisting of three short documents:

- Document 1: It is going to rain today.
- Document 2: Today I am not going outside.
- Document 3: I am going to watch the season premiere.

In [1]:
Document1= "It is going to rain today."
Document2= "Today I am not going outside."
Document3= "I am going to watch the season premiere."
Doc = [Document1 ,
 Document2 , 
 Document3]
print(Doc)

['It is going to rain today.', 'Today I am not going outside.', 'I am going to watch the season premiere.']



From this example, we’ll learn how to convert raw text into numerical features — or what we might call columns of numbers. This process is often referred to as vectorization.

Once we represent text as vectors, we unlock the ability to perform various types of analysis, including:
- Summarization
- Clustering
- Topic modeling
- Information retrieval (e.g., finding similar texts)
- Predictive modeling

The core idea in Natural Language Processing (NLP) is transforming unstructured text into structured numerical form. While there are many ways to do this, we’ll focus on one of the most widely used and interpretable methods: TF-IDF (Term Frequency–Inverse Document Frequency).

TF-IDF is useful in many NLP applications. For example:
- Search engines use it to rank the relevance of a document to a search query.
- It’s also used in text classification, summarization, and topic modeling.

After learning TF-IDF, we’ll apply it in a downstream task — topic modeling — to uncover hidden themes across the documents.

While we won’t cover every vectorization technique or downstream task, this example will give you a strong foundation for understanding how an NLP pipeline works.


### 1.1 Vectorization: Term Frequency(TF) — Inverse Document Frequency(IDF) Vectorization
A corpus can be defined as a collection of documents. In our example, each sentence is a document, and they collectively form a corpus.  

To vectorize text data, we use a TF-IDF method. 
- We first tokenize the text, and then assign an importance score for every term. 
- The importance score of a term is high when it occurs a lot in a given document and rarely in others. 
- In short, commonality within a document measured by TF is balanced by rarity between documents measured by IDF. The resulting TF-IDF score reflects the importance of a term for a document in the corpus.
 

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer() #TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
analyze = vectorizer.build_analyzer()
print("Document 1",analyze(Document1))
print("Document 2",analyze(Document2))
print("Document 3",analyze(Document3))

X = vectorizer.fit_transform(Doc)

print(X)
df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
df

Document 1 ['it', 'is', 'going', 'to', 'rain', 'today']
Document 2 ['today', 'am', 'not', 'going', 'outside']
Document 3 ['am', 'going', 'to', 'watch', 'the', 'season', 'premiere']
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 18 stored elements and shape (3, 13)>
  Coords	Values
  (0, 3)	0.4711101009983051
  (0, 2)	0.4711101009983051
  (0, 1)	0.2782452148327134
  (0, 10)	0.35829137488557944
  (0, 7)	0.4711101009983051
  (0, 11)	0.35829137488557944
  (1, 1)	0.3154441510317797
  (1, 11)	0.4061917781433946
  (1, 0)	0.4061917781433946
  (1, 4)	0.5340933749435833
  (1, 5)	0.5340933749435833
  (2, 1)	0.2517108425440014
  (2, 10)	0.3241235393856436
  (2, 0)	0.3241235393856436
  (2, 12)	0.42618350336974425
  (2, 9)	0.42618350336974425
  (2, 8)	0.42618350336974425
  (2, 6)	0.42618350336974425


Unnamed: 0,am,going,is,it,not,outside,premiere,rain,season,the,to,today,watch
0,0.0,0.278245,0.47111,0.47111,0.0,0.0,0.0,0.47111,0.0,0.0,0.358291,0.358291,0.0
1,0.406192,0.315444,0.0,0.0,0.534093,0.534093,0.0,0.0,0.0,0.0,0.0,0.406192,0.0
2,0.324124,0.251711,0.0,0.0,0.0,0.0,0.426184,0.0,0.426184,0.426184,0.324124,0.0,0.426184



We tokenize and generate a vocab of the document. For each document, we can find the TF= (Number of repetitions of word in a document) / (# of words in a document). We can further find the IDF=Log[(Number of documents) / (Number of documents containing the word)]

| words      | Doc1 | Doc2| Doc3|IDF Value|
| ----------- | ----------- |----------- |----------- |----------- |
| going      | 0.16     |0.16|0.12|0|
| to   | 0.16       |0|0.12|0.41|
|today|0.16|0.16|0|0.41|
|i|0|0.16|0.12|0.41|
|am|0|0.16|0.12|0.41|
|it|0.16|0|0|1.09|
|is |0.16|0|0|1.09|
|rain|0.16|0|0|1.09|

We then construct a document-term matrix using the TF-IDF scores:

| Docs      | going |to|today|i|am|it|is|rain|
| ------ |------ |------ |------ |------ |------ |------ |------ |------ |
| Doc1      | 0  |0.07|0.07|0|0|0.17|0.17|0.17|0.17|
| Doc2   | 0  |0|0.07|0.07|0.07|0|0|0|
|Doc3|0|0.05|0|0.05|0.05|0|0|0|

It is easy to see that 'it', 'is', and 'rain' are important for Doc 1 but not Doc 2 or Doc 3. Each row of the document-term matrix can be thought of as a numeric representation of the documents, which we often term vectors. These numeric representations help you to find similarities between documents. 
 
> You might have noticed that stop words such as “to” and “is” are included above. These are usually filtered out in real-world NLP tasks because they don’t carry much meaning.

To perform vectorization in Python, we use the <code>TfidfVectorizer</code> from the <code>sklearn</code> package.

The steps are:
- Create the vectorizer.
- Fit it on your corpus.
- Transform your corpus into vectors.

The function **TfidfVectorizer** takes two parameters. 
- max_df is used for removing terms that appear too frequently, also known as "corpus-specific stop words". For example:
    - max_df = 0.50 means "ignore terms that appear in more than 50% of the documents".
    - max_df = 25 means "ignore terms that appear in more than 25 documents".
    - The default max_df is 1.0, which means "ignore terms that appear in more than 100% of the documents". Thus, the default setting does not ignore any terms.
- min_df is used for removing terms that appear too infrequently. For example:
    - min_df = 0.01 means "ignore terms that appear in less than 1% of the documents".
    - min_df = 5 means "ignore terms that appear in less than 5 documents".
    - The default min_df is 1, which means "ignore terms that appear in less than 1 document". Thus, the default setting does not ignore any terms.



In [3]:
docs=Doc
#Convert a collection of raw documents to a matrix of TF-IDF features.
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=0.1, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(docs)


tfidf_df = pd.DataFrame.sparse.from_spmatrix(tfidf, columns=tfidf_vectorizer.get_feature_names_out())
tfidf_df

Unnamed: 0,outside,premiere,rain,season,today,watch
0,0.0,0.0,0.795961,0.0,0.605349,0.0
1,0.795961,0.0,0.0,0.0,0.605349,0.0
2,0.0,0.57735,0.0,0.57735,0.0,0.57735


Now, after removing stop words, the resulting matrix looks like this (we'll call it M):

|     | outside   | premiere | rain     | season   | today    | watch    |
|-----|-----------|----------|----------|----------|----------|----------|
| 0   | 0         | 0        | 0.795961 | 0        | 0.605349 | 0        |
| 1   | 0.795961  | 0        | 0        | 0        | 0.605349 | 0        |
| 2   | 0         | 0.57735  | 0        | 0.57735  | 0        | 0.57735  |


It is notable that <code>tfidf</code> is a sparse matrix. If you'd like to view it as a full DataFrame, use:


### 1.2 Non-negative Matrix Factorization (NMF)

TF-IDF vectors are great, but high-dimensional. When we have hundreds or thousands of terms, interpretation becomes difficult.

To reduce this complexity and uncover latent themes, we use Non-negative Matrix Factorization (NMF), a powerful technique for **topic modeling**.

 If we think of the document-term matrix $M$ as a $m \times n$ matrix with $m$ documents and $n$ terms, $M$ can be factorized as 




$$
M=W \times H
$$

- M: Original document-term matrix (e.g., m docs × n terms)
- W: Document-topic matrix (m docs × k topics)
- H: Topic-term matrix (k topics × n terms)
- k: Number of topics

NMF finds W and H such that their product approximates M, and all values remain non-negative.

This technique helps extract topics from text — where each topic is a combination of words, and each document can belong to multiple topics with different strengths.
 
 

The function NMF takes two parameters. 
- n_components is the number of topics
- random_state controls the random number generator used in the attribute combining process.

In [4]:
from sklearn.decomposition import NMF

nmf_model = NMF(n_components=2, random_state=0)
#nmf_model.fit(tfidf)
W = nmf_model.fit_transform(tfidf)  # Document-topic matrix

# Display topics
feature_names = tfidf_vectorizer.get_feature_names_out()
topic_names=[]
# Assume nmf_model and feature_names are already defined
topic_names = []

# Loop through each topic
for topic_index in range(len(nmf_model.components_)):
    topic = nmf_model.components_[topic_index]
    print(topic)
    # Get the indices of the top 3 words (largest values in the topic)
    sorted_indices = topic.argsort()  # sorts from smallest to largest

    print(sorted_indices)
    top_indices = sorted_indices[-3:]  # get the last 3 (top 3 words)
    
    # Reverse to make it largest to smallest
    top_indices = top_indices[::-1]

    # Get the actual word names for these indices
    top_words = []
    for i in top_indices:
        top_words.append(feature_names[i])
    
    # Join the top words into a single string
    top_words_string = " ".join(top_words)

    # Print and save
    print("Topic #{}:".format(topic_index))
    print(top_words_string)
    topic_names.append(top_words_string)
topic_df = pd.DataFrame(W, columns=topic_names)
topic_df

topic_df = pd.DataFrame(nmf_model.components_ ,columns=feature_names)
topic_df

[0.81058158 0.         0.81058158 0.         1.23293637 0.        ]
[1 3 5 0 2 4]
Topic #0:
today rain outside
[0.         0.68727788 0.         0.68727788 0.         0.68727788]
[0 2 4 1 3 5]
Topic #1:
watch season premiere


Unnamed: 0,outside,premiere,rain,season,today,watch
0,0.810582,0.0,0.810582,0.0,1.232936,0.0
1,0.0,0.687278,0.0,0.687278,0.0,0.687278


This is the W matrix (document-topic distribution):


|     | today outside rain | watch season premiere |
|-----|--------------------|------------------------|
| 0   | 0.490981           | 0.000000               |
| 1   | 0.490981           | 0.000000               |
| 2   | 0.000000           | 0.840054               |

And this is the H matrix (topic-word distribution):

|     | outside   | premiere | rain     | season   | today    | watch    |
|-----|-----------|----------|----------|----------|----------|----------|
| 0   | 0.810582  | 0.000000 | 0.810582 | 0.000000 | 1.232936 | 0.000000 |
| 1   | 0.000000  | 0.687278 | 0.000000 | 0.687278 | 0.000000 | 0.687278 |


## 2 Analyzing Twitter Data

Finally, we get to practice using the Twitter data! 
### 2.1 What Social Media Accounts to Search?

To identify social media accounts related to AI tools, we perform a Google search using the keyword "AI marketing tools". Below are the Search Engine Results Pages (also known as “SERPs” or “SERP”).

The first few results are sponsored links, and one organic result points us to [15 Best AI Marketing Tools in 2023-2024](https://improvado.io/blog/best-ai-marketing-tools). Among the recommended AI tools, we are particularly interested in [Grammarly](https://twitter.com/Grammarly). Let's collect tweets generated by Grammarly's official account and examine which tweets get more likes.


> Grammarly is a cloud-based typing assistant. It reviews spelling, grammar, punctuation, clarity, engagement, and delivery mistakes in English texts, detects plagiarism, and suggests replacements for the identified errors. For a brief introduction to Grammarly, watch this [video](https://www.youtube.com/watch?v=zd64pGNLjVY).


### 2.2. Data Collection
Twitter has its API service. To simplify this data collection process, I built a little package.


In [4]:
#!pip3 install --upgrade --force-reinstall git+https://github.com/tantantan12/itom6219.git


In [5]:
import os
os.environ["BEARER_TOKEN"] = "AAAAAAAAAAAAAAAAAAAAAA7fGwEAAAAATek8qNEHmKiwy5NeLLGGLu%2FOllc%3DvMI6a81TOlLcj6fthUgm5xT66tHGcKYcklMRLcRZjxQBKpqWJp"


from itom6219 import user_info, user_tweets, user_tweets_all
user=user_info(["grammarly"])
user
#tweets=user_tweets(["grammarly"], exclude_replies=True, exclude_retweets=True)

#tweets_all=user_tweets_all(["sunomusic","TSwiftLyricsBot"],max_total=1000, exclude_replies=True, exclude_retweets=True)

Unnamed: 0,id,name,username,description,verified,created_at,public_metrics.followers_count,public_metrics.following_count,public_metrics.tweet_count,public_metrics.listed_count,public_metrics.like_count,public_metrics.media_count
0,47191725,Grammarly,Grammarly,Good writing moves work forward. #StandWithUkr...,True,2009-06-14T22:23:52.000Z,227923,3455,41476,2848,21049,9977


In [6]:
tweets=user_tweets(["grammarly"])

In [7]:
tweets

Unnamed: 0,lang,id,created_at,text,conversation_id,in_reply_to_user_id,edit_history_tweet_ids,public_metrics.retweet_count,public_metrics.reply_count,public_metrics.like_count,public_metrics.quote_count,public_metrics.bookmark_count,public_metrics.impression_count,username
0,en,1942327975266992252,2025-07-07T21:00:50.000Z,"@isamirDM Wow, Ibrahim—81 weeks strong! 💪 Big ...",1942285580127330508,810445800081936385,[1942327975266992252],1,1,1,0,0,25,Grammarly
1,en,1942307581965525412,2025-07-07T19:39:48.000Z,"RT @tbpn: TBPN | Monday, July 7th https://t.co...",1942307581965525412,,[1942307581965525412],8,0,0,0,0,0,Grammarly
2,en,1940809006550798352,2025-07-03T16:25:00.000Z,@alliswell4usai Thrilled to be part of your wr...,1940709842370678930,1859531083357786112,[1940809006550798352],0,0,1,0,0,32,Grammarly
3,zxx,1940112480799531218,2025-07-01T18:17:15.000Z,RT @Superhuman: https://t.co/5GQSDJDgIj,1940112480799531218,,[1940112480799531218],24,0,0,0,0,1,Grammarly
4,en,1940077687697285341,2025-07-01T15:59:00.000Z,Grammarly has announced its intent to acquire ...,1940077687697285341,,[1940077687697285341],30,19,264,38,65,226107,Grammarly
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,en,1886708018080817385,2025-02-04T09:26:59.000Z,@LTJ81 50 million words? What an incredible mi...,1886517350259703972,1411004539,[1886708018080817385],0,1,1,0,0,71,Grammarly
96,en,1886366614213333159,2025-02-03T10:50:22.000Z,@Ali_Khazaeei Hello! A member of our team has ...,1885948056710897717,1877370407151611906,[1886366614213333159],0,0,0,0,0,38,Grammarly
97,en,1884976520428454352,2025-01-30T14:46:38.000Z,@patrahgichobi That’s what we love to hear! Ke...,1884939768879800660,836501780158689280,[1884976520428454352],0,0,2,0,0,124,Grammarly
98,en,1884730492051955984,2025-01-29T22:29:00.000Z,RT @itsnicethat: Clear communication: explore ...,1884730492051955984,,[1884730492051955984],6,0,0,0,0,1,Grammarly


In [8]:
# We use pd.read_csv to read csv file
file_path = 'AI_tweets_all.csv'
df = pd.read_csv(file_path)
df

Unnamed: 0,text,referenced_tweets,id,author_id,edit_history_tweet_ids,created_at,entities.mentions,public_metrics.retweet_count,public_metrics.reply_count,public_metrics.like_count,public_metrics.quote_count,public_metrics.bookmark_count,public_metrics.impression_count,in_reply_to_user_id
0,RT @Infosys: We are delighted to announce the ...,"[{'type': 'retweeted', 'id': '1749406750195773...",1760971102103490795,5.451676e+07,['1760971102103490795'],2024-02-23T10:13:22.000Z,"[{'start': 3, 'end': 11, 'username': 'Infosys'...",709.0,0.0,0.0,0.0,0.0,0.0,
1,RT @edwardlimp: @rovercrc The problem is you m...,"[{'type': 'retweeted', 'id': '1760967808882762...",1760970913397309573,1.661782e+18,['1760970913397309573'],2024-02-23T10:12:37.000Z,"[{'start': 3, 'end': 14, 'username': 'edwardli...",10.0,0.0,0.0,0.0,0.0,0.0,
2,RT @BrnMetaverse: 💡 $BRN is everywhere\n\n🚀 We...,"[{'type': 'retweeted', 'id': '1760318913848504...",1760970867607998725,2.753974e+09,['1760970867607998725'],2024-02-23T10:12:26.000Z,"[{'start': 3, 'end': 16, 'username': 'BrnMetav...",60.0,0.0,0.0,0.0,0.0,0.0,
3,RT @kortizart: The Biden Harris administration...,"[{'type': 'retweeted', 'id': '1760820176780775...",1760970809063960931,3.688994e+09,['1760970809063960931'],2024-02-23T10:12:12.000Z,"[{'start': 3, 'end': 13, 'username': 'kortizar...",398.0,0.0,0.0,0.0,0.0,0.0,
4,A thread of all GenAI projects that I have bui...,,1760970779888419317,1.371103e+18,['1760970779888419317'],2024-02-23T10:12:05.000Z,,0.0,1.0,7.0,0.0,9.0,865.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
496,RT @AINN_BRC20: 🎉 Thrilled to announce $AINN j...,"[{'type': 'retweeted', 'id': '1760653986879697...",1762415592211144938,1.642337e+18,['1762415592211144938'],2024-02-27T09:53:15.000Z,"[{'start': 3, 'end': 14, 'username': 'AINN_BRC...",156.0,0.0,0.0,0.0,0.0,0.0,
497,@CryptoThro Missing $CODEX @codex_token 💎👀 \n\...,"[{'type': 'replied_to', 'id': '176238891087584...",1762415571033858182,1.744319e+18,['1762415571033858182'],2024-02-27T09:53:10.000Z,"[{'start': 0, 'end': 11, 'username': 'CryptoTh...",0.0,0.0,0.0,0.0,0.0,1.0,2.738088e+09
498,RT @rafatamames: Hoy como TopVoice escribo est...,"[{'type': 'retweeted', 'id': '1762412235740119...",1762415568823443879,3.145352e+08,['1762415568823443879'],2024-02-27T09:53:10.000Z,"[{'start': 3, 'end': 15, 'username': 'rafatama...",1.0,0.0,0.0,0.0,0.0,0.0,
499,Offered my perspective on some of @Google’s Ge...,,1762415548661477535,2.085956e+08,['1762415548661477535'],2024-02-27T09:53:05.000Z,"[{'start': 34, 'end': 41, 'username': 'Google'...",0.0,0.0,0.0,0.0,0.0,82.0,


## 3 Vectorization

In [15]:
docs=df['text']
#Convert a collection of raw documents to a matrix of TF-IDF features.
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(docs)


tfidf_df = pd.DataFrame(tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
tfidf_df

Unnamed: 0,000,00pm,10,1000,100xgems,12,14,15000,1dnzjoqndy,1m,...,どこが,なのか,なるほど,のようなフルオプトイン型契約の画像生成aiの存在を概ね今の利用者は無視しているんで詭弁に付き合う必要は無いと思います,ほうほうほう,フリーライドでき,元から許諾済みのみのgenaiがあるのに,急速にサーチのuxが置き換わると予想されてる,生成aiによる技術の発展,青線はgenaiによるアンサークエリー数で
0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
496,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
497,0.0,0.0,0.0,0.0,0.197348,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
498,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
499,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 3.4 Topic Modeling

In [16]:

# Apply NMF
from sklearn.decomposition import NMF

nmf_model = NMF(n_components=7, random_state=0)
nmf_model.fit(tfidf)
W = nmf_model.fit_transform(tfidf)  # Document-topic matrix

# Display topics
feature_names = tfidf_vectorizer.get_feature_names_out()
topic_names=[]
for topic_index in range(len(nmf_model.components_)):
    topic = nmf_model.components_[topic_index]
    # Get the indices of the top 3 words (largest values in the topic)
    sorted_indices = topic.argsort()  # sorts from smallest to largest
    top_indices = sorted_indices[-4:]  # get the last 3 (top 3 words)
    # Reverse to make it largest to smallest
    top_indices = top_indices[::-1]
    # Get the actual word names for these indices
    top_words = []
    for i in top_indices:
        top_words.append(feature_names[i])
    # Join the top words into a single string
    top_words_string = " ".join(top_words)
    # Print and save
    print("Topic #{}:".format(topic_index))
    print(top_words_string)
    topic_names.append(top_words_string)

topic_df = pd.DataFrame(W, columns=topic_names)
topic_df

Topic #0:
airdrop 000 10 genai
Topic #1:
genai_offi 5az6vq9msvjaxeinzp2nmwzh66mvtsjum3qiuxgvegkk rqdfwz3n3opcbwu87mxkkwaxj5xyv9fszvju8u6en5j aywxyybyfhoya9ewkwjwssjfcb6frevshnbhwhkcmrde
Topic #2:
genai sp4fxn6bor genaimemechallenge bullish
Topic #3:
codex utilities 100xgems 5o6m40ifkj
Topic #4:
https ai marketing genai
Topic #5:
gem getting kai guys
Topic #6:
00pm boom b² ama


Unnamed: 0,airdrop 000 10 genai,genai_offi 5az6vq9msvjaxeinzp2nmwzh66mvtsjum3qiuxgvegkk rqdfwz3n3opcbwu87mxkkwaxj5xyv9fszvju8u6en5j aywxyybyfhoya9ewkwjwssjfcb6frevshnbhwhkcmrde,genai sp4fxn6bor genaimemechallenge bullish,codex utilities 100xgems 5o6m40ifkj,https ai marketing genai,gem getting kai guys,00pm boom b² ama
0,0.009723,0.000000,0.019345,0.000000,0.041931,0.006715,0.000000
1,0.002644,0.000084,0.002855,0.000000,0.019999,0.016787,0.004619
2,0.001476,0.000000,0.000000,0.000000,0.041276,0.007060,0.000624
3,0.000000,0.000000,0.000000,0.000000,0.076822,0.003809,0.007667
4,0.011300,0.000000,0.034387,0.000000,0.095493,0.000000,0.000000
...,...,...,...,...,...,...,...
496,0.002289,0.000000,0.004224,0.000000,0.008777,0.009289,0.005431
497,0.000000,0.000000,0.000000,0.522019,0.000000,0.000000,0.000000
498,0.000693,0.000157,0.005944,0.000403,0.050524,0.008150,0.005094
499,0.002414,0.000000,0.004520,0.000000,0.094671,0.000000,0.000000



## 4 Linear Regression

Linear regression is one of the most commonly used techniques in data analysis. It helps us understand the relationship between one or more input variables (features) and an output variable (target). In the simplest case, it tries to draw a straight line that best fits the data.

In our example, we want to understand:

- How do the topics of Grammarly’s tweets influence the number of likes?
- Which topics are more likely to lead to higher engagement (likes)?
- Which topics seem to have less impact or even negative impact?

Each tweet is represented as a set of topic weights (from NMF), and our target is the like count for that tweet.

We’ll use the topic weights (<code>topic_df</code>) as features, and the like count (<code>df['public_metrics.like_count']</code>) as the target.


The model assumes a relationship of the form:

$$
\text{Like\_Count} = \beta_0 + \beta_1 \cdot \text{Topic}_1 + \beta_2 \cdot \text{Topic}_2 + \dots + \beta_k \cdot \text{Topic}_k
$$

- $\beta_0$ is the intercept.  
- $\beta_1, \beta_2, \dots, \beta_k$ are **coefficients** for each topic.  
- A **positive coefficient** ($\beta_i > 0$) means the topic is associated with **more likes**.  
- A **negative coefficient** ($\beta_i < 0$) means the topic is associated with **fewer likes**.



In [19]:
import pingouin as pg

# Combine X and y into a single dataframe
df_model = topic_df.copy()
df_model['ratio'] = df['public_metrics.like_count'] / df['log_view']

# Run linear regression
result = pg.linear_regression(df_model.drop(columns='ratio'), df_model['ratio'])

# Round coef and pval to 3 decimal places
result[['names', 'coef', 'pval']] = result[['names', 'coef', 'pval']].round(3)

# Display the rounded result
result[['names', 'coef', 'pval']]




AssertionError: Target (y) contains NaN or Inf. Please remove them manually or use remove_na=True.