# Analysis of Tweets on Generative AI

### Import Packages & Download Dataset

In [1]:
# install tweetnlp
# ! pip install tweetnlp

Collecting tweetnlp
  Using cached tweetnlp-0.4.4.tar.gz (54 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting ray[tune] (from tweetnlp)
  Downloading ray-2.8.1-cp39-cp39-win_amd64.whl.metadata (13 kB)
Collecting numpy (from tweetnlp)
  Downloading numpy-1.26.2-cp39-cp39-win_amd64.whl.metadata (61 kB)
     ---------------------------------------- 0.0/61.2 kB ? eta -:--:--
     ------ --------------------------------- 10.2/61.2 kB ? eta -:--:--
     ---------------------------------------- 61.2/61.2 kB 1.1 MB/s eta 0:00:00
Collecting urlextract (from tweetnlp)
  Using cached urlextract-1.8.0-py3-none-any.whl (21 kB)
Collecting transformers<=4.21.2 (from tweetnlp)
  Using cached transformers-4.21.2-py3-none-any.whl (4.7 MB)
Collecting huggingface-hub<=0.9.1 (from tweetnlp)
  Using cached huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
Collecting sentence_transformers (from tweetnlp)
  Using cached sentence-transformers-2.

In [42]:
# import packages
import os
import pandas as pd
from datetime import datetime as dt
import numpy as np
import tweetnlp
import spacy

In [10]:
df = pd.read_csv(r'C:\Users\rebri\Documents\Data Projects\gen-ai-tweets\dataset\GenerativeAI tweets.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Datetime,Tweet Id,Text,Username
0,0,2023-04-19 21:27:19+00:00,1648800467206672384,From Studio Gangster to Synthetic Gangster 🎤.....,resembleai
1,1,2023-04-19 21:27:09+00:00,1648800425540476929,Took me some time to find this. I build this #...,devaanparbhoo
2,2,2023-04-19 21:26:57+00:00,1648800376479715328,Mind blowing next wave #generativeai platform...,timreha
3,3,2023-04-19 21:26:49+00:00,1648800341193027584,Open Source Generative AI Image Specialist Sta...,VirtReview
4,4,2023-04-19 21:25:00+00:00,1648799883934203905,Are you an #HR leader considering which future...,FrozeElle


In [11]:
df.dtypes

Unnamed: 0     int64
Datetime      object
Tweet Id       int64
Text          object
Username      object
dtype: object

To process the data, we will:
1. Drop the unnamed column (duplicate index)
2. Change the datetime column to datetime datatype


In [12]:
df.drop(columns = ['Unnamed: 0'], inplace = True)
df.head()

Unnamed: 0,Datetime,Tweet Id,Text,Username
0,2023-04-19 21:27:19+00:00,1648800467206672384,From Studio Gangster to Synthetic Gangster 🎤.....,resembleai
1,2023-04-19 21:27:09+00:00,1648800425540476929,Took me some time to find this. I build this #...,devaanparbhoo
2,2023-04-19 21:26:57+00:00,1648800376479715328,Mind blowing next wave #generativeai platform...,timreha
3,2023-04-19 21:26:49+00:00,1648800341193027584,Open Source Generative AI Image Specialist Sta...,VirtReview
4,2023-04-19 21:25:00+00:00,1648799883934203905,Are you an #HR leader considering which future...,FrozeElle


In [13]:
df['Datetime'] = pd.to_datetime(df['Datetime'], format = '%Y-%m-%d %H:%M:%S%z')
df.head()

Unnamed: 0,Datetime,Tweet Id,Text,Username
0,2023-04-19 21:27:19+00:00,1648800467206672384,From Studio Gangster to Synthetic Gangster 🎤.....,resembleai
1,2023-04-19 21:27:09+00:00,1648800425540476929,Took me some time to find this. I build this #...,devaanparbhoo
2,2023-04-19 21:26:57+00:00,1648800376479715328,Mind blowing next wave #generativeai platform...,timreha
3,2023-04-19 21:26:49+00:00,1648800341193027584,Open Source Generative AI Image Specialist Sta...,VirtReview
4,2023-04-19 21:25:00+00:00,1648799883934203905,Are you an #HR leader considering which future...,FrozeElle


In [14]:
# check datatypes again
df.dtypes

Datetime    datetime64[ns, UTC]
Tweet Id                  int64
Text                     object
Username                 object
dtype: object

Let's check what the text of the tweets look like.

In [15]:
# check full text of a tweet
df.iloc[3, 2]

'Open Source Generative AI Image Specialist Stability AI Turns to Text \n\n@StabilityAI:  "Our StableLM models can generate text and code and will power a range of downstream applications."\n\n#StableLM #ai #generativeai #llms #machinelearning #ml #stabilityai\nhttps://t.co/KNbP5bTK8g'

I chose to show the above Tweet because it includes a number of interesting features which we should address:
*   Link at the end of the Tweet
*   Newline characters (\n)
* Hashtags (#StableLM)
* Mentions (@StabilityAI)

Let's start with links. First, I want to see if more Tweets include a link.

In [16]:
# select 5 random tweets and print the full text
tweets = []
for i in range(5):
  x = np.random.randint(0, len(df))
  text = df.iloc[i, 2]
  tweets.append(text)
tweets

['From Studio Gangster to Synthetic Gangster 🎤... we investigate how we suspect the #ghostwriter created the Drake and The Weeknd generative AI track ... \n\n#musicindustry #musicproducer #HipHopMusic #hiphopculture #AIVOICE #GenerativeAI https://t.co/KYPM3Bz8xw',
 'Took me some time to find this. I build this #nocode #prototype in Dec 2018. It’s a reality today, #botsociety #generativeai #ai #gpt https://t.co/1G2jDB3DEG',
 'Mind blowing next wave #generativeai  platform #cuebric\n\nInterview w Pinar Seyhan Demirdag Gary Kopechek at Vu Technologies #interview #genai #virtualproduction #nabshow2023 #iaaglobal #whatscomingnext https://t.co/pBeNG62sOe',
 'Open Source Generative AI Image Specialist Stability AI Turns to Text \n\n@StabilityAI:  "Our StableLM models can generate text and code and will power a range of downstream applications."\n\n#StableLM #ai #generativeai #llms #machinelearning #ml #stabilityai\nhttps://t.co/KNbP5bTK8g',
 "Are you an #HR leader considering which future tre

All five of these Tweets include a link, so it is likely many more do too. I will separate out all links in to a new column

In [17]:
df['Link'] = df['Text'].str.extract(r'(https?://[^\s]+)')
df['Text'] = df['Text'].str.replace(r'https?://[^\s]+', '')
df.head()

Unnamed: 0,Datetime,Tweet Id,Text,Username,Link
0,2023-04-19 21:27:19+00:00,1648800467206672384,From Studio Gangster to Synthetic Gangster 🎤.....,resembleai,https://t.co/KYPM3Bz8xw
1,2023-04-19 21:27:09+00:00,1648800425540476929,Took me some time to find this. I build this #...,devaanparbhoo,https://t.co/1G2jDB3DEG
2,2023-04-19 21:26:57+00:00,1648800376479715328,Mind blowing next wave #generativeai platform...,timreha,https://t.co/pBeNG62sOe
3,2023-04-19 21:26:49+00:00,1648800341193027584,Open Source Generative AI Image Specialist Sta...,VirtReview,https://t.co/KNbP5bTK8g
4,2023-04-19 21:25:00+00:00,1648799883934203905,Are you an #HR leader considering which future...,FrozeElle,https://t.co/LVJpzkMH9P


By checking for null values, we can see that not *all* Tweets include a link.

In [18]:
# check for null values
df.isnull().sum()

Datetime       0
Tweet Id       0
Text           0
Username       0
Link        8092
dtype: int64

Next, I want to copy the hashtags (#) and mentions (@) for each Tweet, so that I may further analyze them.

In [19]:
 import re

r1 = "#\w+"
r2 = "@\w+"

df['Hashtag'] = df['Text'].str.findall(r1)
df['Mention'] = df['Text'].str.findall(r2)
df

Unnamed: 0,Datetime,Tweet Id,Text,Username,Link,Hashtag,Mention
0,2023-04-19 21:27:19+00:00,1648800467206672384,From Studio Gangster to Synthetic Gangster 🎤.....,resembleai,https://t.co/KYPM3Bz8xw,"[#ghostwriter, #musicindustry, #musicproducer,...",[]
1,2023-04-19 21:27:09+00:00,1648800425540476929,Took me some time to find this. I build this #...,devaanparbhoo,https://t.co/1G2jDB3DEG,"[#nocode, #prototype, #botsociety, #generative...",[]
2,2023-04-19 21:26:57+00:00,1648800376479715328,Mind blowing next wave #generativeai platform...,timreha,https://t.co/pBeNG62sOe,"[#generativeai, #cuebric, #interview, #genai, ...",[]
3,2023-04-19 21:26:49+00:00,1648800341193027584,Open Source Generative AI Image Specialist Sta...,VirtReview,https://t.co/KNbP5bTK8g,"[#StableLM, #ai, #generativeai, #llms, #machin...",[@StabilityAI]
4,2023-04-19 21:25:00+00:00,1648799883934203905,Are you an #HR leader considering which future...,FrozeElle,https://t.co/LVJpzkMH9P,"[#HR, #AI, #ML, #GenerativeAI]","[@holgermu, @diginomica, @jonerp, @workday]"
...,...,...,...,...,...,...,...
56216,2022-04-24 16:40:01+00:00,1518268535276904448,"Understanding Generative AI, Its Impacts and L...",analyticsinme,https://t.co/H3RzuP4zhl,"[#GenerativeAI, #ArtificialIntelligence, #Arti...",[]
56217,2022-04-23 07:23:24+00:00,1517766068592381952,Y ya puedes empezar a crear #arte con @thegeni...,iia_es,https://t.co/EYx5zmhz5t,"[#arte, #InteligenciaArtificial, #aiart, #arte...",[@thegeniverse]
56218,2022-04-22 08:20:21+00:00,1517418013812830208,"NVIDIA researchers have developed GANverse3D, ...",VideoGenAI,https://t.co/56aSc34Lsx,[#GenerativeAI],[]
56219,2022-04-21 13:15:21+00:00,1517129866403008512,Tech Trend 2022: เทรนด์เทคโนโลยีสำหรับปี 2022 ...,sitthinuntp,https://t.co/ZBeiHJfTuT,"[#technology, #technologytrend, #gartner, #tec...",[]


Lastly, we can remove the newline characters

In [20]:
df['Text'] = df['Text'].str.replace('\n', '')
df.iloc[3, 2]

'Open Source Generative AI Image Specialist Stability AI Turns to Text @StabilityAI:  "Our StableLM models can generate text and code and will power a range of downstream applications."#StableLM #ai #generativeai #llms #machinelearning #ml #stabilityaihttps://t.co/KNbP5bTK8g'

To explore how TweetNLP handles mentions and hashtags, we will use testcases with a couple of the models

In [22]:
# testing with NER model
model = tweetnlp.NER()

testcase = 'For example, @Microsoft and #Google and @Amazon are all big tech companies. So is @IBM'
model.ner(testcase)

[{'type': 'corporation', 'entity': ' @usericrosoft'},
 {'type': 'corporation', 'entity': ' #Google'},
 {'type': 'corporation', 'entity': ' @usermazon'},
 {'type': 'corporation', 'entity': ' @user'}]

In [38]:
# testing with the sentiment model

model = tweetnlp.Sentiment()

senTest = pd.DataFrame()
testcases = ['I love dogs',
             'I #love dogs',
             'I hate dogs',
             'I #hate dogs']
for string in testcases:
  result = model.sentiment(string, return_probability = True)
  negative, neutral, positive = result['probability'].values()
  newdf = pd.DataFrame({
    'Phrase' : [string],
    'Negative' : [negative],
    'Neutral' : [neutral],
    'Positive' : [positive]
  })
  senTest = pd.concat([senTest,newdf], ignore_index = True)



senTest
# create visualization to emphasize?

Unnamed: 0,Phrase,Negative,Neutral,Positive
0,I love dogs,0.0116,0.056431,0.931969
1,I #love dogs,0.006855,0.025927,0.967218
2,I hate dogs,0.859892,0.117777,0.022331
3,I #hate dogs,0.920269,0.066798,0.012934


What are the most common words or phrases about genAI?
Are there any recurring topics or named entities?

In [55]:
# load medium spacy model
spacy.cli.download("en_core_web_md")
nlp = spacy.load("en_core_web_md")



# doc = nlp(concat_text)


[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [66]:
text_list = df['Text'].to_list()
concat_text = " ".join(firsthalf_list)

nlp.max_length = 5897741
doc = nlp(concat_text)

In [69]:
from collections import Counter

nouns = [token.text for token in doc if token.is_stop != True and token.is_punct != True and token.pos_ == 'NOUN']
word_freq = Counter(nouns)
word_freq.most_common(10)

[('GenerativeAI', 5081),
 ('ChatGPT', 1891),
 ('generativeAI', 1689),
 ('AI', 1648),
 ('art', 1404),
 ('technology', 1390),
 ('images', 1331),
 ('data', 1216),
 ('AIArt', 1205),
 ('tech', 1193)]

This dataset size is so large that it may cause memory issues when using the NER or parser features. To address this, I will split the dataset in half.

In [64]:
import math

fhEnd = math.floor(len(df)/2)
shStart = math.ceil(len(df)/2)


firsthalf_list = df.loc[0:fhEnd, 'Text'].to_list()
# sechalf_list = df.loc[shStart:, 'Text'].to_list()
fh_text = " ".join(firsthalf_list)
# sh_text = " ".join(sechalf_list)
len(firsthalf_list)

28111

In [65]:
fhDoc = nlp(fh_text)

ValueError: [E088] Text of length 5897741 exceeds maximum of 1000000. The parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.

In [54]:
import math 
fhEnd = math.floor(len(df)/2)
shStart = math.ceil(len(df)/2)

28111