__Initialization__

Only run once on the machine

In [None]:
!echo "This could take awhile. Status at bottom will show idle when done."
!git clone https://github.com/JustAnotherArchivist/snscrape
!echo done

In [None]:
!echo "This could take awhile. Status at bottom will show idle when done."
!pip install snscrape
!pip install pandas
!pip install seaborn
!echo done

Below here can be run multiple times

Let's get some data from an account with a lot of video

* --max-results to limit number of tweets (default is to get all)
* --since to only grab tweets since a certain date
* The twitter user is set just before the greater than (redirect to file)

Complete example gets 10000 tweets since Jan 1, 2021 for user SeeFunnyVideo

`!snscrape --max-results 10000 --since 20210101 --jsonl twitter-user --progress SeeFunnyVideo > tweets.jsonl`

In [None]:
!echo "This could take awhile. Status at bottom will show idle when done."
!snscrape --max-results 100 --jsonl --since 2021-01-01 twitter-user SeeFunnyVideo > tweets.jsonl
!echo done

Run this command to dump the output.  Not recommended.

In [None]:
more tweets.jsonl

Set displayMaxRows to None (displayMaxRows = None) to show all rows or use a number, like 10, to limit ouput (displayMaxRows = 10)

In [None]:
displayMaxRows = 10

Parse tweets into a Pandas data set

* Set maxResults to None if you want all tweets or limit by setting a number
* Set since to None for all dates or use YYYY-MM-DD format to get all tweets since that date

In [None]:
maxResults = 100
since = None
twitterUser = "SeeFunnyVideo"

In [None]:
import json
import pandas

temp = []

if maxResults is None:
    maxResultsParam = ""
else:
    maxResultsParam = f"--max-results {maxResults}"
    
if since is None:
    sinceParam = ""
else:
    sinceParam = f"--since {since}"
    
print(f"Running: snscrape {maxResultsParam} --jsonl {sinceParam} twitter-user {twitterUser}")
print("This can take awhile. The status bar at the bottom of the screen will say Busy until this is done.")

results  = !snscrape {maxResultsParam} --jsonl {sinceParam} twitter-user {twitterUser}

print("Done scraping Twitter")

for json_str in results:
    result = json.loads(json_str)
    
    isVideo = False
    isImage = False
    mediaType = "None"
    views = 0
    media = result["media"]
    if (media is None) == False and len(media) > 0:
        if media[0]["_type"] == "snscrape.modules.twitter.Photo":
            isImage = True
            mediaType = "Image"
        elif media[0]["_type"] == "snscrape.modules.twitter.Video":
            isVideo = True
            mediaType = "Video"
            views = media[0]["views"]
        
    record = {
        "tweetId": result["id"],
        "tweetDate": result["date"],
        "replies": result["replyCount"],
        "retweets": result["retweetCount"],
        "likes": result["likeCount"],
        "quotes": result["quoteCount"],
        "source": result["sourceLabel"],
        "isVideo": isVideo,
        "isImage": isImage,
        "videoViews": views,
        "mediaType": mediaType
    }
    
    temp.append(record)

output = pandas.DataFrame(temp)

with pandas.option_context('display.max_rows', displayMaxRows,):
    display(output)

Prepare the dataset for the scatter plot

In [None]:
output['tweetDate'] = pandas.to_datetime(output['tweetDate'])
output = output.groupby(['mediaType', pandas.Grouper(key='tweetDate', freq='W-MON')])['retweets'].sum().reset_index().sort_values('tweetDate')
with pandas.option_context('display.max_rows', displayMaxRows,):
    display(output)

Basic scatter plot

In [None]:
import seaborn
import matplotlib.pyplot as plt
 
seaborn.set(style='whitegrid')
plt.figure(figsize=(20, 5))
 
scatter = seaborn.scatterplot(x="tweetDate", y="retweets", hue = "mediaType", data=output)
plt.legend(loc='upper right', bbox_to_anchor=(1.2, 1))

Scatter plt with log scale on Y

In [None]:
import seaborn
import matplotlib.pyplot as plt
 
seaborn.set(style='whitegrid')
plt.figure(figsize=(20, 5))
 
scatter = seaborn.scatterplot(x="tweetDate", y="retweets", hue = "mediaType", data=output)
scatter.set(yscale='log')
plt.legend(loc='upper right', bbox_to_anchor=(1.2, 1))