Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Applying Trained Models to other Datasets #27

Closed
s2t2 opened this issue Dec 18, 2023 · 4 comments
Closed

Applying Trained Models to other Datasets #27

s2t2 opened this issue Dec 18, 2023 · 4 comments
Assignees

Comments

@s2t2
Copy link
Owner

s2t2 commented Dec 18, 2023

Goal: use the models we trained on the openai text embeddings to perform user classification, and apply them to other datasets of political discussion on twitter.

Datasets: choose the combined election 2020 + transition 2021 dataset in the shared environment ("election_2020_transition_2021_combined").

Models: stored on google cloud storage (see example notebook to load them)

Success criteria: Store the scores back in google bigquery (in new tables in that combined dataset).

Process / Steps:

  1. Assemble a table of user timeline tweet texts, with a row per user, as well as all of their tweet texts concatenated together. See notebooks for example approach (FYI consider all notebooks 1-3 and try to consolidate as applicable into a single process). Whether or not we take only the unique tweets for each user, or leave dups in, remains to be seen. First maybe let's query the dataset to see how common this is. For each user, what number of times do repeat a given tweet verbatim? Likely use all a user's tweets (hopefully we have sufficiently de-duped from the collection process, but we should check duplicates before moving on). So now we have a table with column of user_id and tweet_texts. Note: this table will only have a sample of the user's tweets (for example max 50 or maybe 100 tweets per user, selected at random). For the script, let's parameterize the tweet limit as an environment variable, perhaps called TWEETS_MAX or TWEETS_LIMIT, and let's also consider including this number in the name of the bigquery table used to store the embeddings (like "openai_user_timeline_embeddings_max_50", or "openai_user_timeline_embeddings_max_100", or something like that, using different tables for different limits). Let's start with 50 only for now, as it matches the approach we used when training the models. Note that for all subsequent tables derived from this data, we might want to name those tables with "_max_50" as well, to allow us to differentiate later.
  2. For each user in that new table, we will loop through them, maybe in batches or something, and obtain openai text embeddings for each's profile texts. Let's use the same ada text embedding model that we used when training the models. We want to leverage the existing code to obtain the embeddings. Specifically, when fetching the embeddings, we will "fetch in dynamic batches" to get around API limits. Let's store the embeddings in a separate table, in a way so we know which embeddings are for which users (row per user still). When we save the embeddings, let's save all 1500ish into a single "embeddings" column of an array datatype (we may need to bigquery migrations first to set up that table structure). Our script for obtaining embeddings should only attempt to obtain embeddings for users we haven't already obtained embeddings for, so at the top of the script we may want to first fetch only a list of users that don't currently have records in the embeddings table (might need to join the user timeline texts table to the embeddings table). For this script, let's use an environment variable called something like the USERS_LIMIT, for testing and developing with small batches, like of 5-10 users at a time.
  3. Demonstrate ability to load the trained models from cloud storage. Can use existing storage service. NOTE: Logistic Regression may be most reliably loadable (may be issues with random forest, need to revisit how they were saved).
  4. Let's use the text embeddings as inputs to perform classification. We want to perform classification for each task (bot detection, opinion classification, toxicity, news quality). So we may have separate tables for each task, or we can have a column denoting which type of scores are being stored. Also we should keep track of which model was used to provide the scores. We'll wind up with a table(s) of scores (and probabilities) for each user for each classification task:
    + Bot Status
    + Opinion Community
    + Lang Toxicity
    + News Quality
    + Fourway label (multiclass bot status x opinion community)
@s2t2
Copy link
Owner Author

s2t2 commented Dec 18, 2023

-- TODO create table users_and_timeline_texts_sample (something like this)
SELECT 
   user_id
   ,count(distinct status_id) as tweet_count
   ,count(distinct case when retweeted_status_id is not null then status_id end) as rt_count
   -- also add count of retweets (will be less than the tweet count)
   ,min(date(created_at)) as first_tweet_on
   ,max(date(created_at)) as latest_tweet_on

   -- here we are grabbing at max X of the user's tweets at random:
  --,string_agg(t.status_text, '{TWEET_DELIMETER}' ORDER BY rand() LIMIT {int(TWEET_MAX)}) as tweet_texts

  ,string_agg(t.status_text, ' || ' ORDER BY rand() LIMIT 50) as tweet_texts

FROM `tweet-research-shared.election_2020_transition_2021_combined.tweets_v2_slim` t
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10

@s2t2
Copy link
Owner Author

s2t2 commented Dec 18, 2023

We did a one time bucket transfer from the upstream data collection environment, into the shared project, so now the models are accessible to researchers with access to the shared project.

GOOGLE_PROJECT_NAME="tweet-research-shared"
BUCKET_NAME="openai-embeddings-2023-shared"

@s2t2 s2t2 assigned s2t2 and JiazanShi and unassigned s2t2 Dec 18, 2023
@JiazanShi
Copy link
Collaborator

For the step 1:
Before we sample user data from 'election_2020_transition_2021_combined', we checked if there are duplicates in the dataset. For this dataset, there is no duplicate records but we find some users retweet with the same text. Since we want to use unique text data for training and texting our models, we checked the distribution of number of tweets and unique tweets per user.

-- check duplicated text for the same user
SELECT user_id, status_text,COUNT(DISTINCT status_id) AS text_dup_cnts
	FROM `tweet-research-shared.election_2020_transition_2021_combined.tweets_v2_slim` 
	GROUP BY 1,2
	ORDER BY 3 DESC
	LIMIT 10

--the number of tweets and unique tweets per user
SELECT user_id,
    COUNT(status_text) AS text_cnts,
    COUNT(DISTINCT status_text) AS dedup_text_cnts
    FROM `tweet-research-shared.election_2020_transition_2021_combined.tweets_v2_slim`
    GROUP BY 1

--also check the users in training dataset 
SELECT user_id,
    COUNT(status_text) AS text_cnts,
    COUNT(DISTINCT status_text) AS unique_text_cnts
    FROM `tweet-research-shared.impeachment_2020.tweets_v2`
    WHERE user_id IN ({str(user_list).strip('[]')}) --the user id list in training dataset
    GROUP BY 1

While there is no huge difference between the tweets and unique text, that means repeated text is not a huge proportion in our dataset.
newplot (15)

So for the following steps, we will not de duplicates for sampling user dataset.

@s2t2
Copy link
Owner Author

s2t2 commented Feb 24, 2024

Closed by #28

@s2t2 s2t2 closed this as completed Feb 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants