We fetched user-level and tweet-level OpenAI embeddings and stored on BQ, and copied the data to CSV files on Drive.

Then we de-duped the status embeddings and calculated the average status embeddings for each user, and saved these CSV files on drive.

This notebook provides a preliminary analysis of user-level vs tweet-level embeddings, focusing first on dimensionality reduction.

## Setup

Package installation:

In [161]:
%%capture
!pip install -U kaleido

May need to restart session before continuing.

In [6]:
!pip list | grep kaleido

kaleido                          0.2.1


## Google Drive

In [7]:
import os
from google.colab import drive

drive.mount('/content/drive')
print(os.getcwd(), os.listdir(os.getcwd()))

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content ['.config', 'drive', 'results_pca_2', 'user_results_pca_2', 'sample_data']


In [8]:
# you might need to create a google drive SHORTCUT that has this same path
# ... or update the path to use your own google drive organization
#DIRPATH = '/content/drive/MyDrive/Research/Disinfo Research Shared 2022'
#DIRPATH = '/content/drive/MyDrive/Research/DS Research Shared 2023'
DIRPATH = '/content/drive/MyDrive/Research/DS Research Shared 2024'

print(DIRPATH)
os.path.isdir(DIRPATH)

/content/drive/MyDrive/Research/DS Research Shared 2024


True

New project-based directory structure for 2024:

https://drive.google.com/drive/folders/1SuXkqVT400uZ2OYFGGV8SYBf7NhtBo5k?usp=drive_link

In [9]:
DATA_DIRPATH = os.path.join(DIRPATH, "projects", "Impeachment 2020 Embeddings", "data")
os.path.isdir(DATA_DIRPATH)

True

In [10]:
os.listdir(DATA_DIRPATH)

['botometer_sample_max_50_openai_user_embeddings.csv.gz',
 'botometer_sample_max_50_openai_user_embeddings_unpacked.csv.gz',
 'botometer_sample_max_50_openai_status_embeddings_v3.csv.gz',
 'botometer_sample_max_50_openai_status_embeddings_v3_unpacked.parquet.gzip',
 'botometer_sample_max_50_openai_status_embeddings_v3_unpacked_deduped.parquet.gzip',
 'botometer_sample_max_50_openai_status_embeddings_v3_unpacked_deduped_averaged.csv.gz']

The "unpacked" versions have a column per embedding, and are generally easier to work with.

The files we will be working with are:
  + "botometer_sample_max_50_openai_user_embeddings_unpacked.csv.gz" (user level embeddings) and
  +  "botometer_sample_max_50_openai_status_embeddings_v3_unpacked_deduped_averaged.csv.gz" (average status-level embeddings).

In [55]:
RESULTS_DIRPATH = os.path.join(DIRPATH, "projects", "Impeachment 2020 Embeddings", "results")
os.makedirs(RESULTS_DIRPATH, exist_ok=True)
os.path.isdir(RESULTS_DIRPATH)

True

## Colors

In [11]:
# https://github.com/s2t2/openai-embeddings-2023/blob/main/app/colors.py

#GREY = "#ccc"
#PURPLE = "#7E57C2"

# colorbrewer scales
# light --> dark
BLUES = ['#f7fbff', '#deebf7', '#c6dbef', '#9ecae1', '#6baed6', '#4292c6', '#2171b5', '#08519c', '#08306b']
REDS = ['#fff5f0', '#fee0d2', '#fcbba1', '#fc9272', '#fb6a4a', '#ef3b2c', '#cb181d', '#a50f15', '#67000d']
PURPLES = ['#fcfbfd', '#efedf5', '#dadaeb', '#bcbddc', '#9e9ac8', '#807dba', '#6a51a3', '#54278f', '#3f007d']
GREYS = ['#ffffff', '#f0f0f0', '#d9d9d9', '#bdbdbd', '#969696', '#737373', '#525252', '#252525', '#000000']
GREENS = ["#edf8e9","#c7e9c0","#a1d99b","#74c476","#41ab5d","#238b45","#005a32"]
ORANGES = ['#fff5eb', '#fee6ce', '#fdd0a2', '#fdae6b', '#fd8d3c', '#f16913', '#d94801', '#a63603', '#7f2704']
BROWNS = ["#C46200", "#964B00"]
RD_PU = ["#feebe2","#fcc5c0","#fa9fb5","#f768a1","#dd3497","#ae017e","#7a0177"]
PU_RD = ["#f1eef6","#d4b9da","#c994c7","#df65b0","#e7298a","#ce1256","#91003f"]

OPINION_COLORS_MAP = {"Anti-Trump": BLUES[5], "Pro-Trump": REDS[5]}
BOT_COLORS_MAP = {"Human": GREYS[3], "Bot": PURPLES[6]}
Q_COLORS_MAP = {"Normal": GREYS[3], "Q-anon": REDS[6]}
TOXIC_COLORS_MAP = {"Toxic": BROWNS[1], "Normal": GREYS[3]}
FACT_COLORS_MAP = {"High Quality": GREYS[3], "Low Quality": RD_PU[4]}

FOURWAY_COLORS_MAP = {
    "Anti-Trump Human": BLUES[3],
    "Anti-Trump Bot": BLUES[6],

    "Pro-Trump Human": REDS[3],
    "Pro-Trump Bot": REDS[6],
}
SIXWAY_COLORS_MAP = {
    "Anti-Trump Human": BLUES[3],
    "Anti-Trump Bot": BLUES[6],

    "Pro-Trump Human": REDS[3],
    "Pro-Trump Bot": REDS[6],

    "Q-anon Human": REDS[4], # "Pro-Trump Q-anon Human"
    "Q-anon Bot": REDS[7], # "Pro-Trump Q-anon Bot"
}


COLORS_MAP = {
    "bot_label": BOT_COLORS_MAP,
    "opinion_label": OPINION_COLORS_MAP,
    "q_label": Q_COLORS_MAP,
    "toxic_label": TOXIC_COLORS_MAP,
    "factual_label": FACT_COLORS_MAP,

    "fourway_label": FOURWAY_COLORS_MAP,
    "sixway_label": SIXWAY_COLORS_MAP,
    "bom_overall_label": BOT_COLORS_MAP,
    "bom_astroturf_label": BOT_COLORS_MAP,
}


BOT_LABEL_ORDER = ["Human", "Bot"]
CATEGORY_ORDERS = {
    "bot_label": BOT_LABEL_ORDER,
    "bom_overall_label": BOT_LABEL_ORDER,
    "bom_astroturf_label": BOT_LABEL_ORDER,
    "opinion_label": ["Anti-Trump", "Pro-Trump"],
    "q_label": ["Normal", "Q-anon"],

    "toxic_label": ["Normal", "Toxic"],
    "factual_label": ["High Quality", "Low Quality"],

    "fourway_label": list(FOURWAY_COLORS_MAP.keys()),
    "sixway_label": list(SIXWAY_COLORS_MAP.keys()),
}

## Dimensionality Reduction

In [12]:
#import warnings
#warnings.filterwarnings("ignore", message=".*The 'nopython' keyword.*") # suppress umap warnings https://github.com/slundberg/shap/issues/2909
#warnings.simplefilter("ignore", DeprecationWarning) # suppress warnings.warn("pkg_resources is deprecated as an API", DeprecationWarning) https://discuss.python.org/t/how-to-silence-pkg-resources-warnings/28629/7

In [58]:
import os

import numpy as np
from pandas import DataFrame
import plotly.express as px
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

N_COMPONENTS = 2
#REDUCTION_RESULTS_DIRPATH = "results"
FIG_SHOW = True
FIG_SAVE = False

class ReductionPipeline:
    # adapted from: https://github.com/s2t2/openai-embeddings-2023/blob/main/app/reduction/pipeline.py

    def __init__(self, x, labels, target, n_components=N_COMPONENTS, reducer_type="PCA", #results_dirpath=None
                ):
        """

        """

        self.x = x.copy()
        self.labels = labels.copy()
        self.target = target

        self.reducer_type = reducer_type
        self.reducer_name = {"PCA": "pca", "T-SNE": "tsne", "UMAP": "umap"}[self.reducer_type]

        self.x_scaled = (self.x - self.x.mean(axis=0)) / self.x.std(axis=0)
        #scaler = StandardScaler()
        #self.x_scaled = scaler.fit_transform(self.x)

        self.n_components = n_components
        self.component_names = [f"component_{i+1}" for i in range(self.n_components)]

        #self.results_dirpath = results_dirpath or f"results_pca_{self.n_components}"
        #os.makedirs(self.results_dirpath, exist_ok=True)


    def perform(self):
        self.pca = PCA(n_components=self.n_components, random_state=99)
        print(self.pca)

        embeddings = self.pca.fit_transform(self.x_scaled)
        print("EMBEDDINGS:", embeddings.shape)
        self.embeddings_df = DataFrame(embeddings, columns=self.component_names, index=self.x.index)

        print("EXPLAINED VARIANCE RATIO:", self.pca.explained_variance_ratio_)
        print("EXPLAINED VARIANCE:", self.pca.explained_variance_ratio_.sum().round(2))

        # https://stackoverflow.com/questions/21217710/factor-loadings-using-sklearn/44728692#44728692
        loadings = self.pca.components_.T * np.sqrt(self.pca.explained_variance_)
        print("LOADINGS", loadings.shape)
        self.loadings_df = DataFrame(loadings, columns=self.component_names, index=self.pca.feature_names_in_)


    #def plot_embeddings(self, fig_show=True, fig_save=False, height=350, labels=None, hover_data=None):
    #
    #    labels = labels or self.labels
    #
    #    chart_df = self.embeddings_df.copy()
    #    chart_df = chart_df.merge(self.labels, left_index=True, right_index=True) # ADD TARGET BACK FOR COLOR (ASSUMES INDEX IS the SAME)
    #    #chart_df = chart_df.merge(self.x, left_index=True, right_index=True) # ADD aLL DATA BACK SO WE CAN INSPECT FEATURES AS WELL
    #    #chart_df.sort_values(by=self.target, inplace=True)
    #
    #    fig = None
    #    if self.n_components == 2:
    #        fig = px.scatter(chart_df, x="component_1", y="component_2",
    #            color=self.target, height=height,
    #            title="PCA Embeddings (n_components=2)",
    #            #hover_data=self.x.columns.tolist() #["gender", "island", "body_mass_g"]
    #            hover_data=hover_data
    #        )
    #    elif self.n_components == 3:
    #        fig = px.scatter_3d(chart_df, x="component_1", y="component_2", z="component_3",
    #            color=self.target, height=height,
    #            title="PCA Embeddings (n_components=3)",
    #            #hover_data=self.x.columns.tolist() # ["gender", "island", "body_mass_g"]
    #        )
    #
    #    if fig and fig_show:
    #        fig.show()
    #
    #    if fig and fig_save:
    #        html_filepath = os.path.join(self.results_filepath, f"features.html")
    #        fig.write_html(html_filepath)
    #
    #        png_filepath = os.path.join(self.results_filepath, f"features.png")
    #        fig.write_image(png_filepath)



    def plot_embeddings(self, height=500, fig_show=FIG_SHOW, fig_save=FIG_SAVE, results_dirpath=None,
                        subtitle=None, text=None, size=None, hover_data=None,
                        color=None, color_map=None, color_scale=None, category_orders=None):

        chart_df = self.embeddings_df.copy()
        chart_df = chart_df.merge(self.labels, left_index=True, right_index=True) # ADD TARGET BACK FOR COLOR (ASSUMES INDEX IS the SAME)
        #chart_df = chart_df.merge(self.x, left_index=True, right_index=True) # ADD aLL DATA BACK SO WE CAN INSPECT FEATURES AS WELL
        #chart_df.sort_values(by=self.target, inplace=True)

        title = f"Dimensionality Reduction Results ({self.reducer_type} n_components={self.n_components})"
        if subtitle:
            title += f"<br><sup>{subtitle}</sup>"

        chart_params = dict(x="component_1", y="component_2",
            title=title, height=height,
            #color=color, #"artist_name",
            hover_data= hover_data #{"index": (self.embeddings_df.index)} #hover_data #["audio_filename", "track_number"]
        )
        if color:
            chart_params["color"] = color
        if color_map:
            chart_params["color_discrete_map"] = color_map
        if color_scale:
            chart_params["color_continuous_scale"] = color_scale
        if category_orders:
            chart_params["category_orders"] = category_orders
        if hover_data:
            chart_params["hover_data"] = hover_data
        if size:
            chart_params["size"] = size
        if text:
            chart_params["text"] = text

        if self.n_components == 2:
            fig = px.scatter(chart_df, **chart_params)
        elif self.n_components == 3:
            chart_params["z"] = "component_3"
            fig = px.scatter_3d(chart_df, **chart_params)
        else:
            return None

        if fig_show:
            fig.show()

        if fig_save:
            results_dirpath = results_dirpath or self.results_dirpath
            filestem = os.path.join(results_dirpath, f"{self.reducer_name}_{self.n_components}")
            fig.write_image(f"{filestem}.png")
            fig.write_html(f"{filestem}.html")

        return fig



## User Embeddings

7566 users

### Loading

Loading CSV from drive:

In [15]:
from pandas import read_csv

csv_filepath = os.path.join(DATA_DIRPATH, "botometer_sample_max_50_openai_user_embeddings_unpacked.csv.gz")
users_df = read_csv(csv_filepath, compression="gzip")
print(users_df.shape)
print(users_df.columns)
users_df.head()

(7566, 1547)
Index(['user_id', 'created_on', 'status_count', 'rt_count', 'is_bot',
       'opinion_community', 'is_q', 'avg_toxicity', 'avg_fact_score',
       'bom_astroturf',
       ...
       'openai_1526', 'openai_1527', 'openai_1528', 'openai_1529',
       'openai_1530', 'openai_1531', 'openai_1532', 'openai_1533',
       'openai_1534', 'openai_1535'],
      dtype='object', length=1547)


Unnamed: 0,user_id,created_on,status_count,rt_count,is_bot,opinion_community,is_q,avg_toxicity,avg_fact_score,bom_astroturf,...,openai_1526,openai_1527,openai_1528,openai_1529,openai_1530,openai_1531,openai_1532,openai_1533,openai_1534,openai_1535
0,3420436216,2015-08-13,555,540,True,0,False,0.056113,1.983193,0.295,...,-0.001867,-0.013167,0.020885,-0.022568,-0.033631,0.016153,0.024127,-0.017519,0.002636,-0.039838
1,108121958,2010-01-24,2,2,False,0,False,0.45671,,0.58,...,0.017651,-0.009439,0.024375,-0.032553,-0.042185,0.013782,0.01132,-0.014862,-0.010413,-0.020359
2,3038308638,2015-02-23,755,665,True,0,False,0.06986,3.401786,0.97,...,-0.026273,-0.008139,0.030285,-0.029902,-0.030887,0.022481,-0.005476,-0.016279,-0.010138,-0.021454
3,332396536,2011-07-09,951,951,True,1,False,0.044264,2.304511,0.58,...,-0.00552,-0.005288,0.017071,-0.033637,-0.040202,0.041773,-0.00937,0.003352,0.009391,-0.042671
4,955082522479808512,2018-01-21,570,533,True,0,False,0.049325,4.714286,0.355,...,0.009959,0.004695,0.005555,-0.012851,-0.032229,0.031443,0.008163,-0.018501,-0.008724,-0.042027


In [16]:
users_df["user_id"].nunique()

7566

In [17]:
users_df["is_bot"].value_counts()

False    4466
True     3100
Name: is_bot, dtype: int64

In [18]:
users_df["opinion_community"].value_counts()

0    4891
1    2675
Name: opinion_community, dtype: int64

In [19]:
users_df["avg_fact_score"].info()

<class 'pandas.core.series.Series'>
RangeIndex: 7566 entries, 0 to 7565
Series name: avg_fact_score
Non-Null Count  Dtype  
--------------  -----  
3292 non-null   float64
dtypes: float64(1)
memory usage: 59.2 KB


In [20]:


from pandas import isnull

def add_labels(users_df):
    # APPLY SAME LABELS AS THE ORIGINAL SOURCE CODE
    # https://github.com/s2t2/openai-embeddings-2023/blob/1b8372dd36982009df5d4a80871f4c182ada743d/notebooks/2_embeddings_data_export.py#L51
    # https://github.com/s2t2/openai-embeddings-2023/blob/main/app/dataset.py#L37-L64

    # labels:
    users_df["opinion_label"] = users_df["opinion_community"].map({0:"Anti-Trump", 1:"Pro-Trump"})
    users_df["bot_label"] = users_df["is_bot"].map({True:"Bot", False:"Human"})
    users_df["fourway_label"] = users_df["opinion_label"] + " " + users_df["bot_label"]

    # language toxicity scores (0 low - 1 high)
    toxic_threshold = 0.1
    users_df["is_toxic"] = users_df["avg_toxicity"] >= toxic_threshold
    users_df["is_toxic"] = users_df["is_toxic"].map({True: 1, False :0 })
    users_df["toxic_label"] = users_df["is_toxic"].map({1: "Toxic", 0 :"Normal" })

    # fact check / media quality scores (1 low - 5 high)
    # there are null avg_fact_score, so we only apply operation if not null, and leave nulls
    fact_threshold = 3.0
    users_df["is_factual"] = users_df["avg_fact_score"].apply(lambda score: score if isnull(score) else score >= fact_threshold)
    users_df["is_factual"] = users_df["is_factual"].map({True: 1, False :0 })
    users_df["factual_label"] = users_df["is_factual"].map({1: "High Quality", 0 :"Low Quality" })

    # botometer binary and labels:
    users_df["is_bom_overall"] = users_df["bom_overall"].round()
    users_df["is_bom_astroturf"] = users_df["bom_astroturf"].round()
    users_df["bom_overall_label"] = users_df["is_bom_overall"].map({1:"Bot", 0:"Human"})
    users_df["bom_astroturf_label"] = users_df["is_bom_astroturf"].map({1:"Bot", 0:"Human"})
    users_df["bom_overall_fourway_label"] = users_df["opinion_label"] + " " + users_df["bom_overall_label"]
    users_df["bom_astroturf_fourway_label"] = users_df["opinion_label"] + " " + users_df["bom_astroturf_label"]

    return users_df


users_df = add_labels(users_df)
print(users_df.shape)
print(users_df.columns.tolist())
users_df.head()

(7566, 1560)
['user_id', 'created_on', 'status_count', 'rt_count', 'is_bot', 'opinion_community', 'is_q', 'avg_toxicity', 'avg_fact_score', 'bom_astroturf', 'bom_overall', 'openai_0', 'openai_1', 'openai_2', 'openai_3', 'openai_4', 'openai_5', 'openai_6', 'openai_7', 'openai_8', 'openai_9', 'openai_10', 'openai_11', 'openai_12', 'openai_13', 'openai_14', 'openai_15', 'openai_16', 'openai_17', 'openai_18', 'openai_19', 'openai_20', 'openai_21', 'openai_22', 'openai_23', 'openai_24', 'openai_25', 'openai_26', 'openai_27', 'openai_28', 'openai_29', 'openai_30', 'openai_31', 'openai_32', 'openai_33', 'openai_34', 'openai_35', 'openai_36', 'openai_37', 'openai_38', 'openai_39', 'openai_40', 'openai_41', 'openai_42', 'openai_43', 'openai_44', 'openai_45', 'openai_46', 'openai_47', 'openai_48', 'openai_49', 'openai_50', 'openai_51', 'openai_52', 'openai_53', 'openai_54', 'openai_55', 'openai_56', 'openai_57', 'openai_58', 'openai_59', 'openai_60', 'openai_61', 'openai_62', 'openai_63', 'opena

Unnamed: 0,user_id,created_on,status_count,rt_count,is_bot,opinion_community,is_q,avg_toxicity,avg_fact_score,bom_astroturf,...,is_toxic,toxic_label,is_factual,factual_label,is_bom_overall,is_bom_astroturf,bom_overall_label,bom_astroturf_label,bom_overall_fourway_label,bom_astroturf_fourway_label
0,3420436216,2015-08-13,555,540,True,0,False,0.056113,1.983193,0.295,...,0,Normal,0.0,Low Quality,0.0,0.0,Human,Human,Anti-Trump Human,Anti-Trump Human
1,108121958,2010-01-24,2,2,False,0,False,0.45671,,0.58,...,1,Toxic,,,0.0,1.0,Human,Bot,Anti-Trump Human,Anti-Trump Bot
2,3038308638,2015-02-23,755,665,True,0,False,0.06986,3.401786,0.97,...,0,Normal,1.0,High Quality,1.0,1.0,Bot,Bot,Anti-Trump Bot,Anti-Trump Bot
3,332396536,2011-07-09,951,951,True,1,False,0.044264,2.304511,0.58,...,0,Normal,0.0,Low Quality,1.0,1.0,Bot,Bot,Pro-Trump Bot,Pro-Trump Bot
4,955082522479808512,2018-01-21,570,533,True,0,False,0.049325,4.714286,0.355,...,0,Normal,1.0,High Quality,0.0,0.0,Human,Human,Anti-Trump Human,Anti-Trump Human


In [21]:
users_df["is_factual"].value_counts()

0.0    1696
1.0    1596
Name: is_factual, dtype: int64

In [22]:
users_df["factual_label"].value_counts()

Low Quality     1696
High Quality    1596
Name: factual_label, dtype: int64

In [23]:
users_df["is_toxic"].value_counts()


0    6132
1    1434
Name: is_toxic, dtype: int64

In [24]:
users_df["toxic_label"].value_counts()

Normal    6132
Toxic     1434
Name: toxic_label, dtype: int64

In [25]:
users_df["bot_label"].value_counts()

Human    4466
Bot      3100
Name: bot_label, dtype: int64

In [26]:
users_df["opinion_label"].value_counts()

Anti-Trump    4891
Pro-Trump     2675
Name: opinion_label, dtype: int64

In [27]:
users_df["fourway_label"].value_counts()

Anti-Trump Human    3010
Anti-Trump Bot      1881
Pro-Trump Human     1456
Pro-Trump Bot       1219
Name: fourway_label, dtype: int64

### Splitting

In [28]:
users_df.index = users_df["user_id"]

In [29]:
embeddings_cols = [col for col in users_df.columns if "openai" in col]
print(len(embeddings_cols))
print(embeddings_cols[0], "...", embeddings_cols[-1])

1536
openai_0 ... openai_1535


In [30]:
users_x = users_df[embeddings_cols]
users_x.head()

Unnamed: 0_level_0,openai_0,openai_1,openai_2,openai_3,openai_4,openai_5,openai_6,openai_7,openai_8,openai_9,...,openai_1526,openai_1527,openai_1528,openai_1529,openai_1530,openai_1531,openai_1532,openai_1533,openai_1534,openai_1535
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3420436216,-0.018802,-0.007904,0.013753,-0.000709,-0.013829,0.007897,0.018043,-0.015105,-0.006663,-0.000824,...,-0.001867,-0.013167,0.020885,-0.022568,-0.033631,0.016153,0.024127,-0.017519,0.002636,-0.039838
108121958,-0.030552,-0.00533,0.014622,-0.015383,0.004529,0.022213,-0.00498,-0.008592,-0.004993,-0.007705,...,0.017651,-0.009439,0.024375,-0.032553,-0.042185,0.013782,0.01132,-0.014862,-0.010413,-0.020359
3038308638,-0.007297,0.001028,0.002483,-0.00499,-0.021728,-0.0037,0.008174,0.004453,0.014321,-0.004018,...,-0.026273,-0.008139,0.030285,-0.029902,-0.030887,0.022481,-0.005476,-0.016279,-0.010138,-0.021454
332396536,-0.018347,-0.007322,-0.009216,-0.023215,-0.007946,0.003921,0.023664,-0.009896,-0.00153,-0.010064,...,-0.00552,-0.005288,0.017071,-0.033637,-0.040202,0.041773,-0.00937,0.003352,0.009391,-0.042671
955082522479808512,-0.024804,0.007517,0.008514,-0.022979,0.003899,0.000756,0.024158,-0.016718,-0.003902,-0.008977,...,0.009959,0.004695,0.005555,-0.012851,-0.032229,0.031443,0.008163,-0.018501,-0.008724,-0.042027


In [31]:
#user_labels = users_df.drop(columns=embeddings_cols)
#print(user_labels.columns.tolist())
#user_labels.head()

### PCA 2

In [124]:
# /usr/local/lib/python3.10/dist-packages/plotly/express/_core.py:1223:
# PerformanceWarning: DataFrame is highly fragmented.
# This is usually the result of calling `frame.insert` many times, which has poor performance.
# Consider joining all columns at once using pd.concat(axis=1) instead.
# To get a de-fragmented frame, use `newframe = frame.copy()`
#  df_output[col_name] = to_unindexed_series(df_input[argument])



In [32]:
target = "fourway_label" #@param ["bot_label", "opinion_label", "fourway_label", "toxic_label", "is_factual"]
user_labels = users_df[target]
user_labels

user_id
3420436216               Anti-Trump Bot
108121958              Anti-Trump Human
3038308638               Anti-Trump Bot
332396536                 Pro-Trump Bot
955082522479808512       Anti-Trump Bot
                             ...       
1620694747               Anti-Trump Bot
1047878200406069248      Anti-Trump Bot
823502850336624640       Anti-Trump Bot
26966663                 Anti-Trump Bot
884121768428003329       Anti-Trump Bot
Name: fourway_label, Length: 7566, dtype: object

In [33]:
users_pipeline = ReductionPipeline(x=users_x, labels=user_labels, target=target, n_components=2)

users_pipeline.perform()

PCA(n_components=2, random_state=99)
EMBEDDINGS: (7566, 2)
EXPLAINED VARIANCE RATIO: [0.08211384 0.02338218]
EXPLAINED VARIANCE: 0.11
LOADINGS (1536, 2)


In [34]:
users_pipeline.embeddings_df.head()

Unnamed: 0_level_0,component_1,component_2
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1
3420436216,-12.864011,-0.976722
108121958,7.074388,-5.434687
3038308638,-10.170547,-0.784753
332396536,-10.812691,5.623136
955082522479808512,-10.050573,-0.195144


In [None]:
# todo: re-implement colors map and category orders
#users_pipeline.plot_embeddings(fig_show=False, fig_save=False, height=350, )

In [35]:
color_map = COLORS_MAP[target]
category_orders = {target: CATEGORY_ORDERS[target]}

users_pipeline.plot_embeddings(fig_show=False, fig_save=False, height=350,
    color=target, color_map=color_map, category_orders=category_orders
)

In [59]:

groupby_cols = [
    "bot_label", "opinion_label", # "bom_overall_label", "bom_astroturf_label",
    "toxic_label", "factual_label",
    "fourway_label", #"sixway_label",
]

for groupby_col in groupby_cols:
    color_map = COLORS_MAP[groupby_col]
    category_orders = {groupby_col: CATEGORY_ORDERS[groupby_col]}

    labels = users_df[groupby_col]
    pipeline = ReductionPipeline(x=users_x, labels=labels, target=groupby_col, n_components=2)

    results_dirpath = os.path.join(RESULTS_DIRPATH, "openai_embeddings_v2", "text-embedding-ada-002", f"user_embeddings_{pipeline.reducer_type.lower()}_{pipeline.n_components}", groupby_col)
    os.makedirs(results_dirpath, exist_ok=True)

    pipeline.perform()

    pipeline.plot_embeddings(
        color=groupby_col, color_map=color_map, category_orders=category_orders,
        #hover_data=["user_id", "bot_label"],
        fig_show=True, fig_save=True,
        results_dirpath=results_dirpath
    )

PCA(n_components=2, random_state=99)
EMBEDDINGS: (7566, 2)
EXPLAINED VARIANCE RATIO: [0.08211384 0.02338218]
EXPLAINED VARIANCE: 0.11
LOADINGS (1536, 2)


PCA(n_components=2, random_state=99)
EMBEDDINGS: (7566, 2)
EXPLAINED VARIANCE RATIO: [0.08211384 0.02338218]
EXPLAINED VARIANCE: 0.11
LOADINGS (1536, 2)


PCA(n_components=2, random_state=99)
EMBEDDINGS: (7566, 2)
EXPLAINED VARIANCE RATIO: [0.08211384 0.02338218]
EXPLAINED VARIANCE: 0.11
LOADINGS (1536, 2)


PCA(n_components=2, random_state=99)
EMBEDDINGS: (7566, 2)
EXPLAINED VARIANCE RATIO: [0.08211384 0.02338218]
EXPLAINED VARIANCE: 0.11
LOADINGS (1536, 2)


PCA(n_components=2, random_state=99)
EMBEDDINGS: (7566, 2)
EXPLAINED VARIANCE RATIO: [0.08211384 0.02338218]
EXPLAINED VARIANCE: 0.11
LOADINGS (1536, 2)


## Tweet Embeddings (User Averages)

183K statuses, averaged for each user (see prior notebook). 7566 rows resulting

### Loading

In [47]:
from pandas import read_csv

csv_filepath = os.path.join(DATA_DIRPATH, "botometer_sample_max_50_openai_status_embeddings_v3_unpacked_deduped_averaged.csv.gz")
averages_df = read_csv(csv_filepath)
print(averages_df.shape)
print(averages_df.columns)
averages_df.index = averages_df["user_id"]
averages_df.head()

(7566, 1548)
Index(['user_id', 'openai_0', 'openai_1', 'openai_2', 'openai_3', 'openai_4',
       'openai_5', 'openai_6', 'openai_7', 'openai_8',
       ...
       'created_on', 'status_count', 'rt_count', 'is_bot', 'opinion_community',
       'is_q', 'avg_toxicity', 'avg_fact_score', 'bom_astroturf',
       'bom_overall'],
      dtype='object', length=1548)


Unnamed: 0_level_0,user_id,openai_0,openai_1,openai_2,openai_3,openai_4,openai_5,openai_6,openai_7,openai_8,...,created_on,status_count,rt_count,is_bot,opinion_community,is_q,avg_toxicity,avg_fact_score,bom_astroturf,bom_overall
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2952,2952,-0.023816,0.002004,0.004429,-0.019361,-0.00986,0.004878,0.00096,-0.015426,-0.00643,...,2006-07-24,6,6,False,0,False,0.006899,,0.21,0.2
635553,635553,-0.030022,-0.006063,0.017259,-0.018501,-0.008536,0.004416,-0.01184,-0.010581,-0.010859,...,2007-01-15,12,12,False,0,False,0.077787,,0.24,0.16
656993,656993,-0.010723,0.008235,0.004192,-0.040441,-0.015172,0.012798,-0.015786,0.008556,-0.022145,...,2007-01-17,1,1,False,0,False,0.025031,,0.11,0.1
761154,761154,-0.021389,-0.004747,0.006925,-0.017395,-0.0119,0.018309,-0.007047,-0.024175,0.001368,...,2007-02-09,4,0,False,0,False,0.172311,,0.13,0.72
777554,777554,-0.009369,-0.009612,0.01247,0.005079,-0.019303,-0.010459,0.019815,-0.019171,-0.017594,...,2007-02-17,1,1,False,0,False,0.00166,,0.15,0.03


In [38]:
averages_df["user_id"].nunique()

7566

In [39]:
len(averages_df)

7566

In [48]:
averages_df = add_labels(averages_df)
print(averages_df.shape)
print(averages_df.columns.tolist())
averages_df.head()

(7566, 1561)
['user_id', 'openai_0', 'openai_1', 'openai_2', 'openai_3', 'openai_4', 'openai_5', 'openai_6', 'openai_7', 'openai_8', 'openai_9', 'openai_10', 'openai_11', 'openai_12', 'openai_13', 'openai_14', 'openai_15', 'openai_16', 'openai_17', 'openai_18', 'openai_19', 'openai_20', 'openai_21', 'openai_22', 'openai_23', 'openai_24', 'openai_25', 'openai_26', 'openai_27', 'openai_28', 'openai_29', 'openai_30', 'openai_31', 'openai_32', 'openai_33', 'openai_34', 'openai_35', 'openai_36', 'openai_37', 'openai_38', 'openai_39', 'openai_40', 'openai_41', 'openai_42', 'openai_43', 'openai_44', 'openai_45', 'openai_46', 'openai_47', 'openai_48', 'openai_49', 'openai_50', 'openai_51', 'openai_52', 'openai_53', 'openai_54', 'openai_55', 'openai_56', 'openai_57', 'openai_58', 'openai_59', 'openai_60', 'openai_61', 'openai_62', 'openai_63', 'openai_64', 'openai_65', 'openai_66', 'openai_67', 'openai_68', 'openai_69', 'openai_70', 'openai_71', 'openai_72', 'openai_73', 'openai_74', 'openai_75

Unnamed: 0_level_0,user_id,openai_0,openai_1,openai_2,openai_3,openai_4,openai_5,openai_6,openai_7,openai_8,...,is_toxic,toxic_label,is_factual,factual_label,is_bom_overall,is_bom_astroturf,bom_overall_label,bom_astroturf_label,bom_overall_fourway_label,bom_astroturf_fourway_label
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2952,2952,-0.023816,0.002004,0.004429,-0.019361,-0.00986,0.004878,0.00096,-0.015426,-0.00643,...,0,Normal,,,0.0,0.0,Human,Human,Anti-Trump Human,Anti-Trump Human
635553,635553,-0.030022,-0.006063,0.017259,-0.018501,-0.008536,0.004416,-0.01184,-0.010581,-0.010859,...,0,Normal,,,0.0,0.0,Human,Human,Anti-Trump Human,Anti-Trump Human
656993,656993,-0.010723,0.008235,0.004192,-0.040441,-0.015172,0.012798,-0.015786,0.008556,-0.022145,...,0,Normal,,,0.0,0.0,Human,Human,Anti-Trump Human,Anti-Trump Human
761154,761154,-0.021389,-0.004747,0.006925,-0.017395,-0.0119,0.018309,-0.007047,-0.024175,0.001368,...,1,Toxic,,,1.0,0.0,Bot,Human,Anti-Trump Bot,Anti-Trump Human
777554,777554,-0.009369,-0.009612,0.01247,0.005079,-0.019303,-0.010459,0.019815,-0.019171,-0.017594,...,0,Normal,,,0.0,0.0,Human,Human,Anti-Trump Human,Anti-Trump Human


### Splitting

In [49]:
averages_x = averages_df[embeddings_cols]
averages_x

Unnamed: 0_level_0,openai_0,openai_1,openai_2,openai_3,openai_4,openai_5,openai_6,openai_7,openai_8,openai_9,...,openai_1526,openai_1527,openai_1528,openai_1529,openai_1530,openai_1531,openai_1532,openai_1533,openai_1534,openai_1535
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2952,-0.023816,0.002004,0.004429,-0.019361,-0.009860,0.004878,0.000960,-0.015426,-0.006430,0.001027,...,-0.012285,0.001094,0.015767,-0.026536,-0.024981,0.015113,0.018588,-0.002324,-0.003782,-0.028532
635553,-0.030022,-0.006063,0.017259,-0.018501,-0.008536,0.004416,-0.011840,-0.010581,-0.010859,-0.003771,...,-0.005960,-0.007866,0.010948,-0.021376,-0.023424,0.020705,0.005084,-0.011961,-0.003258,-0.026262
656993,-0.010723,0.008235,0.004192,-0.040441,-0.015172,0.012798,-0.015786,0.008556,-0.022145,-0.017026,...,-0.018110,0.007116,-0.004877,-0.032427,-0.023885,-0.000715,0.003886,-0.024242,0.003839,-0.048883
761154,-0.021389,-0.004747,0.006925,-0.017395,-0.011900,0.018309,-0.007047,-0.024175,0.001368,0.002065,...,0.013326,-0.020819,0.007364,-0.016794,-0.049548,0.013037,0.024798,-0.008543,0.006142,-0.035867
777554,-0.009369,-0.009612,0.012470,0.005079,-0.019303,-0.010459,0.019815,-0.019171,-0.017594,-0.006209,...,0.007358,0.010696,0.008784,-0.024808,-0.008042,0.011077,0.001996,-0.001104,-0.019460,-0.030301
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1234200349600288772,-0.024670,-0.007194,0.012253,-0.015047,-0.003037,0.004080,-0.004318,-0.005398,0.000077,-0.006043,...,0.001035,-0.011842,0.022085,-0.027978,-0.030627,0.017037,-0.001254,-0.012667,-0.002032,-0.026470
1234846911028453376,0.002930,-0.013252,-0.010225,-0.034511,-0.022378,0.011991,0.008446,0.007933,0.007091,-0.018994,...,0.009281,-0.010663,0.016082,-0.026220,0.000994,0.016662,0.005803,0.005945,0.001228,-0.041925
1237940420136456192,-0.013656,-0.002694,0.007115,-0.019419,-0.001060,0.007154,-0.000715,-0.013499,0.001435,-0.011751,...,0.008012,0.011208,0.025522,-0.025476,-0.034344,0.023171,-0.007861,-0.006911,-0.005543,-0.026032
1238854780191195136,-0.041529,-0.024860,0.005283,-0.026827,0.005801,0.000377,0.011535,-0.013447,0.002424,0.010552,...,0.009792,-0.014946,0.028075,-0.031737,-0.047090,0.030326,-0.023545,-0.014824,0.003257,-0.022161


### PCA 2

In [42]:
averages_labels = averages_df[target]
averages_labels

user_id
2952                   Anti-Trump Human
635553                 Anti-Trump Human
656993                 Anti-Trump Human
761154                 Anti-Trump Human
777554                 Anti-Trump Human
                             ...       
1234200349600288772     Pro-Trump Human
1234846911028453376    Anti-Trump Human
1237940420136456192     Pro-Trump Human
1238854780191195136    Anti-Trump Human
1240138605726760962    Anti-Trump Human
Name: fourway_label, Length: 7566, dtype: object

In [43]:
averages_pipeline = ReductionPipeline(x=averages_x, labels=averages_labels, target=target, n_components=2)

averages_pipeline.perform()

PCA(n_components=2, random_state=99)
EMBEDDINGS: (7566, 2)
EXPLAINED VARIANCE RATIO: [0.04725761 0.03663586]
EXPLAINED VARIANCE: 0.08
LOADINGS (1536, 2)


In [44]:
averages_pipeline.embeddings_df.head()

Unnamed: 0_level_0,component_1,component_2
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1
2952,-6.801425,-1.905464
635553,-2.489854,-7.58317
656993,6.493947,-21.43361
761154,-0.132166,-3.147631
777554,-4.050601,7.515363


In [None]:
# todo: re-implement colors map and category orders
#averages_pipeline.plot_embeddings(fig_show=True, fig_save=False, height=350)

In [61]:

groupby_cols = [
    "bot_label", "opinion_label", # "bom_overall_label", "bom_astroturf_label",
    "toxic_label", "factual_label",
    "fourway_label", #"sixway_label",
]

for groupby_col in groupby_cols:
    color_map = COLORS_MAP[groupby_col]
    category_orders = {groupby_col: CATEGORY_ORDERS[groupby_col]}

    labels = averages_df[groupby_col]
    pipeline = ReductionPipeline(x=averages_x, labels=labels, target=groupby_col, n_components=2)

    results_dirpath = os.path.join(RESULTS_DIRPATH, "openai_embeddings_v2", "text-embedding-ada-002", f"status_avg_embeddings_{pipeline.reducer_type.lower()}_{pipeline.n_components}", groupby_col)
    os.makedirs(results_dirpath, exist_ok=True)

    pipeline.perform()

    pipeline.plot_embeddings(
        color=groupby_col, color_map=color_map, category_orders=category_orders,
        #hover_data=["user_id", "bot_label"],
        fig_show=True, fig_save=True,
        results_dirpath=results_dirpath
    )

PCA(n_components=2, random_state=99)
EMBEDDINGS: (7566, 2)
EXPLAINED VARIANCE RATIO: [0.04725761 0.03663586]
EXPLAINED VARIANCE: 0.08
LOADINGS (1536, 2)


PCA(n_components=2, random_state=99)
EMBEDDINGS: (7566, 2)
EXPLAINED VARIANCE RATIO: [0.04725761 0.03663586]
EXPLAINED VARIANCE: 0.08
LOADINGS (1536, 2)


PCA(n_components=2, random_state=99)
EMBEDDINGS: (7566, 2)
EXPLAINED VARIANCE RATIO: [0.04725761 0.03663586]
EXPLAINED VARIANCE: 0.08
LOADINGS (1536, 2)


PCA(n_components=2, random_state=99)
EMBEDDINGS: (7566, 2)
EXPLAINED VARIANCE RATIO: [0.04725761 0.03663586]
EXPLAINED VARIANCE: 0.08
LOADINGS (1536, 2)


PCA(n_components=2, random_state=99)
EMBEDDINGS: (7566, 2)
EXPLAINED VARIANCE RATIO: [0.04725761 0.03663586]
EXPLAINED VARIANCE: 0.08
LOADINGS (1536, 2)
