<a href="https://colab.research.google.com/github/s2t2/ml-project-2022/blob/main/notebooks/Hashtag_Similarity_Mapping_20221216_DELIVER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup

## Installing Packages

In [None]:
print("THIS IS A DELIVERABLE")

THIS IS A DELIVERABLE


In [None]:
%%capture
!pip install umap-learn[plot]

In [None]:
%%capture
!pip install hdbscan

In [None]:
# https://www.pauldesalvo.com/how-to-download-plotly-express-charts-as-images-in-google-colab/
%%capture
!pip install kaleido
!pip install plotly>=4.0.0
!wget https://github.com/plotly/orca/releases/download/v1.2.1/orca-1.2.1-x86_64.AppImage -O /usr/local/bin/orca
!chmod +x /usr/local/bin/orca
!apt-get install xvfb libgtk2.0-0 libgconf-2-4

## Imports

In [None]:
import os

from google.colab import drive
from google.cloud import bigquery

from pandas import DataFrame, crosstab
from sklearn.preprocessing import StandardScaler # OneHotEncoder 
from sklearn.pipeline import Pipeline 

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from umap import UMAP

from hdbscan import HDBSCAN

from sklearn import metrics

import plotly.express as px

## Mounting the Drive

In [None]:
drive.mount('/content/drive')
print(os.listdir(os.getcwd())) 

Mounted at /content/drive
['.config', '=4.0.0', 'drive', 'sample_data']


In [None]:
# you might need to create a google drive SHORTCUT that has this same path
# ... or update the path to use your own google drive organization
DIRPATH = '/content/drive/MyDrive/Research/Disinfo Research Shared 2022'
print(DIRPATH)
os.path.isdir(DIRPATH)

/content/drive/MyDrive/Research/Disinfo Research Shared 2022


True

## Configuring Credentials 


In [None]:
# google.cloud checks the file at path designated by the GOOGLE_APPLICATION_CREDENTIALS env var
# so we set it here using the shared credentials JSON file from our shared google drive
# and verify it for good measure

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = os.path.join(DIRPATH, "credentials", "tweet-research-shared-268bbccc0aac.json") 

GOOGLE_APPLICATION_CREDENTIALS = os.getenv("GOOGLE_APPLICATION_CREDENTIALS") # implicit check by google.cloud
print(GOOGLE_APPLICATION_CREDENTIALS) # verification for implicit check
print(os.path.isfile(GOOGLE_APPLICATION_CREDENTIALS)) # verification for implicit check

/content/drive/MyDrive/Research/Disinfo Research Shared 2022/credentials/tweet-research-shared-268bbccc0aac.json
True


## BigQuery Service

In [None]:
class BigQueryService():
    def __init__(self):
        self.client = bigquery.Client()

    def execute_query(self, sql, verbose=True):
        if verbose == True: 
            print(sql)
        job = self.client.query(sql)
        return job.result()
    
    def query_to_df(self, sql, verbose=True):
        """high-level wrapper to return a DataFrame"""
        results = self.execute_query(sql, verbose=verbose)
        records = [dict(row) for row in list(results)]
        df = DataFrame(records)
        return df


In [None]:
bq_service = BigQueryService()
print(bq_service)

<__main__.BigQueryService object at 0x7fcbd4520c40>


## Helpers

In [None]:
DATA_DIRPATH = os.path.join(DIRPATH, "data", "profile_tag_mapping")
FIGURES_DIRPATH = os.path.join(DIRPATH, "figures", "profile_tag_mapping")

In [None]:
def component_names(n_components):
    columns = []
    for n in range(0, n_components):
        letter = "abcdefghijklmnopqrstuvwxyz"[n] # assumes we are looking at 25 components or less, otherwise let's use n as the component identifier instead
        columns.append(f"component_{letter}")
    #print(columns)
    return columns

assert component_names(1) == ["component_a"]
assert component_names(2) == ["component_a", "component_b"]

# Dashboard

In [None]:
TOP_TAGS_LIMIT = 50 #@param {type:"slider", min:5, max:250, step:5}

## Fetching Data

In [None]:
# there are 1,288,844 total rows (per user per tag), so lets only take the top X tags instead
# sql = f"""
#     SELECT DISTINCT user_id, tag
#     FROM `tweet-research-shared.impeachment_2020.profile_tags_v2_flat`
#     ORDER BY user_id
#     -- LIMIT 10
# """

In [None]:
sql = f"""
    WITH top_tags as (
        SELECT tag, count(DISTINCT user_id) as user_count
        FROM `tweet-research-shared.impeachment_2020.profile_tags_v2_flat`
        GROUP BY tag
        ORDER BY user_count DESC
        LIMIT {int(TOP_TAGS_LIMIT)}
    )

    SELECT DISTINCT pt.user_id, pt.tag
    FROM `tweet-research-shared.impeachment_2020.profile_tags_v2_flat` pt
    JOIN top_tags on top_tags.tag = pt.tag
    ORDER BY user_id
"""

df = bq_service.query_to_df(sql, verbose=False)
df.head()

Unnamed: 0,user_id,tag
0,409,#IMPEACH
1,409,#RESIST
2,1153,#RESIST
3,1186,#1
4,3301,#RESIST


In [None]:
print("ROWS:", len(df))
print("USERS:", len(df["user_id"].unique()))
print("TAGS:", len(df["tag"].unique()))

ROWS: 360424
USERS: 183516
TAGS: 50


For top 250 tags, we see 506,119 rows (per user per tag) with 236,988 unique users and 250 unique tags.

For top 100 tags, we see 419,540 rows (per user per tag) with 204,423 unique users and 100 unique tags.

For top 75 tags, we see 394,703 rows (per user per tag) with 196,717 unique users and 75 unique tags.

For top 50 tags, we see 360,424 rows (per user per tag) with 183,516 unique users and 50 unique tags.

For top 25 tags, we see 297,095 rows (per user per tag) with 165,458 unique users and 25 unique tags.





## Top Tags

In [None]:
#from plotly.express import bar

CHART_TAGS_LIMIT = 15

chart_data = df.groupby(["tag"])["user_id"].count().sort_values(ascending=False).head(CHART_TAGS_LIMIT)
#chart_data.rename(columns={'count':'user_count'})
#chart_data

In [None]:
chart_data = df.groupby(["tag"])["user_id"].count().sort_values(ascending=True).tail(CHART_TAGS_LIMIT)

fig = px.bar(x=chart_data.values, y=chart_data.index, orientation="h", #, height=750,
          title="Top Hashtags found in User Profiles (Impeachment 2020 Dataset)", 
          labels={"x":"Count of Distinct User", "y": "Hashtag"},
          color_discrete_sequence=["#7F7F7F"]  #["#E377C2"] # ["#FFDD71"]  #["#8C564B"] #["#FF7F0E"] # "#aaaaaa"  "#004ba8"
)

fig.show()

In [None]:
top_tags = df.groupby(["tag"])["user_id"].count().sort_values(ascending=False)
tags_df = DataFrame({"tag": top_tags.index, "user_count": top_tags.values})
#tags_df.index = top_tags.index
tags_df.to_csv(os.path.join(DATA_DIRPATH, f"profile_tags_{TOP_TAGS_LIMIT}.csv" ))
tags_df.head(10)

Unnamed: 0,tag,user_count
0,#MAGA,63744
1,#RESIST,35313
2,#KAG,27820
3,#TRUMP2020,20984
4,#WWG1WGA,14999
5,#2A,13681
6,#THERESISTANCE,11647
7,#FBR,10574
8,#RESISTANCE,10324
9,#QANON,7355


In [None]:
#tags_df[tags_df["tag"] == "#WWG1WGA"]["user_count"].iloc[0]

In [None]:
#dict(top_tags)

## One Hot Encoding

In [None]:
df.head()

Unnamed: 0,user_id,tag
0,409,#IMPEACH
1,409,#RESIST
2,1153,#RESIST
3,1186,#1
4,3301,#RESIST


In [None]:
# from pandas import get_dummies as one_hot_encoder
#
# one_hot_encoder(df["tag"]) #> OK
# one_hot_encoder(df["user_id"]) #> CRASH!!! NO MEMORY :-/

In [None]:
%%time

from pandas import crosstab

df_onehot = crosstab(df["tag"], df["user_id"])
df_onehot

CPU times: user 28.2 s, sys: 2.15 s, total: 30.3 s
Wall time: 27.8 s


user_id,409,1153,1186,3301,4822,4936,5908,6504,7578,10345,...,1242241859038121984,1242262008466010112,1242268109962018818,1242277072526929920,1242278762583953408,1242299587911512064,1242304023799009280,1242479770996068359,1242485644229849090,1242493368955015169
tag,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
#1,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
#1A,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
#2A,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
#AMERICAFIRST,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
#BACKTHEBLUE,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
#BERNIE2020,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
#BLACKLIVESMATTER,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
#BLM,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
#BLUEWAVE,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
#BLUEWAVE2020,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
print(df_onehot.columns.tolist()[0:5])
print(df_onehot.index[0:5])

[409, 1153, 1186, 3301, 4822]
Index(['#1', '#1A', '#2A', '#AMERICAFIRST', '#BACKTHEBLUE'], dtype='object', name='tag')


In [None]:
df_onehot.max()

user_id
409                    1
1153                   1
1186                   1
3301                   1
4822                   1
                      ..
1242299587911512064    1
1242304023799009280    1
1242479770996068359    1
1242485644229849090    1
1242493368955015169    1
Length: 183516, dtype: int64

In [None]:
csv_filepath = os.path.join(DIRPATH, "data", f"tags_users_onehot_{TOP_TAGS_LIMIT}.csv")
df_onehot.to_csv(csv_filepath)

## PCA Pipeline

In [None]:


def pca_pipeline(df_onehot=df_onehot, n_components=2):
    column_names = component_names(n_components)

    pipeline = Pipeline(steps=[
        #("one_hot", OneHotEncoder(use_cat_names=True, cols=column_names)),
        ("scaler", StandardScaler()),
        ("reducer", PCA(n_components=n_components, random_state=99))
    ])
    print(pipeline)

    embeddings = pipeline.fit_transform(df_onehot)
    print("EMBEDDINGS:", type(embeddings), embeddings.shape)
    pca = pipeline.named_steps["reducer"]
    print("EXPLAINED VARIANCE RATIO:", pca.explained_variance_ratio_)
    print("SINGULAR VALS:", pca.singular_values_)

    df_embed = DataFrame(embeddings, columns=column_names)
    df_embed["tag"] = df_onehot.index
    print(len(df_embed))
    print(df_embed.head())
    csv_filepath = os.path.join(DATA_DIRPATH, f"profile_tags_{TOP_TAGS_LIMIT}_pca_{n_components}.csv")
    df_embed.to_csv(csv_filepath)

    title = f"PCA Dimension Reduction of Top {TOP_TAGS_LIMIT} Tags in User Profiles"
    if n_components in [1,2]:
        chart_options = dict(x="component_a", text="tag",title=title)
        if n_components == 2:
            chart_options["y"] = "component_b"
        fig = px.scatter(df_embed, **chart_options)
    elif n_components == 3:
        chart_df = df_embed.copy()
        chart_df["color"] = chart_df["component_a"] * chart_df["component_b"] * chart_df["component_c"]

        # https://plotly.com/python-api-reference/generated/plotly.express.scatter_3d.html
        fig = px.scatter_3d(chart_df, title=title, text="tag", #size="user_count",
            x="component_a", y="component_b", z="component_c", 
            #labels={"component_a":"", "component_b":"", "component_a":""},
            #color="color", color_continuous_scale=px.colors.colorbrewer.RdBu_r
        )

    fig.show()
    image_filepath = os.path.join(FIGURES_DIRPATH, f"profile_tags_{TOP_TAGS_LIMIT}_pca_{n_components}.png")
    fig.write_image(image_filepath)


In [None]:
#pca_pipeline(n_components=1)

In [None]:
pca_pipeline(n_components=2)

Pipeline(steps=[('scaler', StandardScaler()),
                ('reducer', PCA(n_components=2, random_state=99))])
EMBEDDINGS: <class 'numpy.ndarray'> (50, 2)
EXPLAINED VARIANCE RATIO: [0.19793495 0.14108989]
SINGULAR VALS: [1347.66891397 1137.81045008]
50
   component_a  component_b            tag
0   -48.859166   -41.474028             #1
1   -19.207557   -36.745420            #1A
2    52.820406   -34.420723            #2A
3   -15.027216   -34.990186  #AMERICAFIRST
4   -29.868907   -35.936543   #BACKTHEBLUE


In [None]:
pca_pipeline(n_components=3)

Pipeline(steps=[('scaler', StandardScaler()),
                ('reducer', PCA(n_components=3, random_state=99))])
EMBEDDINGS: <class 'numpy.ndarray'> (50, 3)
EXPLAINED VARIANCE RATIO: [0.19793495 0.14108989 0.05429723]
SINGULAR VALS: [1347.66891397 1137.81045008  705.84736694]
50
   component_a  component_b  component_c            tag
0   -48.859196   -41.473923   -23.065098             #1
1   -19.207557   -36.745421     4.175816            #1A
2    52.820406   -34.420722    41.376223            #2A
3   -15.027215   -34.990189    -1.249942  #AMERICAFIRST
4   -29.868906   -35.936545    -6.041021   #BACKTHEBLUE


## T-SNE Pipeline

In [None]:


def tsne_pipeline(df_onehot=df_onehot, n_components=2):
    column_names = component_names(n_components)

    pipeline = Pipeline(steps=[
        #("one_hot", OneHotEncoder(use_cat_names=True, cols=column_names)),
        ("scaler", StandardScaler()),
        ("reducer", TSNE(n_components=n_components, random_state=99))
    ])
    print(pipeline)
    tsne = pipeline.named_steps["reducer"]

    embeddings = pipeline.fit_transform(df_onehot)
    print("EMBEDDINGS:", type(embeddings), embeddings.shape)
    
    print("K-L DIVERGENCE:", tsne.kl_divergence_)

    df_embed = DataFrame(embeddings, columns=column_names)
    df_embed["tag"] = df_onehot.index
    print(len(df_embed))
    print(df_embed.head())
    csv_filepath = os.path.join(DATA_DIRPATH, f"profile_tags_{TOP_TAGS_LIMIT}_tsne_{n_components}.csv")
    df_embed.to_csv(csv_filepath)
    
    chart_df = df_embed.copy()
    chart_df["tag"] = df_onehot.index
    #chart_df = chart_df.merge(tags_df, left_on="tag", right_on="tag")

    title = f"T-SNE Dimension Reduction of Top {TOP_TAGS_LIMIT} Tags in User Profiles"
    if n_components in [1,2]:
        chart_options = dict(x="component_a", text="tag",title=title)
        if n_components == 2:
            chart_options["y"] = "component_b"
    
        fig = px.scatter(chart_df, **chart_options)
    elif n_components == 3:
        chart_df["color"] = chart_df["component_a"] * chart_df["component_b"] * chart_df["component_c"]
    
        # https://plotly.com/python-api-reference/generated/plotly.express.scatter_3d.html
        fig = px.scatter_3d(chart_df, title=title, text="tag", #size="user_count",
            x="component_a", y="component_b", z="component_c", 
            #labels={"component_a":"", "component_b":"", "component_a":""},
            #color="color", color_continuous_scale=px.colors.colorbrewer.RdBu_r
        )

    #fig = plot_embeddings(df_embed=df_embed, n_components=n_components, title=title)
    fig.show()
    image_filepath = os.path.join(FIGURES_DIRPATH, f"profile_tags_{TOP_TAGS_LIMIT}_tsne_{n_components}.png")
    fig.write_image(image_filepath)



In [None]:
tsne_pipeline(n_components=1)

Pipeline(steps=[('scaler', StandardScaler()),
                ('reducer', TSNE(n_components=1, random_state=99))])



The default initialization in TSNE will change from 'random' to 'pca' in 1.2.


The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.



EMBEDDINGS: <class 'numpy.ndarray'> (50, 1)
K-L DIVERGENCE: 6.940203666687012
50
   component_a            tag
0  2879.544922             #1
1 -3368.776611            #1A
2  3121.458984            #2A
3 -3718.508545  #AMERICAFIRST
4 -3291.204346   #BACKTHEBLUE


In [None]:
tsne_pipeline(n_components=2)

Pipeline(steps=[('scaler', StandardScaler()),
                ('reducer', TSNE(random_state=99))])



The default initialization in TSNE will change from 'random' to 'pca' in 1.2.


The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.



EMBEDDINGS: <class 'numpy.ndarray'> (50, 2)
K-L DIVERGENCE: 0.7424313426017761
50
   component_a  component_b            tag
0   -54.680664  -270.438232             #1
1   -47.781769    40.930222            #1A
2   -26.017017   176.119370            #2A
3   130.821182   -58.517063  #AMERICAFIRST
4    51.133808   -35.137714   #BACKTHEBLUE


In [None]:
tsne_pipeline(n_components=3)

Pipeline(steps=[('scaler', StandardScaler()),
                ('reducer', TSNE(n_components=3, random_state=99))])



The default initialization in TSNE will change from 'random' to 'pca' in 1.2.


The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.



EMBEDDINGS: <class 'numpy.ndarray'> (50, 3)
K-L DIVERGENCE: 1.1587451696395874
50
   component_a  component_b  component_c            tag
0  -176.686371   -89.843567   -39.558567             #1
1    -4.534178  -110.564598   -16.706930            #1A
2   176.916321   -30.907244   -69.342514            #2A
3   157.230850   133.446976   227.507812  #AMERICAFIRST
4    55.426788    50.318306   -79.567726   #BACKTHEBLUE


## UMAP

In [None]:

def umapper(df_onehot=df_onehot, n_components=2):

    reducer = UMAP(n_components=n_components, random_state=99)
    embedding = reducer.fit_transform(df_onehot)

    embed_df = DataFrame(embedding, columns=component_names(n_components))
    #embed_df.index = df_onehot.index
    embed_df["tag"] = df_onehot.index
    #embed_df.head()
    csv_filepath = os.path.join(DATA_DIRPATH, f"profile_tags_{TOP_TAGS_LIMIT}_umap_{n_components}.csv")
    embed_df.to_csv(csv_filepath)

    #
    # PLOTTING
    #

    chart_df = embed_df.copy()
    chart_df["tag"] = df_onehot.index
    chart_df = chart_df.merge(tags_df, left_on="tag", right_on="tag")

    title = f"UMAP Dimension Reduction of Top {TOP_TAGS_LIMIT} Tags in User Profiles"
    if n_components == 1:
        chart_df["color"] = chart_df["component_a"]
        
        fig = px.scatter(chart_df, title=title, text="tag", size="user_count",
            x="component_a",            
            labels={"component_a":""},
            #color="color", color_continuous_scale=px.colors.colorbrewer.RdBu
        )
    elif n_components == 2:
        chart_df["color"] = chart_df["component_a"] * chart_df["component_b"]
        if TOP_TAGS_LIMIT == 25:
            chart_df["color"] = chart_df["component_a"] #* chart_df["component_b"] * chart_df["component_c"]

        fig = px.scatter(chart_df, title=title, text="tag", size="user_count",
            x="component_a", y="component_b",        
            labels={"component_a":"", "component_b":""},
            color="color", color_continuous_scale=px.colors.colorbrewer.RdBu
        )
    elif n_components == 3:
        chart_df["color"] = chart_df["component_a"] * chart_df["component_b"] * chart_df["component_c"]
        scale = px.colors.colorbrewer.RdBu_r
        if TOP_TAGS_LIMIT == 25:
            chart_df["color"] = chart_df["component_a"] * chart_df["component_b"] # * chart_df["component_c"]
            scale = px.colors.colorbrewer.RdBu

        # https://plotly.com/python-api-reference/generated/plotly.express.scatter_3d.html
        fig = px.scatter_3d(chart_df, title=title, text="tag", #size="user_count",
            x="component_a", y="component_b", z="component_c", 
            labels={"component_a":"", "component_b":"", "component_c":""},
            color="color", color_continuous_scale=scale
        )
    
    fig.show()
    image_filepath = os.path.join(FIGURES_DIRPATH, f"profile_tags_{TOP_TAGS_LIMIT}_umap_{n_components}.png")
    fig.write_image(image_filepath)
    if n_components in [2,3]:
        html_filepath = os.path.join(FIGURES_DIRPATH, f"profile_tags_{TOP_TAGS_LIMIT}_umap_{n_components}.html")
        fig.write_html(html_filepath)



In [None]:
#umapper(n_components=1)

In [None]:
umapper(n_components=2)

In [None]:
umapper(n_components=3)

## UMAP Enhanced Clustering

In [None]:
def umap_embeddings(top_tags_limit=100, n_components=10) -> DataFrame:

    # DATA FETCHING

    print("FETCHING...")
    sql = f"""
        WITH top_tags as (
            SELECT tag, count(DISTINCT user_id) as user_count
            FROM `tweet-research-shared.impeachment_2020.profile_tags_v2_flat`
            GROUP BY tag
            ORDER BY user_count DESC
            LIMIT {int(top_tags_limit)}
        )

        SELECT DISTINCT pt.user_id, pt.tag
        FROM `tweet-research-shared.impeachment_2020.profile_tags_v2_flat` pt
        JOIN top_tags on top_tags.tag = pt.tag
        ORDER BY user_id
    """
    df = bq_service.query_to_df(sql, verbose=False)
    #print(df.head())

    # ONE HOT ENCODING
    
    print("ENCODING...")
    df_onehot = crosstab(df["tag"], df["user_id"])
    #print(df_onehot.head())
    
    csv_filepath = os.path.join(DATA_DIRPATH, f"tags_users_onehot_{top_tags_limit}.csv")
    df_onehot.to_csv(csv_filepath)

    # DIMENSIONALITY REDUCTION

    print("REDUCING...")
    reducer = UMAP(n_components=n_components, random_state=99)
    embedding = reducer.fit_transform(df_onehot)

    embed_df = DataFrame(embedding, columns=component_names(n_components))
    #embed_df.index = df_onehot.index
    embed_df["tag"] = df_onehot.index
    #embed_df.head()
    csv_filepath = os.path.join(DATA_DIRPATH, f"profile_tags_{TOP_TAGS_LIMIT}_umap_{n_components}.csv")
    embed_df.to_csv(csv_filepath)

    return embed_df, df_onehot



In [None]:
%%time

CLUSTERING_TAGS_LIMIT = 250
CLUSTERING_N_COMPONENTS = 2
embed_df, df_onehot = umap_embeddings(top_tags_limit=CLUSTERING_TAGS_LIMIT, n_components=CLUSTERING_N_COMPONENTS)
embed_df.head()

FETCHING...
ENCODING...
REDUCING...
CPU times: user 2min 54s, sys: 9.79 s, total: 3min 3s
Wall time: 3min 20s


Unnamed: 0,component_a,component_b,tag
0,0.720125,3.79015,#1
1,2.189894,2.804938,#1A
2,0.800512,3.775858,#2
3,2.982606,4.472976,#2020
4,2.343032,2.778843,#2A


In [None]:
embed_df.index = embed_df["tag"]
embed_df.drop(columns=["tag"], inplace=True)
embed_df.head()

Unnamed: 0_level_0,component_a,component_b
tag,Unnamed: 1_level_1,Unnamed: 2_level_1
#1,0.720125,3.79015
#1A,2.189894,2.804938
#2,0.800512,3.775858
#2020,2.982606,4.472976
#2A,2.343032,2.778843


### K Means

In [None]:
#from sklearn.cluster import KMeans
## n_init: Number of time the k-means algorithm will be run with different centroid seeds. 
## ... The final results will be the best output of n_init consecutive runs in terms of inertia.
#
#N_CLUSTERS = 2
#kmeans = KMeans(init="k-means++", n_clusters=N_CLUSTERS, n_init=1000, random_state=99, verbose=True) 
#x_kmeans = kmeans.fit_transform(df_onehot)

In [None]:
#print(kmeans.cluster_centers_)
#print(kmeans.labels_)
#print(kmeans.feature_names_in_)

### K Modes

In [None]:
#%%capture
#!pip install kmodes

In [None]:
## https://github.com/nicodv/kmodes
#from kmodes.kmodes import KModes
#
#kmodes = KModes(n_clusters=5, init='Huang', n_init=5, verbose=1, n_jobs=-1, random_state=99)
#
#kmodes_clusters = kmodes.fit_predict(embed_df)
#
#print("-------")
#print(kmodes.cost_)
#print("-------")
#print(kmodes.labels_)
#print("----------")
#print(kmodes.cluster_centroids_)

### DBSCAN

In [None]:
## https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
#from sklearn.cluster import DBSCAN
#
## eps (float) default=0.5 ... The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function.
## min_samples (int) default=5 ... The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.
## metric (str), or callable, default=’euclidean’ ... The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by sklearn.metrics.pairwise_distances for its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square. X may be a sparse graph, in which case only “nonzero” elements may be considered neighbors for DBSCAN.
## metric_params
## algorithm {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, default=’auto’ The algorithm to be used by the NearestNeighbors module to compute pointwise distances and find nearest neighbors. See NearestNeighbors module documentation for details.
## leaf_size (int), default=30 ... Leaf size passed to BallTree or cKDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.
## n_jobs (int) default=None ... The number of parallel jobs to run. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details
#
#dbscan = DBSCAN(eps=0.3, min_samples=5)
#
#dbscan.fit(embed_df)
#
#print(dbscan.labels_)
##print(dbscan.components_)
##print(dbscan.core_sample_indices_)

### HDBSCAN

In [None]:
# %%capture
# !pip install hdbscan

In [None]:
# https://hdbscan.readthedocs.io/en/latest/basic_hdbscan.html
# from hdbscan import HDBSCAN

# TODO: tune metric:
#{'braycurtis': hdbscan.dist_metrics.BrayCurtisDistance,
# 'canberra': hdbscan.dist_metrics.CanberraDistance,
# 'chebyshev': hdbscan.dist_metrics.ChebyshevDistance,
# 'cityblock': hdbscan.dist_metrics.ManhattanDistance,
# 'dice': hdbscan.dist_metrics.DiceDistance,
# 'euclidean': hdbscan.dist_metrics.EuclideanDistance,
# 'hamming': hdbscan.dist_metrics.HammingDistance,
# 'haversine': hdbscan.dist_metrics.HaversineDistance,
# 'infinity': hdbscan.dist_metrics.ChebyshevDistance,
# 'jaccard': hdbscan.dist_metrics.JaccardDistance,
# 'kulsinski': hdbscan.dist_metrics.KulsinskiDistance,
# 'l1': hdbscan.dist_metrics.ManhattanDistance,
# 'l2': hdbscan.dist_metrics.EuclideanDistance,
# 'mahalanobis': hdbscan.dist_metrics.MahalanobisDistance,
# 'manhattan': hdbscan.dist_metrics.ManhattanDistance,
# 'matching': hdbscan.dist_metrics.MatchingDistance,
# 'minkowski': hdbscan.dist_metrics.MinkowskiDistance,
# 'p': hdbscan.dist_metrics.MinkowskiDistance,
# 'pyfunc': hdbscan.dist_metrics.PyFuncDistance,
# 'rogerstanimoto': hdbscan.dist_metrics.RogersTanimotoDistance,
# 'russellrao': hdbscan.dist_metrics.RussellRaoDistance,
# 'seuclidean': hdbscan.dist_metrics.SEuclideanDistance,
# 'sokalmichener': hdbscan.dist_metrics.SokalMichenerDistance,
# 'sokalsneath': hdbscan.dist_metrics.SokalSneathDistance,
# 'wminkowski': hdbscan.dist_metrics.WMinkowskiDistance}

# todo tune other params
#hdbscan_params = dict(metric="euclidean",
#                    min_cluster_size=5, 
#                    max_cluster_size=0,
#                    min_samples=None,
#                    allow_single_cluster=False,
#                    leaf_size=40,
#                    # alpha=1.0, p=None, algorithm='best', 
#                    # , cluster_selection_method='eom', cluster_selection_epsilon=0.0, 
#                    #memory=Memory(None, verbose=0), 
#                    #approx_min_span_tree=True, gen_min_span_tree=False, 
#                    #core_dist_n_jobs=4, 
#                    # prediction_data=False, 
#                    # match_reference_implementation=False, **kwargs
#                )

hdbscan = HDBSCAN()

hdbscan.fit(embed_df)

print(hdbscan.labels_)

[-1  2 -1  2  2  2 -1  2  2  2  2  2  2  2 -1  2  2  2  2  2  2  2  2  2
  2  2  1  2  2 -1 -1  2  2  2 -1  2  2  2  2  2  2  2  2  2  2  2  2  2
  2  2 -1  2  2  2  2  2  2  2  2 -1  2  2  2 -1  2  0  2  2  2  2  2  2
  2  2  2  2  2  2  2  2  2  2 -1  2  2  2  2  2  2  2  2  2  2  2  2  2
  2  2  2  0  1  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  1 -1  2  2  2
  2  1  1  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  1  2  2  2  2  2
  2  2  2  2  2  2  2  2  2  2  2  2  0  0  0  2  2  2  2  2  2  2  2  2
  0  2  2  2  2  2  2  2  2  2  2  2  2  2  0  2  2  2  2  2  2  2  2  2
  0  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
  2  2  2  2  2  0  2  2  2  2]


In [None]:
hdbscan.probabilities_[0:25]

array([0.        , 1.        , 0.        , 1.        , 1.        ,
       1.        , 0.        , 0.96149033, 1.        , 1.        ,
       1.        , 0.9949045 , 1.        , 1.        , 0.        ,
       1.        , 0.91667129, 0.99445681, 1.        , 1.        ,
       0.90577835, 0.93974542, 0.93222151, 0.80658614, 1.        ])

In [None]:
hdbscan_df = embed_df.copy()
hdbscan_df["hdbscan_label"] = hdbscan.labels_
hdbscan_df["hdbscan_probability"] = hdbscan.probabilities_
hdbscan_df.sort_values(by=["hdbscan_label"], inplace=True)
hdbscan_df.head()

csv_filepath = os.path.join(DATA_DIRPATH, f"tags_users_onehot_{CLUSTERING_TAGS_LIMIT}_umap_{CLUSTERING_N_COMPONENTS}_cluster_hdbscan.csv")
hdbscan_df.to_csv(csv_filepath)

#### Cluster Examination

In [None]:
hdbscan_df["hdbscan_label"].unique()

array([-1,  0,  1,  2])

In [None]:
#hdbscan_df[hdbscan_df["hdbscan_label"] == -2]

In [None]:
hdbscan_df[hdbscan_df["hdbscan_label"] == -1]

Unnamed: 0_level_0,component_a,component_b,hdbscan_label,hdbscan_probability
tag,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
#1,0.720125,3.79015,-1,0.0
#BLOCKCHAIN,0.376044,5.988869,-1,0.0
#AMERICAN,1.139848,3.37446,-1,0.0
#CLIMATECHANGE,1.267409,6.675369,-1,0.0
#COWBOYSNATION,1.933878,6.465194,-1,0.0
#CULT45,3.423666,3.647456,-1,0.0
#BLACKLIVESMATTER,3.792548,6.334107,-1,0.0
#BITCOIN,0.353679,5.994987,-1,0.0
#ABOLISHICE,3.634205,6.403845,-1,0.0
#FEELTHEBERN,3.784604,5.109557,-1,0.0


In [None]:
hdbscan_df[hdbscan_df["hdbscan_label"] == 0]

Unnamed: 0_level_0,component_a,component_b,hdbscan_label,hdbscan_probability
tag,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
#QARMY,3.461772,3.162215,0,1.0
#Q,3.272686,2.973814,0,0.985185
#GREATAWAKENING,3.525285,2.839632,0,0.919391
#THEGREATAWAKENING,3.577361,2.957337,0,1.0
#TRUSTTHEPLAN,3.376201,3.193832,0,1.0
#QANON,3.63281,2.794166,0,0.653429
#SAVETHECHILDREN,3.339809,3.221749,0,0.854868
#DARKTOLIGHT,3.538648,2.987036,0,1.0
#WWG1WGA,3.506495,3.005488,0,1.0


In [None]:
hdbscan_df[hdbscan_df["hdbscan_label"] == 1]

Unnamed: 0_level_0,component_a,component_b,hdbscan_label,hdbscan_probability
tag,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
#MEDICAREFORALL,4.267169,5.063769,1,0.841627
#NOTMEUS,3.995468,5.377748,1,1.0
#BERNIE2020,4.223105,5.284012,1,1.0
#M4A,4.155612,5.186047,1,1.0
#MEDICARE4ALL,4.145595,5.281722,1,1.0
#GREENNEWDEAL,4.274827,5.1678,1,1.0


In [None]:
hdbscan_df[hdbscan_df["hdbscan_label"] == 2]

Unnamed: 0_level_0,component_a,component_b,hdbscan_label,hdbscan_probability
tag,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
#YANG2020,1.367247,4.241541,2,1.00000
#NOTMYPRESIDENT,2.306097,5.799427,2,1.00000
#NOWARWITHIRAN,2.919118,4.908142,2,1.00000
#NRA,2.206555,2.699216,2,1.00000
#PATRIOT,1.956350,2.634614,2,1.00000
...,...,...,...,...
#GOD,0.907211,4.786075,2,0.96149
#GODBLESSAMERICA,3.621497,4.471911,2,1.00000
#GOP,3.174718,4.620247,2,1.00000
#FOLLOWBACK,1.640142,5.641024,2,1.00000


#### Cluster Evaluation

In [None]:
hdbscan_df.shape

(250, 4)

In [None]:
embed_df.shape

(250, 2)

In [None]:
df_onehot.shape

(250, 236988)

In [None]:
X = df_onehot
labels = hdbscan.labels_

sh_score = metrics.silhouette_score(X, labels, metric='euclidean')
print(sh_score) #> 0.0567

ch_score = metrics.calinski_harabasz_score(X, labels)
print(ch_score) #> 2.5839

db_score = metrics.davies_bouldin_score(X, labels)
print(db_score) #> 3.5719

0.05678184829264286
2.583912092416509
3.5719186821651365


### Spectral Clustering

In [None]:
## https://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering.html#sklearn.cluster.SpectralClustering
#
#from sklearn.cluster import SpectralClustering
#
## TODO: tune params
#SPECTRAL_N_CLUSTERS = 5
#spectral = SpectralClustering(n_clusters=5, random_state=99)
#
#spectral.fit(embed_df)
#
#print(spectral.labels_)

In [None]:
#spectral_df = embed_df.copy()
#
#spectral_df["spectral_label"] = spectral.labels_
#spectral_df.sort_values(by=["spectral_label"], inplace=True)
#spectral_df.head()
#csv_filepath = os.path.join(DIRPATH, "data", f"tags_users_onehot_{CLUSTERING_TAGS_LIMIT}_umap_{CLUSTERING_N_COMPONENTS}_cluster_spectral_{SPECTRAL_N_CLUSTERS}.csv")
#spectral_df.to_csv(csv_filepath)

In [None]:
#spectral_df[spectral_df["spectral_label"] == 0]

In [None]:
#spectral_df[spectral_df["spectral_label"] == 1]

In [None]:
#spectral_df[spectral_df["spectral_label"] == 2]

In [None]:
#spectral_df[spectral_df["spectral_label"] == 3]

In [None]:
#spectral_df[spectral_df["spectral_label"] == 4]

## Clustering Evaluation Pipeline

In [None]:
assert embed_df.index.tolist() == df_onehot.index.tolist()

True

In [None]:
# https://hdbscan.readthedocs.io/en/latest/basic_hdbscan.html
# from hdbscan import HDBSCAN

# FOR TUNING, JUST START WITH METRIC FOR NOW

# https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation
# we don't have true labels, so...
# https://scikit-learn.org/stable/modules/clustering.html#silhouette-coefficient
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabasz_score.html
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.davies_bouldin_score.html
from sklearn import metrics

def cluster_eval(X_df=df_onehot, embed_df=embed_df, metric="euclidean"): 
    print("-------------")
    print("METRIC:", metric)

    hdbscan = HDBSCAN(metric=metric)

    hdbscan.fit(embed_df)
    #print(hdbscan.labels_[0:25])
    #print(hdbscan.probabilities_[0:25])

    # EXPORT RESULTS TO CSV...
    labels_df = embed_df.copy()
    labels_df["hdbscan_label"] = hdbscan.labels_
    labels_df["hdbscan_probability"] = hdbscan.probabilities_
    labels_df.sort_values(by=["hdbscan_label"], inplace=True)
    #print(labels_df.head())
    print(labels_df["hdbscan_label"].value_counts())
    #csv_filepath = os.path.join(DATA_DIRPATH, f"tags_users_onehot_{CLUSTERING_TAGS_LIMIT}_umap_{CLUSTERING_N_COMPONENTS}_cluster_hdbscan_{metric}.csv")
    #labels_df.to_csv(csv_filepath)

    # EVALUATION

    labels = hdbscan.labels_

    sh_score = metrics.silhouette_score(X_df, labels, metric='euclidean')
    #print("SH SCORE:", sh_score) #> 0.0567

    ch_score = metrics.calinski_harabasz_score(X_df, labels)
    #print("CH SCORE:", ch_score) #> 2.5839

    db_score = metrics.davies_bouldin_score(X_df, labels)
    #print("DB SCORE:", db_score) #> 3.5719

    # REPORT ON THE SCORES:
    return {
        "metric": metric,
        "sh_score": sh_score,
        "ch_score": ch_score,
        "db_score": db_score
    }

In [None]:
DISTINCT_METRICS = [
 'braycurtis',       #: hdbscan.dist_metrics.BrayCurtisDistance,
 'canberra'       , #  : hdbscan.dist_metrics.CanberraDistance,
 'chebyshev'      , #  : hdbscan.dist_metrics.ChebyshevDistance,
 'dice'           , #  : hdbscan.dist_metrics.DiceDistance,
 'euclidean'      , #  : hdbscan.dist_metrics.EuclideanDistance,
 'hamming'        , #  : hdbscan.dist_metrics.HammingDistance,
 'haversine'      , #  : hdbscan.dist_metrics.HaversineDistance,
 'jaccard'        , #  : hdbscan.dist_metrics.JaccardDistance,
 'kulsinski'      , #  : hdbscan.dist_metrics.KulsinskiDistance,
 'mahalanobis'    , #  : hdbscan.dist_metrics.MahalanobisDistance,
 'manhattan'      , #  : hdbscan.dist_metrics.ManhattanDistance,
 'matching'       , #  : hdbscan.dist_metrics.MatchingDistance,
 'minkowski'      , #  : hdbscan.dist_metrics.MinkowskiDistance,
 'pyfunc'         , #  : hdbscan.dist_metrics.PyFuncDistance,
 'rogerstanimoto' , #  : hdbscan.dist_metrics.RogersTanimotoDistance,
 'russellrao'     , #  : hdbscan.dist_metrics.RussellRaoDistance,
 'seuclidean'     , #  : hdbscan.dist_metrics.SEuclideanDistance,
 'sokalmichener'  , #  : hdbscan.dist_metrics.SokalMichenerDistance,
 'sokalsneath'    , #  : hdbscan.dist_metrics.SokalSneathDistance,
 'wminkowski'     , #  : hdbscan.dist_metrics.WMinkowskiDistance}
]


In [None]:

eval_results = []
for metric in DISTINCT_METRICS:

    try:
        result = cluster_eval(metric=metric)
        eval_results.append(result)
    except Exception as err:
        #print("OOPS, ERROR...", err)
        result = {"metric": metric, "err": err}
        eval_results.append(result)

eval_results_df = DataFrame(eval_results)
eval_results_df

-------------
METRIC: braycurtis
 2    180
-1     49
 1     13
 0      8
Name: hdbscan_label, dtype: int64
SH SCORE: 0.0782489653133154
CH SCORE: 3.534140546794715
DB SCORE: 5.027867411858536
-------------
METRIC: canberra
 6    117
-1     74
 1     18
 3     11
 2      9
 0      7
 4      7
 5      7
Name: hdbscan_label, dtype: int64
SH SCORE: -0.16850751275941078
CH SCORE: 1.827499369041369
DB SCORE: 5.327690786535379
-------------
METRIC: chebyshev
 5    89
-1    87
 4    19
 7    18
 6    11
 1     9
 0     6
 2     6
 3     5
Name: hdbscan_label, dtype: int64
SH SCORE: -0.13645174825920084
CH SCORE: 1.7203249791719313
DB SCORE: 5.472020602819412
-------------
METRIC: dice
-1    250
Name: hdbscan_label, dtype: int64
-------------
METRIC: euclidean
 2    223
-1     12
 0      9
 1      6
Name: hdbscan_label, dtype: int64
SH SCORE: 0.05678184829264286
CH SCORE: 2.583912092416509
DB SCORE: 3.5719186821651365
-------------
METRIC: hamming
-1    250
Name: hdbscan_label, dtype: int64
---

Unnamed: 0,metric,sh_score,ch_score,db_score,err
0,braycurtis,0.078249,3.534141,5.027867,
1,canberra,-0.168508,1.827499,5.327691,
2,chebyshev,-0.136452,1.720325,5.472021,
3,dice,,,,Number of labels is 1. Valid values are 2 to n...
4,euclidean,0.056782,2.583912,3.571919,
5,hamming,,,,Number of labels is 1. Valid values are 2 to n...
6,haversine,-0.001884,1.85885,3.582341,
7,jaccard,,,,Number of labels is 1. Valid values are 2 to n...
8,kulsinski,,,,Number of labels is 1. Valid values are 2 to n...
9,mahalanobis,,,,Must provide either V or VI for Mahalanobis di...


#### Interpretations

Interpretations

Silhouette:

  + The best value is 1 and 
  + The worst value is -1. 
  + Values near 0 indicate overlapping clusters.    
  + Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.


Drawbacks:

The Silhouette Coefficient is generally higher for convex clusters than other concepts of clusters, such as density based clusters like those obtained through DBSCAN.

In [None]:
eval_results_df[["metric", "sh_score"]].dropna().sort_values(by=["sh_score"], ascending=False)

Unnamed: 0,metric,sh_score
0,braycurtis,0.078249
4,euclidean,0.056782
10,manhattan,0.023579
6,haversine,-0.001884
2,chebyshev,-0.136452
1,canberra,-0.168508


CH SCORE

Variance Ratio Criterion - can be used to evaluate the model, where a higher Calinski-Harabasz score relates to a model with better defined clusters.


Drawbacks:
The Calinski-Harabasz index is generally higher for convex clusters than other concepts of clusters, such as density based clusters like those obtained through DBSCAN.

In [None]:
eval_results_df[["metric", "ch_score"]].dropna().sort_values(by=["ch_score"], ascending=False)

Unnamed: 0,metric,ch_score
0,braycurtis,3.534141
4,euclidean,2.583912
10,manhattan,1.861086
6,haversine,1.85885
1,canberra,1.827499
2,chebyshev,1.720325


DB SCORE

The minimum score is zero, with lower values indicating better clustering.



The Davies-Boulding index is generally higher for convex clusters than other concepts of clusters, such as density based clusters like those obtained from DBSCAN.

In [None]:
eval_results_df[["metric", "db_score"]].dropna().sort_values(by=["db_score"], ascending=True)

Unnamed: 0,metric,db_score
4,euclidean,3.571919
6,haversine,3.582341
10,manhattan,4.93492
0,braycurtis,5.027867
1,canberra,5.327691
2,chebyshev,5.472021
