## Softcosine Clusters: File Conversion (pkl2net)

**Purpose**: prep for clustering by converting pickled pandas dataframes to pajek files
- Code based on https://github.com/damian0604/newsevents/blob/master/src/data-processing/030-pkl2net.py

In [1]:
import networkx as nx
from tqdm import tqdm
from glob import glob
import os
import pickle

In [2]:
pkl_dir = os.path.join("..", "..", "data", "02-intermediate", "06-newsevents", "01-softcosine-output")
models = [('softcosine', glob(f"{pkl_dir}/*.pkl"))]

net_dir = os.path.join("..", "..", "data", "02-intermediate", "06-newsevents", "02-softcosine-pkl2net")

#### Example of softcosine output

In [3]:
with open(models[0][1][0], mode='rb') as fi:
    df = pickle.load(fi)

In [4]:
df

Unnamed: 0,source,target,similarity,source_date,target_date,source_doctype,target_doctype
0,VDARE_800077991,VDARE_800077991,1.000000,2018-02-16T19:25:28+00:00,2018-02-16T19:25:28+00:00,vdare,vdare
385,DailyCaller_799691321,DailyCaller_799691321,1.000000,2018-02-16T13:10:25+00:00,2018-02-16T13:10:25+00:00,dailycaller,dailycaller
414,DailyCaller_799691321,OneAmericaNews_799543472,0.267355,2018-02-16T13:10:25+00:00,2018-02-16T09:41:08+00:00,dailycaller,oneamericanews
420,DailyCaller_799691321,InfoWars_800516011,0.437665,2018-02-16T13:10:25+00:00,2018-02-16T13:43:14+00:00,dailycaller,infowars
564,DailyCaller_799691321,Breitbart_800083844,0.311668,2018-02-16T13:10:25+00:00,2018-02-16T14:34:44+00:00,dailycaller,breitbart
...,...,...,...,...,...,...,...
296427,GatewayPundit_800017206,GatewayPundit_800961573,0.314828,2018-02-16T19:31:05+00:00,2018-02-18T04:50:48+00:00,gatewaypundit,gatewaypundit
296433,GatewayPundit_800017206,GatewayPundit_848321260,0.278300,2018-02-16T19:31:05+00:00,2018-02-18T03:00:10+00:00,gatewaypundit,gatewaypundit
296437,GatewayPundit_800017206,WashingtonExaminer_1003243768,0.276371,2018-02-16T19:31:05+00:00,2018-02-18T05:00:00+00:00,gatewaypundit,washingtonexaminer
296439,GatewayPundit_800017206,Breitbart_801468367,0.321000,2018-02-16T19:31:05+00:00,2018-02-18T16:18:01+00:00,gatewaypundit,breitbart


#### Convert softcosine output to network format

In [5]:
for modelname, filenames in models:
    print(f'Processing output of the {modelname} model...')
    for fn in tqdm(filenames):
        with open(fn, mode='rb') as fi:
            df = pickle.load(fi)
            G = nx.Graph()
            # change int to str (necessary for pajek format)
            df['similarity'] = df['similarity'].apply(str)
            # change column name to 'weights' to faciliate later analysis
            df.rename({'similarity':'weight'}, axis=1, inplace=True) 
            # notes and weights from dataframe
            G = nx.from_pandas_edgelist(df, source='source', target='target', edge_attr='weight')
            # write to pajek
            path = f'{net_dir}/'
            if not os.path.exists(path):
                os.makedirs(path)
            nx.write_pajek(G, path+os.path.basename(fn)[:-3]+'net')

Processing output of the softcosine model...


100%|██████████| 476/476 [00:31<00:00, 15.01it/s]
