## Softcosine Clusters: File Conversion (pkl2net)

**Purpose**: prep for clustering by converting pickled pandas dataframes to pajek files
- Code based on https://github.com/damian0604/newsevents/blob/master/src/data-processing/030-pkl2net.py

In [1]:
import networkx as nx
from tqdm import tqdm
from glob import glob
import os
import pickle

In [2]:
pkl_dir = os.path.join("..", "..", "data", "02-intermediate", "07-newsevents", "01-softcosine-output")
models = [('softcosine', glob(f"{pkl_dir}/**/*.pkl"))]

net_dir = os.path.join("..", "..", "data", "02-intermediate", "07-newsevents", "02-softcosine-pkl2net")

#### Example of softcosine output

In [3]:
with open(models[0][1][0], mode='rb') as fi:
    df = pickle.load(fi)

In [4]:
df

Unnamed: 0,source,target,similarity,source_date,target_date,source_doctype,target_doctype
0,Breitbart_513186861,Breitbart_513186861,1.000000,2016-09-13T15:40:18+00:00,2016-09-13T15:40:18+00:00,breitbart,breitbart
59,Breitbart_513186861,Breitbart_545230952,0.230769,2016-09-13T15:40:18+00:00,2016-09-13T01:48:11+00:00,breitbart,breitbart
74,Breitbart_513186861,Newsmax_713072384,0.327481,2016-09-13T15:40:18+00:00,2016-09-13T00:54:23+00:00,breitbart,newsmax
75,Breitbart_513186861,Breitbart_512743064,0.233033,2016-09-13T15:40:18+00:00,2016-09-13T01:07:34+00:00,breitbart,breitbart
164,Breitbart_513186861,FoxNews_553935231,0.333716,2016-09-13T15:40:18+00:00,2016-09-13T12:00:00+00:00,breitbart,foxnews
...,...,...,...,...,...,...,...
426682,RushLimbaugh_513263320,VDARE_513726520,0.230031,2016-09-13T15:14:04+00:00,2016-09-15T05:14:24+00:00,rushlimbaugh,vdare
427266,WashingtonExaminer_513236237,Newsmax_713110752,0.201143,2016-09-13T00:01:00+00:00,2016-09-15T20:19:44+00:00,washingtonexaminer,newsmax
427268,WashingtonExaminer_513236237,OneAmericaNews_513862797,0.200890,2016-09-13T00:01:00+00:00,2016-09-15T12:35:15+00:00,washingtonexaminer,oneamericanews
427288,WashingtonExaminer_513236237,WashingtonExaminer_1214212377,0.212561,2016-09-13T00:01:00+00:00,2016-09-15T04:00:00+00:00,washingtonexaminer,washingtonexaminer


#### Convert softcosine output to network format

In [5]:
for modelname, filenames in models:
    print(f'Processing output of the {modelname} model...')
    for fn in tqdm(filenames):
        with open(fn, mode='rb') as fi:
            df = pickle.load(fi)
            G = nx.Graph()
            # change int to str (necessary for pajek format)
            df['similarity'] = df['similarity'].apply(str)
            # change column name to 'weights' to faciliate later analysis
            df.rename({'similarity':'weight'}, axis=1, inplace=True) 
            # notes and weights from dataframe
            G = nx.from_pandas_edgelist(df, source='source', target='target', edge_attr='weight')
            # write to pajek
            path = f'{net_dir}/'
            if not os.path.exists(path):
                os.makedirs(path)
            nx.write_pajek(G, path+os.path.basename(fn)[:-3]+'net')

Processing output of the softcosine model...


100%|██████████| 1821/1821 [02:08<00:00, 14.22it/s]
