# Table of contents
1. Introduction
2. Datasets and preprocessing
3. Visualisations
    - property 1 etc.
4. Reflection
5. Work distribution


# 1. Introduction
- Aims and Goals
- Take a general audience into network science
- explain the multiple perspectives

relationships between entities in a network

the project is designed so that a person with little knowledge of networks can understand the main features of social networks

the project is designed to include a person with little knowledge of networks in understanding properties of networks

# 2. Datasets and preprocessing
This project is based on three different datasets of social networks that can be found on Kaggle.(reffer to links). Each dataset is a representation of a network in the form of a list of edges. To some datasets there is also a file added with node attributes. This file is often a `.json` file. The latest dataset is slightly different from the first two. Unique to this dataset, is the fact that it is a dataset containing multiple datasets of social networks.

Before the datasets can be read in, it is necessary to manually select which datasets will and will not be selected for the project. This selection is based on the size of the datasets to reduce the risk of very long processing times. The table below is an overview of all the different networks, retrieved from the datasets, that will be used in this project. For each network, it is given what its node representation is, what its link representation is and what the node attributes are if applicable.

|Dataset|Node representation|Link representation|Node attributes if applicable|
|---|---|---|---|
|**NashvilleMeetupNetwork**|Member of a Meetup group|Shared group membership in 'weight' groups|n/a|
|**DeezerHR**|Deezer users from Croatia|The relation friendship|Genre preferences of each user|
|**DeezerHU**|Deezer users from Hungary|The relation friendship|Genre preferences of each user|
|**DeezerRO**|Deezer users from Romania|The relation friendship|Genre preferences of each user|
|**FacebookLargePage**|Official Facebook pages| The amount of mutual likes between pages| Descriptions of the purpose of the site|
|**FeatherDeezerSocial**|Deezer users from Europe|The relation friendship|Artists liked by the users|
|**FeatherLastfmSocial**|
|**GithubSocial**|
|**TwitchSocialNetworks**|
|**TwitchSocialNetworksDE**|
|**TwitchSocialNetworksENGB**|
|**TwitchSocialNetworksES**|
|**TwitchSocialNetworksFR**|
|**TwitchSocialNetworksPTBR**|
|**TwitchSocialNetworksRU**|
|**SocEpinions1**|
|**SocSignSlashdot081106**|
|**SocSignSlashdot090216**|
|**SocSignSlashdot090221**|
|**SocSlashdot0811**|
|**SocSlashdot0902**|

Other basic info that is excluded:
- Time span of the network
- Where the data is retrieved from

## Importing the relevant python libaries

In [9]:
import networkx as nx
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import csv
import time
import preprocessing

## Preprocessing non-csv datasets
Some datasets of networks have their edges saved in a `.txt` file while others in a `.csv` file. In order to read the edges with the pandas read_csv function those `.txt` files need to be converted tot `.csv` files. A function is used to accomplish this that can be found in the `preprocessing.py` file. 
- soc-Epinions1.txt
- soc-sign-Slashdot081106.txt
- soc-sign-Slashdot090216.txt
- soc-sign-Slashdot090221.txt
- soc-Slashdot0811.txt
- soc-Slashdot0902.txt

In [10]:
preprocessing.transform_text_to_csv('datasets/soc-Epinions1.txt', 'datasets/soc-Epinions1.csv')
preprocessing.transform_text_to_csv('datasets/soc-sign-Slashdot081106.txt', 'datasets/soc-sign-Slashdot081106.csv')
preprocessing.transform_text_to_csv('datasets/soc-sign-Slashdot090216.txt', 'datasets/soc-sign-Slashdot090216.csv')
preprocessing.transform_text_to_csv('datasets/soc-sign-Slashdot090221.txt', 'datasets/soc-sign-Slashdot090221.csv')
preprocessing.transform_text_to_csv('datasets/soc-Slashdot0811.txt', 'datasets/soc-Slashdot0811.csv')
preprocessing.transform_text_to_csv('datasets/soc-Slashdot0902.txt', 'datasets/soc-Slashdot0902.csv')

## Read in datasets with pandas
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

In [11]:
NashvilleMeetupNetwork_member_edges = pd.read_csv('datasets/NashvilleMeetupNetwork/member-edges.csv', index_col=0)
DeezerHR_edges = pd.read_csv('datasets/DeezerSocialNetworks/HR/HR_edges.csv')
DeezerHU_edges = pd.read_csv('datasets/DeezerSocialNetworks/HU/HU_edges.csv')
DeezerRO_edges = pd.read_csv('datasets/DeezerSocialNetworks/RO/RO_edges.csv')
FacebookLargePage_edges = pd.read_csv('datasets/facebook-large-page-page-network/musae_facebook_edges.csv')
FeatherDeezerSocial_edges = pd.read_csv('datasets/feather-deezer-social/deezer_europe_edges.csv')
FeatherLastfmSocial_edges = pd.read_csv('datasets/feather-lastfm-social/lastfm_asia_edges.csv')
GithubSocial_edges = pd.read_csv("datasets/github-social/musae_git_edges.csv")
# TwitchGamers_edges = pd.read_csv("datasets/twitch_gamers/large_twitch_edges.csv")
TwitchSocialNetworksDE_edges = pd.read_csv("datasets/twitch-social-networks/DE/musae_DE_edges.csv")
TwitchSocialNetworksENGB_edges = pd.read_csv("datasets/twitch-social-networks/ENGB/musae_ENGB_edges.csv")
TwitchSocialNetworksES_edges = pd.read_csv("datasets/twitch-social-networks/ES/musae_ES_edges.csv")
TwitchSocialNetworksFR_edges = pd.read_csv("datasets/twitch-social-networks/FR/musae_FR_edges.csv")
TwitchSocialNetworksPTBR_edges = pd.read_csv("datasets/twitch-social-networks/PTBR/musae_PTBR_edges.csv")
TwitchSocialNetworksRU_edges = pd.read_csv("datasets/twitch-social-networks/RU/musae_RU_edges.csv")
SocEpinions1_edges = pd.read_csv('datasets/soc-Epinions1.csv')
SocSignSlashdot081106_edges = pd.read_csv('datasets/soc-sign-Slashdot081106.csv')
SocSignSlashdot090216_edges = pd.read_csv('datasets/soc-sign-Slashdot090216.csv')
SocSignSlashdot090221_edges = pd.read_csv('datasets/soc-sign-Slashdot090221.csv')
SocSlashdot0811_edges = pd.read_csv('datasets/soc-Slashdot0811.csv')
SocSlashdot0902_edges = pd.read_csv('datasets/soc-Slashdot0902.csv')

## Construct graphs with networkx
https://networkx.org/documentation/stable/reference/generated/networkx.convert_matrix.from_pandas_edgelist.html

In [12]:
NashvilleMeetupNetwork = nx.from_pandas_edgelist(NashvilleMeetupNetwork_member_edges, 'member1', 'member2', edge_attr='weight', create_using=nx.Graph)
DeezerHR = nx.from_pandas_edgelist(DeezerHR_edges, 'node_1', 'node_2', create_using=nx.Graph)
DeezerHU = nx.from_pandas_edgelist(DeezerHU_edges, 'node_1', 'node_2', create_using=nx.Graph)
DeezerRO = nx.from_pandas_edgelist(DeezerRO_edges, 'node_1', 'node_2', create_using=nx.Graph)
FacebookLargePage = nx.from_pandas_edgelist(FacebookLargePage_edges, 'id_1', 'id_2', create_using=nx.Graph)
FeatherDeezerSocial = nx.from_pandas_edgelist(FeatherDeezerSocial_edges, 'node_1', 'node_2', create_using=nx.Graph)
FeatherLastfmSocial = nx.from_pandas_edgelist(FeatherLastfmSocial_edges, 'node_1', 'node_2', create_using=nx.Graph) 
GithubSocial = nx.from_pandas_edgelist(GithubSocial_edges, 'id_1', 'id_2', create_using=nx.Graph)
# TwitchGamers = nx.from_pandas_edgelist(TwitchGamers_edges, 'numeric_id_1', 'numeric_id_2', create_using=nx.Graph)
TwitchSocialNetworksDE = nx.from_pandas_edgelist(TwitchSocialNetworksDE_edges, 'from', 'to', create_using=nx.Graph)
TwitchSocialNetworksENGB = nx.from_pandas_edgelist(TwitchSocialNetworksENGB_edges, 'from', 'to', create_using=nx.Graph)
TwitchSocialNetworksES = nx.from_pandas_edgelist(TwitchSocialNetworksES_edges, 'from', 'to', create_using=nx.Graph)
TwitchSocialNetworksFR = nx.from_pandas_edgelist(TwitchSocialNetworksFR_edges, 'from', 'to', create_using=nx.Graph)
TwitchSocialNetworksPTBR = nx.from_pandas_edgelist(TwitchSocialNetworksPTBR_edges, 'from', 'to', create_using=nx.Graph)
TwitchSocialNetworksRU = nx.from_pandas_edgelist(TwitchSocialNetworksRU_edges, 'from', 'to', create_using=nx.Graph)
SocEpinions1 = nx.from_pandas_edgelist(SocEpinions1_edges, '# FromNodeId', 'ToNodeId', create_using=nx.Graph)
SocSignSlashdot081106 = nx.from_pandas_edgelist(SocSignSlashdot081106_edges, '# FromNodeId', 'ToNodeId', edge_attr='Sign' , create_using=nx.Graph)
SocSignSlashdot090216 = nx.from_pandas_edgelist(SocSignSlashdot090216_edges, '# FromNodeId', 'ToNodeId', edge_attr='Sign' , create_using=nx.Graph)
SocSignSlashdot090221 = nx.from_pandas_edgelist(SocSignSlashdot090221_edges, '# FromNodeId', 'ToNodeId', edge_attr='Sign' , create_using=nx.Graph)
SocSlashdot0811 = nx.from_pandas_edgelist(SocSlashdot0811_edges, '# FromNodeId', 'ToNodeId', create_using=nx.Graph)
SocSlashdot0902 = nx.from_pandas_edgelist(SocSlashdot0902_edges, '# FromNodeId', 'ToNodeId', create_using=nx.Graph)

# Destriptive statistics of different networks
- number of nodes
- number of edges
- density
- connectedness
- average clustering coefficient
- transitivity

In [13]:
networks = {
    'NashvilleMeetupNetwork' : NashvilleMeetupNetwork,
    'DeezerHR' : DeezerHR,
    'DeezerHU' : DeezerHU,
    'DeezerRO' : DeezerRO,
    'FacebookLargepage' : FacebookLargePage,
    'FeatherDeezerSocial' : FeatherDeezerSocial,
    'FeatherLastfmSocial' : FeatherLastfmSocial, 
    'GithubSocial' : GithubSocial,
    #'TwitchGamers' : TwitchGamers, #Network to large for sufficient computing
    'TwitchSocialNetworksDE' : TwitchSocialNetworksDE,
    'TwitchSocialNetworksENGB' : TwitchSocialNetworksENGB,
    'TwitchSocialNetworksES' : TwitchSocialNetworksES,
    'TwitchSocialNetworksFR' : TwitchSocialNetworksFR,
    'TwitchSocialNetworksPTBR' : TwitchSocialNetworksPTBR,
    'TwitchSocialNetworksRU' : TwitchSocialNetworksRU,
    'SocEpinions1' : SocEpinions1,
    'SocSignSlashdot081106' : SocSignSlashdot081106,
    'SocSignSlashdot090216' : SocSignSlashdot090216,
    'SocSignSlashdot090221' : SocSignSlashdot090221,
    'SocSlashdot0811' : SocSlashdot0811,
    'SocSlashdot0902' : SocSlashdot0902,
}

The code below will take some time to run. That's why the results are written to a csv file that can be read in later.

In [14]:
# descriptive_stats = pd.DataFrame(index=list(networks.keys()), columns=['number_of_nodes', 'number_of_edges', 'density', 'connected_network', 'avg_cc', 'transitivity'])
# for name, network in networks.items():
#     descriptive_stats['number_of_nodes'].loc[name] = nx.number_of_nodes(network)
#     descriptive_stats['number_of_edges'].loc[name] = nx.number_of_edges(network)
#     descriptive_stats['density'].loc[name] = nx.density(network)
#     descriptive_stats['connected_network'].loc[name] = nx.is_connected(network)
#     descriptive_stats['avg_cc'].loc[name] = nx.average_clustering(network)
#     descriptive_stats['transitivity'].loc[name] = nx.transitivity(network)

# descriptive_stats.to_csv('results/descriptive_stats_of_networks.csv')

In [15]:
descriptive_stats = pd.read_csv('results/descriptive_stats_of_networks.csv')
descriptive_stats

Unnamed: 0.1,Unnamed: 0,number_of_nodes,number_of_edges,density,connected_network,avg_cc,transitivity
0,NashvilleMeetupNetwork,11372,1176368,0.018194,True,0.884957,0.604407
1,DeezerHR,54573,498202,0.000335,True,0.136477,0.11463
2,DeezerHU,47538,222887,0.000197,True,0.116187,0.092924
3,DeezerRO,41773,125826,0.000144,True,0.091212,0.075267
4,FacebookLargepage,22470,171002,0.000677,True,0.359738,0.232321
5,FeatherDeezerSocial,28281,92752,0.000232,True,0.14116,0.095922
6,FeatherLastfmSocial,7624,27806,0.000957,True,0.219418,0.178623
7,GithubSocial,37700,289003,0.000407,True,0.167537,0.012357
8,TwitchSocialNetworksDE,9498,153138,0.003395,True,0.200886,0.046471
9,TwitchSocialNetworksENGB,7126,35324,0.001391,True,0.130928,0.042433


# Iedeen over wat we verder nog kunnen zeggen over netwerken
- Nodes in larges SCC
- Average clustering coefficient
- Number of triangles
- fraction of closed triangles
- Diameter (longest shortest path) In order to say something about small world property.
- Relationship between network density and diameter, average cc, etc.

## Oude tekst bij datasets and preprocessing

### 1. Nashville Meetup Network
https://www.kaggle.com/datasets/stkbailey/nashville-meetup?select=rsvps.csv

Dataset about who goes to what meetups. From these relations a social network can be constructed.

- `member-to-group-edges.csv`: Edge list for constructing a member-to-group bipartite graph. Weights represent number of events attended in each group.
- `group-edges.csv`: Edge list for constructing a group-to-group graph. Weights represent shared members between groups.
- `member-edges.csv`: Edge list for constructing a member-to-member graph. Weights represent shared group membership.
- `rsvps.csv`: Raw member-to-event attendance data, which was aggregated to form member-to-group-edges.csv.

In short the relations mean:
- `member-to-group-edges`: member is part of group and has attended 'weight' events in this group
- `group-edges`: group A has 'weight' shared members with group B
- `member-edges`: member A attented in 'weight' events with member B

### 2. Deezer Social Networks
https://www.kaggle.com/datasets/andreagarritano/deezer-social-networks

Dataset about friendship networks of users on Deezer in three European countries: Romania, Croatia and Hungary. The edges represent the relationship 'friendship'. Since this data is from three countries, it are basicly three different networks. (There is also a json file which contains properties of a node, namely its genre preferences.)

### 3. Facebook Large Page Page Network
https://www.kaggle.com/datasets/wolfram77/graphs-social?select=feather-lastfm-social

This is a webgraph of verified Facebook sites. The nodes represent official facebook pages. The edges represent mutual likes between the sites. (There is also a json file which contains properties of a node, namely its genre preferences.) This graph was collected through the Facebook Graph API in November 2017 and restricted to pages from 4 categories which are defined by Facebook. These categories are: politicians, governmental organizations, television shows and companies.


# 3. Visualisations
min 6 max 8 visualisations: lets aim for two visualisations per topic

1. average degree/degree distribution give insight into hubs. Few highly connected nodes
    - Zou voor alle netwerken in een plot weergeven kunnen worden.
    - Als er hub zijn dan kan worden geconcludeerd dat het een real network is?
2. small world property: only few steps needed to reach any arbitraity node in network
    - Heeft te maken met de hubs. Network transitivity.
    - logaritmische schaal tussen grote netwerk en netwerk diameter--> Daar kan plot van gemaakt worden
3. triadic closure: meet new fried through shared contacts. high clustering coefficient.
    - real networks have high clusering coefficients
    - 
4. How concepts, ideas and preferences spread through the network.
   - nog een stap verder: Als het blijkt dat vrienden elkaar leren kennen door vrienden dan kan het ook zijn dat eigenschappen van personen zich verspreiden via vrienden. Proberen dit aan te tonen.
   - S(x,y) : x lijkt op y (similarity)
   - R(x,y) : x is vriend van y (relatie in netwerk)
   1. Als bevried dan kans op vergelijkbare smaak. $\forall x\forall y(R(x,y)\rightarrow S(x,y))$
   2. Als vergelijkbare smaak dan kans op bevriend $\forall x\forall y(S(x,y)\rightarrow R(x,y))$
   - Clusering plot kan een optie zijn
   - Pearson correlation is ook een goede om inzicht te krijgen hierin.

In [17]:
start = time.time()
print(nx.diameter(TwitchSocialNetworksPTBR))
end = time.time()
print(f'tijd in seconden:{end - start}')

7
tijd in seconden:4.84745979309082
