# RedHat BitbyBit StackOverFlow Tag Network Visualization and Analysis: 

* Making the network
* Network Visualization


# Make Network

In [1]:
import networkx as nx
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

ModuleNotFoundError: No module named 'networkx'

In [None]:
nodes = pd.read_csv('../input/stack_network_nodes.csv')
edges = pd.read_csv('../input/stack_network_links.csv')

In [None]:
nodes.head()

In [None]:
edges.head()

A graph consists of nodes and edges.  In this case the nodes are the tags that appear in a developer's profile in stackoverflow ["Developer Stories"](https://stackoverflow.com/users/story/join). If two tags appear on the same profile there's a tag between them.  To build a graph in networkx first we define an empty graph. Then we add the nodes and the edges to the graph from the dataframes ```nodes``` and ```edges``` by iterating over the rows. 

> The dataset include only a subset of tags used on Developer Stories, tags that were used by at least 0.5% of users and were correlated with another tag with a correlation coefficient above 0.1. This means that very sparsely used tags and tags that are not used with other tags were filtered out. 

For more details see here : https://www.kaggle.com/stackoverflow/stack-overflow-tag-network/discussion/41307

Nodes and edges can have metadata associated with them. 

**Node metadata : **
*  ```nodesize```  :  Proportional to how many developers have that tag in their developer story profile.
*  ```group``` : which group that node belongs to (calculated via a cluster walktrap).

**Edge metadata : **
*  ```value``` :  proportional to how correlated those two tags are (correlation coefficient * 100). 



In [None]:
G = nx.Graph()

In [None]:
for index, row in nodes.iterrows():
    G.add_node(row["name"],group = row["group"], nodesize = row["nodesize"] )

In [None]:
for index, row in edges.iterrows():
    G.add_edge(row["source"], row["target"], weight = row["value"])

In [None]:
print(nx.info(G))

# Connectivity 


In [None]:
nx.is_connected(G)

In [None]:
nx.number_connected_components(G)

In [None]:
maximum_connected_component = max(nx.connected_component_subgraphs(G), key=len)

# Network Visualization

Reference : https://www.kaggle.com/jncharon/python-network-graph for the network visualization 

In [None]:
print(nx.__version__)

In [None]:
def draw_graph(G,size):
    nodes = G.nodes()
    color_map = {1:'#f09494', 2:'#eebcbc', 3:'#72bbd0', 4:'#91f0a1', 5:'#629fff', 6:'#bcc2f2',  
             7:'#eebcbc', 8:'#f1f0c0', 9:'#d2ffe7', 10:'#caf3a6', 11:'#ffdf55', 12:'#ef77aa', 
             13:'#d6dcff', 14:'#d2f5f0'}
    node_color= [color_map[d['group']] for n,d in G.nodes(data=True)]
    node_size = [d['nodesize']*10 for n,d in G.nodes(data=True)]
    pos = nx.drawing.spring_layout(G,k=0.70,iterations=60)
    plt.figure(figsize=size)
    nx.draw_networkx(G,pos=pos,node_color=node_color,node_size=node_size,edge_color='#FFDEA2',edge_width=1)
    plt.show()

In [None]:
draw_graph(G,size=(25,25))

# Cliques 

In [None]:
cliques = list(nx.find_cliques(G))

In [None]:
clique_number = len(list(cliques))
print(clique_number)

In [None]:
for clique in cliques:
    print(clique)

# Language Specific Ego Network And Cliques 

For each programming language there's a tag in the network. E.g 'python' will refer to the python language. So we can check the cliques that contains that node. We can also visualize the ego network for a node. Ego network for a node is the subgraph containing that node and all its neighbors with a specifed depth range. 

For example, we can check the ego network for python with radius 2, which means that we get the subgraph containing python and all it's direct neighbors which are 1 edge away from python  and also the nodes which are 2 hop away from python. 

Ego networks can be used for checking shortest paths or generally conducting analysis of who is connected to whom, but cliques are helpful because it shows us the data in a more granular way. 

In [None]:
print(nx.ego_graph(G,'python',radius=2).nodes())

Python participates in 4 different cliques,  one for web development with django and flask, one for open source development presumably which is connected to linux. One for machine learning where it's adjacent to R. I think the fourth one is for porting python and C/C++ back and forth.

In [None]:
nx.algorithms.clique.cliques_containing_node(G,"python")

In [None]:
nx.algorithms.clique.cliques_containing_node(G,"c++")

In [None]:
nx.algorithms.clique.cliques_containing_node(G,"php")

# Visualize Maximal Clique

It's possible that visualizing the largest cliques will let us see some pattern in the data. After finding all the cliques here we sort them by the length(number of nodes in that clique) and draw the cliques with the maximum length. ```G.subgraph``` allows us to extract a subgraph from the graph by passing a list of nodes. We have 3 cliques of size 7 which are the biggest, however I've only taken the unique nodes in a set while extracting the subgraphs, so we can see two different clusters containing javascript and .net related tags.

In [None]:
sorted_cliques = sorted(list(nx.find_cliques(G)),key=len)

In [None]:
max_clique_nodes = set()

for nodelist in sorted_cliques[-4:-1]:
    for node in nodelist:
        max_clique_nodes.add(node)

In [None]:
max_clique = G.subgraph(max_clique_nodes)

In [None]:
print(nx.info(max_clique))

In [None]:
draw_graph(max_clique,size=(10,10))

# Degree Distribution 

For checking the degree distribution of the graph plotting the list containing degrees for each node works. In the tag network clearly most tags hae only 1 or 2 neighbors while some tags are linked to more than 10-12 tags.

In [None]:
plt.hist([node[1] for node in list(G.degree())])
plt.title("Stack Overflow Tag Degree Distribution")