# Social Network Analysis
Social Network Analysis is the process of identifying social structures using network and graph theories. It is a specific application of graph theory in which the individuals as well as the social actors, such as groups, organizations are represented as nodes and their social relations are represented by edges.

Data retrieved from Twitter is stored in the file snaFile. Each line in the file is a relation between with the Twitter users. Each line in the file starts with twitter user id followed by a list of follower's ids.

In [1]:
from graphframes import *
from pyspark.sql.types import BooleanType
from pyspark.sql.functions import *

sqlContext.setConf("spark.sql.shuffle.partitions", "4")
raw_dataframe = sqlContext.read.text('output2.txt')


In [2]:
""" Declare a udf function to filter out the null edge list"""
checkNull = udf(lambda x: len(x)!=0,BooleanType())

In [3]:
def getVertexEdgeDataFrame(data):
    """ Parse the data, clean the null values and create the vertex and edge dataframe.

    Args:
        data :  A `dataframe` consisting of raw data where each line consists of 
                twitter id followed by a list of followers id.

    Returns:
        vertex_list, edge_list_df: vertex_list is a dataframe that holds all the vertices.
                                   edge_list_df is a dataframe that holds all the edges
    """
    split_df = data.select(regexp_extract('value',r'^([^\s]+)',1).alias('v1'),
                           regexp_extract('value', r'^.*\[(.*)\]', 1).alias('v-list')) # parse the data into dest id and src list

    vertex_src_dest_df = split_df.select(trim(split_df.v1).alias('v1'),split('v-list',',').alias('v-list'))
    edge_list = vertex_src_dest_df.select(vertex_src_dest_df.v1,explode('v-list'))
    edge_list = edge_list.select(trim(edge_list.col).alias('src'),edge_list.v1.alias('dst'))
    edge_list_df = edge_list.filter(checkNull(edge_list.src)).cache()
    vertex_list = edge_list_df.select(edge_list_df.src.alias('id')).distinct()
    return vertex_list, edge_list_df

### Create the Vertex and Edge List:
Call the getVertexEdgeDataFrame to parse the raw_dataframe to retrieve the vertex and edge dataframe.


In [4]:
v_list,e_list = getVertexEdgeDataFrame(raw_dataframe)


### Check the Vertex and Edges count

Call the count api to retrieve the total number of edges and vertices


In [5]:
print "Vertex list count: ", v_list.count()
print "Edge list count: ", e_list.count()

Vertex list count:  52567
Edge list count:  63166


### Create a GraphFrame

Graphframe represent a graph and holds vertices and edges as dataframes. Graphframe provide a set of apis to run graph algorithms on the data.

In [6]:
sna_Graph=GraphFrame(v_list,e_list)


### Graph Degree
List the degrees of vertices in the graph.


In [7]:
sna_Graph.degrees.sort('degree',ascending=False).show()


+----------+------+
|        id|degree|
+----------+------+
|  37247645|   375|
| 292417113|   323|
| 270605608|   316|
| 278371078|   308|
|4841703016|   305|
| 256641818|   300|
| 407363823|   300|
| 201335291|   288|
| 342598879|   288|
|  28000155|   283|
|  19874185|   282|
| 320455769|   279|
| 362272328|   277|
|  49862199|   276|
|2737077493|   274|
|  45696435|   272|
|  48650113|   272|
|2485443951|   269|
| 211316487|   267|
| 213562417|   264|
+----------+------+
only showing top 20 rows



### Graph indegree
Sort the vertices based on the inDegree and show the top 20 rows


In [8]:
sna_Graph.inDegrees.sort('inDegree', ascending=False).show()

+----------+--------+
|        id|inDegree|
+----------+--------+
|  37247645|     369|
|4841703016|     304|
| 270605608|     302|
| 256641818|     288|
| 278371078|     287|
|  19874185|     281|
| 342598879|     279|
| 407363823|     279|
| 201335291|     277|
|2737077493|     273|
|  28000155|     273|
| 362272328|     268|
|  45696435|     266|
| 211316487|     266|
|2830180693|     263|
|2149637461|     259|
| 320455769|     259|
|  49862199|     257|
| 213562417|     256|
|2485443951|     254|
+----------+--------+
only showing top 20 rows



### Graph outdegree
Sort the vertices based on the outDegree and show the top 20 rows


In [9]:
sna_Graph.outDegrees.sort('outDegree',ascending=False).show()

+------------------+---------+
|                id|outDegree|
+------------------+---------+
|         292417113|      138|
|         109088737|      102|
|         249014213|       49|
|         224314351|       48|
|741351844362915846|       43|
|744380735100858369|       38|
|         208243253|       36|
|         191674711|       30|
|          47481461|       29|
|         121337468|       28|
|        1332595111|       27|
|748289232972582912|       25|
|          58377545|       25|
|746278469177384961|       24|
|736302842496045057|       24|
|         199821749|       24|
|          30803766|       24|
|         111618023|       23|
|        1099786399|       23|
|         420075870|       22|
+------------------+---------+
only showing top 20 rows



###  PageRank
Run the PageRank algorithm on the graphframe. Sort the results based on pagerank and list the top 20 rows.

In [10]:
results = sna_Graph.pageRank(resetProbability=0.15, maxIter=10)

sna_rank = results.vertices.sort('pagerank',ascending=False)
sna_rank.show()

+------------------+------------------+
|                id|          pagerank|
+------------------+------------------+
|         292417113|2096.8267965976465|
|736302842496045057|1683.3349026011945|
|        1332595111|1082.8139035419747|
|         224314351|  912.832967680284|
|741351844362915846|  796.000924418758|
|         249014213| 792.5400662919325|
|         109088737|  717.118647015483|
|         895948314| 575.7891671169604|
|744380735100858369|   513.97804949864|
|        1099786399|500.58541644862066|
|748289232972582912| 272.0488691881934|
|746278469177384961|271.11062808072677|
|        2902981373|232.04126123535772|
|        1049448162|214.19359623140426|
|          49862199|196.48823326781095|
|          37247645| 186.0075429616639|
|         249168748|182.96643345818617|
|         304580751|182.45703822551692|
|         320455769| 181.5170042591198|
|         225520398|169.81087034066056|
+------------------+------------------+
only showing top 20 rows



### Label Propagation

Run the label propagation alogorithm to find the communities in the network. 

In [11]:
communities=sna_Graph.labelPropagation(maxIter=10)

### Communities
Display the communities and total no of communities.

In [12]:
communities.select('id','label').show()
print "Total no. of communities: ", communities.select(communities.label).distinct().count()

+------------------+-----+
|                id|label|
+------------------+-----+
|        1332595111|    3|
|750098559484043264|   47|
|          39096451|   47|
|746278469177384961|   15|
|740215465536954368|   19|
|          23277482|   47|
|        1268241504|   27|
|736575530884071426|   31|
|        1439425182|   35|
|         249168748|   39|
|          15586037|   43|
|        1099786399|   47|
|733932435935629312|   51|
|736301137410818048|   55|
|        2259030413|   59|
|741580973108690948|   63|
|        3067369849|   67|
|        1426954831|  309|
|         424760563|  309|
|        2195048798|   79|
+------------------+-----+
only showing top 20 rows

Total no. of communities:  1994


### Social Network Data
Creata a social network data frame that holds the src vertex, dest vertex, pagerank of src vertex and the community label for the source id. This information can be used to analyse the data.

In [13]:
e_list=e_list.distinct()
comm_rank = communities.join(sna_rank,communities.id==sna_rank.id,"inner").select(communities.id,communities.label, sna_rank.pagerank)
sna_network = comm_rank.join(e_list, e_list.src==comm_rank.id,"inner").select(comm_rank.id,e_list.dst,comm_rank.label,comm_rank.pagerank)
sna_network.count()


62428

### Save the Output

In [14]:
sna_network.repartition(1).rdd.saveAsTextFile("/home/shan/SNA/output/")

In [15]:
sna_network.write.json("/home/shan/SNA/json/")