---

_You are currently looking at **version 1.2** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-social-network-analysis/resources/yPcBs) course resource._

---

# Assignment 4

In [1]:
import networkx as nx
import pandas as pd
import numpy as np
import pickle

---

## Part 1 - Random Graph Identification

For the first part of this assignment you will analyze randomly generated graphs and determine which algorithm created them.

In [2]:
P1_Graphs = pickle.load(open('A4_graphs','rb'))
P1_Graphs

[<networkx.classes.graph.Graph at 0x7f81d5904908>,
 <networkx.classes.graph.Graph at 0x7f81d5904a20>,
 <networkx.classes.graph.Graph at 0x7f81d5904a58>,
 <networkx.classes.graph.Graph at 0x7f81d5904a90>,
 <networkx.classes.graph.Graph at 0x7f81d5904ac8>]

<br>
`P1_Graphs` is a list containing 5 networkx graphs. Each of these graphs were generated by one of three possible algorithms:
* Preferential Attachment (`'PA'`)
* Small World with low probability of rewiring (`'SW_L'`)
* Small World with high probability of rewiring (`'SW_H'`)

Anaylze each of the 5 graphs and determine which of the three algorithms generated the graph.

*The `graph_identification` function should return a list of length 5 where each element in the list is either `'PA'`, `'SW_L'`, or `'SW_H'`.*

In [3]:

def graph_identification():
    
    my_output = []    
 
    for G in P1_Graphs:

        # Get degree distribution info about the graph
        degree_dist = G.degree()
        degree_vals = sorted(set(degree_dist.values()))
        degree_dist_len = len(degree_vals)

        # Get clustering info about the graph
        clust_coeff = nx.average_clustering(G)

        #print(degree_dist_len)
        #print(clust_coeff)
        
        # PA graphs have many nodes with small degree values
        if degree_dist_len > 20:
            my_output.append('PA')
        # SW_L have higher clustering coefficients because they haven't created as many new connections
        elif clust_coeff > 0.25: 
            my_output.append('SW_L')
        # SW_H have lower clustering coefficients because they have created more new connections outside of their KNN
        else:
            my_output.append('SW_H')
        
    return my_output

graph_identification()

['PA', 'SW_L', 'SW_L', 'PA', 'SW_H']

---

## Part 2 - Company Emails

For the second part of this assignment you will be workking with a company's email network where each node corresponds to a person at the company, and each edge indicates that at least one email has been sent between two people.

The network also contains the node attributes `Department` and `ManagementSalary`.

`Department` indicates the department in the company which the person belongs to, and `ManagementSalary` indicates whether that person is receiving a management position salary.

In [4]:
G = nx.read_gpickle('email_prediction.txt')

print(nx.info(G))

Name: 
Type: Graph
Number of nodes: 1005
Number of edges: 16706
Average degree:  33.2458


### Part 2A - Salary Prediction

Using network `G`, identify the people in the network with missing values for the node attribute `ManagementSalary` and predict whether or not these individuals are receiving a management position salary.

To accomplish this, you will need to create a matrix of node features using networkx, train a sklearn classifier on nodes that have `ManagementSalary` data, and predict a probability of the node receiving a management salary for nodes where `ManagementSalary` is missing.



Your predictions will need to be given as the probability that the corresponding employee is receiving a management position salary.

The evaluation metric for this assignment is the Area Under the ROC Curve (AUC).

Your grade will be based on the AUC score computed for your classifier. A model which with an AUC of 0.88 or higher will receive full points, and with an AUC of 0.82 or higher will pass (get 80% of the full points).

Using your trained classifier, return a series of length 252 with the data being the probability of receiving management salary, and the index being the node id.

    Example:
    
        1       1.0
        2       0.0
        5       0.8
        8       1.0
            ...
        996     0.7
        1000    0.5
        1001    0.0
        Length: 252, dtype: float64

In [5]:
G.nodes(data=True)[:15]

[(0, {'Department': 1, 'ManagementSalary': 0.0}),
 (1, {'Department': 1, 'ManagementSalary': nan}),
 (2, {'Department': 21, 'ManagementSalary': nan}),
 (3, {'Department': 21, 'ManagementSalary': 1.0}),
 (4, {'Department': 21, 'ManagementSalary': 1.0}),
 (5, {'Department': 25, 'ManagementSalary': nan}),
 (6, {'Department': 25, 'ManagementSalary': 1.0}),
 (7, {'Department': 14, 'ManagementSalary': 0.0}),
 (8, {'Department': 14, 'ManagementSalary': nan}),
 (9, {'Department': 14, 'ManagementSalary': 0.0}),
 (10, {'Department': 9, 'ManagementSalary': 0.0}),
 (11, {'Department': 14, 'ManagementSalary': 0.0}),
 (12, {'Department': 14, 'ManagementSalary': 1.0}),
 (13, {'Department': 26, 'ManagementSalary': 1.0}),
 (14, {'Department': 4, 'ManagementSalary': nan})]

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
#from sklearn import preprocessing

def salary_predictions():
    
    df = pd.DataFrame(index = G.nodes())
    df['ManagementSalary'] = pd.Series(nx.get_node_attributes(G, 'ManagementSalary'))

    # Need to include Department?
    #df['Department'] = pd.Series(nx.get_node_attributes(G, 'Department'))
    
    # Generate values for all of our favorite centrality/closeness measures
    hub_auth = nx.hits(G)
    df['hub_score'] = hub_auth[0]
    df['auth_score'] = hub_auth[1]
    df['degree'] = pd.Series(nx.degree(G))
    df['cent_degree'] = pd.Series(nx.degree_centrality(G))
    df['cent_close'] = pd.Series(nx.closeness_centrality(G))
    df['cent_between'] = pd.Series(nx.betweenness_centrality(G, normalized=True, endpoints=False))
    df['page_rank'] = pd.Series(nx.pagerank(G, alpha=0.85))    

    # Exclude rows with no Management Salary value
    df_target = df[(df.ManagementSalary == 0) | (df.ManagementSalary == 1)]
    # Create prediction set for those rows with no Management Salary value
    df_predict  = df[(df.ManagementSalary != 0) & (df.ManagementSalary != 1)]
    # Exclude column holding Management Salary
    X = df_target.drop('ManagementSalary', axis=1)
    y = df_target['ManagementSalary']
    X_predict = df_predict.drop('ManagementSalary', axis=1)

    # Create Train Test Split sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)                                   

    # Need to Encode Department value??
    #X_traintest = X_train.append(X_test)
    #le = preprocessing.LabelEncoder()
    #le.fit(X_traintest.loc[:, 'Department'])
    #X_train.loc[:, 'Department'] = le.transform(X_train.loc[:, 'Department'])
    #X_test.loc[:, 'Department']  = le.transform(X_test.loc[:, 'Department'])

    # Create a Logistic Regression Classifier and fit
    LRclf = LogisticRegression().fit(X_train, y_train)
    # predict_proba() returns the probability estimates for each node.
    X_predict['LRclf_probability'] = [p[1] for p in LRclf.predict_proba(X_predict)]

    # Create a Decision Tree Classifier and fit
    #DTclf = DecisionTreeClassifier().fit(X_train, y_train)
    #X_predict['DTclf_probability'] = [p[1] for p in DTclf.predict_proba(X_predict)]
    
    # return X_predict 
    return X_predict['LRclf_probability']

salary_predictions()

1       0.240689
2       0.596864
5       0.958561
8       0.155016
14      0.462744
18      0.262732
27      0.316128
30      0.353912
31      0.224344
34      0.129183
37      0.154921
40      0.293567
45      0.209909
54      0.227727
55      0.179552
60      0.255897
62      0.991253
65      0.549875
77      0.101902
79      0.181177
97      0.102054
101     0.114840
103     0.225875
108     0.173790
113     0.505053
122     0.107958
141     0.465872
142     0.875109
144     0.070749
145     0.357158
          ...   
913     0.051956
914     0.055294
915     0.027373
918     0.064888
923     0.032015
926     0.041906
931     0.044590
934     0.028805
939     0.028095
944     0.026590
945     0.030012
947     0.035311
950     0.107672
951     0.036746
953     0.042537
959     0.025906
962     0.027261
963     0.098078
968     0.028673
969     0.026640
974     0.030767
984     0.026947
987     0.036302
989     0.028182
991     0.028039
992     0.026788
994     0.025171
996     0.0255

### Part 2B - New Connections Prediction

For the last part of this assignment, you will predict future connections between employees of the network. The future connections information has been loaded into the variable `future_connections`. The index is a tuple indicating a pair of nodes that currently do not have a connection, and the `Future Connection` column indicates if an edge between those two nodes will exist in the future, where a value of 1.0 indicates a future connection.

In [7]:
future_connections = pd.read_csv('Future_Connections.csv', index_col=0, converters={0: eval})
future_connections.head(10)

Unnamed: 0,Future Connection
"(6, 840)",0.0
"(4, 197)",0.0
"(620, 979)",0.0
"(519, 872)",0.0
"(382, 423)",0.0
"(97, 226)",1.0
"(349, 905)",0.0
"(429, 860)",0.0
"(309, 989)",0.0
"(468, 880)",0.0


Using network `G` and `future_connections`, identify the edges in `future_connections` with missing values and predict whether or not these edges will have a future connection.

To accomplish this, you will need to create a matrix of features for the edges found in `future_connections` using networkx, train a sklearn classifier on those edges in `future_connections` that have `Future Connection` data, and predict a probability of the edge being a future connection for those edges in `future_connections` where `Future Connection` is missing.



Your predictions will need to be given as the probability of the corresponding edge being a future connection.

The evaluation metric for this assignment is the Area Under the ROC Curve (AUC).

Your grade will be based on the AUC score computed for your classifier. A model which with an AUC of 0.88 or higher will receive full points, and with an AUC of 0.82 or higher will pass (get 80% of the full points).

Using your trained classifier, return a series of length 122112 with the data being the probability of the edge being a future connection, and the index being the edge as represented by a tuple of nodes.

    Example:
    
        (107, 348)    0.35
        (542, 751)    0.40
        (20, 426)     0.55
        (50, 989)     0.35
                  ...
        (939, 940)    0.15
        (555, 905)    0.35
        (75, 101)     0.65
        Length: 122112, dtype: float64

In [8]:
#from operator import itemgetter
#from sklearn.tree import DecisionTreeClassifier
#from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import MinMaxScaler


def new_connections_predictions():
    
    # Choose some of the Link Prediction Models to populate our dataframe.  
    # Need to make a dual-index dataframe to hold the two nodes
           
    # Measure 2: Jaccard Coefficient -- Returns tuples of the two nodes, and the jaccard coefficient value
    JACO = list(nx.jaccard_coefficient(G))
    future_connections['JACO'] = pd.Series([z for x, y, z in JACO], index=[(x,y) for x, y, z in JACO])

    # Measure 3: Resource Allocation -- Returns tuples of the two nodes, and the resource alloc index
    RSAL = list(nx.resource_allocation_index(G))
    future_connections['RSAL'] = pd.Series([z for x, y, z in RSAL], index=[(x,y) for x, y, z in RSAL])
    
    # Measure 4: Adamic-Adar -- Returns tuples of the two nodes, and the adamic-adar score
    ADAR = list(nx.adamic_adar_index(G))
    future_connections['ADAR'] = pd.Series([z for x, y, z in ADAR], index=[(x,y) for x, y, z in ADAR])

    # Measure 5: Preferential Attachment -- Returns tuples of the two nodes, and the PA score
    PRAT = list(nx.preferential_attachment(G))
    future_connections['PRAT'] = pd.Series([z for x, y, z in PRAT], index=[(x,y) for x, y, z in PRAT])

    # Measure 6: Community Common Neighbors -- Returns tuples of the two nodes, and the CCN score
    COCN = list(nx.cn_soundarajan_hopcroft(G, community='Department'))
    future_connections['COCN'] = pd.Series([z for x, y, z in COCN], index=[(x,y) for x, y, z in COCN])

    # Measure 7: Community Resource Allocation -- Returns tuples of the two nodes, and the CCN score
    CORA = list(nx.ra_index_soundarajan_hopcroft(G, community='Department'))
    future_connections['CORA'] = pd.Series([z for x, y, z in CORA], index=[(x,y) for x, y, z in CORA])
    
    df_target  = future_connections[future_connections['Future Connection'].notnull()]
    df_predict = future_connections[future_connections['Future Connection'].isnull()]
    X = df_target.drop('Future Connection', axis=1)
    y = df_target['Future Connection']
    X_predict = df_predict.drop('Future Connection', axis=1)

    # Create Train Test Split sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)                                       
    
    # Create a Decision Tree Classifier and fit
    #DTclf = DecisionTreeClassifier(max_depth=2).fit(X_train, y_train)
    #X_predict['clf_probability'] = [p[1] for p in DTclf.predict_proba(X_predict)]

    # Create an SVC Classifier and fit -- NOT GOOD FOR LARGE DATASETS
    #SVclf = SVC(gamma=0.1, kernel='rbf', probability=True).fit(X_train, y_train)
    #X_predict['clf_probability'] = [p[1] for p in SVclf.predict_proba(X_predict)]

    # Create an MLP Classifier and fit
    MLPclf = MLPClassifier(alpha = 5, random_state = 0, solver='lbfgs', verbose=0).fit(X_train, y_train)
    X_predict['clf_probability'] = [p[1] for p in MLPclf.predict_proba(X_predict)]

    return X_predict['clf_probability']

new_connections_predictions()

(107, 348)    0.033118
(542, 751)    0.012073
(20, 426)     0.617300
(50, 989)     0.012277
(942, 986)    0.013017
(324, 857)    0.012249
(13, 710)     0.165424
(19, 271)     0.104716
(319, 878)    0.012348
(659, 707)    0.012094
(49, 843)     0.012413
(208, 893)    0.012143
(377, 469)    0.005814
(405, 999)    0.022379
(129, 740)    0.018862
(292, 618)    0.020468
(239, 689)    0.012363
(359, 373)    0.007954
(53, 523)     0.095032
(276, 984)    0.012402
(202, 997)    0.012445
(604, 619)    0.039557
(270, 911)    0.012359
(261, 481)    0.069446
(200, 450)    0.996532
(213, 634)    0.012010
(644, 735)    0.040377
(346, 553)    0.011513
(521, 738)    0.010597
(422, 953)    0.018749
                ...   
(672, 848)    0.012359
(28, 127)     0.967726
(202, 661)    0.011430
(54, 195)     0.999924
(295, 864)    0.012217
(814, 936)    0.012041
(839, 874)    0.013017
(139, 843)    0.012235
(461, 544)    0.009988
(68, 487)     0.009884
(622, 932)    0.012167
(504, 936)    0.016144
(479, 528) 