# Social Network Analysis using ML

This script is based on an assignment given in the 'Python for Social Network Analysis' course on Coursera.


**Note: requires networkx version 1.11 (not the more recent networkX 2.0) - can run `pip install networkx==1.11` to downgrade**

In [1]:
import networkx as nx
import pandas as pd
import numpy as np
import pickle

# !pip install networkx==1.11

---

## Company Emails

We are working with a company's email network where each node corresponds to a person at the company, and each edge indicates that at least one email has been sent between two people.

The network also contains the node attributes `Department` and `ManagementSalary`.

`Department` indicates the department in the company which the person belongs to, and `ManagementSalary` indicates whether that person is receiving a management position salary.

In [2]:
G = nx.read_gpickle('email_prediction.txt')

print(nx.info(G))

Name: 
Type: Graph
Number of nodes: 1005
Number of edges: 16706
Average degree:  33.2458


### Part A - Salary Prediction

Using network `G`, we identify the people in the network with missing values for the node attribute `ManagementSalary` and predict whether or not these individuals are receiving a management position salary.

To accomplish this, we create a matrix of node features using networkx, train a sklearn classifier on nodes that have `ManagementSalary` data, and predict a probability of the node receiving a management salary for nodes where `ManagementSalary` is missing.



The predictions are given as the probability that the corresponding employee is receiving a management position salary.

The evaluation metric is the Area Under the ROC Curve (AUC). A model which with an AUC of 0.80 or higher is deemed satisfactory.

Using the trained classifier, we return a series of length 252 with the data being the probability of receiving management salary, and the index being the node id.

    Example:
    
        1       1.0
        2       0.0
        5       0.8
        8       1.0
            ...
        996     0.7
        1000    0.5
        1001    0.0
        Length: 252, dtype: float64

In [3]:
def salary_predictions():
    
    from sklearn.linear_model import Lasso, Ridge
    from sklearn.model_selection import GridSearchCV
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import MinMaxScaler
    from sklearn.metrics import roc_auc_score

    # create dataframe of nodes
    df = pd.DataFrame(index=G.nodes())
    df['Department']        = pd.Series(nx.get_node_attributes(G,      'Department'))
    df['ManagementSalary']  = pd.Series(nx.get_node_attributes(G,'ManagementSalary'))
    df['Degree']            = pd.Series(G.degree())
    df['Degree Centrality'] = pd.Series(nx.degree_centrality(G))
    df['Close Centrality']  = pd.Series(nx.closeness_centrality(G))
    df['Betw Centrality']   = pd.Series(nx.betweenness_centrality(G))
    df['Clustering']        = pd.Series(nx.clustering(G))

    # train the model using the non-null management salary part of dataframe
    df_no_nan = pd.DataFrame(index=df[~df.ManagementSalary.isnull()].index)
    df_no_nan = df[~df.ManagementSalary.isnull()]
    X = df_no_nan.drop('ManagementSalary',axis=1)
    y = df_no_nan['ManagementSalary']
    
    # split into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

    # normalise
    scaler = MinMaxScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled  = scaler.transform(X_test)

    # create ML model - use ridge
    rgr = Ridge()
    
    # test for the best value of the regularisation parameter alpha
    param_grid = {'alpha':(0.2, 0.4, 0.6, 0.8, 1.0)}
    grid_rgr = GridSearchCV(rgr,scoring='roc_auc',param_grid=param_grid)
    grid_rgr.fit(X_train_scaled,y_train)

    best_alpha = list(grid_rgr.best_params_.values())[0]

    print('Best alpha:',best_alpha)

    rgr = Ridge(alpha=best_alpha).fit(X_train_scaled,y_train)

    # calculate and report the score using the training set
    y_pred = rgr.predict(X_test_scaled)
    score = roc_auc_score(y_test,y_pred)
    print('Training set score: {0:.1f}%'.format(100*score))

    # Re-run on whole dataset to predict null values:
    # find the null values, assign to new dataframe
    df_assign = df[df['ManagementSalary'].isnull()]

    # Perform the predictions
    X_assign = df_assign.drop('ManagementSalary',axis=1)
    X_assign_scaled = scaler.transform(X_assign)
    y_assign_pred = pd.Series(index=df_assign.index,data=rgr.predict(X_assign_scaled))

    return y_assign_pred

y=salary_predictions()

Best alpha: 1.0
Training set score: 86.4%


In [4]:
print('Salary predictions:')
print(y[:10])

Salary predictions:
1     0.188365
2     0.494295
5     1.145811
8     0.164430
14    0.344260
18    0.237219
27    0.296099
30    0.302561
31    0.207467
34    0.148646
dtype: float64


### Part B - New Connections Prediction

For the last part, we predict future connections between employees of the network. The future connections information has been loaded into the variable `future_connections`. The index is a tuple indicating a pair of nodes that currently do not have a connection, and the `Future Connection` column indicates if an edge between those two nodes will exist in the future, where a value of 1.0 indicates a future connection.

In [5]:
future_connections = pd.read_csv('Future_Connections.csv', index_col=0, converters={0: eval})
future_connections.head(10)

Unnamed: 0,Future Connection
"(6, 840)",0.0
"(4, 197)",0.0
"(620, 979)",0.0
"(519, 872)",0.0
"(382, 423)",0.0
"(97, 226)",1.0
"(349, 905)",0.0
"(429, 860)",0.0
"(309, 989)",0.0
"(468, 880)",0.0


Using network `G` and `future_connections`, we identify the edges in `future_connections` with missing values and predict whether or not these edges will have a future connection.

To accomplish this, we create a matrix of features for the edges found in `future_connections` using networkx, train a sklearn classifier on those edges in `future_connections` that have `Future Connection` data, and predict a probability of the edge being a future connection for those edges in `future_connections` where `Future Connection` is missing.



The predictions are given as the probability of the corresponding edge being a future connection.

The evaluation metric is the Area Under the ROC Curve (AUC). A model which with an AUC of 0.80 or higher is deemed satisfactory.

Using the trained classifier, we return a series of length 122,112 with the data being the probability of the edge being a future connection, and the index being the edge as represented by a tuple of nodes.

    Example:
    
        (107, 348)    0.35
        (542, 751)    0.40
        (20, 426)     0.55
        (50, 989)     0.35
                  ...
        (939, 940)    0.15
        (555, 905)    0.35
        (75, 101)     0.65
        Length: 122112, dtype: float64

In [6]:
def new_connections_predictions():
    
    # import ML
    from sklearn.linear_model import Lasso, Ridge
    from sklearn.model_selection import GridSearchCV
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import MinMaxScaler
    from sklearn.metrics import roc_auc_score
    
    # Find the various indices for non-connected edges
    # Jaccard Index
    jai = [((e[0],e[1]),e[2]) for e in list(nx.jaccard_coefficient(G)) if (e[0], e[1]) in future_connections.index]

    # Resource Allocation Index
    rai = [((e[0],e[1]),e[2]) for e in list(nx.resource_allocation_index(G)) if (e[0], e[1]) in future_connections.index]

    # Adamic Adar Index
    aai = [((e[0],e[1]),e[2]) for e in list(nx.adamic_adar_index(G)) if (e[0], e[1]) in future_connections.index]

    # Preferential Attachment
    pai = [((e[0],e[1]),e[2]) for e in list(nx.preferential_attachment(G)) if (e[0], e[1]) in future_connections.index]

    # Create a sorted version of future_connections
    train_df = future_connections
    train_df['Old Index Location'] = [i for i in range(len(future_connections))]
    train_df = train_df.sort_index()

    # Populate df with index values
    train_df['jai'] = [e[1] for e in jai]
    train_df['rai'] = [e[1] for e in rai]
    train_df['aai'] = [e[1] for e in aai]
    train_df['pai'] = [e[1] for e in pai]

    # Remove the parts of the df that have null Future Connection values
    nan_df     = train_df[ train_df['Future Connection'].isnull()]
    non_nan_df = train_df[~train_df['Future Connection'].isnull()]

    
    # Get X,y
    X = non_nan_df.drop('Future Connection',axis=1)
    y = non_nan_df['Future Connection']

    # Split the non-null into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

    # normalise
    scaler = MinMaxScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled  = scaler.transform(X_test)

    # create the ML model using ridge
    rgr = Ridge()

    # find the best regularisation value
    param_grid = {'alpha':(0.2, 0.4, 0.6, 0.8, 1.0)}
    grid_rgr = GridSearchCV(rgr,scoring='roc_auc',param_grid=param_grid)
    grid_rgr.fit(X_train_scaled,y_train)
    best_alpha = list(grid_rgr.best_params_.values())[0]
    
    
    # fit the model
    rgr = Ridge(alpha=best_alpha).fit(X_train_scaled,y_train)
    
    # calculate and report the score using the training set
    y_pred = rgr.predict(X_test_scaled)
    score = roc_auc_score(y_test,y_pred)
    print('Training set score: {0:.1f}%'.format(100*score))
    
    
    # create predictions using whole dataset
    X_assign = nan_df.drop('Future Connection',axis=1)
    X_assign_scaled = scaler.transform(X_assign)
    y_assign_pred = pd.Series(index=nan_df.index,data=rgr.predict(X_assign_scaled))


    guess_df = pd.DataFrame(index=nan_df.index)
    guess_df['Old Index Location'] = nan_df['Old Index Location']
    guess_df['Future Connection'] = y_assign_pred
    
    # return values to original indexing
    guess_df['Index Tuple'] = nan_df.index
    answer = guess_df.set_index('Old Index Location').sort_index().set_index('Index Tuple')['Future Connection']
    
    return pd.Series(answer)

z=new_connections_predictions()

Training set score: 90.4%


In [7]:
print('Likelihood of new connection:')
print(z[:10])

Likelihood of new connection:
Index Tuple
(107, 348)    0.075898
(542, 751)   -0.004899
(20, 426)     0.407092
(50, 989)    -0.004247
(942, 986)   -0.003549
(324, 857)   -0.004337
(13, 710)     0.247795
(19, 271)     0.213620
(319, 878)   -0.004022
(659, 707)   -0.004832
Name: Future Connection, dtype: float64
