# Web Analytics - Project-1

*June 20, 2017*

## Group 1 Members:
* Mauricio Alarcon
* Sekhar Mekala
* Aadi Kalloo
* Srinivasa Illapani
* Param Singh

## I. Background

The main goal of this project is to perform analysis of a network data set which contains information about owners of hamsters and the friendships between the owners. The owners maintain the details about their  hamster(s) on a website, and the owners also maintain friendship with some of the owners on the same website. This data has a lot of potential to study relationships between people based on various attributes of hamsters they own. For instance we can identify if the gender of hamster has any relationship to the number of friends a owner can have. We can also identify if the owner of same species of hamster tend to make more friends with similar species' owners than with the owners of different species. The data for this project is present as a compressed folder at http://konect.uni-koblenz.de/networks/petster-hamster. The compressed file has 4 files described below: 

   * **ent.petster-hamster** - Contains hamster attributes such as ID, name, date joined, color,species, gender, birthday, age, hometown, favorite toy, favorite activity, favorite food.
    
   * **out.petster-hamster** - Contains mapping between IDs, representing friendship between the owners of the hamsters with respective IDs.
    
   * **README.petster-hamster** - Information about the files and citation requirements
    
   * **meta.petster-hamster** - Contains information about the data

## II. Requirements

For this project, we will use the files **ent.petster-hamster** and **out.petster-hamster** data sets. Specifically we want to identify if the gender of the hamster has any effect on the centrality measures of the graph. We will use degree centrality, and eigen vector centrality measures for this project. The requirements of this project are given below:

1. Build a undirected graph using the data present in *out.petster-hamster* data set. Each node will represents the hamster and each edge represents friendship between the owners of the hamster.

2. Extract the gender of the hamster from *ent.petster-hamster*, and update each node of the graph.

3. Get the degree centrality and eigen vector centrality measures of each nodes.

4. Perform hypothesis testing of the following hypothesis. A confidence level of 95% should be used in these tests:

    _Hypothesis:1_
    
    $H_0:$ The Degree Centrality has a uniform effect, irrespective of the gender of the hamster

    $H_1:$ The Degree Centrality has some effect based on the gender of the hamster

    _Hypothesis:2_
    
    $H_0:$ The Eigen vector Centrality has a uniform effect, irrespective of the gender of the hamster

    $H_1:$ The Eigen vector Centrality has some effect based on the gender of the hamster

5. Additional requirements (will be addressed later, unless we have time)

Using Gephi, plot the center nodes, nodes with higher eigen vector centrality values and analyze how these nodes are distributed in the complete graph. Using island method, identify subgraphs, and check the attributes of the nodes in the subgraphs.

## III. Data cleansing

Importing all the required packages for this project.

In [57]:
##Import all the required packages
import networkx as net
import urllib
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import sklearn
import scipy.stats as stats

The following code block will display the initial records of the file: **ent.petster-hamster**

In [2]:
#Reading and displaying initial records of ent.petster-hamster
with open('data/ent.petster-hamster') as f:
    i = 0
    node_type = list()
    for line in f:
        print(line)
        i = i+1    
        #Break, after reading the initial 5 records
        if i > 5:
            break

%

%

% ent dat.name dat.joined dat.species dat.coloring dat.gender dat.birthday dat.age dat.hometown dat.favorite_toy dat.favorite_activity dat.favorite_food

"1" "Dexter" "August 1, 2006" "Hamster (Dwarf)" "Blonde" "Male" "June 20, 2006" "Gone to Hamster Heaven" "Vernon Hills, IL&nbsp; United States" "Toilet Paper, paper towels, etc" "Toilet Paper, paper towels, etc" "Lettuce"

"2" "Tonks" "April 15, 2004" "Hamster (Syrian)" "Golden" "Female" "June 20, 2003" "Gone to Hamster Heaven" "Windsor Mill, MD&nbsp; United States" "wooden hamster bunker" "wooden hamster bunker" "just about anything"

"3" "Bunny-Rabbit" "October 27, 2004" "Hamster (Syrian)" "Black" "Female" "August 1, 2002" "Gone to Hamster Heaven" "Chatsworth, GA&nbsp; United States" "Carpet?" "Carpet?" "Doritoes"



From the above display, we can infer that the initial 3 lines must be ignored, although the third line represents the field names. The fields are enclosed in double quotes and the fields are separated by a single space. So to read these records into a pandas data frame, we have to use the '" "' as the separator. The starting quote of field-1 and the end quote of the last field also must be dropped, before creating a pandas data frame. However we are interested only in the gender and ent fields. The ent field helps us to uniquely identify a hamster, and the gender column will contain the gender of the hamster. Hence we will not drop the quote of the last field, since we are not interested in that field anyway. 

In [6]:
##Open the file
with open('data/ent.petster-hamster') as f:
    i = 0
    #Create a list object to collect the data
    node_type = list()
    #Iteratively read the lines in the file
    for line in f:
        if line.startswith("%"):
            continue
        data = line.strip().split('" "')
        
        #Extracting only the ent field and gender field's data
        #Also dropping the initial quote from the ent field.
        node_type.append([int(data[0][1:]),data[4].strip(),data[5].strip()])
        
        
node_type_df = pd.DataFrame(node_type,columns=["node","species","gender"])
print("Displaying initial records from the node_type_df data frame")
display(node_type_df.head())

print("Does the data frame has any NA values?")
node_type_df.isnull().values.any()


Displaying initial records from the node_type_df data frame


Unnamed: 0,node,species,gender
0,1,Blonde,Male
1,2,Golden,Female
2,3,Black,Female
3,4,Cream,Female
4,5,Cream,Female


Does the data frame has any NA values?


False

The above display confirms that the records were read correctly, and there are no NA values in the data.

The following code block will display the initial records of the file: **out.petster-hamster**

In [7]:
#Reading and displaying initial records of ent.petster-hamster
with open('data/out.petster-hamster') as f:
    i = 0
    node_type = list()
    for line in f:
        print(line)
        i = i+1    
        #Break, after reading the initial 5 records
        if i > 5:
            break

% sym unweighted

99 98

999 550

999 549

999 42

999 25



The above display shows that there are 2 fields representing the edge between two nodes. We will skip the initial line and read the data into a pandas data frame named: *edges_df*

In [8]:
#Skip the first row and read the rest of the data into a data frame
edges_df = pd.read_csv('data\out.petster-hamster', skiprows=1,header=None, sep=' ')

#Name the columns as node1 and node2
edges_df.columns = ["node1","node2"]

#Displaying initial records of the edges_df data frame
print("Displaying the inital records of the edges_df")
display(edges_df.head())

#Check if the data frame has any NA values
print("Did edges_df have any NA values?")
edges_df.isnull().values.any()


Displaying the inital records of the edges_df


Unnamed: 0,node1,node2
0,99,98
1,999,550
2,999,549
3,999,42
4,999,25


Did edges_df have any NA values?


False

## IV. Building a graph

The following code block will create a graph using Networkx package. It will also add gender variable to each node.

In [9]:
#Create a graph
GA = net.from_pandas_dataframe(edges_df,source="node1",target="node2")


#Add the gender attribute to each node
gender_type = dict(zip(node_type_df["node"],node_type_df["gender"]))
species_type = dict(zip(node_type_df["node"],node_type_df["species"]))
net.set_node_attributes(GA,'gender',gender_type)
net.set_node_attributes(GA,'species',species_type)

#Print graph summary:
print("Graph built successfully. The graph summary is displayed below:")
print(net.info(GA))


Graph built successfully. The graph summary is displayed below:
Name: 
Type: Graph
Number of nodes: 2426
Number of edges: 16631
Average degree:  13.7106


The summary shows that on an average each node (or each hamster's owner) is connected with 14 other owners. There are 2426 number of owners and 16631 number of edges or connections between them. Let us now check if there are any disconnected components or sub-graphs in the graph.

In [10]:
if ~net.is_connected(GA):
    GA_list = list(net.connected_component_subgraphs(GA))

    print("There are {0} component subgraphs".format(len(GA_list)))

    x = 1
    for i in GA_list:
        if(len(i.nodes()) > 20):
            print("Display summary of component sub-graphs, if and only if the number of nodes > 20")
            print("----------------------------------")
            print("Summary details of component sub-graph: {0} ".format(x))
            print("----------------------------------")
            x += 1
            print(net.info(i))
            print("Diameter:{0}".format(net.diameter(i)))
            print("Radius:{0}".format(net.radius(i)))
        else:
            x += 1
            
else:
    print("The graph does not have any disconnected components")

There are 148 component subgraphs
Display summary of component sub-graphs, if and only if the number of nodes > 20
----------------------------------
Summary details of component sub-graph: 1 
----------------------------------
Name: 
Type: Graph
Number of nodes: 2000
Number of edges: 16098
Average degree:  16.0980
Diameter:10
Radius:6


The above display shows that there are 148 component sub-graphs, and only one component sub-graph has more than 20 nodes. So for the rest of the discussion/analysis, we will consider the component sub-graph with highest number of nodes (2000) as the main graph and discard the remaining component sub-graphs (given their sizes are not big). 

## V. Centralities
The following code block will obtain the degree and eigen vector centralities of the graph.

In [12]:
GA = GA_list[0]
#Get eigen vector centralities
eigen_centrality=net.eigenvector_centrality(GA)

#Get degree centralities
degree_centrality=net.degree_centrality(GA)

#Get betweenness centralities
betweenness_centrality=net.betweenness_centrality(GA)


The top 5 nodes with highest centrality measures are displayed below:

In [42]:
#print(eigen_centrality)
temp_df1=pd.DataFrame(sorted(eigen_centrality.items(),key=lambda x:x[1],reverse=True),\
                    columns = ["node","eigen_centrality_score"])

print("Top 5 nodes based on the eigen vector centrality measure:")
display(pd.merge(temp_df1[:5],node_type_df, on="node"))


#print(degree_centrality)
temp_df2=pd.DataFrame(sorted(degree_centrality.items(),key=lambda x:x[1],reverse=True),\
                    columns = ["node","degree_centrality_score"])

print("Top 5 nodes based on the degree centrality measure:")
display(pd.merge(temp_df2[:5],node_type_df, on="node"))


#print(betweenness_centrality)
temp_df3=pd.DataFrame(sorted(betweenness_centrality.items(),key=lambda x:x[1],reverse=True),\
                    columns = ["node","betweenness_centrality_score"])

print("Top 5 nodes based on the betweenness centrality measure:")
display(pd.merge(temp_df3[:5],node_type_df, on="node"))

##Capture every ecntrality into a common data frame, and also gender:
combined_df = pd.merge(pd.merge(pd.merge(temp_df1,temp_df2, on="node"),\
                  temp_df3,on="node"),node_type_df,on="node")

Top 5 nodes based on the eigen vector centrality measure:


Unnamed: 0,node,eigen_centrality_score,species,gender
0,237,0.211469,Golden Banded,Female
1,238,0.185886,Cream,Female
2,168,0.143649,Orange,Female
3,177,0.123658,Golden,Female
4,356,0.120463,Cream,Male


Top 5 nodes based on the degree centrality measure:


Unnamed: 0,node,degree_centrality_score,species,gender
0,237,0.136568,Golden Banded,Female
1,238,0.111556,Cream,Female
2,44,0.086543,Golden Banded,Female
3,168,0.077039,Orange,Female
4,169,0.075538,Beige,Male


Top 5 nodes based on the betweenness centrality measure:


Unnamed: 0,node,betweenness_centrality_score,species,gender
0,237,0.081095,Golden Banded,Female
1,169,0.060421,Beige,Male
2,137,0.059367,Golden Banded,Male
3,238,0.040978,Cream,Female
4,251,0.039148,Albino,Female


## VI. Hypothesis testing

In this section we will perform the hypothesis testing of the three centrality measures, to determine if there is any relationship between the gender of the hamster and its centrality measure. Unless otherwise specified, we will use 95% confidence level to conduct our hypothesis testing:

### V1.I. Effect of gender on the Degree centrality
To perform hypothesis testing of degree centrality, we will get the average centrality of males ($dc_{\mu_m}$) and average degree centrality of females ($dc_{\mu_f}$), and perform the following testing:

$$H_0: dc_{\mu_m} = dc_{\mu_f}$$
$$H_1: dc_{\mu_m} \ne dc_{\mu_f}$$

Or the above hypothesis can be re-written as:

$$H_0: dc_{\mu_m} - dc_{\mu_f} = 0$$
$$H_1: dc_{\mu_m} - dc_{\mu_f} \ne 0$$

The following code block will obtain the p-value based on the t-test score

In [59]:
male_dc = list(combined_df[(combined_df["gender"]=="Male")]["degree_centrality_score"])
female_dc = list(combined_df[(combined_df["gender"]=="Female")]["degree_centrality_score"])
stats.ttest_ind(male_dc, female_dc)

Ttest_indResult(statistic=-1.6266727664321396, pvalue=0.10396462032870302)

We obtained a p-value of 10.3%, which is more than the significance level of 5% (since the confidence level is 95%, significance level will be 5%). Since the p-value is greater than significance level, we will not reject the null hypothesis, and this confirms that there is no difference of degree centrality metric for the two genders of hamsters.

### VI.II. Effect of gender on the Eigen vector centrality

We will perform the following hypothesis testing to determine the effect of gender on the Eigen vector centrality score.

$$H_0: ec_{\mu_m} = ec_{\mu_f}$$
$$H_1: ec_{\mu_m} \ne ec_{\mu_f}$$

Or the above hypothesis can be re-written as:

$$H_0: ec_{\mu_m} - ec_{\mu_f} = 0$$
$$H_1: ec_{\mu_m} - ec_{\mu_f} \ne 0$$

where $ec_{\mu_m}$ is the average eigen vector centrality of male hamsters and $ec_{\mu_f}$ is the average eigen vector centrality of female hamsters.

The following code block will obtain the p-value based on the t-test score


In [61]:
male_dc = list(combined_df[(combined_df["gender"]=="Male")]["eigen_centrality_score"])
mu_male = np.mean(combined_df[(combined_df["gender"]=="Male")]["eigen_centrality_score"])
female_dc = list(combined_df[(combined_df["gender"]=="Female")]["eigen_centrality_score"])
mu_female = np.mean(combined_df[(combined_df["gender"]=="Female")]["eigen_centrality_score"])
stats.ttest_ind(male_dc, female_dc)

Ttest_indResult(statistic=-2.0712207587987468, pvalue=0.038466528960882658)

Since the p-value is 3.8%, which is less than the 5% significance level, we can conclude that there is a relationshi between the gender and the eigen vector centrality.

### VI.III. Effect of gender on the betweenness centrality

We will perform the following hypothesis testing to determine the effect of gender on the betweenness centrality score.

$$H_0: bc_{\mu_m} = bc_{\mu_f}$$
$$H_1: bc_{\mu_m} \ne bc_{\mu_f}$$

Or the above hypothesis can be re-written as:

$$H_0: bc_{\mu_m} - bc_{\mu_f} = 0$$
$$H_1: bc_{\mu_m} - bc_{\mu_f} \ne 0$$

where $bc_{\mu_m}$ is the average betweenness centrality of male hamsters and $bc_{\mu_f}$ is the average betweenness centrality of female hamsters.

The following code block will obtain the p-value based on the t-test score


In [62]:
male_dc = list(combined_df[(combined_df["gender"]=="Male")]["betweenness_centrality_score"])
mu_male = np.mean(combined_df[(combined_df["gender"]=="Male")]["betweenness_centrality_score"])
female_dc = list(combined_df[(combined_df["gender"]=="Female")]["betweenness_centrality_score"])
mu_female = np.mean(combined_df[(combined_df["gender"]=="Female")]["betweenness_centrality_score"])
stats.ttest_ind(male_dc, female_dc)

Ttest_indResult(statistic=-0.9044146530500522, pvalue=0.36588482552664714)

The p-value of 36.6% shows that the difference between the average betweenness values of males and female hamsters was just a pure random phenemon, and there is no relationship between the gender and the betweenness centrality scores. Hence we are unable to reject the null hypothesis (based on the 5% significance level)

### VI.IV. What can we conclude based on the above hypothesis testing?

We can conclude the following:

* The gender attribute has NO relationship with the average degree centrality scores. That is the degree centrality score does not change based on the gender of the hamster. _So the number of connections for a node is not dependent on the gender attribute_.

* The gender attribute has some relationship with the average eigen vector centrality. That is the eigen vector centrality score does change based on the gender of the hamster. _So the importance of a node is dependent on the gender attribute. The eigen centrality measure will identify the hidden nodes, which might not have a higher degree, but the nodes are significant, since they are attached to the nodes of higher degree centralities. In other words, nodes (NOT necessarily immediate nodes) with significant eigen scores will be very important in the network. This importance does have some relationship to the gender attribute of the node in our graph network_.

* The gender attribute has NO relationship with the average betweenness centrality scores. That is the betweenness centrality score does not change based on the gender of the hamster. _So a node with higher betweenness score can belong to a male or femal hamster_.




## VII.I Comparing the three metrics
We will get the Kendall tau rank distance between the three centrality measures. This will help us to identify if there is any difference in the rankings of the nodes, based on the 

In [75]:
print(net.info(GA))
#net.write_gml(GA,"subgraph0.gml")
from pip._vendor import pkg_resources

def get_version(package):
    package = package.lower()
    return next((p.version for p in pkg_resources.working_set if p.project_name.lower() == package), "No match")

get_version("networkx")
pd.DataFrame(GA.edges(),columns=["source","target"]).to_csv("subgraph0",index=False)

Name: 
Type: Graph
Number of nodes: 2000
Number of edges: 16098
Average degree:  16.0980


Still working ...

### IV. References
konect:2016:petster-hamster, Hamsterster full network dataset, http://konect.uni-koblenz.de/networks/petster-hamster