# Web Analytics - Project-1

*June 20, 2017*

## Group 1 Members:
* Mauricio Alarcon
* Sekhar Mekala
* Aadi Kalloo
* Srinivasa Illapani
* Param Singh

## I. Background

The main goal of this project is to perform analysis of a network data set which contains information about owners of hamsters and the friendships between the owners. The owners maintain the details about their  hamster(s) on a website, and the owners also maintain friendship with some of the owners on the same website. This data has a lot of potential to study relationships between people based on various attributes of hamsters they own. For instance we can identify if the gender of hamster has any relationship to the number of friends a owner can have. We can also identify if the owner of same species of hamster tend to make more friends with similar species' owners than with the owners of different species. The data for this project is present as a compressed folder at http://konect.uni-koblenz.de/networks/petster-hamster. The compressed file has 4 files described below: 

   * ent.petster-hamster - Contains hamster attributes such as ID, name, date joined, color,species, gender, birthday, age, hometown, favorite toy, favorite activity, favorite food.
    
   * out.petster-hamster - Contains mapping between IDs, representing friendship between the owners of the hamsters with respective IDs.
    
   * README.petster-hamster - Information about the files and citation requirements
    
   * meta.petster-hamster - Contains information about the data

## II. Requirements

For this project, we will use the files *ent.petster-hamster* and *out.petster-hamster* data sets. Specifically we want to identify if the gender of the hamster has any effect on the centrality measures of the graph. We will use degree centrality, and eigen vector centrality measures for this project. The requirements of this project are given below:

1. Build a undirected graph using the file data present in *out.petster-hamster* data set. Each node will represent the hamster and each edge represents friendship between the owners of the hamster.

2. Extract the gender of the hamster from *ent.petster-hamster*, and update each node of the graph.

3. Get the degree centrality and eigen vector centrality measures of each nodes.

4. Perform hypothesis testing of the following hypothesis. A confidence level of 95% should be used in these tests:

    _Hypothesis:1_
    
    $H_0:$ The Degree Centrality has a uniform effect, irrespective of the gender of the hamster

    $H_1:$ The Degree Centrality has some effect based on the gender of the hamster

    _Hypothesis:2_
    
    $H_0:$ The Eigen vector Centrality has a uniform effect, irrespective of the gender of the hamster

    $H_1:$ The Eigen vector Centrality has some effect based on the gender of the hamster

5. Additional requirements (will be addressed later, unless we have time)

Using Gephi, plot the center nodes, nodes with higher eigen vector centrality values and analyze how these nodes are distributed in the complete graph.

## III. Data cleansing

Importing all the required packages for this project.The following code block will display the initial records of the two files: *ent.petster-hamster* and *out.petster-hamster* 

In [1]:
##Import all the required packages
import networkx as net
import urllib
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import sklearn

The following code block will display the initial records of the file: **ent.petster-hamster**

In [19]:
#Reading and displaying initial records of ent.petster-hamster
with open('data/ent.petster-hamster') as f:
    i = 0
    node_type = list()
    for line in f:
        print(line)
        i = i+1    
        #Break, after reading the initial 5 records
        if i > 5:
            break

%

%

% ent dat.name dat.joined dat.species dat.coloring dat.gender dat.birthday dat.age dat.hometown dat.favorite_toy dat.favorite_activity dat.favorite_food

"1" "Dexter" "August 1, 2006" "Hamster (Dwarf)" "Blonde" "Male" "June 20, 2006" "Gone to Hamster Heaven" "Vernon Hills, IL&nbsp; United States" "Toilet Paper, paper towels, etc" "Toilet Paper, paper towels, etc" "Lettuce"

"2" "Tonks" "April 15, 2004" "Hamster (Syrian)" "Golden" "Female" "June 20, 2003" "Gone to Hamster Heaven" "Windsor Mill, MD&nbsp; United States" "wooden hamster bunker" "wooden hamster bunker" "just about anything"

"3" "Bunny-Rabbit" "October 27, 2004" "Hamster (Syrian)" "Black" "Female" "August 1, 2002" "Gone to Hamster Heaven" "Chatsworth, GA&nbsp; United States" "Carpet?" "Carpet?" "Doritoes"



From the above display, we can infer that the initial 3 lines must be ignored, although the third line represents the field names. The fields are enclosed in double quotes and the fields are separated by a single space. So to read these records into a pandas data frame, we have to use the '" "' as the separator. The starting quote of field-1 and the end quote of the last field also must be dropped, before creating a pandas data frame. However we are interested only in the gender and ent fields. The ent field helps us to uniquely identify a hamster, and the gender column will contain the gender of the hamster. Hence we will not drop the quote of the last field, since we are not interested in that field anyway. 

In [18]:
##Open the file
with open('data/ent.petster-hamster') as f:
    i = 0
    #Create a list object to collect the data
    node_type = list()
    #Iteratively read the lines in the file
    for line in f:
        if line.startswith("%"):
            continue
        data = line.strip().split('" "')
        
        #Extracting only the ent field and gender field's data
        #Also dropping the initial quote from the ent field.
        node_type.append([int(data[0][1:]),data[5].strip()])
        
node_type_df = pd.DataFrame(node_type,columns=["node","gender"])
print("Displaying initial records from the node_type_df data frame")
display(node_type_df.head())

print("Does the data frame has any NA values?")
node_type_df.isnull().values.any()


Displaying initial records from the node_type_df data frame


Unnamed: 0,node,gender
0,1,Male
1,2,Female
2,3,Female
3,4,Female
4,5,Female


Does the data frame has any NA values?


False

The above display confirms that the records were read correctly, and there are no NaN values in the data.

The following code block will display the initial records of the file: **out.petster-hamster**

In [20]:
#Reading and displaying initial records of ent.petster-hamster
with open('data/out.petster-hamster') as f:
    i = 0
    node_type = list()
    for line in f:
        print(line)
        i = i+1    
        #Break, after reading the initial 5 records
        if i > 5:
            break

% sym unweighted

99 98

999 550

999 549

999 42

999 25



The above display shows that there are 2 fields representing the edge between two nodes. We will skip the nitial line and read the data into a pandas data frame named: *edges_df*

In [26]:
#Skip the first row and read the rest of the data into a data frame
edges_df = pd.read_csv('data\out.petster-hamster', skiprows=1,header=None, sep=' ')

#Name the columns as node1 and node2
edges_df.columns = ["node1","node2"]

#Displaying initial records of the edges_df data frame
print("Displaying the inital records of the edges_df")
display(edges_df.head())

#Check if the data frame has any NA values
print("Did edges_df have any NA values?")
edges_df.isnull().values.any()


Displaying the inital records of the edges_df


Unnamed: 0,node1,node2
0,99,98
1,999,550
2,999,549
3,999,42
4,999,25


Did edges_df have any NA values?


False

## IV. Building a graph

The following code block will create a graph using Networkx package. It will also add gender variable to each node.

In [38]:
#Create a graph
GA = net.from_pandas_dataframe(edges_df,source="node1",target="node2")


#Add the gender attribute to each node
gender_type = dict(zip(node_type_df["node"],node_type_df["gender"]))
net.set_node_attributes(GA,'gender',gender_type)

#Print graph summary:
print("Graph built successfully. The graph summary is displayed below:")
print(net.info(GA))


Graph built successfully. The graph summary is displayed below:
Name: 
Type: Graph
Number of nodes: 2426
Number of edges: 16631
Average degree:  13.7106


The summary shows that on an average each node (or each hamster's owner) is connected with 14 other owners. There are 2426 number of owners and 16631 number of edges or connections between them. Let us now check if there are any disconnected components or sub-graphs in the graph.

In [35]:
if ~net.is_connected(GA):
    GA_list = list(net.connected_component_subgraphs(GA))

    print("There are {0} subgraphs".format(len(GA_list)))

    x = 1
    for i in GA_list:
        if(len(i.nodes()) > 20):
            print("Display summary of sub-graphs, if and only if the number of nodes > 20")
            print("----------------------------------")
            print("Summary details of sub-graph: {0} ".format(x))
            print("----------------------------------")
            x += 1
            print(net.info(i))
            print("Diameter:{0}".format(net.diameter(i)))
            print("Radius:{0}".format(net.radius(i)))
        else:
            x += 1
            
else:
    print("The graph does not have any disconnected components")

There are 148 subgraphs
Display summary of sub-graphs, if and only if the number of nodes > 20
----------------------------------
Summary details of sub-graph: 1 
----------------------------------
Name: 
Type: Graph
Number of nodes: 2000
Number of edges: 16098
Average degree:  16.0980
Diameter:10
Radius:6


The above display shows that there are 148 sub-graphs, and only one sub-graph has more than 20 nodes. So for the rest of the discussion/analysis, we will consider the sub-graph with highest number of nodes (2000) as the ain graph and discard the remaining sub-graphs. 

## V. Centralities
The following code block will obtain the degree and eigen vector centralities of the graph.

In [40]:
GA = GA_list[0]
#Get eigen vector centralities
eigen_centrality=net.eigenvector_centrality(GA)

#Get degree centralities
degree_centrality=net.degree_centrality(GA)


The top 5 nodes with highest centrality measures are displayed below:

In [45]:
#print(eigen_centrality)
temp_df=pd.DataFrame(sorted(eigen_centrality.items(),key=lambda x:x[1],reverse=True)[:5],\
                    columns = ["node","eigen_centrality_score"])

print("Top 5 nodes based on the eigen vector centrality measure:")
display(pd.merge(temp_df,node_type_df, on="node"))


#print(degree_centrality)
temp_df=pd.DataFrame(sorted(degree_centrality.items(),key=lambda x:x[1],reverse=True)[:5],\
                    columns = ["node","degree_centrality_score"])

print("Top 5 nodes based on the degree vector centrality measure:")
display(pd.merge(temp_df,node_type_df, on="node"))


Top 5 nodes based on the eigen vector centrality measure:


Unnamed: 0,node,eigen_centrality_score,gender
0,237,0.211469,Female
1,238,0.185886,Female
2,168,0.143649,Female
3,177,0.123658,Female
4,356,0.120463,Male


Top 5 nodes based on the degree vector centrality measure:


Unnamed: 0,node,degree_centrality_score,gender
0,237,0.136568,Female
1,238,0.111556,Female
2,44,0.086543,Female
3,168,0.077039,Female
4,169,0.075538,Male


In [75]:
print(net.info(GA))
#net.write_gml(GA,"subgraph0.gml")
from pip._vendor import pkg_resources

def get_version(package):
    package = package.lower()
    return next((p.version for p in pkg_resources.working_set if p.project_name.lower() == package), "No match")

get_version("networkx")
pd.DataFrame(GA.edges(),columns=["source","target"]).to_csv("subgraph0",index=False)

Name: 
Type: Graph
Number of nodes: 2000
Number of edges: 16098
Average degree:  16.0980


Still working ...

### IV. References
konect:2016:petster-hamster, Hamsterster full network dataset, http://konect.uni-koblenz.de/networks/petster-hamster