# Social Graph

[New York Social Diary](http://www.newyorksocialdiary.com/) provides a
fascinating lens onto New York's socially well-to-do.  The data forms a natural
social graph for New York's social elite.  Take a look at this page of a recent
run-of-the-mill holiday party:

`http://www.newyorksocialdiary.com/party-pictures/2014/holiday-dinners-and-doers`

Besides the brand-name celebrities, you will notice the photos have carefully
annotated captions labeling those that appear in the photos.  We can think of
this as implicitly implying a social graph: there is a connection between two
individuals if they appear in a picture together.

Fetching the data can be broken down into two phases:

The first step is to crawl the data.  We want photos from parties before
December 1st, 2014.  Go to `http://www.newyorksocialdiary.com/party-pictures`
to see a list of (party) pages.  For each party's page, grab all the captions.

## Hints
  1. Click on the on the index page and see how they change the url.  Use this
     to determine a strategy to get all the data.
  2. Notice that each party has a date on the index page.  Use python's
     `datetime.strptime` function to parse it.
  3. Some captions are not useful: they contain long narrative texts that
     explain the event.  Usually in two stage processes like this, it is better
     to keep more data in the first stage and then filter it out in the second
     stage.  This makes your work more reproducible.  It's usually faster to
     download more data than you need now than to have to redownload more data
     later.
  4. To avoid having to re-scrape every time you run your code, you can
	 consider saving the data to disk, and having the parsing code load a file.
	 A checkpoint library like
	 [ediblepickle](https://pypi.python.org/pypi/ediblepickle/1.1.3) can
     streamline the process so that the time-consuming code will only be run
     when necessary.
  5. HTTP requests can sometimes fail inconsistently. You should expect to
     run into this issue and deal with it as best you can.

Now that you have a list of all captions, you should probably save the data on
disk so that you can quickly retrieve it.  Now comes the parsing part.
  1. Some captions are not useful: they contain long narrative texts that
     explain the event.  Try to find some heuristic rules to separate captions
     that are a list of names from those that are not.  A few heuristics
     include:
      - look for sentences (which have verbs) and as opposed to lists of nouns.
        For example, [nltk does part of speech
        tagging](http://www.nltk.org/book/ch05.html) but it is a little slow.
        There may also be heuristics that accomplish the same thing.
      - Look for commonly repeated threads (e.g. you might end up picking up
        the photo credits or people such as "a friend").
      - Long captions are often not lists of people.  The cutoff is subjective,
        but for grading purposes, *set that cutoff at 250 characters*.
  2. You will want to separate the captions based on various forms of
     punctuation.  Try using `re.split`, which is more sophisticated than
     `string.split`.
     **Note**: The reference solution uses regex exclusively for name parsing.
  3. You might find a person named "ra Lebenthal".  There is no one by this
     name.  Can anyone spot what's happening here?
  4. This site is pretty formal and likes to say things like "Mayor Michael
     Bloomberg" after his election but "Michael Bloomberg" before his election.
     Can you find other ('optional') titles that are being used?  They should
     probably be filtered out b/c they ultimately refer to the same person:
     "Michael Bloomberg."
  4. There is a special case you might find where couples are written as eg.
     "John and Mary Smith". You will need to write some extra logic to make
     sure this properly parses to two names: "John Smith" and "Mary Smith".
  5. When parsing names from captions, it can help to look at your output
     frequently and address the problems that you see coming up, iterating
     until you have a list that looks reasonable. This is the approach used
     in the reference solution. Because we can only asymptotically approach
     perfect identification and entity matching, we have to stop somewhere.

**Further considerations (not included in solution)**
  1. Who is Patrick McMullan and should he be included in the results? How would
     you address this?
  2. What else could you do to improve the quality of the graph's information?

For the analysis, we think of the problem in terms of a
[network](http://en.wikipedia.org/wiki/Computer_network) or a
[graph](http://en.wikipedia.org/wiki/Graph_%28mathematics%29).  Any time a pair
of people appear in a photo together, that is considered a link.  What we have
described is more appropriately called an (undirected)
[multigraph](http://en.wikipedia.org/wiki/Multigraph) with no self-loops but
this has an obvious analog in terms of an undirected [weighted
graph](http://en.wikipedia.org/wiki/Graph_%28mathematics%29#Weighted_graph).
In this problem, we will analyze the social graph of the new york social elite.

We recommend using python's `networkx` library.

In [1]:
import csv
import re
import networkx as nx
import pandas as pd

In [2]:
def create_edge_tuple(List):
    a = []
    for x in List:
        for y in List:
            if x !=y:
                a.append(tuple(list(set([x,y]))))
        List.remove(x)
    return [item for item in list(set(a))]

In [3]:
input_file = csv.DictReader(open('captiondata1.csv'))

caption_list = []
name_sublist = []
name_list = []
titles = ['Mr.', 'Mrs.', 'Ms.', 'Dr.', 'Mayor', 'CEO', 'M.D.', 'AMC', 'AOHT','ANDRUS', 'AOHT', 'ASF', 'ASPCA',\
         'ACO', 'ACC', 'ABT', 'ACandC', 'AFIPO', 'ALS', 'ALSGNY', 'AAADTs', 'AIA', 'AIS',\
         'Actress', 'Actresses', 'Actor', 'Actors', 'Author', 'Authors', 'Bad', 'C.B.E.', 'COO'\
         'Board Member','Photographs','Benefit Chairman', 'Benefit Chairmen', 'Benefit Chairs','CCBF Chairman',\
         'CCBF Doctors', 'CNBC', 'CUNY Chancellor', 'CSHL President', 'CSUN President', 'President', \
         'Vice President',  'Cardiologist', 'Miss New York', 'New York', 'COO'\
          'Board Member','Photographs','Benefit Chairman', 'Benefit Chairmen', 'Benefit Chairs','CCBF Chairman',\
         'CCBF Doctors', 'CNBC', 'CUNY Chancellor', 'CSHL President', 'CSUN President', 'President', \
         'Vice President',  'Cardiologist', 'Miss New York', 'New York', 'COO'
         ]

for row in input_file:    
    caption = row["caption"].split('%')
    for caption_item in caption:
        if len(caption_item)<250:
            caption_item = caption_item.decode('utf-8').strip().replace('\n',' ').replace('\t',' ')
            for word in titles:
                if word in caption_item:
                    caption_item = re.sub(word,'',caption_item)
            caption_item = re.sub('[^A-Za-z\,\& \.]+', ' ', caption_item)      # remove all the special characters
            split_list = re.split(',|and |with |& ',caption_item)                    
            name_sublist = filter(None, split_list)
            name_sublist = [item.strip() for item in name_sublist]
            # remove whitespaces strings
            name_sublist_filter = filter(lambda name: name.strip() and len(name.split(' '))<=4 and name[0].isupper(),name_sublist) 
            if name_sublist_filter:
                # deal with husband and wife case
                new_list = []
                c = []
                for item in name_sublist_filter:
                    if len(item.split(' ')) ==1:
                        new_list.append(item)
                        #print new_list
                        continue
                    else:
                        last_name = item.split(' ')[-1]
                        b = [first_name+" "+last_name for first_name in new_list]
                        new_list = []
                        c.extend(b)
                        c.append(item)
                name_list.append(c)
                #print name_sublist_filter
        caption_list.append(caption_item)

In [4]:
#Draw a Graph
G = nx.Graph()
node_list = [item for x in name_list for item in x]
new_node_list = list(set(node_list))
G.add_nodes_from(new_node_list)
list_tuple = []
for item in name_list:
    a = create_edge_tuple(item)
    for x in a:
        list_tuple.append(x)
#print list_tuple
for node_pair in list_tuple:
    if G.has_edge(node_pair[0],node_pair[1]):
        G[node_pair[0]][node_pair[1]]['weight']+=1            
    else:
        G.add_edge(node_pair[0],node_pair[1],weight = 1)

## 1. degree
The simplest question to ask is "who is the most popular"?  The easiest way to
answer this question is to look at how many connections everyone has.  Return
the top 100 people and their degree.  Remember that if an edge of the graph has
weight 2, it counts for 2 in the degree.

In [5]:
# Question 1
degree_dict = G.degree()
table_degree = pd.Series(degree_dict)
sorted_table_degree = table_degree.order(ascending = False)
sorted_list_degree = []
for i in range(0,100):
    Index = sorted_table_degree.index[i]
    sorted_list_degree.append((str(sorted_table_degree.index[i]),sorted_table_degree[Index])) 

## 2. pagerank
A similar way to determine popularity is to look at their
[pagerank](http://en.wikipedia.org/wiki/PageRank).  Pagerank is used for web
ranking and was originally
[patented](http://patft.uspto.gov/netacgi/nph-Parser?patentnumber=6285999) by
Google and is essentially the stationary distribution of a [markov
chain](http://en.wikipedia.org/wiki/Markov_chain) implied by the social graph.

Use 0.85 as the damping parameter so that there is a 15% chance of jumping to
another vertex at random.

In [6]:
#Question 2
pagerank_dict = nx.pagerank(G,alpha=0.85, personalization=None, max_iter=100, tol=1e-06, nstart=None, weight='weight', dangling=None)
table_pagerank = pd.Series(pagerank_dict)
sorted_table_pagerank = table_pagerank.order(ascending = False)
sorted_list_pagerank = []
for i in range(0,100):
    Index = sorted_table_pagerank.index[i]
    sorted_list_pagerank.append((str(sorted_table_pagerank.index[i]),sorted_table_pagerank[Index]))
#print sorted_list_pagerank

## 3. best_friends
Another interesting question is who tend to co-occur with each other.  Give
us the 100 edges with the highest weights.

In [7]:
#Question 3
weights = G.edges(data = True)
L = []
for (n1,n2,w) in weights:
    t = (n1,n2,w['weight'])
    L.append(t)
df = pd.DataFrame(L, columns=['node1', 'node2', 'weight'])
sorted_df = df.sort(['weight'],ascending = False)

#print sorted_df[0:100]

best_friends = []
for name1,name2,weight in sorted_df[0:100].values:
    best_friends.append(((str(name1),str(name2)),weight))
print best_friends
len(best_friends)

[(('Gillian Miniter', 'Sylvester Miniter'), 75), (('Jamee Gregory', 'Peter Gregory'), 54), (('Bonnie Comley', 'Stewart Lane'), 51), (('Andrew Saffir', 'Daniel Benedict'), 51), (('Roric Tobin', 'Geoffrey Bradfield'), 46), (('Somers Farkas', 'Jonathan Farkas'), 40), (('Jay Diamond', 'Alexandra Lebenthal'), 37), (('Donald Tober', 'Barbara Tober'), 36), (('Martin Shafiroff', 'Jean Shafiroff'), 35), (('Chappy Morris', 'Melissa Morris'), 32), (('Campion Platt', 'Tatiana Platt'), 30), (('Chris Meigher', 'Grace Meigher'), 30), (('Lizzie Tisch', 'Jonathan Tisch'), 28), (('Peter Regna', 'Barbara Regna'), 27), (('Sessa von Richthofen', 'Richard Johnson'), 27), (('John Catsimatidis', 'Margo Catsimatidis'), 27), (('Wilbur Ross', 'Hilary Geary Ross'), 26), (('Arie Kopelman', 'Coco Kopelman'), 26), (('Deborah Norville', 'Karl Wellner'), 26), (('Elizabeth Stribling', 'Guy Robinson'), 24), (('Yaz Hernandez', 'Valentin Hernandez'), 24), (('Julia Koch', 'David Koch'), 24), (('Olivia Palermo', 'Johannes H

100