## Problem statement

These are the IDs of people (2600+) who have flown together (or who know one another in some other way). Each ID is unique. And if X knows Y then Y also knows X.

## Approach used
Tools used
* neo4j
* javascript for visualisations
* pandas 
* anaconda python 

Used graphDB (neo4j) for storing the relationship between the IDs. This will allow to scale the relationship in the future with additional domain objects coming into the problem.
This is the best storage persisting entities with relationship definitions or ontologies, in this case 'KNOWS'.
The ontological querying is also lot faster in the graph based DB.

In [None]:
# connecting to the neo4j db

from py2neo import Graph
graph = Graph("http://neo4j:root123@localhost:7474/graph/data")
graph.delete_all()

In [41]:
# loading the data from db
import pandas as pd
result = pd.read_excel("Bristol Adjacency.xlsx", header = 0)
df = result[['ID-1', 'ID-2']]
df.shape

(9076, 2)

In [42]:
df.columns = ['id1', 'id2']
df.head()

Unnamed: 0,id1,id2
0,1,422
1,2,826
2,2,1047
3,3,209
4,3,612


In [43]:
# saving the file as csv

df.to_csv('ids.csv')

In [44]:
graph.delete_all()

In [45]:
# bulk importing the csv into neo4j

query = '''

    LOAD CSV WITH HEADERS FROM "file:///ids.csv" AS line
    WITH line
    WHERE line.id1 IS NOT NULL
    MERGE (person1:Person{id: line.id1})
    MERGE (person2:Person{id: line.id2})
    MERGE (person1)-[:KNOWS]->(person2)
    MERGE (person2)-[:KNOWS]->(person1)

''' 
graph.run(query)

<py2neo.graph.Cursor at 0x193a67a5940>

In [46]:
# visualise the Nodes

from scripts.vis import draw
options = {"Person": "id"}
draw(graph, options)


In [47]:
# querying the graph and into the dataframe

query = """
    MATCH (person1:Person)-[:KNOWS]->(person2:Person)
    RETURN person1.id AS id1, person2.id as id2
"""
results = graph.run(query)
df2 = pd.DataFrame(graph.data(query))
df.shape

(9076, 2)

In [27]:
df2.head()

Unnamed: 0,id1,id2
0,1,422
1,2,1047
2,2,826
3,3,749
4,3,612


In [48]:
# finding the top 10 ids with most connections

query = '''
    MATCH (b:Person)
    WITH b, SIZE(()-[:KNOWS]->(b)) as personCnt
    ORDER BY personCnt DESC LIMIT 10
    MATCH (a)-[:KNOWS]->(b)
    RETURN b.id as id1, a.id as id2
    
'''
df3 = pd.DataFrame(graph.data(query))
df3.shape

(449, 2)

In [49]:
df3.head()

Unnamed: 0,id1,id2
0,298,2595
1,298,2604
2,298,2463
3,298,2452
4,298,2435


## Identify clusters of people who know each other.

The top 10 cluster by which they know each other are

In [50]:
# grouping of the cluster

df3.groupby('id1').size().reset_index(name='counts').sort_values('counts', ascending=False)

Unnamed: 0,id1,counts
2,298,63
7,736,52
3,304,50
5,389,49
1,205,45
6,561,41
9,93,41
4,359,39
0,119,36
8,91,33


## Identify the most influential peel within those clusters.

The most influential peel is with the id 298

In [60]:
query2 = """
    
    MATCH (person1)-[:KNOWS]->(person2)
    WHERE person1.id = '298'
    RETURN person1.id AS id1, person2.id as id2
"""

graph.run(query2)

<py2neo.graph.Cursor at 0x193a64e65f8>

In [61]:
draw(graph, options)

#### Saving the data to a file


In [59]:
# the largest clsuter of connection 

max_conn = df3[df3.id1 == '298']
max_conn.to_csv('max_conn.csv')

## How could we use this information for target marketing?

The largest cluster is helpful in reducing the marketing costs.
The network of known people in this cluster has the maximum reach with they may know each other through social networks, business relationship, friendgroups etc.

The clustering indicates the people know each other due to various sociological reason and their demographic parameters might be similar like

### age, sex, race, nationality, workplace, designation, sector of work, income, outing habits, interests etc.
This will give opportunity to give offers appealing to these demographics

### The focussed marketing on the individual will allow easy flow of information in the cluster group.

Furthermore we should identify the platform through which they are related, and can focus on the medium based targetting

We should also look for classifying the travel destinations as 

* Government and International Organizations
* Regional Resident Personal and Leisure Travelers
* Diaspora Personal and Leisure Travelers
* Western European Personal and Leisure Travelers
* Seasonal Holiday Travelers 
* Business travellers

This will allow us to give need/travel based offers to the segment

### How could we use this information to recognise that some individuals are related to (close to ) people of high value to EK?

We can define "High value to EK" as people with coming in the category of Frequent flyers, high mile flyers, business class users. 
From the clusters above we can identify the customers connected with "High value to EK" customers.
Every one of these connection as potential "High value to EK" customers. 
As knowing one another is driven by factors like 
### workplace, designation, sector of work, income, outing habits, interests etc.
and these indicate that the wallet size (purchasing capacity) or net worth of these connections are similar