<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">


# Telecomm EDA Challenge Lab

_Author: Alex Combs (NYC) _

---

Let's do some Exploratory Data Analysis (EDA)! As a data scientist, you often may find yourself given a data set you've never seen before, and asked to do a rapid analysis. This is today's goal.

# Prompt

You work for a telecommunications company. The company has been storing metadata about customer phone usage, as part of the regular course of business. Currently, this data is sitting in an unsecured database. The company doesn't want to pay to increase their database security, because they don't think there's really anything to be learned from the metadata.

They are under pressure from "right to privacy" organizations to beef up the database security. These organizations argue that you can learn a lot about a person from their cell phone metadata.

The telecom company wants to understand if this is true, and they want your help. They will give you one person's metadata for 2014 and want to see what you can learn from it.

Working in teams, create a report revealing everything you can about the person. Prepare a presentation, with slides, showcasing your findings.


# The Data

The [person's metadata](./datasets/metadata.csv) has the following fields:

| Field Name          | Description
| ---                 | ---
| **Cell Cgi**        | cell phone tower identifier
| **Cell Tower**      | cell phone tower location
| **Comm Identifier** |	de-identified recipient of communication
| **Comm Timedate String** | time of communication
| **Comm Type	Id**  | type of communication
| **Latitude**        | latitude of communication
| **Longitude**       | longitude of communication


# Hints

This is totally open-ended! If you're totally stumped -- and only if stumped -- should you look below for prompts. As a starting point, given that you have geo-locations, consider investigating ways to display this type of information (i.e. mapping functionality).

<font color='white'>
Well for starters, he's in Australia!

Ideas for things to look into:
- where does he work?
- where does he live?
- who does he contact most often?
- what hours does he work?
- did he move?
- did he go on holiday?  If so, where did he go?
- did he get a new phone?

Challenges:
- how does he get to work?
- where does his family live?
- if he went on holiday, can you find which flights he took?
- can you guess who some of his contacts are, based on the frequency, location, time and mode (phone/text) of communications?


If you're stuck on how to map the data, you can try "basemap" or "gmplot", or anything else you find online.
</font>

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats

plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [2]:
df = pd.read_csv(r'C:\Users\Daniel_Heffley\Desktop\jan 5 2020\DAT-course-materials-students-master\homework\metadata.csv', encoding='latin-1')

In [3]:
df.head()

Unnamed: 0,Cell Cgi,Cell Tower Location,Comm Identifier,Comm Timedate String,Comm Type,Latitude,Longitude
0,50501015388B9,REDFERN TE,f1a6836c0b7a3415a19a90fdd6f0ae18484d6d1e,4/1/14 9:40,Phone,-33.892933,151.202296
1,50501015388B9,REDFERN TE,62157ccf2910019ffd915b11fa037243b75c1624,4/1/14 9:42,Phone,-33.892933,151.202296
2,505010153111F,HAYMARKET #,c8f92bd0f4e6fb45ed7fce96fc831b283db2b642,4/1/14 13:13,Phone,-33.880329,151.20569
3,505010153111F,HAYMARKET #,f1a6836c0b7a3415a19a90fdd6f0ae18484d6d1e,4/1/14 13:13,Phone,-33.880329,151.20569
4,5.05E+106,HAYMARKET #,f1a6836c0b7a3415a19a90fdd6f0ae18484d6d1e,4/1/14 17:27,Phone,-33.880329,151.20569


In [9]:
pd.options.display.max_rows = 130

In [10]:
df['Cell Tower Location'].value_counts()

BALGOWLAH HAYES ST                          4301
CHIPPENDALE                                 1084
SUNDERLAND ST                                723
REDFERN TE                                   712
HAYMARKET #                                  563
BRICKWORKS                                   501
HARBORD 22 WAINE ST                          465
FAIRLIGHT 137 SYDNEY RD                      454
MANLY #                                      231
NEW TOWN                                     197
CHINATOWN                                    161
BEECHWORTH                                   112
BALGOWLAH VILLAGE SHOPPING CENTRE IBC        106
MANLY SOUTH STEYNE                            92
BROADWAY OTC                                  85
MASCOT INTERNATIONAL AIRPORT TERMINAL T1      65
71 MACQUARIE ST                               49
SURRY HILLS 418A ELIZABETH ST                 45
MANLY NTH STEYNE                              40
MASCOT M5 MOTORWAY EMERGENCY STAIRS           33
BALGOWLAH TE        

In [11]:
df['Comm Identifier'].value_counts()# I would guess the top communicator would be a spouse and the next three top
# communicators are his children? Total conjecture, but a worthwhile guess. I would also say the other top communicators are
# various other family member and friends. The communicators with one communication are probably spam calls/ texts.
# We can estimate from this how many spam calls the customer gets

bc0b01860486b0f0a240ce8419d3d7553fe404ab    219
12e3d1b0c95aa32b6890c4455918dfc10e09fb51    146
91aba4a11359ff3af7902428d20cfa7e676c36e7    144
a24a4646d074a779b45b34b943a47bf33168f791    133
6bbc17070aa91e2dab7909b96c6eecbd6109ba56     83
a804558e420ececf05faedf05722704a115f1b50     62
cd3b39466869088df4904451c626591cc500e4ba     56
c22670da93038f568c4a3bd8ae22f9e6fef2c5a2     44
70e1f163d854d4e9b63e9a3f4056ced467567d85     39
c521537546eee0e62e2d8e98e831ac11edbf10cc     31
746da741fb2ac66a5130b1ce2ee4615a58b356ae     29
62157ccf2910019ffd915b11fa037243b75c1624     28
a5834ee77b2c1dd26c78966f5e2c989c453878ba     26
c13b4711ebc07ab70537d8d6f6326edb1c7b419a     17
6dad3704c00eb122d2f183ca612ef8990cc72bea     17
0767e517fa09e8bcc0b87461133830378a107619     16
b10d076fa3354b695ffa1e5f0ec5178d792fea95     16
b9ac4ec5dea89a5d8cac2854bfa0161989d3aa66     15
c8f92bd0f4e6fb45ed7fce96fc831b283db2b642     14
ce4393a832fb45f7dddcb5c2147fd086763acfb4     13
dc6774d10eeca42629f043d3649f1edf903b0dab

In [16]:
df['lat long'] = df['Latitude'].astype(str) + ' ' + df['Longitude'].astype(str)

In [17]:
df.head()


Unnamed: 0,Cell Cgi,Cell Tower Location,Comm Identifier,Comm Timedate String,Comm Type,Latitude,Longitude,lat long
0,50501015388B9,REDFERN TE,f1a6836c0b7a3415a19a90fdd6f0ae18484d6d1e,4/1/14 9:40,Phone,-33.892933,151.202296,-33.89293336 151.20229619999998
1,50501015388B9,REDFERN TE,62157ccf2910019ffd915b11fa037243b75c1624,4/1/14 9:42,Phone,-33.892933,151.202296,-33.89293336 151.20229619999998
2,505010153111F,HAYMARKET #,c8f92bd0f4e6fb45ed7fce96fc831b283db2b642,4/1/14 13:13,Phone,-33.880329,151.20569,-33.88032891 151.2056904
3,505010153111F,HAYMARKET #,f1a6836c0b7a3415a19a90fdd6f0ae18484d6d1e,4/1/14 13:13,Phone,-33.880329,151.20569,-33.88032891 151.2056904
4,5.05E+106,HAYMARKET #,f1a6836c0b7a3415a19a90fdd6f0ae18484d6d1e,4/1/14 17:27,Phone,-33.880329,151.20569,-33.88032891 151.2056904


In [18]:
df['lat long'].value_counts()# Here we can see where the customer is when his phone is used. The top two results
# are in Sydney, Australia. The customer probably lives at the number one location and works at number two location. 
# The number three location is in Tasmania, Australia. Possibly a vacation home? It seems the customer spends most of his
#time in Sydney and goes on trips to Tasmania

-33.78815 151.26654                       4301
-33.88417103 151.20235                    1084
-42.843379999999996 147.29568999999998     723
-33.89293336 151.20229619999998            712
-33.88032891 151.2056904                   563
-42.859840000000005 147.29215              501
-33.779333 151.276901                      465
-33.79661 151.27756000000002               454
-33.796679 151.285293                      231
-42.85307 147.31531999999999               197
-33.87829 151.20345                        161
-36.3567 146.7136                          112
-33.793648 151.263934                      106
-33.799479999999996 151.28933999999998      92
-33.884603000000006 151.195643              85
-33.937558 151.1657                         65
-33.861129999999996 151.21293               49
-33.8864 151.2088                           45
-33.791965000000005 151.286589              40
-33.946740000000005 151.16714               33
-33.79345 151.2631                          30
-36.502179999

In [None]:
#Assuming the customer has his phone on him at all times, he has only been at the coordinates above in the year
# shown in the metadata

In [22]:
df['Comm Type'].value_counts() #The customer calls more than texts. Signals older customer?

Internet    9102
Phone        717
SMS          657
Name: Comm Type, dtype: int64

In [24]:
top_comm_df = df[df['Comm Identifier'] == 'bc0b01860486b0f0a240ce8419d3d7553fe404ab']

In [25]:
top_comm_df['Comm Type'].value_counts()#Customer mainly texts top communicator

SMS      210
Phone      9
Name: Comm Type, dtype: int64