# DAT210x - Programming with Python for DS

## Module5- Lab2

Start by importing whatever you need to import in order to make this lab work:

In [2]:
# Magic command, works inside jupyter notebooks
# This includes an interactive control/renderer and does not require plt.show()
%matplotlib notebook

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans

import matplotlib.pyplot as plt
import matplotlib

matplotlib.style.use('ggplot') # Make it look pretty

### CDRs

A [call detail record](https://en.wikipedia.org/wiki/Call_detail_record) (CDR) is a data record produced by a telephone exchange or other telecommunications equipment that documents the details of a telephone call or other telecommunications transaction (e.g., text message) that passes through that facility or device.

The record contains various attributes of the call, such as time, duration, completion status, source number, and destination number. It is the automated equivalent of the paper toll tickets that were written and timed by operators for long-distance calls in a manual telephone exchange.

The dataset we've curated for you contains call records for 10 people, tracked over the course of 3 years. Your job in this assignment is to find out where each of these people likely live and where they work at!

Start by loading up the dataset and taking a peek at its head. You can convert date-strings to real date-time objects using `pd.to_datetime`, and the times using `pd.to_timedelta`:

In [3]:
df = pd.read_csv('./Datasets/CDR.csv', header=0)
df.loc[:, 'CallDate'] = pd.to_datetime(df.loc[:, 'CallDate'], errors='coerce')
df.loc[:, 'CallTime'] = pd.to_timedelta(df.loc[:, 'CallTime'], errors='coerce')
#df.shape
df.head(10)
# df.dtypes

Unnamed: 0,In,Out,Direction,CallDate,CallTime,DOW,Duration,TowerID,TowerLat,TowerLon
0,4638472273,2666307251,Incoming,2010-12-25,07:16:24.736813,Sat,0:02:41.741499,0db53dd3-eb9c-4344-abc5-c2d74ebc3eec,32.731611,-96.709417
1,4638472273,1755442610,Incoming,2010-12-25,21:18:30.053710,Sat,0:02:47.108750,aeaf8b43-8034-44fe-833d-31854a75acbf,32.731722,-96.7095
2,4638472273,5481755331,Incoming,2010-12-25,14:52:42.878016,Sat,0:04:35.356341,fadaa83f-6001-45fd-aa4a-17d6c6b7ec00,32.899944,-96.910389
3,4638472273,1755442610,Incoming,2010-12-25,16:02:09.001913,Sat,0:02:23.498499,fadaa83f-6001-45fd-aa4a-17d6c6b7ec00,32.899944,-96.910389
4,4638472273,2145623768,Incoming,2010-12-25,15:28:35.028554,Sat,0:03:54.692497,95d7920d-c3cd-4d20-a568-9a55800dc807,32.899944,-96.910389
5,4638472273,2946222380,Incoming,2010-12-25,11:38:17.275327,Sat,0:03:06.670355,95c91e8b-6ff1-4893-9df3-b0342636bd25,32.899944,-96.910389
6,4638472273,7841019020,Missed,2010-12-25,10:38:35.924232,Sat,0:02:02.855268,fadaa83f-6001-45fd-aa4a-17d6c6b7ec00,32.899944,-96.910389
7,1559410755,6092528894,Missed,2010-12-25,15:15:56.502972,Sat,0:11:52.952187,b4319acf-b475-4c3e-a2e0-03b2dd2daf9e,32.696722,-96.934306
8,1559410755,6092528894,Incoming,2010-12-25,20:15:19.667734,Sat,0:11:52.951080,f958754c-3d55-47c4-8236-50b964a7b997,32.870972,-96.923556
9,1559410755,8125446700,Missed,2010-12-25,10:01:02.162977,Sat,0:14:11.046844,07dec2d7-b5d1-410d-8879-ecf7385af719,32.696083,-96.934333


Create a distinct list of `In` phone numbers (people) and store the values in a regular python list. Make sure the numbers appear in your list in the same order they appear in your dataframe; but only keep a single copy of each number. [This link](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.tolist.html) might also be helpful.

In [4]:
uniquecallers = df.In.unique().tolist()

Create a slice named `user1` that filters to _only_ include dataset records where the `In` feature (user's phone number) is equal to the first number on your unique list above, i.e., the very first number in the dataset:

In [5]:
user1slice = df.loc[:, 'In'] == uniquecallers[0]
user1 = df.loc[user1slice, :]
user1.shape

(3648, 10)

Let's go ahead and plot all the call locations:

In [6]:
user1.plot.scatter(x='TowerLon', y='TowerLat', c='gray', alpha=0.1, title='Call Locations')
plt.show()

<IPython.core.display.Javascript object>

INFO: The locations map above should be too "busy" to really wrap your head around. This is where domain expertise comes into play. Your intuition can direct you by knowing people are likely to behave differently on weekends vs on weekdays:

#### On Weekends
1. People probably don't go into work
1. They probably sleep in late on Saturday
1. They probably run a bunch of random errands, since they couldn't during the week
1. They should be home, at least during the very late hours, e.g. 1-4 AM

#### On Weekdays
1. People probably are at work during normal working hours
1. They probably are at home in the early morning and during the late night
1. They probably spend time commuting between work and home everyday

Add more filters to the `user1` slice you created. Add bitwise logic so that you only examine records that _came in_ on weekends (sat/sun):

In [7]:
user1slice2A = (df.loc[:, 'In'] == uniquecallers[0])
user1slice2B = (df.loc[:, 'DOW'] == 'Sat')
user1slice2C = (df.loc[:, 'DOW'] == 'Sun')
user1slice2D = (df.loc[:, 'CallTime'] < '06:00:00')
user1slice2E = (df.loc[:, 'CallTime'] > '22:00:00')
user1slice2 = user1slice2A & (user1slice2B | user1slice2C) & (user1slice2D | user1slice2E)
user1 = df.loc[user1slice2, :]
#user1.head(10)

Further filter `user1` down for calls that came in either before 6AM OR after 10pm (22:00:00). Even if you didn't convert your times from string objects to timedeltas, you can still use `<` and `>` to compare the string times as long as you code them as [military time strings](https://en.wikipedia.org/wiki/24-hour_clock), eg: "06:00:00", "22:00:00": 

You may also want to review the Data Manipulation section for this. Once you have your filtered slice, print out its length:

In [8]:
user1.shape

(28, 10)

Visualize the dataframe with a scatter plot as a sanity check. Since you're [familiar with maps](https://en.wikipedia.org/wiki/Geographic_coordinate_system#Geographic_latitude_and_longitude), you know well that your X-Coordinate should be Longitude, and your Y coordinate should be the tower Latitude. Check the dataset headers for proper column feature names.
 
At this point, you don't yet know exactly where the user is located just based off the cell phone tower position data; but considering the below are for Calls that arrived in the twilight hours of weekends, it's likely that wherever they are bunched up is probably near where the caller's residence:

In [9]:
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(user1.TowerLon,user1.TowerLat, c='g', marker='o', alpha=0.2)
ax.set_title('Weekend Calls (<6am or >10p)')
plt.show()

<IPython.core.display.Javascript object>

Run K-Means with a `K=1`. There really should only be a single area of concentration. If you notice multiple areas that are "hot" (multiple areas the user spends a lot of time at that are FAR apart from one another), then increase `K=2`, with the goal being that one of the centroids will sweep up the annoying outliers; and the other will zero in on the user's approximate home location. Or rather the location of the cell tower closest to their home.....

Be sure to only feed in Lat and Lon coordinates to the KMeans algorithm, since none of the other data is suitable for your purposes. Since both Lat and Lon are (approximately) on the same scale, no feature scaling is required. Print out the centroid locations and add them onto your scatter plot. Use a distinguishable marker and color.

Hint: Make sure you graph the CORRECT coordinates. This is part of your domain expertise.

In [10]:
def doKMeans(df):
    # Let's plot your data with a '.' marker, a 0.3 alpha at the Longitude,
    # and Latitude locations in your dataset. Longitude = x, Latitude = y
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(df.TowerLat, df.TowerLon, c='pink', marker='.', alpha=0.3)

    
    # TODO: Filter `df` using indexing so it only contains Longitude and Latitude,
    # since the remaining columns aren't really applicable for this lab:
    #
    columns = ['TowerLat', 'TowerLon']
    df = df.loc[:, columns]

    # TODO: Use K-Means to try and find seven cluster centers in this df.
    # Be sure to name your kmeans model `model` so that the printing works.
    #
    model = KMeans(n_clusters=2).fit(df)


    # Now we can print and plot the centroids:
    centroids = model.cluster_centers_
    print(centroids)
    ax.scatter(centroids[:,0], centroids[:,1], marker='x', c='red', alpha=0.5, linewidths=3, s=169)
    
    return centroids.tolist()
    
locations = doKMeans(user1)

<IPython.core.display.Javascript object>

[[ 32.73164942 -96.70944573]
 [ 32.750556   -96.694722  ]]


In [11]:
locations

[[32.731649423076924, -96.70944573076923], [32.750556, -96.694722]]

Now that you have a system in place, repeat the above steps for all 10 individuals in the dataset, being sure to record their approximate home locations. You might want to use a for-loop, unless you enjoy copying and pasting:

In [14]:
usercentroids = []

for x in (uniquecallers):
    usersliceA = (df.loc[:, 'In'] == x)
    usersliceB = (df.loc[:, 'DOW'] == 'Sat')
    usersliceC = (df.loc[:, 'DOW'] == 'Sun')
    usersliceD = (df.loc[:, 'CallTime'] < '06:00:00')
    usersliceE = (df.loc[:, 'CallTime'] > '22:00:00')
    user1slice = usersliceA & (usersliceB | usersliceC) & (usersliceD | usersliceE)
    userframe = df.loc[user1slice, :]

    usercentroids.append(doKMeans(userframe))
    
usercentroids

<IPython.core.display.Javascript object>

[[ 32.73164942 -96.70944573]
 [ 32.750556   -96.694722  ]]


<IPython.core.display.Javascript object>

[[ 32.87096756 -96.92355156]
 [ 32.871111   -96.923556  ]]


<IPython.core.display.Javascript object>

[[ 32.86592718 -96.865298  ]
 [ 32.857778   -96.864444  ]]


<IPython.core.display.Javascript object>

[[ 32.84635163 -96.83515822]
 [ 32.861222   -96.852389  ]]


<IPython.core.display.Javascript object>

[[ 32.875    -96.730278]
 [ 32.917333 -96.759694]]


<IPython.core.display.Javascript object>

[[ 32.770833 -96.685556]
 [ 32.770833 -96.685556]]


<IPython.core.display.Javascript object>

[[ 32.705222 -96.840667]
 [ 32.695    -96.840556]]


<IPython.core.display.Javascript object>

[[ 32.703056 -96.604444]
 [ 32.703056 -96.604444]]


<IPython.core.display.Javascript object>

[[ 32.77401172 -96.81277401]
 [ 32.702      -96.920139  ]]


<IPython.core.display.Javascript object>

[[ 32.7722949  -96.77946848]
 [ 33.01525    -96.831472  ]]


[[[32.731649423076924, -96.70944573076923], [32.750556, -96.694722]],
 [[32.870967564356434, -96.92355156435644], [32.871111, -96.923556]],
 [[32.86592718181818, -96.865298], [32.857778, -96.86444399999999]],
 [[32.84635162962963, -96.83515822222222], [32.861222, -96.852389]],
 [[32.875, -96.730278], [32.917333, -96.759694]],
 [[32.770833, -96.685556], [32.770833, -96.685556]],
 [[32.705222, -96.84066700000001], [32.695, -96.840556]],
 [[32.703056, -96.604444], [32.703056, -96.604444]],
 [[32.77401171698113, -96.81277400943395], [32.702, -96.920139]],
 [[32.7722949047619, -96.77946847619049], [33.01525, -96.83147199999999]]]

[[32.870967564356434, -96.92355156435644], [32.871111, -96.923556]]