# Assignment 5

## Xu, Ruoying

In this assignment, I am going to investigate the location of 30 citizens in NY using their time-stamped twitter data for 30 days.

In [1]:
import pandas as pd
import itertools
import numpy as np
from scipy import linalg
import matplotlib.pyplot as plt
import matplotlib as mpl
%matplotlib inline
import datetime
from sklearn import mixture
from sklearn.externals.six.moves import xrange
import math
import heapq
import os

import warnings
warnings.filterwarnings("ignore")

First, I write two functions to select the data that is useful for this study. I only includes lat, long, and time from each samples, and exclude samples that are stamped during the weekends.

In [2]:
#get lat,long, and time from data
def get_variable(data):
    data = data.apply(lambda x: x.str.split(','),axis=1)
    data['time'] = data[0].apply(lambda x: x[0])
    data['lat'] = data[0].apply(lambda x: float(x[1]))
    data['long'] = data[0].apply(lambda x: float(x[2]))
    return data[['time','lat','long']]

# only use data from the weekday
def weekday(data):
    data['weekday'] = data['time'].apply(lambda x: datetime.datetime.strptime(x, '%Y-%m-%d %H:%M:%S').weekday())
    d = data.loc[data.weekday.isin([0,1,2,3,4])]
    d['hour'] = d['time'].apply(lambda x: int(x[11:13]))
    d['min'] = d['time'].apply(lambda x: int(x[14:16]))
    d['time_of_day'] = d.hour + d['min'] / float(60)
    return d[['lat','long','time_of_day']]

## Use GMM to find the most visited location

Next, I am going to design a code that use GMM to find out whether a person is a commuters, their specific home and work locations, and time that they spend at home/work. This GMM uses three parameters: lat, long, and time-stamp. In order to find their home/work locations, I make the following assumptions:

    1. All individuals would spend most of their time at home or at work, compared to any other locations.
    2. If a person is a commuter, he/she would be at work during 9am to 5pm, and spend the rest of the time mostly at home.

Based on the above assumptions, I design the algorithm as follows. 

    1. I set the number of components in the GMM as 5, so that the model could cpature the clusters of home before work, home after work, work, and two other activity locations. 

    2. Based on the fist assumption, a location would be qualified as home/work if the weights of this location cluster is greater than 0.2. 

    3. This model would capture the two largest location clusters by weights.However, a individual may twitts more often at home than at work, making it possible that that two largest location clusters by weights are home before work and home after work (since we use time as a parameter), I will compute the distance between the captured two largest clusters using their mean lat, long. If the distance is less than 1km, I assume that they are at the same location, therefore the second largest cluster would be replaced by the third largest, and compute the distance again. If there are a second cluster that are far away from the largest location cluster and having a weight greater than 0.2, it would be included in the final location. Otherwise I assume that this person only has one important location, and therefore he/she is not a commuter.

    4. Once I have the two important location clusters of each person, I would determine whether a location is home or work based on the mean time-stamp. If a person is a commuter, he/she would be at work during 9am to 5pm, and spend the rest of the time mostly at home.
    
    5. If the two most important location clusters are both having mean time-stamps during regular work or home period, then this person is not a commuter and his/her home location would be given to the location that have the largest weight.
    
    6. If the algorithm only produce one important location of an individual, then this location is his/her home location and this person is not a commuter.

In [3]:
location_final=[]
name_all=[]
for filename in os.listdir("./new_york"):
    if filename.endswith(".csv"):
        # read data
        df = pd.read_table('new_york/' + filename.split('.')[0]+ '.csv', header=None)
        # get weekday location and time
        x = weekday(get_variable(df))
        X = x.as_matrix()
        
        # fit a GMM
        #assume that the person has 5 important location
        gmm = mixture.GMM(n_components=5, covariance_type='full', n_iter=100)
        gmm.fit(X)
        Y_ = gmm.predict(X)
        
        #get parameters from GMM
        mm=gmm.means_  # center of each cluster
        wei=gmm.weights_  # weight of each cluster
        
        location=np.zeros((6,), dtype=np.float)
        loc_1=mm[[i for i, x in enumerate(wei) if x == max(wei)][0]]
        x1=loc_1[0]*89.7
        y1=loc_1[1]*112.7
        w=2
        
        while w<=5:
            weight_2=heapq.nlargest(w, wei)[-1]
            loc_temp=mm[[i for i, x in enumerate(wei) if x == weight_2][0]]
            # if the weight for the second largest cluster is too small, we say that it is not a constantly visited place
            if weight_2<0.2:
                break
            x_temp=loc_temp[0]*89.7
            y_temp=loc_temp[1]*112.7
            dist= math.sqrt(pow((x_temp - x1), 2)+pow((y_temp - y1), 2))
            
            # distance cannot be within 1km between work and home, otherwise these two clusters are the same
            if dist>1:
                loc_2=loc_temp
                break
            w+=1
            
        # end while
            
        if loc_2[2]==0:
            location[0]=loc_1[0]
            location[1]=loc_1[1]
            location[2]=loc_1[2]
        elif loc_1[2]<=9 or loc_1[2]>17:
            location[0]=loc_1[0]
            location[1]=loc_1[1]
            location[2]=loc_1[2]
            if (9<loc_2[2]<=17):
                location[3]=loc_2[0]
                location[4]=loc_2[1]
                location[5]=loc_2[2]
        elif 9<loc_1[2]<=17 and 9<loc_2[2]<=17:
            location[0]=loc_1[0]
            location[1]=loc_1[1]
            location[2]=loc_1[2]
        else:
            location[3]=loc_1[0]
            location[4]=loc_1[1]
            location[5]=loc_1[2]
            if not (9<loc_2[2]<=17):
                location[0]=loc_2[0]
                location[1]=loc_2[1]
                location[2]=loc_2[2]
        name_all.append(filename)  
        location_final.append(location)

loc_final=np.array(location_final)    
name_final=np.array(name_all)

## summary of results
Then I put all the results in to a pandas dataframe and show the table. Variable home_lat, home_long is the predicted home location for each person, and work_lat, work_long is the predicted work location for each person. Variable time_home and time_work is the time that are mostly to find this person at home/work. If an individual isidentified as non-cmomuter, the columns for work would be left as 0.

In [4]:
result = pd.DataFrame({'1_name':name_final,'2_home_lat':loc_final[:,0],'3_home_long':loc_final[:,1],
                       '4_time_home':loc_final[:,2],'5_work_lat':loc_final[:,3],'6_work_long':loc_final[:,4],
                      '7_time_work':loc_final[:,5]})
result

Unnamed: 0,1_name,2_home_lat,3_home_long,4_time_home,5_work_lat,6_work_long,7_time_work
0,Ana.csv,40.788814,-73.956774,19.736366,40.796585,-73.946963,11.362135
1,Billy.csv,40.723435,-74.043201,20.352286,0.0,0.0,0.0
2,David.csv,40.633213,-74.075204,20.766845,0.0,0.0,0.0
3,Dianne.csv,40.669517,-74.273722,18.968986,40.670708,-74.25046,10.264149
4,Donald.csv,40.627877,-74.397809,13.733401,0.0,0.0,0.0
5,Elisabeth.csv,40.727007,-73.985331,19.540995,40.535571,-74.288343,16.201707
6,Garland.csv,40.743575,-74.009059,21.041185,40.762162,-73.692088,12.897446
7,George.csv,40.716349,-73.982437,19.93333,0.0,0.0,0.0
8,Heather.csv,40.822248,-73.972712,20.642322,40.824436,-73.962498,10.872598
9,Hilda.csv,40.704299,-73.982405,19.24331,40.824436,-73.962498,10.872598


## Name, home location and time being at home for non-commuters
The table below shows the names, home locations, and the time of being at home of the non-commuters determined by the algorithm. As we can see there are 8 non-commuters identified by the algorithm.Their home location are indicated by the variables "2_home_lat", and "3_home_long". The most likely time to find the person at home is at "4_time_home". For example, the most likely time to find Billy at home is 8pm according to the table below.

In [5]:
result[(result['7_time_work']==0)]

Unnamed: 0,1_name,2_home_lat,3_home_long,4_time_home,5_work_lat,6_work_long,7_time_work
1,Billy.csv,40.723435,-74.043201,20.352286,0,0,0
2,David.csv,40.633213,-74.075204,20.766845,0,0,0
4,Donald.csv,40.627877,-74.397809,13.733401,0,0,0
7,George.csv,40.716349,-73.982437,19.93333,0,0,0
16,Mary.csv,40.759657,-73.805436,20.766912,0,0,0
17,Megan.csv,40.947265,-73.86382,19.893895,0,0,0
18,Mildred.csv,40.674527,-73.871793,18.41648,0,0,0
23,Rosalie.csv,40.730011,-74.032238,20.614257,0,0,0


## Name, home/work location and time being at home for commuters
The table below shows the names, home/work locations, and the time of being at home/work of the commuters determined by the algorithm. As we can see there are 8 non-commuters identified by the algorithm.Their home location are indicated by the variables "2_home_lat", and "3_home_long". Their work location are indicated by the variables "5_work_lat", and "5_work_long".The most likely time to find the person at home is at "4_time_home". The most likely time to find the person at work is at "7_time_work". 

For example, the most likely time to find Ana at home is 7pm, and the most likely time to find Dianne at work is 10am, according to the table below.

In [6]:
result[(result['7_time_work']>0)]

Unnamed: 0,1_name,2_home_lat,3_home_long,4_time_home,5_work_lat,6_work_long,7_time_work
0,Ana.csv,40.788814,-73.956774,19.736366,40.796585,-73.946963,11.362135
3,Dianne.csv,40.669517,-74.273722,18.968986,40.670708,-74.25046,10.264149
5,Elisabeth.csv,40.727007,-73.985331,19.540995,40.535571,-74.288343,16.201707
6,Garland.csv,40.743575,-74.009059,21.041185,40.762162,-73.692088,12.897446
8,Heather.csv,40.822248,-73.972712,20.642322,40.824436,-73.962498,10.872598
9,Hilda.csv,40.704299,-73.982405,19.24331,40.824436,-73.962498,10.872598
10,James.csv,40.736375,-73.877507,19.34197,40.752649,-73.955119,12.941478
11,Jerrie.csv,40.722768,-74.040232,19.807802,40.717747,-74.028192,10.996355
12,John.csv,40.964571,-74.077437,19.324833,40.814759,-74.104075,11.956605
13,Latasha.csv,40.757927,-73.998582,21.250817,40.839806,-74.104403,11.266519
