# How to use GPS data to learn important places for users
## Introduction

According to statista website, in 2016, the number of mobile phone users has reached 4.61 billions. Since the mobile device has become more and more common and people spent more and more time on using mobile device, we can retreive more and more information about users based on the user behaviors that have been recorded in the mobile devices or mobile applications. GPS data is one of the most significant dataset. Through mobile GPS data obtaining from mobile application, we can understand user's movement and behavior for better user profiling. 

However, due the natural of GPS data, we may not retreive siginificant insight from the GPS data directly. Therefore, this tutorial will serve as a starting point for GPS data analysis. I will introduce a framework to extract some important information from the GPS dataset by filering and clustering. To be more specific, this tutorial will extract the "important places" for each user at the end. After going through this tutorial, you will be able to use GPS data to understand more about each user for his/her important places and you can use that to do further analysis such as movement analysis.


## GPS Data source

The data source can be any kind of GPS data you collected. In this tutorial, we will assume that the input data format will be like the following format in csv file:
*******************************************************
<pre>2016-07-09 00:07:31	0	-17.010466	145.738624
2016-07-10 10:44:45	0	-17.0017640005282	145.724898
2016-07-10 10:44:43	0	-17.0017640005282	145.724898
2016-07-10 10:33:38	0	-17.0017640005282	145.724898
2016-07-10 09:43:06	0	-17.0017640005282	145.72489
2016-07-12 13:40:48	1	-35.141066	138.495843
2016-07-13 13:35:55	1	-35.141057	138.495833
2016-07-13 13:30:40	1	-35.14106	138.49582
2016-07-13 13:25:47	1	-35.141055	138.495807
2016-07-14 13:21:33	1	-35.141066	138.49582
2016-07-14 13:15:39	1	-35.141062	138.495814
2016-07-14 13:12:54	1	-35.141058	138.495827
2016-07-14 13:07:54	1	-35.141064	138.495832</pre>
*******************************************************

To elaborate, the first field should be timestamp, the second field should be unique user id to identify each user, the third field should be the latitude and the forth field is the longitude. Each field is seperated by tab.

The sample GPS data I used for this tutorial came from some users that used certain mobile application. Since the application is embedded with GPS tracking code, we will be able to obtain user's GPS records. I did not share the dataset since the dataset is really sensitive and non-disclosable. However, I will try to make it very clear for you to understand the process and framework without the dataset. By running the code I wrote here with the input that consistent with the format I mentioned, you should be able to analyze your own GPS dataset too.


## The content of tutorial

This tutorial will cover several functions to process, clean and cluster user GPS data. 
1. preprocess the GPS dataset: 
    for all gps records, divide them into difference group based on UserID.
    
2. filter out non-static GPS record: 
    after first step, we need to filter out non-static data since in this case, we want to get more information about user's important places and don't want to be confused by other GPS records that are moving.
    
3. Cluster GPS records to be "location":
    after we obtain static GPS records for each user, we can start to cluster the records based on latitude and longitude. This step is important since GPS data is very granular and sensitive and the GPS coordinate for a physical location may vary ,we should group the places to a location to make the study more useful. Since location-based data is a special type of data, we will use a variant of K-mean clustering algorithm which will cluster the data based on given radius instead of number of K.

4. Identify important locations for each users:
    After we obtain the information about "location", we can start identify which location is more important to users by some indicators such as visit frquency and duration of staying.



In [None]:
import csv
import itertools
import operator
from math import radians, cos, sin, asin, sqrt,atan2
from datetime import datetime,timedelta

## 1. Preprocess the GPS dataset:


We will process the original data to be grouped by each user. For each user, we will have a list. The first element in the list is the user ID and the second element is a list of dictionary with all the GPS records related to this user.

<pre>
format: 
[
[userid1,[list of dictionary of the GPS records for this user1]],
...,
[userid2,[list of dictionary of the GPS records for this user2]]
]

For example:
[
[12345, [{'timestamp':'2010-10-12 09:39:45', 'id':'12345', 'lat':'13.2345', 'lng':'100.3923'},{'timestamp':'2010-10-12 09:45:29', 'id':'12345', 'lat':13.23467, 'lng':100.4031},{'timestamp':'2010-10-12 09:45:39', 'id':'12345', 'lat':13.23467, 'lng':100.4031}]],
[67891, [{'timestamp':'2010-10-12 09:39:45', 'id':'67891', 'lat':'13.2345', 'lng':'100.3923'}, {'timestamp':'2010-10-12 09:45:29', 'id':'678915', 'lat':13.23467, 'lng':100.4031}]]
]
</pre>

In [2]:
#############################
#   The support function    #
#############################

def haversine(lat1, lon1, lat2, lon2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    R = 6371000 #earth radius in meters
    phi1 = radians(lat1)
    phi2 = radians(lat2)
    deltaphi = radians(lat2-lat1)
    deltalamba = radians(lon2-lon1)

    a = sin(deltaphi/2)*sin(deltaphi/2)+cos(phi1)*cos(phi2)*sin(deltalamba/2)*sin(deltalamba/2)
    c = 2*atan2(sqrt(a),sqrt(1-a))
    d = R*c
    return d

def timedelta_total_seconds(timedelta):
    """ calculate the real time difference using timedelta object
        Args:
            timedelta (timedelta Object): created by Time1-Time2
        Output:
            (int): secs in time differences  
    """    
    return (timedelta.microseconds + 0.0 +
        (timedelta.seconds + timedelta.days * 24 * 3600) * 10 ** 6) / 10 ** 6

#########################################
#   The GPS data processing function    #
#########################################

def transformListToDict(data):
    """ transform input record to the dictionary format. 
        Args:
            data (list) : a record in the input csv file. 
                        format: 2015-07-12 13:07:54	1	-35.141064	138.495832
        Output:
            (dict) : the dictionary version of the record. 
                    format: {'timestamp':'2015-07-12 13:07:54', 'id':'1', 'lat':'-35.141064', 'lng':'138.495832'}
    """
    column_names = ['timestamp', 'id', 'lat', 'lng']
    parsed_dict = {}
    for i in range(len(column_names)):
        parsed_dict[column_names[i]] = data[i]
    return parsed_dict

def readCSV(filename):
    """ read the input csv file line by line and appy the transformListToDict function to produce desired format.
        Args:
            filename (String) : input file name
        Output:
            (list) : a list of desired format record based on each user.
                    format: 
                    [[userid1,[list of dictionary of the GPS records for this user1],...
                    ,[userid2,[list of dictionary of the GPS records for this user2]]
                    ex:
                    [[12345, [{'timestamp':'2010-10-12 09:39:45', 'id':'12345', 'lat':'13.2345', 'lng':'100.3923'}, 
                    {'timestamp':'2010-10-12 09:45:29', 'id':'12345', 'lat':13.23467, 'lng':100.4031}]],
                    [67891, [{'timestamp':'2010-10-12 09:39:45', 'id':'67891', 'lat':'13.2345', 'lng':'100.3923'}, 
                    {'timestamp':'2010-10-12 09:45:29', 'id':'678915', 'lat':13.23467, 'lng':100.4031}]]]
    """    
    gps = []
    with open(filename) as f:
        reader = csv.reader(f,delimiter='\t')
        for row in reader:
            gps.append(transformListToDict(row))       
    keyfunc = operator.itemgetter("id")
    User_proccessed_List = [list(grp) for key, grp in itertools.groupby(sorted(gps, key=keyfunc), key=keyfunc)]
    return User_proccessed_List

######################
#   Load GPS data    #
######################
User_proccessed_List = readCSV("GPS_sample.csv")

## 2. filter out non-static GPS record:

Since we want to identify the important "place" for each user, we need to keep the GPS records that are recorded when the user is not moving and filter out the trajectory records. 

[<img src="https://c2.staticflickr.com/6/5707/30110692184_6875e87326_z.jpg">](https://c2.staticflickr.com/6/5707/30110692184_6875e87326_z.jpg)

To filter out the non-static GPS record, we will go through every records in the input file and keep only the records that have been identified as a static place for a certain users. The principle is described as above - moving short distance and stay long. The detail methodology is that we will keep adding records into a temporary batch if the distance between the first record in the batch and the current record are within the distance threshold. Once we meet a record whose distance with current batch is further than the distance threshold, we will then check the time duration for the current batch. If the time duration is long enough (larger than the time threshold), all records in the batch will be identified as "static GPS record", meaning that these GPS records are recorded when the user is not moving. The following is the visualization process to give you a clearer idea.

<pre>
1. We have the first record in the batch.
2. Marking the first record as center with a radius, the distance threshold.
3. Let's look at the second record. The second record is within the distance threshold.
4. Let's look at the third record. The third record is within the distance threshold.
5. Let's look at the forth record. The forth record is out of the distance threshold.
6. We then check the time duration from first record to third record. if the time duration is long enough, record 1-3 will be considered as static place records.
</pre>

[<img src="https://c2.staticflickr.com/6/5707/30627569192_fe637998e3_b.jpg">](https://c2.staticflickr.com/6/5707/30627569192_fe637998e3_b.jpg)

We will also add information to each records. After filtering out the moving GPS records, we can also know how long the user stay in certain batch (place). We will add this information into the dictionary of the record.

In [None]:
ThresDistance = 200 # meters
ThresTime = 300 # seconds

#########################################################
#   The function to filter out non-static GPS records   #
#########################################################

def retain_static_only(data):
    """ filter out non-static GPS records for each user.
        Args:
            data (list) : a list for a certain user: [userid1,[list of dictionary of the GPS records for this user1]
                          The keys in the dictionary for each record: timestamp, id, lat, lng
        Output:
            (list) : [userid1,[list of dictionary of the GPS records for this user1]] (retain only static GPS records)
                    The keys in the dictionary for each record: timestamp, id, lat, lng, place, duration
    """  
    #data: a list for a certain user like we describe in the first step.
    userID = data[0]
    list_record = data[1]
    member_in_cur_cluster =[]
    place_count = 0
    
    #go through each record
    for i in range(len(list_record)):
        #for the first record
        if i == 0:
            member_in_cur_cluster.append(i)
            temp_location_center_lat = float(list_record[i]['lat'])
            temp_location_center_lon = float(list_record[i]['lng'])

        else:			
            #within distance threshold
            if(haversine(float(list_record[i]['lat']),float(list_record[i]['lng']),temp_location_center_lat,temp_location_center_lon) <= ThresDistance):
                #if this is not the last record
                if(i != len(list_record)-1):
                    member_in_cur_cluster.append(i)
                
                #if this is the last record
                elif(i == len(list_record)-1):
                    last_time = datetime.strptime(list_record[member_in_cur_cluster[-1]]['timestamp'],"%Y-%m-%d %H:%M:%S")
                    begin_time = datetime.strptime(list_record[member_in_cur_cluster[0]]['timestamp'],"%Y-%m-%d %H:%M:%S")
                    #check if the time duration is long enough to be a static place
                    if(timedelta_total_seconds(last_time-begin_time) >= ThresTime):
                        for k in member_in_cur_cluster:
                            list_record[k]['place'] = "place_"+str(place_count)						
                            list_record[k]['duration'] = timedelta_total_seconds(last_time-begin_time)								
                        list_record[i]['place'] = "place_"+str(place_count)	
                        list_record[k]['duration'] = timedelta_total_seconds(last_time-begin_time)	
                        place_count=place_count+1						
                    else:
                        for k in member_in_cur_cluster:
                            list_record[k]['place'] = "no"	
                        list_record[i]['place'] = "no"					

            #out of distance threshold
            else:
                #if this is the last record
                if(i == len(list_record)-1):
                    list_record[i]['place'] = "no"
                #not the last record
                else:						
                    last_time = datetime.strptime(list_record[member_in_cur_cluster[-1]]['timestamp'],"%Y-%m-%d %H:%M:%S")
                    begin_time = datetime.strptime(list_record[member_in_cur_cluster[0]]['timestamp'],"%Y-%m-%d %H:%M:%S")
                    #check if the time duration is long enough to be a static place
                    if(timedelta_total_seconds(last_time-begin_time) >= ThresTime):
                        for k in member_in_cur_cluster:
                            list_record[k]['place'] = "place_"+str(place_count)						
                            #calculate time duration
                            list_record[k]['duration'] = timedelta_total_seconds(last_time-begin_time)

                        place_count=place_count+1
                        temp_location_center_lat = float(list_record[i]['lat'])
                        temp_location_center_lon = float(list_record[i]['lng'])
                        #clean the temp cluster
                        member_in_cur_cluster[:] =[]
                        member_in_cur_cluster.append(i)
                    else:
                        for k in member_in_cur_cluster:
                            list_record[k]['place'] = "no"
                        temp_location_center_lat = float(list_record[i]['lat'])
                        temp_location_center_lon = float(list_record[i]['lng'])
                        #clean the temp cluster
                        member_in_cur_cluster[:] =[]
                        member_in_cur_cluster.append(i)	
    
    list_record = [x for x in list_record if ('duration' in x.keys())]	
    return (userID, list_record)

Below is the result from a sample user GPS data for 30 days, using tableau to visualize.
The original GPS data visualization:
[<img src="https://c2.staticflickr.com/6/5450/30626476102_21a90ce16c_b.jpg">](https://c2.staticflickr.com/6/5450/30626476102_21a90ce16c_b.jpg)
After we filter out the moving GPS records, the remaining GPS data visualization for the same user:
[<img src="https://c2.staticflickr.com/6/5560/30743187415_f71369b1f7_b.jpg">](https://c2.staticflickr.com/6/5560/30743187415_f71369b1f7_b.jpg)


## 3. Cluster GPS records to be locations:

After retaining only static GPS records, we need to cluster these records into several "locations" since the GPS records for places are too subtle, for example, although you are at your own home, you may have several different GPS location records. Therefore, since the GPS coordinate for a physical location may vary, we should group the places into a location to make the study more useful. Here, we will use a variant of k-kean clustering algorithm to cluster the location-based GPS data.
[<img src="https://c2.staticflickr.com/6/5599/30626826222_5773c9cea1_z.jpg">](https://c2.staticflickr.com/6/5599/30626826222_5773c9cea1_z.jpg)

The main idea of this clustering result is shown above. In detail, we will start with the first record as the center of the cluster and go through the records to find the GPS records that are within the radius from the center. After that, we will recalculate the center of this cluster. Then we will go through the records again to include new set of records that are within the radius from the new center. We will repeat the process until we the center is not moving. To be more clear, you can refer to the visualization of clustering steps below.

[<img src="https://c2.staticflickr.com/6/5699/30655919461_ddd296710a_b.jpg">](https://c2.staticflickr.com/6/5699/30655919461_ddd296710a_b.jpg)

[<img src="https://c2.staticflickr.com/6/5644/30108758773_7dd7c16ffc_b.jpg">](https://c2.staticflickr.com/6/5644/30108758773_7dd7c16ffc_b.jpg)

We have included an important information on filtering phase - how long did the users stay in this place? Now, since we have clustered the place into location, we can also retreive another important information - how frequently does this users visit this location? Therefore, in this phase, we will also include the frequency information to the record dictionary.

In [None]:
###############################
#   The clustering function   #
###############################
def cluster_places(data):   
    """ cluster the GPS records for each user.
        Args:
            data (list) : a list for a certain user: [userid1,[list of dictionary of the GPS records for this user1]
                          The keys in the dictionary for each record: timestamp, id, lat, lng, place, duration
        Output:
            (list) : [userid1,[list of dictionary of the GPS records for this user1] (retain only static GPS records)
                    The keys in the dictionary for each record: timestamp, id, lat, lng, place, duration, cluster, frequency
    """
    #data: a list for a certain user like we describe in the first step.
    userID = data[0]
    list_record = data[1]
    total_record = len(list_record)
    final_record = []
    count_cluster = 0
    temp_cluster_member = []
    temp_cluster_lat = []
    temp_cluster_lon = []
    
    #go through the records for the user
    while len(list_record) != 0:
        start_lat = float(list_record[0]['lat'])
        start_lon = float(list_record[0]['lng'])
        temp_cluster_member = [0]
        temp_cluster_lat =[start_lat]
        temp_cluster_lon = [start_lon]
        
        #find location within the initiater
        for i in range(1,len(list_record)):
            if (haversine(start_lat, start_lon, float(list_record[i]['lat']), float(list_record[i]['lng'])) <= cluster_radius):
                temp_cluster_member.append(i)
                temp_cluster_lat.append(float(list_record[i]['lat']))
                temp_cluster_lon.append(float(list_record[i]['lng']))
        
        #recalculate the center
        new_centroid_lat = sum(temp_cluster_lat)/len(temp_cluster_lat)
        new_centroid_lon = sum(temp_cluster_lon)/len(temp_cluster_lon)
        interation_count = 1

        # if the new center is not as same as the start center -> not stable yet -> keep moving the cluster
        while start_lat!=new_centroid_lat and start_lon!=new_centroid_lon:
            interation_count = interation_count+1
            start_lat = new_centroid_lat
            start_lon = new_centroid_lon
            temp_cluster_member[:] = []
            temp_cluster_lat[:] =[]
            temp_cluster_lon[:] = []
            #find location within the radius
            for i in range(len(list_record)):
                if (haversine(start_lat, start_lon, float(list_record[i]['lat']), float(list_record[i]['lng'])) <= cluster_radius):
                    temp_cluster_member.append(i)
                    temp_cluster_lat.append(float(list_record[i]['lat']))
                    temp_cluster_lon.append(float(list_record[i]['lng']))     
            new_centroid_lat = sum(temp_cluster_lat)/len(temp_cluster_lat)
            new_centroid_lon = sum(temp_cluster_lon)/len(temp_cluster_lon) 

        #after finalize the cluster -> remove place from orignial set to new set
        cluster_date_set = set()
        
        #calculate the visit frequency to this cluster 
        for member in temp_cluster_member:
            temp_date = (datetime.strptime(list_record[member]['timestamp'],"%Y-%m-%d %H:%M:%S")).date()
            cluster_date_set.add(temp_date)
        date_frequency = len(cluster_date_set)
        
        #add cluster and visit frequency information to the record dictionary.
        for i in temp_cluster_member:
            list_record[i]['cluster'] = count_cluster
            list_record[i]['frequency'] = date_frequency
            final_record.append(list_record[i])

        list_record = [record for record in list_record if list_record.index(record) not in temp_cluster_member]
        count_cluster = count_cluster + 1 
    return (userID, final_record)	


############################################
#   The final output processing function   #
############################################
def toCSVLine(data):
    """ produce the result to csv file.
        Args:
            (list) : a list for certain users [userid1,[list of dictionary of the GPS records for this user1] (retain only static GPS records)
                    The keys in the dictionary for each record: timestamp, id, lat, lng, place, duration, cluster, frequency
        Output:
            (str) : A string with all records for a user. for each record, the result will have placeID, duration(in secs),
                    cluster ID, visit frequency for the cluster, # of records in cluster.
                    ex:
                    "0,2015-07-26 18:39:27,-17.010409,145.732294,place_0,96142.0,0,25,1878
                     0,2015-07-26 18:39:27,-17.010409,145.732294,place_0,96142.0,0,25,1878"

    """    
    imsi = data[0]
    list_record = data[1]
    final_output = []
    for record in list_record:
        record['imsi'] = imsi
        temp = [record['id'],record['timestamp'],record['lat'],record['lng'],record['place'],record['duration'],record['cluster'],record['frequency'],record['record_in_cluster']]
        final_output.append(temp)
    
    outlst_first = [','.join([str(c) for c in lst]) for lst in final_output]
    return '\n'.join(map(str, outlst_first))

#########################################################################################################################
#   Apply the retain_static_only function, the cluster_places function and the toCSVLine function to get final output   #
#########################################################################################################################

for user in User_proccessed_List:
    ID  = user[0]["id"]
    data = [ID,user]
    data_static = retain_static_only(data)
    data_cluster = cluster_places(data_static)
    data_CSV = toCSVLine(data_cluster)
    fd = open('result.csv','a')
    fd.write(data_CSV)
    fd.close()

The output will retain only the records that are static with additional information we extract. 
<pre>
Here is the sample output:
id, timestamp, lat, lng, placeID, duration, clusterID, visit frequency(days), number of records in the cluster
</pre>
<pre>
0,2015-07-26 18:17:34,-17.010375,145.73223,place_0,96142.0,0,25,1878
0,2015-07-26 18:12:50,-17.010375,145.73223,place_0,96142.0,0,25,1878
0,2015-07-26 18:08:03,-17.010394,145.73226,place_0,96142.0,0,25,1878
0,2015-07-26 18:03:14,-17.010393,145.732259,place_0,96142.0,0,25,1878
</pre>




Here is the result after we run the clustering algorithm representing by the same user and same period. We used different colors to represent different clusters.

[<img src="https://c2.staticflickr.com/6/5659/30707561476_790d7b1122_b.jpg">](https://c2.staticflickr.com/6/5659/30707561476_790d7b1122_b.jpg)



## 4. Identify important locations for each users:
As you can see, after clustering, we will form the records into several areas. Those areas may be user's home, workplace or campus. Now, how can we understand the importance of each areas to the users? We can use the information we have extracted - the duration of staying and the frequency of visiting. After using these two factors to the visualization...

[<img src="https://c2.staticflickr.com/6/5574/30627560922_f09290bb63_b.jpg">](https://c2.staticflickr.com/6/5574/30627560922_f09290bb63_b.jpg)

As you can see, now, we have a clear picture of the areas that are very important to the users. Those two dark and big points are probably the important place for this user since he/she visits there frequently and stay long. We can also look into deeper if the user stay there at night or daytime. If the user stay there at night mostly, we may assume that it is the home for that user.

## Conclusion and reference

This tutorial aims to analyze the GPS data and help the readers to retreive more information from user GPS data. After apply the GPS dataset to this algorithm, you can have a better understanding for your users. In this case, you will be able to porvide better or further service to users. One example is about the advertising. If you know where are the important places for your users, you can post the most relevant advertisement to them and increase the response rate to the advertisement. 

Furthermore, this tutorial is just the starting point of a GPS dataset analysis. After you have the place information, you can apply markov model or association rules to it. For markov model, since you now have the cluster visited sequence, you can use markov model to predict the movement of the users. For example, you will have better idea of, if a user go to place(cluster)1, he/she may go to place(cluster)2 next according to the sequence analysis using markove model. 

Here are the research papers that I have referneced to and can help you understand the idea better:
<pre>
1. Mining GPS Data to Determine Interesting Location
https://www.researchgate.net/publication/254004719_Mining_GPS_data_to_determine_interesting_locations

2. Using GPS to Learn Significant Locations and Predict Movement Across Multiple Users
http://www.lifewear.gatech.edu/resources/Ashbrook,_Starner_-_Using_GPS_to_Learn_Significant_Locations_and_Predict_Movement_Across_Multiple_Users.pdf
</pre>

Thank you for reading!
