# PROJECT GO2 - Carbon Footprint Calculation

--- THE PROJECT ---

This project is aiming to calculate one's carbon footprint from it's Google Map location history.

# NoteBook n°1 : Raw Data Treatment & Vizualisation

--- NOTEBOOK GOAL ---

This first Notebook is aiming to give a quick overview of the user location history, without any Machine Learning, from raw dataset.

At the end, the user will have a Folium Map showing places of interest like :
- different homes (places the user have spent a lot of nights in)
- different work places (places the user have spent a lot of day time in weekdays, different from homes)
- diffrent trip places (places the user have spent time that are away from main home)
- All places on HeatMap (to show mouvements)
 
--- DATASET ---

Dataset is uploaded by user from his personal Google TakeOut account

In [1]:
# Création du sommaire :
from jyquickhelper import add_notebook_menu
add_notebook_menu(first_level=1, last_level=3, header="SOMMAIRE")

# I. LIBRARIES, FUNCTIONS & HYPOTHESIS

## 1. Libraries

In [2]:
# Basic Librairies :
import pandas as pd
import numpy as np
import json

# Datetime / Time :
import datetime as dt
from time import strftime

# Folium Visualisation :
import folium
from folium.plugins import HeatMap

# Distance calculation :
from geopy.distance import geodesic

## 2. Functions

In [3]:
def distance(point, coord_a, coord_b):
    """
    Calculate and return the geodesic distance between 'point' and a point of coordinates : (coord_a,coord_b)
    The result is in kilometer.
    """
    return geodesic((point[0], point[1]),(coord_a, coord_b)).m/1000

## 3. Hypothesis

In [4]:
Coord_Precision = 2         # Coordinates rounding precision

# II. DATASET

## 1. Loading

In [5]:
# Loading json dataset into raw file :
username = 'ElenaD'
datafile_path="datas/Positions_{}.json".format(username)

with open(datafile_path, 'r') as fh:
    raw = json.loads(fh.read())
    
# DataFrame definition : et on supprime le raw qui prend de l'espace mémoire :
DF_Locations = pd.DataFrame(raw['locations'])

# Deleting original file to save memory :
del raw

## 2. Quick Pretreatment

For the purpose of this Notebook, we only need to keep 'Latitude', 'Longitude' and 'Timestamp' columns

In [6]:
# Features selection :
Features = ['latitudeE7', 'longitudeE7', 'timestampMs']
DF_Locations = DF_Locations[Features]

In [7]:
# Converting units into international standards :
DF_Locations['latitudeE7'] = DF_Locations['latitudeE7']/float(1e7) 
DF_Locations['longitudeE7'] = DF_Locations['longitudeE7']/float(1e7)
DF_Locations['timestampMs'] = DF_Locations['timestampMs'].map(lambda x: float(x)/1000) #(converting into seconds)

# Columns rename :
DF_Locations.rename(columns={'latitudeE7':'Latitude', 'longitudeE7':'Longitude', 'timestampMs':'Timestamp'}, inplace=True)

# Creating 'Datetime' column :
DF_Locations['Datetime'] = DF_Locations.Timestamp.map(dt.datetime.fromtimestamp)

# Displaying first and last position date, and number of rows :
Begin = dt.datetime.fromtimestamp(min(DF_Locations['Timestamp'])).strftime('%Y-%m-%d %H:%M:%S')
End = dt.datetime.fromtimestamp(max(DF_Locations['Timestamp'])).strftime('%Y-%m-%d %H:%M:%S')

print("first date of dataset : {}".format(Begin))
print("last date of dataset : {}".format(End))
print("number of rows : {}".format(len(DF_Locations)))

first date of dataset : 2017-12-19 11:48:38
last date of dataset : 2020-01-02 14:16:57
number of rows : 334156


# III. FEATURES ENGINEERING

## 1. Rounding locations

We're rounding Latitudes and Longitudes to avoid having to much of the same position with only little movements.

In [8]:
DF_Locations['Lat_round'] = round(DF_Locations.Latitude, Coord_Precision)
DF_Locations['Lon_round'] = round(DF_Locations.Longitude, Coord_Precision)

In [9]:
# Creating a tuple (Lat-Lon) for each rounded location :
DF_Locations['Lat_Lon'] = [(a,b) for (a,b) in zip(DF_Locations.Lat_round, 
                                                  DF_Locations.Lon_round)]

## 2. Columns Creation

### 2.1 Table creation : Occurence table

In [10]:
# Creating a DataFrame with number of occurence for each rounded position
loc_occ = DF_Locations.groupby(['Lat_Lon']).size().sort_values(ascending=False)

df_loc_occ = pd.DataFrame(loc_occ).reset_index(drop=False).reset_index(drop=False)
df_loc_occ.rename(columns={0:'Occurence'}, inplace=True)

### 2.2 Column creation : 'Occ_class'
'Occ_class' : Classification index of ordered occurence table

In [11]:
# Creating a dictionary of corresponding 'Lat_Lon' & 'Occ_class' :
dict_Loc_OccClass = pd.DataFrame(list(df_loc_occ['index']), 
                             index = list(df_loc_occ['Lat_Lon']), 
                             columns = ['Occ_class']).to_dict()['Occ_class']

# Applying to df_location :
DF_Locations['Occ_class'] = DF_Locations.Lat_Lon.map(dict_Loc_OccClass)

### 2.3 Column creation : 'Occurence'
'Occurence' : Number of rows of each rounded location

In [12]:
# Creating a dictionary of corresponding 'Lat_Lon' & 'Occurence' :
dict_Loc_Occ = pd.DataFrame(list(df_loc_occ['Occurence']), 
                             index = list(df_loc_occ['Lat_Lon']), 
                             columns = ['Occurence']).to_dict()['Occurence']

# Application au DF df_pos :
DF_Locations['Occurence'] = DF_Locations.Lat_Lon.map(dict_Loc_Occ)

### 2.4 Column creation : 'Duration'
'Duration' : Time spent in each row-location

In [13]:
# Création d'une colonne 'Duration' :
DF_Locations['Duration'] = DF_Locations.shift(periods=-1)['Timestamp'] - DF_Locations['Timestamp']

### 2.5 Column creation : 'Time_spent'
'Time_spent' : Total time spent in each aggregated rounded-location

In [14]:
# Creating a dictionary of corresponding 'Occ_class' & 'Time_spent' :
dict_Occ_Time = DF_Locations.groupby(by=['Occ_class']).sum()['Duration'].to_dict()

# Application au DF df_final :
DF_Locations['Time_spent'] = DF_Locations.Occ_class.map(dict_Occ_Time)

### 2.6 Creating DataFrame : Final_Location
Corresponding to locations grouped by Lat-Lon

In [15]:
useless_columns = ['Latitude', 'Longitude', 'Timestamp', 'Duration']
DF_Final_Locations = DF_Locations.groupby(by='Occ_class').mean().drop(columns=useless_columns)

### 2.7 Column creation : 'Night_Time'
'Night_Time' : Total time spent in each aggregated-location during night hours

In [16]:
# 'Night Period' (to find classical Homes) :
DF_Locations['Hour'] = [x.hour for x in DF_Locations.Datetime]

dict_DayPeriod = {8 : 'Day', 9 : 'Day', 10 : 'Day', 11 : 'Day', 12 : 'Day', 13 : 'Day', 
                  14 : 'Day', 15 : 'Day', 16 : 'Day', 17 : 'Day', 18 : 'Day', 19 : 'Day', 
                  20 : 'Night', 21 : 'Night', 22 : 'Night', 23 : 'Night', 
                  0 : 'Night', 1 : 'Night', 2 : 'Night', 3 : 'Night', 4 : 'Night',
                  5 : 'Night', 6 : 'Night', 7 : 'Night'}

DF_Locations['DayPeriod'] = DF_Locations.Hour.map(dict_DayPeriod)

dict_OccClass_NightTime = DF_Locations[
    DF_Locations.DayPeriod == 'Night'].groupby(
    by=['Occ_class']).sum()['Duration'].to_dict()

DF_Final_Locations['Night_Time'] = DF_Final_Locations.index.map(dict_OccClass_NightTime)

### 2.8 Column creation :  'WeekDay_Time'
'WeekDay_Time' : Total time spent in each aggregated-location during week-day hours 

In [17]:
# 'Week-Day Period' (to find classical Work Place) :
DF_Locations['Day'] = [x.day_name() for x in DF_Locations.Datetime]

dict_WeekPeriod = {'Monday' : 'Week', 'Tuesday' : 'Week', 'Wednesday' : 'Week', 
                   'Thursday' : 'Week', 'Friday' : 'Week', 
                   'Saturday' : 'WeekEnd', 'Sunday' : 'WeekEnd'}

DF_Locations['WeekPeriod'] = DF_Locations.Day.map(dict_WeekPeriod)

dict_OccClass_WeekDayTime = DF_Locations[
    (DF_Locations.WeekPeriod == 'Week') &  (DF_Locations.DayPeriod == 'Day')].groupby(
    by=['Occ_class']).sum()['Duration'].to_dict()

DF_Final_Locations['WeekDay_Time'] = DF_Final_Locations.index.map(dict_OccClass_WeekDayTime)

### 2.9 Column creation : 'First_time' / 'Last_time'
'First_time' : Datetime of first time in an aggregated-location

'Last_time' : Datetime of last time in an aggregated-location

In [18]:
# 'First_time' / 'Last_time'
dict_OccClass_1stTime = DF_Locations.groupby(by=['Occ_class']).min()['Datetime'].to_dict()
dict_OccClass_LastTime = DF_Locations.groupby(by=['Occ_class']).max()['Datetime'].to_dict()

DF_Final_Locations['First_time'] = DF_Final_Locations.index.map(dict_OccClass_1stTime)
DF_Final_Locations['Last_time'] = DF_Final_Locations.index.map(dict_OccClass_LastTime)

## 3. Places of Interest

### 3.1 Home Places (Night Places)

In [19]:
# Sorting agg-locations depending on time spent by night ('Night_Time') :
DF_Night_Places = DF_Final_Locations.sort_values(by='Night_Time', ascending=False)

# 'Home_Places' hypothesis (20 full nights OR 20% of Dataset's night time) :
DF_Home_Places = DF_Final_Locations[
    (DF_Final_Locations.Night_Time >= 20*60*60*12) 
    | 
    (DF_Final_Locations.Night_Time >= 20*DF_Night_Places.Night_Time.sum()/100)]

### 3.2 Work Places (Week-Day Places)

In [20]:
# Sorting agg-locations depending on time spent by day during weekday ('WeekDay_Time') :
DF_WeekDay_Places = DF_Final_Locations.sort_values(by='WeekDay_Time', ascending=False)

# 'Work_Places' hypothesis (20 full day OR 15% of Dataset's WeekDay time) :
DF_Work_Places = DF_Final_Locations[
    (DF_Final_Locations.WeekDay_Time >= 20*60*60*12) 
    | 
    (DF_Final_Locations.WeekDay_Time >= 15*DF_WeekDay_Places.WeekDay_Time.sum()/100)]

### 3.3 Column Creation : 'Dist_Home'
'Dist_Home' : Distance from main home (first Night Place)

In [21]:
# Defining Home coordinates (Most represented night place) :
Coord_Home = (DF_Home_Places.Lat_round[0], DF_Home_Places.Lon_round[0])

In [22]:
# Creating column 'Dist_Home' :
DF_Final_Locations['Dist_Home'] = [distance(Coord_Home, a, b) for (a, b) 
                                   in zip(DF_Final_Locations.Lat_round, 
                                          DF_Final_Locations.Lon_round)]

### 3.4 Trip Places (Distant Places)

In [23]:
# 'Trip_Places' hypothesis (Dist_Home >= 50km AND Time_spent > 24h) :
DF_Trip_Places = DF_Final_Locations[
    (DF_Final_Locations.Dist_Home >= 50) 
    & 
    (DF_Final_Locations.Time_spent / (60*60*24) >= 1)]

# IV. VIZUALISATION ON FOLIUM MAP

## 1. Creating map

In [24]:
# Center map coordinates (to adapt if necessary) :
Coord_Home = (DF_Home_Places.Lat_round[0], DF_Home_Places.Lon_round[0])

# Building initial Folium Map :
Folium_Map = folium.Map(location=Coord_Home, 
                       zoom_start=2,
                       tiles='cartodbpositron', 
                       width=864, height=560)

## 2. Home Places - Markers

In [25]:
# Defining Home Places coordinates :
Coord_Home = DF_Home_Places[['Lat_round', 'Lon_round']].reset_index(drop=True)

# Adding Home Places markers to initial Folium Map :
for i in range(len(Coord_Home)):
    try :
        Nights = int((DF_Home_Places.Night_Time.iloc[i] // (60*60*12)) + 1)
    except :
        Nights = 0
        
    folium.Marker([Coord_Home.iloc[i]['Lat_round'], Coord_Home.iloc[i]['Lon_round']], 
                  icon=folium.Icon(color='green'),
                  popup=str("Home Place\nFirst_time:{0}\nLast_time:{1}\nNights:{2}"
                            .format(
                                DF_Home_Places.iloc[i]['First_time'].strftime('%Y/%m/%d'),
                                DF_Home_Places.iloc[i]['Last_time'].strftime('%Y/%m/%d'),
                                Nights))).add_to(Folium_Map)
    
# Displaying Map
Folium_Map

## 3. Work Places - Markers

In [26]:
# Defining Work Places coordinates :
DF_Work_Places_Only = DF_Work_Places.drop(
    list(set(DF_Work_Places.index) & 
         set(DF_Home_Places.index)))

Coord_Work = DF_Work_Places_Only[['Lat_round', 'Lon_round']].reset_index(drop=True)

# Adding Work Places markers to initial Folium Map :
for i in range(len(Coord_Work)):
    try :
        WeekDay = int((DF_Work_Places_Only.WeekDay_Time.iloc[i] // (60*60*12)) + 1)
    except :
        WeekDay = 0
        
    folium.Marker([Coord_Work.iloc[i]['Lat_round'], Coord_Work.iloc[i]['Lon_round']], 
                  icon=folium.Icon(color='red'),
                  popup=str("Work Place\nFirst_time:{0}\nLast_time:{1}\nWeekDays:{2}"
                            .format(
                                DF_Work_Places_Only.iloc[i]['First_time'].strftime('%Y/%m/%d'),
                                DF_Work_Places_Only.iloc[i]['Last_time'].strftime('%Y/%m/%d'),
                                WeekDay))).add_to(Folium_Map)

# Displaying Map
Folium_Map

## 4. Trip Places - Markers

In [27]:
# Adding Home Places markers to initial Folium Map :
DF_Trip_Places_Only = DF_Trip_Places.drop(list(
    (set(DF_Trip_Places.index) & set(DF_Home_Places.index)) 
    |
    set(DF_Trip_Places.index) & set(DF_Work_Places.index)))
                                          
Coord_Trip = DF_Trip_Places_Only[['Lat_round', 'Lon_round']].reset_index(drop=True)


for i in range(len(Coord_Trip)):
    try :
        Time = int(round((DF_Trip_Places_Only.Time_spent.iloc[i] / (60*60*24)),0))
    except :
        Time = 0
        
    folium.Marker([Coord_Trip.iloc[i]['Lat_round'], Coord_Trip.iloc[i]['Lon_round']], 
                  icon=folium.Icon(color='blue'),
                  popup=str("Trip Place\nFirst_time:{0}\nLast_time:{1}\nTime:{2} Days"
                            .format(
                                DF_Trip_Places_Only.iloc[i]['First_time'].strftime('%Y/%m/%d'),
                                DF_Trip_Places_Only.iloc[i]['Last_time'].strftime('%Y/%m/%d'),
                                Time))).add_to(Folium_Map)
    
# Displaying Map
Folium_Map

## 5. Trips Vizualisation - HeatMap

In [28]:
# On HeatMap we display all agg-rounded locations :
Coord_All = DF_Final_Locations[['Lat_round', 'Lon_round']].reset_index(drop=True)

# Adding all agg-rounded locations on HeatMap :
HeatMap(data=Coord_All, radius=10, max_zoom=13).add_to(Folium_Map)

Folium_Map

## 6. Saving FoliumMap on HTML Page

In [29]:
FoliumMapName = 'FoliumMap_{}.html'.format(username)
Folium_Map.save(FoliumMapName)