## Problem 3: How far did people travel? (8 points)

During this task, the aim is to calculate the (air-line) distance in meters that each social media user in the data set prepared in *Problem 2* has travelled in-between the posts. We’re interested in the Euclidean distance between subsequent points generated by the same user.

For this, we will need to use the `userid` column of the data set `kruger_posts.shp` that we created in *Problem 2*.

Answer the following questions:
- What was the shortest distance a user travelled between all their posts (in meters)?
- What was the mean distance travelled per user (in meters)?
- What was the maximum distance a user travelled (in meters)?

---


### a) Read the input file and re-project it

- Read the input file `kruger_points.shp` into a geo-data frame `kruger_points`
- Transform the data from WGS84 to an `EPSG:32735` projection (UTM Zone 35S, suitable for South Africa). This CRS has *metres* as units.

In [1]:
import pandas as pd
import geopandas as gpd
import shapely
from shapely.geometry import Point
from shapely.geometry import LineString
from shapely.geometry import Polygon


import os
os.environ['USE_PYGEOS'] = '0'
import geopandas

In a future release, GeoPandas will switch to using Shapely by default. If you are using PyGEOS directly (calling PyGEOS functions on geometries from GeoPandas), this will then stop working and you are encouraged to migrate from PyGEOS to Shapely 2.0 (https://shapely.readthedocs.io/en/latest/migration_pygeos.html).
  import geopandas as gpd


In [2]:
# ADD YOUR OWN CODE HERE
kruger_points = gpd.read_file("data/kruger_points.shp")
kruger_points = kruger_points.to_crs(epsg=32735)

In [3]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check the data
kruger_points.head()

Unnamed: 0,lat,lon,timestamp,userid,geometry
0,-24.980792,31.484633,2015-07-07 03:02,66487960,POINT (952912.890 7229683.258)
1,-25.499225,31.508906,2015-07-07 03:18,65281761,POINT (953433.223 7172080.632)
2,-24.342578,30.930866,2015-03-07 03:38,90916112,POINT (898955.144 7302197.408)
3,-24.854614,31.519718,2015-10-07 05:04,37959089,POINT (956927.218 7243564.942)
4,-24.921069,31.520836,2015-10-07 05:19,27793716,POINT (956794.955 7236187.926)


In [4]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check that the crs is correct after re-projecting (should be epsg:32735)
import pyproj
assert kruger_points.crs == pyproj.CRS("EPSG:32735")

### b) Group the data by user id

Group the data by `userid` and store the grouped data in a variable `grouped_by_users`

In [5]:
kruger_points["userid"].nunique()

14990

In [6]:
# ADD YOUR OWN CODE HERE

In [7]:
grouped_by_users = kruger_points.groupby("userid")

In [8]:
# Iterate over groups
for key, group in grouped_by_users:
    # Print key and group
    print("Key:\n", key)
    print("\nFirst rows of data in this group:\n", group.head())

    # Stop iteration with break command
    break

Key:
 16301

First rows of data in this group:
              lat        lon         timestamp  userid  \
30512 -24.760170  31.339430  2015-06-08 04:34   16301   
30535 -24.759508  31.371200  2015-02-08 06:18   16301   
30545 -24.774158  31.380342  2015-09-08 06:58   16301   
30770 -24.749845  31.338317  2015-02-09 08:09   16301   
38232 -24.791483  31.865172  2015-05-13 10:51   16301   

                             geometry  
30512  POINT (939011.113 7254636.121)  
30535  POINT (942231.630 7254606.868)  
30545  POINT (943105.509 7252951.967)  
30770  POINT (938934.725 7255785.084)  
38232  POINT (992154.406 7249364.820)  


In [9]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check the number of groups:
assert len(grouped_by_users.groups) == kruger_points["userid"].nunique(), "Number of groups should match number of unique users!"

### c) Create `shapely.geometry.LineString` objects for each user connecting the points from oldest to most recent

There are multiple ways to solve this problem (see the [hints for this exercise](https://autogis-site.readthedocs.io/en/latest/lessons/lesson-2/exercise-2.html). You can use, for instance, a dictionary or an empty GeoDataFrame to collect data that is generated using the steps below:

- Use a for-loop to iterate over the grouped object. For each user’s data: 
    - [sort](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) the rows by timestamp 
    - create a `shapely.geometry.LineString` based on the user’s points

**CAREFUL**: Remember that every LineString needs at least two points. Skip users who have less than two posts.

Store the results in a `geopandas.GeoDataFrame` called `movements`, and remember to assign a CRS.

In [10]:
# ADD YOUR OWN CODE HERE
# Preparing an empty list to collect LineString geometries and user_ids
lines = []
users = []

for user_id, group in grouped_by_users:
    if group.shape[0] < 2:
        continue  # Skip users with less than two points
    sorted_group = group.sort_values('timestamp')
    oldest_point = sorted_group.iloc[0]['geometry']
    newest_point = sorted_group.iloc[-1]['geometry']
    
    # Only create a LineString if there are at least two different timestamps
    if sorted_group.iloc[0]['geometry'] != sorted_group.iloc[-1]['geometry']:
        line = LineString([oldest_point, newest_point])
        lines.append(line)
        users.append(user_id)
# Create a GeoDataFrame with these lines
movements = gpd.GeoDataFrame({'user_id': users, 'geometry': lines})

# Set the CRS to WGS84
movements.set_crs(epsg=32735, inplace=True)


Unnamed: 0,user_id,geometry
0,16301,"LINESTRING (942231.630 7254606.868, 995551.997..."
1,50136,"LINESTRING (944551.607 7253384.183, 963788.403..."
2,88775,"LINESTRING (902800.817 7192546.975, 902800.839..."
3,88918,"LINESTRING (959332.961 7219877.715, 963788.403..."
4,90156,"LINESTRING (944913.750 7243343.215, 944914.735..."
...,...,...
7657,99908614,"LINESTRING (903191.542 7198170.853, 903213.285..."
7658,99921781,"LINESTRING (902885.190 7196931.096, 903380.518..."
7659,99936874,"LINESTRING (963782.211 7228000.079, 963754.402..."
7660,99964140,"LINESTRING (938876.653 7305143.369, 938876.943..."


In [11]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check the result
print(type(movements))
print(movements.crs)

movements

<class 'geopandas.geodataframe.GeoDataFrame'>
EPSG:32735


Unnamed: 0,user_id,geometry
0,16301,"LINESTRING (942231.630 7254606.868, 995551.997..."
1,50136,"LINESTRING (944551.607 7253384.183, 963788.403..."
2,88775,"LINESTRING (902800.817 7192546.975, 902800.839..."
3,88918,"LINESTRING (959332.961 7219877.715, 963788.403..."
4,90156,"LINESTRING (944913.750 7243343.215, 944914.735..."
...,...,...
7657,99908614,"LINESTRING (903191.542 7198170.853, 903213.285..."
7658,99921781,"LINESTRING (902885.190 7196931.096, 903380.518..."
7659,99936874,"LINESTRING (963782.211 7228000.079, 963754.402..."
7660,99964140,"LINESTRING (938876.653 7305143.369, 938876.943..."


### d) Calculate the distance between all posts of a user

- Check once more that the CRS of the data frame is correct
- Compute the lengths of the lines, and store it in a new column called `distance`

In [12]:
# ADD YOUR OWN CODE HERE
movements['distance_meters'] = movements['geometry'].length

In [13]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

#Check the output
movements.head()

Unnamed: 0,user_id,geometry,distance_meters
0,16301,"LINESTRING (942231.630 7254606.868, 995551.997...",67990.877546
1,50136,"LINESTRING (944551.607 7253384.183, 963788.403...",31837.816204
2,88775,"LINESTRING (902800.817 7192546.975, 902800.839...",0.080245
3,88918,"LINESTRING (959332.961 7219877.715, 963788.403...",9277.252211
4,90156,"LINESTRING (944913.750 7243343.215, 944914.735...",1.103448


### e) Answer the original questions

You should now be able to quickly find answers to the following questions: 
- What was the shortest distance a user travelled between all their posts (in meters)? (store the value in a variable `shortest_distance`)
- What was the mean distance travelled per user (in meters)? (store the value in a variable `mean_distance`)
- What was the maximum distance a user travelled (in meters)? (store the value in a variable `longest_distance`)

In [14]:
# ADD YOUR OWN CODE HERE
# Calculate the shortest distance traveled by any user
shortest_distance = movements['distance_meters'].min()

# Calculate the mean distance traveled per user
mean_distance = movements['distance_meters'].mean()

# Calculate the maximum distance traveled by any user
longest_distance = movements['distance_meters'].max()

# Print results
print(f"Shortest Distance Traveled: {shortest_distance} meters")
print(f"Mean Distance Traveled: {mean_distance} meters")
print(f"Longest Distance Traveled: {longest_distance} meters")

Shortest Distance Traveled: 0.00010161846437100987 meters
Mean Distance Traveled: 18996.68193172954 meters
Longest Distance Traveled: 314280.6178495145 meters


In [15]:
movements.head()

Unnamed: 0,user_id,geometry,distance_meters
0,16301,"LINESTRING (942231.630 7254606.868, 995551.997...",67990.877546
1,50136,"LINESTRING (944551.607 7253384.183, 963788.403...",31837.816204
2,88775,"LINESTRING (902800.817 7192546.975, 902800.839...",0.080245
3,88918,"LINESTRING (959332.961 7219877.715, 963788.403...",9277.252211
4,90156,"LINESTRING (944913.750 7243343.215, 944914.735...",1.103448


### f) Save the movements in a file

Save the `movements` into a new Shapefile called `movements.shp` inside the `data` directory.

In [24]:
# ADD YOUR OWN CODE HERE
import pathlib
pathlib.Path()
path = pathlib.Path()
path = path.resolve()

DATA_DIRECTORY = path / "data"
outfp = "data/movements.shp"
movements.to_file(outfp)

  movements.to_file(outfp)


In [25]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

assert (DATA_DIRECTORY / "movements.shp").exists()


---

# Fantastic job!

That’s all for this week! 