## Problem 3: How far did people travel? (8 points)

During this task, the aim is to calculate the (air-line) distance in meters that each social media user in the data set prepared in *Problem 2* has travelled in-between the posts. We’re interested in the Euclidean distance between subsequent points generated by the same user.

For this, we will need to use the `userid` column of the data set `kruger_posts.shp` that we created in *Problem 2*.

Answer the following questions:
- What was the shortest distance a user travelled between all their posts (in meters)?
- What was the mean distance travelled per user (in meters)?
- What was the maximum distance a user travelled (in meters)?

---


### a) Read the input file and re-project it

- Read the input file `kruger_points.shp` into a geo-data frame `kruger_points`
- Transform the data from WGS84 to an `EPSG:32735` projection (UTM Zone 35S, suitable for South Africa). This CRS has *metres* as units.

In [1]:
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import pathlib
from shapely.geometry import LineString, Polygon, Point
from statistics import mean

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
# ADD YOUR OWN CODE HERE
NOTEBOOK_PATH = pathlib.Path().resolve()
DATA_PATH = NOTEBOOK_PATH / "data" / "kruger_points.shp"
kruger_points = gpd.read_file(DATA_PATH)

# Transform projection
kruger_points = kruger_points.to_crs("EPSG:32735")

In [3]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check the data
kruger_points.head()

Unnamed: 0,lat,lon,timestamp,userid,geometry
0,-24.980792,31.484633,2015-07-07 03:02,66487960,POINT (952912.890 7229683.258)
1,-25.499225,31.508906,2015-07-07 03:18,65281761,POINT (953433.223 7172080.632)
2,-24.342578,30.930866,2015-03-07 03:38,90916112,POINT (898955.144 7302197.408)
3,-24.854614,31.519718,2015-10-07 05:04,37959089,POINT (956927.218 7243564.942)
4,-24.921069,31.520836,2015-10-07 05:19,27793716,POINT (956794.955 7236187.926)


In [4]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check that the crs is correct after re-projecting (should be epsg:32735)
import pyproj
assert kruger_points.crs == pyproj.CRS("EPSG:32735")

### b) Group the data by user id

Group the data by `userid` and store the grouped data in a variable `grouped_by_users`

In [5]:
# ADD YOUR OWN CODE HERE

In [6]:
grouped_by_users = kruger_points.groupby(['userid'])

In [7]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check the number of groups:
assert len(grouped_by_users.groups) == kruger_points["userid"].nunique(), "Number of groups should match number of unique users!"

### c) Create `shapely.geometry.LineString` objects for each user connecting the points from oldest to most recent

There are multiple ways to solve this problem (see the [hints for this exercise](https://autogis-site.readthedocs.io/en/latest/lessons/lesson-2/exercise-2.html). You can use, for instance, a dictionary or an empty GeoDataFrame to collect data that is generated using the steps below:

- Use a for-loop to iterate over the grouped object. For each user’s data: 
    - [sort](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) the rows by timestamp 
    - create a `shapely.geometry.LineString` based on the user’s points

**CAREFUL**: Remember that every LineString needs at least two points. Skip users who have less than two posts.

Store the results in a `geopandas.GeoDataFrame` called `movements`, and remember to assign a CRS.

In [8]:
# ADD YOUR OWN CODE HERE
movements = {"userid":[], "geometry":[]}

for userid, group in grouped_by_users:
    sorted_group = group.sort_values(by = ["timestamp"])

    if len(sorted_group) >= 2:
        # Create a LineString from the sorted points
        lines = LineString(sorted_group[["lat", "lon"]].values)
        movements["geometry"].append(lines)
        movements["userid"].append(userid[0])

movements = gpd.GeoDataFrame(movements, crs="EPSG:32735")

In [9]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check the result
print(type(movements))
print(movements.crs)

movements

<class 'geopandas.geodataframe.GeoDataFrame'>
EPSG:32735


Unnamed: 0,userid,geometry
0,16301,"LINESTRING (-24.760 31.371, -24.750 31.338, -2..."
1,45136,"LINESTRING (-25.321 31.026, -25.321 31.026)"
2,50136,"LINESTRING (-24.770 31.394, -24.993 31.593, -2..."
3,88775,"LINESTRING (-25.329 31.000, -25.329 31.000)"
4,88918,"LINESTRING (-25.067 31.551, -24.993 31.593)"
...,...,...
9021,99921781,"LINESTRING (-25.290 31.000, -25.295 31.011, -2..."
9022,99936874,"LINESTRING (-24.993 31.593, -24.993 31.592, -2..."
9023,99964140,"LINESTRING (-24.305 31.322, -24.305 31.322)"
9024,99986933,"LINESTRING (-24.299 31.293, -24.276 31.299)"


### d) Calculate the distance between all posts of a user

- Check once more that the CRS of the data frame is correct
- Compute the lengths of the lines, and store it in a new column called `distance`

In [10]:
# ADD YOUR OWN CODE HERE
def calculate_distance(row):
    return row["geometry"].length

movements["distance"] = movements.apply(calculate_distance, axis = 1)

In [11]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

#Check the output
movements.head()

Unnamed: 0,userid,geometry,distance
0,16301,"LINESTRING (-24.760 31.371, -24.750 31.338, -2...",3.158937
1,45136,"LINESTRING (-25.321 31.026, -25.321 31.026)",0.0
2,50136,"LINESTRING (-24.770 31.394, -24.993 31.593, -2...",1.490359
3,88775,"LINESTRING (-25.329 31.000, -25.329 31.000)",7.28011e-07
4,88918,"LINESTRING (-25.067 31.551, -24.993 31.593)",0.08527984


### e) Answer the original questions

You should now be able to quickly find answers to the following questions: 
- What was the shortest distance a user travelled between all their posts (in meters)? (store the value in a variable `shortest_distance`)
- What was the mean distance travelled per user (in meters)? (store the value in a variable `mean_distance`)
- What was the maximum distance a user travelled (in meters)? (store the value in a variable `longest_distance`)

In [12]:
# ADD YOUR OWN CODE HERE
def find_distance(line, return_type):
    if len(line.coords) < 2:
        return 0  # No distance if less than two points

    # Calculate distances between consecutive points
    distances = [Point(line.coords[i]).distance(Point(line.coords[i + 1]))
                 for i in range(len(line.coords) - 1)]

    if return_type == "max":
        return max(distances)
        
    elif return_type == "min":
        return min(distances)

    elif return_type == "mean":
        return mean(distances)

In [13]:
# ADD YOUR OWN CODE HERE
def max_distance(row):
    return find_distance(row["geometry"], "max")

def mean_distance(row):
    return find_distance(row["geometry"], "mean")

def min_distane(row):
    return find_distance(row["geometry"], "min")

movements["max_d"] = round(movements.apply(max_distance, axis = 1), 2)
movements["mean_d"] = round(movements.apply(mean_distance, axis = 1), 2)
movements["min_d"] = round(movements.apply(min_distane, axis = 1), 2)

In [14]:
movements.head()

Unnamed: 0,userid,geometry,distance,max_d,mean_d,min_d
0,16301,"LINESTRING (-24.760 31.371, -24.750 31.338, -2...",3.158937,0.65,0.39,0.03
1,45136,"LINESTRING (-25.321 31.026, -25.321 31.026)",0.0,0.0,0.0,0.0
2,50136,"LINESTRING (-24.770 31.394, -24.993 31.593, -2...",1.490359,0.3,0.17,0.0
3,88775,"LINESTRING (-25.329 31.000, -25.329 31.000)",7.28011e-07,0.0,0.0,0.0
4,88918,"LINESTRING (-25.067 31.551, -24.993 31.593)",0.08527984,0.09,0.09,0.09


### f) Save the movements in a file

Save the `movements` into a new Shapefile called `movements.shp` inside the `data` directory.

In [15]:
# ADD YOUR OWN CODE HERE
DATA_DIRECTORY = NOTEBOOK_PATH / "data"
movements.to_file(DATA_DIRECTORY / "movements.shp")

In [16]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

assert (DATA_DIRECTORY / "movements.shp").exists()


---

# Fantastic job!

That’s all for this week! 