# CITY LIFE. Clustering analysis

## Task

In this project, you will try to get some valuable insights about customers’
and taxi drivers’ behavior. Maybe it will help a taxi company optimize its
business.
Here are the tasks that you need to do:
1. find out and visualize on a map most popular areas where people ordered a taxi as
well as where they headed to,
2. find out and visualize the most popular routes in different time intervals,
3. find in the dataset locations of the city infrastructure and visualize how the customers were arriving at one of them using an animated map,
4. visualize one day of a taxi driver and how much money he or she earned using an
animated map,
5. visualize one day of the city (working day and weekend day) using an animated
map.

## Imports

In [21]:
import geopandas as gpd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import shapely
import folium


from shapely.geometry import Polygon, LineString, Point


## Load data

In [28]:
df_map = gpd.read_file('data/chicago_map.shx')

In [29]:
df_map.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 1 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   geometry  77 non-null     geometry
dtypes: geometry(1)
memory usage: 744.0 bytes


In [30]:
rush = gpd.read_file('data/rush_hours_empty.csv')

In [31]:
rush.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   name                1 non-null      object  
 1   longitude           1 non-null      object  
 2   latitude            1 non-null      object  
 3   num_of_rides        1 non-null      object  
 4   Trip End Timestamp  1 non-null      object  
 5   geometry            0 non-null      geometry
dtypes: geometry(1), object(5)
memory usage: 176.0+ bytes


In [32]:
taxi_loc = gpd.read_file('data/taxi_locations.csv')

In [33]:
taxi_loc.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 2506294 entries, 0 to 2506293
Data columns (total 12 columns):
 #   Column                      Dtype   
---  ------                      -----   
 0   Trip ID                     object  
 1   Pickup Centroid Latitude    object  
 2   Pickup Centroid Longitude   object  
 3   Pickup Centroid Location    object  
 4   Dropoff Centroid Latitude   object  
 5   Dropoff Centroid Longitude  object  
 6   Dropoff Centroid  Location  object  
 7   Trip Start Timestamp        object  
 8   Trip End Timestamp          object  
 9   Taxi ID                     object  
 10  Fare                        object  
 11  geometry                    geometry
dtypes: geometry(1), object(11)
memory usage: 229.5+ MB


## Data preparation

In [50]:
df_map.set_crs(epsg=4326, inplace=True)

Unnamed: 0,geometry
0,"POLYGON ((-87.60914 41.84469, -87.60915 41.844..."
1,"POLYGON ((-87.59215 41.81693, -87.59231 41.816..."
2,"POLYGON ((-87.62880 41.80189, -87.62879 41.801..."
3,"POLYGON ((-87.60671 41.81681, -87.60670 41.816..."
4,"POLYGON ((-87.59215 41.81693, -87.59215 41.816..."
...,...
72,"POLYGON ((-87.69646 41.70714, -87.69644 41.706..."
73,"POLYGON ((-87.64215 41.68508, -87.64249 41.685..."
74,"MULTIPOLYGON (((-87.83658 41.98640, -87.83658 ..."
75,"POLYGON ((-87.65456 41.99817, -87.65456 41.998..."


In [51]:
m = folium.Map(location=[41.85, -87.65], zoom_start=10, tiles='CartoDB positron')

In [52]:
for _, r in df_map.iterrows():
    sim_geo = gpd.GeoSeries(r['geometry']).simplify(tolerance=0.001)
    geo_j = sim_geo.to_json()
    geo_j = folium.GeoJson(data=geo_j,
                           style_function=lambda x: {'fillColor': 'orange'})
    geo_j.add_to(m)

In [53]:
m

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 2506294 entries, 0 to 2506293
Data columns (total 12 columns):
 #   Column                      Dtype   
---  ------                      -----   
 0   Trip ID                     object  
 1   Pickup Centroid Latitude    object  
 2   Pickup Centroid Longitude   object  
 3   Pickup Centroid Location    object  
 4   Dropoff Centroid Latitude   object  
 5   Dropoff Centroid Longitude  object  
 6   Dropoff Centroid  Location  object  
 7   Trip Start Timestamp        object  
 8   Trip End Timestamp          object  
 9   Taxi ID                     object  
 10  Fare                        object  
 11  geometry                    geometry
dtypes: geometry(1), object(11)
memory usage: 229.5+ MB


## 1. Most popular areas

1. Conduct clustering analysis of pick point and drop off locations based on their coordinates. Clusters might be different for each of the categories (pickpoints and drop-offs).

#### Pickpoints

#### Drop-offs

2. Draw the borders and centroids of the clusters on a map. You will have two maps.

3. Use a color scale to show which clusters (i.e. areas) have the largest number of pick
points and drop-offs.

## 2. Most popular routes

## 3. City infrastructure

## 4. One day of a taxi driver

## 5. One day of the city

## Bonus