# Predict Future

In this part, you will not start anything new but will continue working with the data from Prague from the previous section and get a bit deeper into the problem. Not everything has been covered in class, so consult the documentation when unsure.

## Continue with Classification

### 1. Explore the Classification Problem Further
- **Try different combinations of independent variables.**
  - Does it make sense to combine proximity variables with spatial heterogeneity? Test that.
  - Contrary to what you may expect, removing some variables with low importance can improve performance. Is this the case in our situation?
  - Find the best combination of variables. How far can you push accuracy?

In [1]:
import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import shapely
from libpysal import graph
from sklearn import ensemble, metrics, model_selection

In [2]:
gdf_buildings = gpd.read_file(
  "https://martinfleischmann.net/sds/classification/data/prg_building_locations.gpkg",
)
gdf_buildings.head()

  _init_gdal_data()


Unnamed: 0,cluster,floor_area_ratio,height,compactness,street_alignment,interbuilding_distance,block_perimeter_wall_length,basic_settlement_unit,cadastral zone,geometry
0,large-scale industry,0.614375,23.458,0.747131,10.601522,37.185479,57.751467,U cementárny,Radotín,POINT (-749841.681 -1052279.951)
1,medieval city,2.993299,16.099,0.469154,8.655982,8.547983,1033.921828,Horní malostranský obvod,Malá Strana,POINT (-744432.289 -1042699.409)
2,periphery,0.108374,3.673,0.498831,2.473966,26.135688,74.432812,Dolní Měcholupy-střed,Dolní Měcholupy,POINT (-733300.261 -1048136.856)
3,periphery,0.290723,9.097,0.627294,6.054875,32.423481,38.59203,Trojský obvod,Troja,POINT (-742468.177 -1039691.997)
4,grids,0.017193,4.216,0.540439,0.134446,48.068409,49.125654,Vrch Svatého kříže,Žižkov,POINT (-740093.985 -1043857.813)


In [3]:
independent_variables = [
    "floor_area_ratio",
    "height",
    "compactness",
    "street_alignment",
    "interbuilding_distance",
    "block_perimeter_wall_length",
]

training_sample = gdf_buildings.sample(20_000, random_state=0)


In [4]:
## add proximity variables
old_town_square = (
    gpd.tools.geocode("Old Town Square, Prague", provider="nominatim", user_agent="sds")
    .to_crs(gdf_buildings.crs)
    .geometry.item()
)
training_sample["distance_to_old_town"] = training_sample.distance(old_town_square)

# add spatial heterogenenity
training_sample[["x", "y"]] = training_sample.get_coordinates()


### 2. Test Other ML Models
- **Experiment with models other than random forest.**
  - Compare the same input using different models, such as:
    - `HistGradientBoostingClassifier`
    - `DecisionTreeClassifier`
    - `AdaBoostClassifier`
  - Determine which model performs the best with default hyperparameters.

### 3. Refine Your Results
- **Pick your favorite model and identify prediction certainty clusters.**
  - Find clusters with high and low prediction certainty.
- **Fine-tune the models using grid search.**