## Introduction

This dataset contains data in Apache Parquet format along with the corresponding geometries in shapefiles. We will explore how to import and use the dataset. All the parquet files are located in `parquet` directory and all the shapefiles are located in `shapefiles` directory. Each directory then contains subdirectories for `mobile` and `fixed` connection type which has further subdirectories for each quarter.

In [None]:
import folium
import geopandas as gpd
import pandas as pd
from folium import Choropleth

In [None]:
ls -R "/kaggle/input/ookla-internet-speed-dataset"

## Parquet and Shapefile

Parquet files can be read using `pandas` and `dask`. It is very common and well known columnar format for storing data.

The shapefiles are located in `shapefiles` directory. The benefits of directly reading shapefile is that they can automatically interpret geometric columns.

For quarter 1 of mobile speeds, we need to refer to `/kaggle/input/ookla-internet-speed-dataset/shapefiles/mobile/quarter=1` directory. To read shapefile, we need to use the `geopandas` package.

In [None]:
df = gpd.read_file(
    "/kaggle/input/ookla-internet-speed-dataset/shapefiles/mobile/quarter=1/gps_mobile_tiles.shp",
    rows=10000
)

In [None]:
df

We can see that the there is a `geometry` datatype present.

In [None]:
df.dtypes

We will now plot data where the average latency was less than 50ms. For that purpose, we will use `folium` package to build interactive maps.

In [None]:
m = folium.Map(location=[41.87, -87.62], tiles='cartodbpositron', zoom_start=11)

In [None]:
d = df[df.avg_lat_ms < 50].set_index("quadkey")

Choropleth(
    geo_data=d.geometry.__geo_interface__,
    data=d.avg_lat_ms,
    legend_name="Average Latency",
    fill_color="RdPu",
    opacity=0.8,
    highlight=True,
    key_on="feature.id",
).add_to(m)

In [None]:
m

We can alternatively play with the geometries too.

In [None]:
list(df.loc[0].geometry.exterior.coords)

When we want to perform analysis on whole dataset, we can use parquet files because we would have to bring in each shape file individually and then wrangle them. We can simply use `pandas` for that purpose. But we may often find running out of memory due to it. so we may use `dask` for that purpose.

In [None]:
df = pd.read_parquet("/kaggle/input/ookla-internet-speed-dataset/parquet/mobile-speeds/")

In [None]:
df

Similarly, we can work with the `fixed-speeds` data too.

There is also a Kaggle Course that teaches Geospatial Analysis.

[Geospatial Analysis - Kaggle Learn](https://www.kaggle.com/learn/geospatial-analysis)