## Gemmm

A table of the Middle Super Output Areas (MSOAs) used to fit the telecoms model is provided in the `tables` module.

This table includes columns for the code and name of each MSOA, as well as their corresponding Local Authority District (LAD), region and country. For MSOAs in Wales and Scotland, the region column is equal to the LAD column since there is no equivalent to regions.

In [1]:
from gemmm.tables import gb_msoas

In [2]:
gb_msoas.head()

Unnamed: 0,msoa,msoa_name,lad,lad_name,region,region_name,country
0,E02000001,City of London 001,E09000001,City of London,E12000007,London,England
1,E02000002,Barking and Dagenham 001,E09000002,Barking and Dagenham,E12000007,London,England
2,E02000003,Barking and Dagenham 002,E09000002,Barking and Dagenham,E12000007,London,England
3,E02000004,Barking and Dagenham 003,E09000002,Barking and Dagenham,E12000007,London,England
4,E02000005,Barking and Dagenham 004,E09000002,Barking and Dagenham,E12000007,London,England


We can use this table to extract a list of MSOA codes for a certain area.

In [3]:
LAD_NAME = 'Cambridge'
msoas = gb_msoas.query('lad_name==@LAD_NAME').msoa.values

### Generating samples

The `OriginDestination` class can be used to sample the numbers of journeys between MSOAs at different hours of the day.

To do so, we provide it with a list of MSOAs and a day type, either weekday or weekend.

The model requires two data files to generate the samples. These files are downloaded from [Gemmm/model_data](https://github.com/ukhsa-collaboration/Gemmm/tree/main/model_data) and cached for future use. 

In [4]:
from gemmm import OriginDestination

In [5]:
X = OriginDestination(msoas=msoas, day_type='weekday')

We now need to specify the hours for which we require samples (0-23), as well as the number of realizations for each hour.

In [6]:
X.generate_sample(hours=[8, 12, 16], n_realizations=5)

The samples are stored in a dictionary in `X.samples`. The keys of the dictionary take the form (x, y) where x is the hour and y is the realization number.

Each sample is stored as a sparse matrix in coordinate format. The row attribute contains the indices of the start MSOA, the col attribute contains the indices of the end MSOA, and the data attribute contains the number of journeys. The indices of the MSOAs refer to their position in the list or numpy array initially provided to the OriginDestination class.

The `to_pandas` method can be used to convert from a sparse matrix to a pandas DataFrame. The indices of the start and end MSOAs are now replaced with their respective codes:

In [7]:
df_8 = X.to_pandas(hour=8, realization=0)
df_8.head()

Unnamed: 0,start_msoa,end_msoa,journeys
0,E02003719,E02003719,229
1,E02003719,E02003720,5
2,E02003719,E02003721,7
3,E02003719,E02003722,2
4,E02003719,E02003723,5


We can also write the samples to a NetCDF4 file using the `save_sample` argument. If True, the file is saved in the current working directory, otherwise we can specify a directory.

In [8]:
X.generate_sample(hours=[8, 12, 16], n_realizations=5, save_sample=True)

Saving samples to weekday_samples_2024-09-09--14-57-33.nc


### Loading samples

To load the samples, we again use the `OriginDestination` class, but this time provide the path to the file.

In [9]:
# update with the file name from the output of the previous cell
Y = OriginDestination(file='weekday_samples_2024-09-09--14-57-33.nc')

Available hours: 8, 12, 16
Number of realizations: 5


We can load a specific realization for one of the available hours. If the `realization` argument is omitted, a realization will be chosen at random.

By default, this will return a pandas DataFrame containing the start MSOA, end MSOA, and the number of journeys between them. Pairs with zero journeys are not included.

In [10]:
hour = 8
realization = 0
loaded_sample = Y.load_sample(hour=hour, realization=realization)
loaded_sample.head(5)

Unnamed: 0,start_msoa,end_msoa,journeys
0,E02003719,E02003719,231
1,E02003719,E02003720,2
2,E02003719,E02003721,6
3,E02003719,E02003722,1
4,E02003719,E02003723,7


Note that this is equivalent to the output produced by calling `to_pandas` on the original object.

In [11]:
(X.to_pandas(hour, realization)).equals(loaded_sample)

True

In addition, setting `wide=True` returns a DataFrame that more closely resembles an origin-destination matrix:

In [12]:
Y.load_sample(hour=hour, realization=realization, wide=True)

Unnamed: 0_level_0,journeys,journeys,journeys,journeys,journeys,journeys,journeys,journeys,journeys,journeys,journeys,journeys,journeys
end_msoa,E02003719,E02003720,E02003721,E02003722,E02003723,E02003724,E02003725,E02003726,E02003727,E02003728,E02003729,E02003730,E02003731
start_msoa,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2
E02003719,231.0,2.0,6.0,1.0,7.0,10.0,43.0,9.0,1.0,32.0,10.0,33.0,14.0
E02003720,5.0,260.0,18.0,3.0,6.0,19.0,38.0,10.0,6.0,30.0,8.0,51.0,29.0
E02003721,2.0,3.0,403.0,3.0,8.0,10.0,46.0,6.0,3.0,15.0,8.0,32.0,18.0
E02003722,1.0,1.0,6.0,390.0,0.0,1.0,17.0,5.0,0.0,11.0,7.0,21.0,15.0
E02003723,5.0,8.0,3.0,2.0,426.0,8.0,25.0,4.0,6.0,9.0,2.0,30.0,13.0
E02003724,6.0,12.0,5.0,4.0,10.0,543.0,81.0,4.0,2.0,14.0,20.0,37.0,50.0
E02003725,8.0,14.0,19.0,10.0,15.0,24.0,1025.0,15.0,9.0,36.0,18.0,51.0,28.0
E02003726,5.0,3.0,11.0,1.0,1.0,4.0,21.0,346.0,2.0,4.0,11.0,5.0,5.0
E02003727,3.0,6.0,4.0,2.0,5.0,3.0,12.0,0.0,375.0,7.0,4.0,16.0,8.0
E02003728,7.0,4.0,16.0,8.0,15.0,9.0,52.0,6.0,5.0,342.0,10.0,26.0,15.0


Finally, setting `as_pandas=False` will return a numpy array that contains the indices of the start MSOA and end MSOA, rather than their codes.