# FSDS Group Assessment (Group Safari)

## 1. Data Collection and Cleaning
We will use 2 different datasets:
1. Airbnb data of London (10 Dec, 2022) downloading from [InsideAirbnb](http://insideairbnb.com/get-the-data)  
2. 2011 and 2021 Census data including:
* popchurn 11.csv
* MIG009EW_LTLA_OUT.csv
* MIG009EW_LTLA_IN.csv
* ethnic group 2011.csv
* ethnic group 2021.csv
* house price_median.xls
* house price_aver.xlsx
* Deprivation 2011.xls
* Deprivation 2021.csv


### 1.1 Input data and create dataframe

Note that all data in the Data subdirectory is ignored in the `.gitignore` file.

The file names that are used in this script are as follows.

|Data Type|File Name|df/gdf name|Gentrification Score df Name|Note|
|:---|:---|:---|:--|:--|
||`popchurn 11.csv`|`popch2011`|`2011moving%`||
||`MIG009EW_LTLA_OUT.csv`|`moving2021`|`2021moving%`||
||`MIG009EW_LTLA_IN.csv`||||
||`ethnic group 2011.csv`|`eg2011`|`w_ratio11`||
||`ethnic group 2021.csv`|`eg2021`|`w_ratio21`||
||`house price_median.xls`|`price_med`|`houseprice%`||
||`house price_aver.xlsx`|`housing_df`|||
||`Deprivation 2011.xls`|`dpr2011`|`dpr2011%`||
||`Deprivation 2021.csv`|`dpr2021`|`dpr2021%`||
|`multipolygon`|`Boroughs.gpkg`|`boros`||`crs:27700`|
|`Airbnb listing`|`listings.csv.gz`|`ls`|`\`||
|`Airbnb listing -> geopandas`|`\`|`gls`|`\`|`crs:27700`|

#### 1.1.1 Get Prepared

In [1]:
# Import packages
import os
from urllib.request import urlopen
from requests import get
from urllib.parse import urlparse
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns
import re

<jemalloc>: MADV_DONTNEED does not work (memset will be used instead)
<jemalloc>: (This is the expected behaviour if you are running under QEMU)


In [2]:
# Download data from remote location
def cache_data(src:str, dest:str) -> str:
    """Downloads and caches a remote file locally.
    
    The function sits between the 'read' step of a pandas or geopandas
    data frame and downloading the file from a remote location. The idea
    is that it will save it locally so that you don't need to remember to
    do so yourself. Subsequent re-reads of the file will return instantly
    rather than downloading the entire file for a second or n-th itme.
    
    Parameters
    ----------
    src : str
        The remote *source* for the file, any valid URL should work.
    dest : str
        The *destination* location to save the downloaded file.
        
    Returns
    -------
    str
        A string representing the local location of the file.
    """
    url = urlparse(src) # We assume that this is some kind of valid URL 
    fn  = os.path.split(url.path)[-1] # Extract the filename
    dfn = os.path.join(dest,fn) # Destination filename
    
    if not os.path.isfile(dfn):
        
        print(f"{dfn} not found, downloading!")

        path = os.path.split(dest)
        
        if len(path) >= 1 and path[0] != '':
            os.makedirs(os.path.join(*path), exist_ok=True)
            
        with open(dfn, "wb") as file:
            response = get(src)
            file.write(response.content)
            
        print("\tDone downloading...")

    else:
        print(f"Found {dfn} locally!")

    return dfn

Please save data files under directory: ***Data/*** which is in the same level as this ipynb file

In [3]:
#local_repo_dir = 'Documents/casa/fsds/group' # change this to your own directory under 'work'
# os.chdir('/home/jovyan/work/' + local_repo_dir)
padir = 'Data/'

#### 1.1.2 Read Files and Select Columns

<span style="color:red">**For now, I am using local files(which are under folder "Data"), but I'll adjust it later to download directly using url.** </span>

In [None]:
## Population Churn
popch2011 = pd.read_csv(padir+'popchurn 11.csv', skiprows=7, skip_blank_lines=True, usecols=[
    'local authority: district / unitary (prior to April 2015)',
    'mnemonic',
    'Whole household lived at same address one year ago', 
    'Wholly moving household: Total']).dropna(how='all').iloc[:33]

popch2021_in_raw = pd.read_csv(padir + 'MIG009EW_LTLA_IN.csv', usecols=['Lower tier local authorities code', 'Household migration LTLA (inflow) (7 categories) code', 'Count'])
popch2021_out_raw = pd.read_csv(padir + 'MIG009EW_LTLA_OUT.csv', usecols=['Migrant LTLA one year ago code', 'Household migration LTLA (outflow) (3 categories) code', 'Count'])

popch2021_in = popch2021_in_raw.loc[popch2021_in_raw['Lower tier local authorities code'].astype(str).str.match(r'^E090000[0-9]{2}$|^E09000[1-3][0-3]$', na=False)]
popch2021_out = popch2021_out_raw.loc[popch2021_out_raw['Migrant LTLA one year ago code'].astype(str).str.match(r'^E090000[0-9]{2}$|^E09000[1-3][0-3]$', na=False)]

## Ethnic Group
eg2011 = pd.read_csv(padir+'ethnic group 2011.csv', skiprows=7, header=0, skip_blank_lines=True, usecols=[
    'mnemonic','All categories: Ethnic group','White'])
eg2021 = pd.read_csv(padir+'ethnic group 2021.csv', skiprows=6, header=0, skip_blank_lines=True, usecols=[
    'mnemonic','Total: All usual residents','White'])

## Housing price
# median housing price
price_med_raw = pd.read_excel(padir+'house price_median.xls',sheet_name='1a',engine='xlrd',skiprows=5,header=0,usecols=[
    'Local authority code','Year ending Dec 2001','Year ending Dec 2021'])
price_med = price_med_raw.loc[price_med_raw['Local authority code'].astype(str).str.contains(r'^E09', regex=True)]
price_med.set_index('Local authority code', inplace=True)
# average housing price 
housing_price = "house price_aver.xlsx"
housing_df = pd.read_excel(os.path.join(padir, housing_price),sheet_name=2,skiprows=1, header=0,index_col=0)

## Deprivation
dpr2011_raw = pd.read_excel(padir+'deprivation 2011.xls',sheet_name='QS119EW_Percentages',engine='xlrd',skiprows=10,header=0, usecols=[
    'Area code','Household is not deprived in any dimension'])
dpr2011 = dpr2011_raw.loc[dpr2011_raw['Area code'].astype(str).str.contains(r'^E090000[0-2][0-9]$|^E090003[0-3]$|^E090000[1-9][0-9]$|^E09000[1-3][0-3]$'
, regex=True)]

dpr2021_raw = pd.read_csv(padir+'deprivation 2021.csv')
dpr2021 = dpr2021_raw[dpr2021_raw['Upper tier local authorities Code'].astype(str).str.contains(
    r'^E090000[0-2][0-9]$|^E090003[0-3]$|^E090000[1-9][0-9]$|^E09000[1-3][0-3]$', regex=True)]

### 1.2 Calculate Gentrification Score
G = 1/2c - 1/4e + 1/8h - 1/8d + 0.25  
    c: population churn at household level - the ratio of the households that have changed  
    e: ethnic group - the change of the proportion of non-white residents  
    h: housing price - relative change in median house price compared with acerage price  
    d: deprivation - relative change in the proportion of households with deprivation dimensions

#### 1.2.1   c: Population Churn (Household Level)

In [None]:
gtr = pd.DataFrame()
## 2011 moving households
popch2011['2011moving%'] = (100*
    (popch2011['Wholly moving household: Total'] /
    (popch2011['Wholly moving household: Total'] + popch2011['Whole household lived at same address one year ago'])))

gtr['borough'] = popch2011['local authority: district / unitary (prior to April 2015)']
gtr['borough code'] = popch2011['mnemonic']
gtr['2011moving%'] = popch2011['2011moving%']
print(gtr)

In [None]:
# 2022 moving households
# population churn = moving household / all household =  moving household / (not moving household +  moving household)
samead = popch2021_in.loc[popch2021_in['Household migration LTLA (inflow) (7 categories) code'] == 1].groupby('Lower tier local authorities code')['Count'].sum().reset_index()
movein = popch2021_in.loc[(popch2021_in['Household migration LTLA (inflow) (7 categories) code'] >= 2) & (popch2021_in['Household migration LTLA (inflow) (7 categories) code'] <= 5)].groupby('Lower tier local authorities code')['Count'].sum().reset_index()
moveout = popch2021_out.loc[(popch2021_out['Household migration LTLA (outflow) (3 categories) code'] >= 1) & (popch2021_out['Household migration LTLA (outflow) (3 categories) code'] <= 2)].groupby('Migrant LTLA one year ago code')['Count'].sum().reset_index()

samead = samead.rename(columns={'Lower tier local authorities code': 'code'})
movein = movein.rename(columns={'Lower tier local authorities code': 'code'})
moveout = moveout.rename(columns={'Migrant LTLA one year ago code': 'code'})
print(samead.head(5))
moving2021 = (
    (movein.set_index('code')['Count'] +moveout.set_index('code')['Count']) /
    (samead.set_index('code')['Count'] + movein.set_index('code')['Count'] + moveout.set_index('code')['Count'])
).reset_index(name='2021moving%') * 100
print(moving2021.head(5))

# Extract the first part of 'code' in moving2021
moving2021['code'] = moving2021['code'].str.slice(0, 9)

# Merge the result into gtr based on 'borough code' and 'code'
gtr = gtr.merge(moving2021, how='left', left_on='borough code', right_on='code',suffixes=('', '_y'))

# Drop the redundant 'code' column
gtr = gtr.drop(columns=['code'])

# add 'popchurn' column: 
gtr['popchurn%'] = gtr['2021moving%'] - gtr['2011moving%']
# Display the resulting DataFrame gtr
print(gtr)

#### 1.2.2 e: Non-white Ethnic Group Proportion Change

In [None]:
# Calculate the ratio for 2011
eg2011['w_ratio11'] = eg2011['White'] / eg2011['All categories: Ethnic group']

# Calculate the ratio for 2021
eg2021['w_ratio21'] = eg2021['White'] / eg2021['Total: All usual residents']

# Merge with gtr based on 'mnemonic' and 'borough code'
gtr = gtr.merge(eg2011[['mnemonic', 'w_ratio11']], how='left', left_on='borough code', right_on='mnemonic')
gtr = gtr.merge(eg2021[['mnemonic', 'w_ratio21']], how='left', left_on='borough code', right_on='mnemonic')

# Drop redundant columns
gtr = gtr.drop(columns=['mnemonic_x', 'mnemonic_y'])

# add 'ethnic group%' column
gtr['ethgr%'] = (gtr['w_ratio11'] - gtr['w_ratio21']) * 100
print(gtr)

####  <span style="color:red"> 1.2.3 h: Housing Price Change (Median/Average) </span>

In [8]:
## median price

unique_values = price_med.index.unique()
print(unique_values)
# select data in 2011 and 2021
Housing_med_df = pd.DataFrame()
Housing_med_df ['median_2011'] =price_med.loc[:, ['Year ending Dec 2001']]
Housing_med_df['median_2021'] =price_med.loc[:, ['Year ending Dec 2021']]
Housing_med_df = Housing_med_df.groupby('Local authority code')[['median_2011', 'median_2021']].median()
print(Housing_med_df.head(10))

Index(['E09000002', 'E09000003', 'E09000004', 'E09000005', 'E09000006',
       'E09000007', 'E09000001', 'E09000008', 'E09000009', 'E09000010',
       'E09000011', 'E09000012', 'E09000013', 'E09000014', 'E09000015',
       'E09000016', 'E09000017', 'E09000018', 'E09000019', 'E09000020',
       'E09000021', 'E09000022', 'E09000023', 'E09000024', 'E09000025',
       'E09000026', 'E09000027', 'E09000028', 'E09000029', 'E09000030',
       'E09000031', 'E09000032', 'E09000033'],
      dtype='object', name='Local authority code')
                      median_2011  median_2021
Local authority code                          
E09000001               237500.00     797500.0
E09000002                87871.25     333750.0
E09000003               185000.00     600000.0
E09000004               119983.75     394000.0
E09000005               158656.25     509250.0
E09000006               158500.00     490000.0
E09000007               249725.00     808125.0
E09000008               125000.00     398750.0


In [9]:
print(housing_df.head(5)) # the row of 'city of london' as header

                    E09000001  91448.98487  82202.77314  79120.70256  \
City of London                                                         
Barking & Dagenham  E09000002  50460.22660  51085.77983  51268.96956   
Barnet              E09000003  93284.51832  93190.16963  92247.52435   
Bexley              E09000004  64958.09036  64787.92069  64367.49344   
Brent               E09000005  71306.56698  72022.26197  72015.76274   
Bromley             E09000006  81671.47692  81657.55944  81449.31143   

                    77101.20804  84409.14932  94900.51244  110128.0423  \
City of London                                                           
Barking & Dagenham  53133.50526  53042.24852  53700.34831  52113.12157   
Barnet              90762.87492  90258.00033  90107.23471  91441.24768   
Bexley              64277.66881  63997.13588  64252.32335  63722.70055   
Brent               72965.63094  73704.04743  74310.48167  74127.03788   
Bromley             81124.41227  81542.61561  82382

In [10]:
## average price

# set the index to datetime data
housing_df.index = pd.to_datetime(housing_df.index, format='%Y%m%d')
# set the column and index name
housing_df.columns.name = 'London_borough'
housing_df.index.name = 'year'
# check the index(year) type
print(housing_df.index.dtype)
# select the london borough data
London_housing_df = housing_df.filter(regex='^E09', axis=1)
# change the column and index location 
London_housing_df = London_housing_df.transpose()
# check the data
London_housing_df.head(3) 

# select the data of 2011 and 2021
housing_ave_df = pd.DataFrame()
housing_ave_df ['average_2011'] =London_housing_df.loc[:, ['2011-12-01']]
housing_ave_df ['average_2021'] =London_housing_df.loc[:, ['2012-12-01']]
housing_ave_df.head(10) 

# link the median data and average data
total_housing_df = pd.merge(housing_ave_df,Housing_med_df, left_index=True, right_index=True)
# calculate the change of housing price
total_housing_df['Compare_2011'] = total_housing_df['median_2011']/total_housing_df['average_2011']
total_housing_df['Compare_2021'] = total_housing_df['median_2021']/total_housing_df['average_2021']
total_housing_df['houseprice%'] = (total_housing_df['Compare_2021']-total_housing_df['Compare_2011']) / total_housing_df['Compare_2011']

print(total_housing_df)

ValueError: time data "Barking & Dagenham" doesn't match format "%Y%m%d", at position 0. You might want to try:
    - passing `format` if your strings have a consistent format;
    - passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
    - passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.

In [None]:
# add 'houseprice%' to 'gtr'
gtr = pd.merge(gtr, total_housing_df['houseprice%'], left_on='borough code', right_on='???', how='left')
# Drop the redundant 'Area code' column in 'gtr'
gtr = gtr.drop('???', axis=1)
print(gtr)

#### 1.2.4 d: Deprivation Proportion Change

In [11]:
dpr2021_nodpr = dpr2021[dpr2021['Household deprivation (6 categories) Code'] == 1]
dpr2021_all = dpr2021[(dpr2021['Household deprivation (6 categories) Code'] >= 1) & (dpr2021['Household deprivation (6 categories) Code'] <= 5)]
sum = dpr2021_all.groupby('Upper tier local authorities Code')['Observation'].sum()

ratios = (dpr2021_nodpr.groupby('Upper tier local authorities Code')['Observation'].sum() / sum) * 100

# Create a new DataFrame by merging 'dpr2011' and 'ratios'
result_df = pd.merge(dpr2011, ratios, left_on='Area code', right_index=True, how='left')
result_df = result_df.rename(columns={'Observation': 'dpr2021%',
                                      'Household is not deprived in any dimension': 'dpr2011%'})
print(result_df)

     Area code dpr2011%   dpr2021%
261  E09000007     37.9  47.523151
262  E09000001       45  59.768010
263  E09000012     31.5  44.952442
264  E09000013     41.5  51.385949
265  E09000014     35.7  43.316999
266  E09000019     36.7  48.351227
267  E09000020     43.6  52.589558
268  E09000022     39.9  50.016334
269  E09000023     38.4  47.211419
270  E09000025       25  39.304111
271  E09000028     36.2  48.546072
272  E09000030     32.7  46.404898
273  E09000032     50.4  58.362932
274  E09000033     39.2  50.139746
277  E09000002     28.2  37.591675
278  E09000003     43.2  49.573264
279  E09000004     41.5  48.518872
280  E09000005     30.9  39.941996
281  E09000006     48.5  54.642302
282  E09000008       41  47.994717
283  E09000009     37.4  46.034656
284  E09000010     36.1  42.187397
285  E09000011     37.2  48.243784
286  E09000015     41.8  48.728275
287  E09000016     39.7  47.298992
288  E09000017     40.1  45.900319
289  E09000018     37.3  44.100193
290  E09000021     5

In [12]:
# Merge 'result_df' with 'gtr'
gtr = pd.merge(gtr, result_df[['Area code', 'dpr2011%', 'dpr2021%']], left_on='borough code', right_on='Area code', how='left')
# Drop the redundant 'Area code' column in 'gtr'
gtr = gtr.drop('Area code', axis=1)

print(gtr)

                   borough borough code  2011moving%  2021moving%  popchurn%  \
0     Barking and Dagenham    E09000002    11.314102    11.096466  -0.217636   
1                   Barnet    E09000003    13.528176    13.449671  -0.078505   
2                   Bexley    E09000004     8.559739     9.228325   0.668586   
3                    Brent    E09000005    14.025929    15.189397   1.163468   
4                  Bromley    E09000006    10.936476    10.685033  -0.251443   
5                   Camden    E09000007    19.379820    21.068651   1.688831   
6           City of London    E09000001    27.085688    32.213164   5.127475   
7                  Croydon    E09000008    11.519518    11.934819   0.415301   
8                   Ealing    E09000009    13.772567    14.096957   0.324390   
9                  Enfield    E09000010    12.009192    10.912710  -1.096482   
10               Greenwich    E09000011    14.079714    13.880816  -0.198898   
11                 Hackney    E09000012 

In [13]:
# add deprivation change to 'gtr'
gtr['dpr%'] = gtr['dpr2011%'] - gtr['dpr2021%']
print(gtr)

                   borough borough code  2011moving%  2021moving%  popchurn%  \
0     Barking and Dagenham    E09000002    11.314102    11.096466  -0.217636   
1                   Barnet    E09000003    13.528176    13.449671  -0.078505   
2                   Bexley    E09000004     8.559739     9.228325   0.668586   
3                    Brent    E09000005    14.025929    15.189397   1.163468   
4                  Bromley    E09000006    10.936476    10.685033  -0.251443   
5                   Camden    E09000007    19.379820    21.068651   1.688831   
6           City of London    E09000001    27.085688    32.213164   5.127475   
7                  Croydon    E09000008    11.519518    11.934819   0.415301   
8                   Ealing    E09000009    13.772567    14.096957   0.324390   
9                  Enfield    E09000010    12.009192    10.912710  -1.096482   
10               Greenwich    E09000011    14.079714    13.880816  -0.198898   
11                 Hackney    E09000012 

#### 1.2.5  Gentrification Score

In [None]:
gtr['score'] = (1/2*gtr['popchurn%'] - 1/4*gtr['ethgr%'] 
                + 1/8*gtr['houseprice%'] -1/8*gtr['dpr%']) + 0.25
print(gtr)

### 1.3 Airbnb data

In [16]:
dest = 'Data/'

# boroughs
boros = gpd.read_file(cache_data('https://github.com/jreades/fsds/blob/master/data/src/Boroughs.gpkg?raw=true', dest+'Boroughs.gpkg') )
boros.crs
print(boros.head(10))

# listings
url = 'http://data.insideairbnb.com/united-kingdom/england/london/2022-12-10/data/listings.csv.gz'
ls = pd.read_csv(cache_data(url, dest),compression='gzip',usecols=['id','latitude','longitude','price','bedrooms','minimum_nights'])
gls['price'] = gls['price'].str.replace('$', '').astype(float)

print(ls.head(10))

Found Data/Boroughs.gpkg/Boroughs.gpkg locally!
                   NAME   GSS_CODE   HECTARES  NONLD_AREA ONS_INNER  \
0  Kingston upon Thames  E09000021   3726.117       0.000         F   
1               Croydon  E09000008   8649.441       0.000         F   
2               Bromley  E09000006  15013.487       0.000         F   
3              Hounslow  E09000018   5658.541      60.755         F   
4                Ealing  E09000009   5554.428       0.000         F   
5              Havering  E09000016  11445.735     210.763         F   
6            Hillingdon  E09000017  11570.063       0.000         F   
7                Harrow  E09000015   5046.330       0.000         F   
8                 Brent  E09000005   4323.270       0.000         F   
9                Barnet  E09000003   8674.837       0.000         F   

                                            geometry  
0  MULTIPOLYGON (((516401.600 160201.800, 516407....  
1  MULTIPOLYGON (((535009.200 159504.700, 535005....  
2  MU

ValueError: could not convert string to float: '1,000.00'

In [17]:
# ls to Gegpd
gls = gpd.GeoDataFrame(ls, 
      geometry=gpd.points_from_xy(ls.longitude, ls.latitude, crs='epsg:27700'))
gls['price'] = gls['price'].str.replace('$', '').astype(int)

print(gls.head(10))

ValueError: invalid literal for int() with base 10: '79.00'

In [11]:
# add gentrification score to borough dataframe
merged_df = pd.merge(boros, gtr[['borough code', 'score']], left_on='GSS_CODE', right_on='borough code', how='left')
merged_df = merged_df.drop('borough code', axis=1)

                   NAME   GSS_CODE   HECTARES  NONLD_AREA ONS_INNER  \
0  Kingston upon Thames  E09000021   3726.117       0.000         F   
1               Croydon  E09000008   8649.441       0.000         F   
2               Bromley  E09000006  15013.487       0.000         F   
3              Hounslow  E09000018   5658.541      60.755         F   
4                Ealing  E09000009   5554.428       0.000         F   
5              Havering  E09000016  11445.735     210.763         F   
6            Hillingdon  E09000017  11570.063       0.000         F   
7                Harrow  E09000015   5046.330       0.000         F   
8                 Brent  E09000005   4323.270       0.000         F   
9                Barnet  E09000003   8674.837       0.000         F   

                                            geometry  
0  MULTIPOLYGON (((516401.600 160201.800, 516407....  
1  MULTIPOLYGON (((535009.200 159504.700, 535005....  
2  MULTIPOLYGON (((540373.600 157530.400, 540361.... 

### Don't mind

In [None]:
# data source
# https://cycling.data.tfl.gov.uk/

# files saved under Data/ActiveTravelCounts
dir = 'Data/ActiveTravelCounts'
# raw files
loc_raw = '0-Count locations.csv'
central_raw = '2022-Central.csv'
inner_raw1 = '2022-Inner-Part1.csv'
inner_raw2 = '2022-Inner-Part2.csv'
outer_raw = '2022-Outer.csv'
# saved file name
location_fn = 'count_locations.geoparquet'
travelcounts_fn = 'travel_counts.parquet'

# geodataframe for points data will be saved as loc_gdf
# dataframe for counts will be saved as counts_df

# load the points data

# check if gpkg file already exists
# if not, convert the raw file into geoparquet after reading it in
if not os.path.exists(os.path.join(dir, location_fn)):
    print("Loading locations from csv and saving as geoparquet")
    loc_df = pd.read_csv(os.path.join(dir, loc_raw))
    loc_gdf = gpd.GeoDataFrame(loc_df, geometry = gpd.points_from_xy(loc_df['Easting (UK Grid)'], loc_df['Northing (UK Grid)'], crs = 'EPSG:27700'))
    # convert Functional area for monitoring into category
    loc_gdf['Functional area for monitoring'] = loc_gdf['Functional area for monitoring'].astype('category')
    loc_gdf.to_parquet(os.path.join(dir, location_fn))

# if file already there, load from gpkg
else:
    print("Loading locations from processed geoparquet")
    loc_gdf = gpd.read_parquet(os.path.join(dir, location_fn))

print("Location load complete. Use loc_gdf")

# load the travel counts data
# check if file already exists
# if not, load from csv and save the chunk before analysis

if not os.path.exists(os.path.join(dir, travelcounts_fn)):
    print("Loading counts from CSV and cleaning data")

    # load files
    cen_df = pd.read_csv(os.path.join(dir, central_raw))
    in1_df = pd.read_csv(os.path.join(dir, inner_raw1))
    in2_df = pd.read_csv(os.path.join(dir, inner_raw2))
    out_df = pd.read_csv(os.path.join(dir, outer_raw))

    # add zone
    cen_df.insert(2, 'Zone', 'Central')
    in1_df.insert(2, 'Zone', 'Inner')
    in2_df.insert(2, 'Zone', 'Inner')
    out_df.insert(2, 'Zone', 'Outer')

    # join data frames
    counts_df = pd.concat([cen_df, in1_df, in2_df, out_df])

    # clean data
    # insert datetime column in datetime format
    counts_df.insert(3, 'datetime', pd.to_datetime(counts_df['Date'] + ' ' + counts_df['Time'], dayfirst = True))
    
    # turn into categorical data
    categorical = ['Zone', 'Weather', 'Day', 'Round', 'Dir', 'Path', 'Mode']
    
    for c in categorical:
        counts_df[c] = counts_df[c].astype('category')

    # save parquet file
    counts_df.to_parquet(os.path.join(dir, travelcounts_fn))

# if file already there, load from parquet
else:
    print("Loading counts from processed parquet")
    counts_df = pd.read_parquet(os.path.join(dir, travelcounts_fn))

print("Counts load complete. Use counts_df")

### Looking at the `loc_gdf` Geodataframe

Check to confirm file loading is done correctly.


In [17]:
loc_gdf.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 2297 entries, 0 to 2296
Data columns (total 13 columns):
 #   Column                             Non-Null Count  Dtype   
---  ------                             --------------  -----   
 0   Site ID                            2297 non-null   object  
 1   Which folder?                      2297 non-null   object  
 2   Shared sites                       2297 non-null   object  
 3   Location description               2297 non-null   object  
 4   Borough                            2297 non-null   object  
 5   Functional area for monitoring     2297 non-null   category
 6   Road type                          2297 non-null   object  
 7   Is it on the strategic CIO panel?  2297 non-null   int64   
 8   Easting (UK Grid)                  2297 non-null   float64 
 9   Northing (UK Grid)                 2297 non-null   float64 
 10  Latitude                           2297 non-null   float64 
 11  Longitude                          