# FSDS Group Assessment (Group Safari)

## 1. Data Collection and Cleaning
We will use 2 different datasets:
1. Airbnb data of London (10 Dec, 2022) downloading from [InsideAirbnb](http://insideairbnb.com/get-the-data)  
2. 2011 and 2021 Census data including:
* .csv
* .csv
* .xls
* .xlsx
* ...


### 1.1 Input data and create dataframe and geodataframe

Note that all data in the Data subdirectory is ignored in the `.gitignore` file. <span style="color:red">(***We may need to change the setting of our repo later.***)</span>

The file names that are created through this script is as follows.

|Data|File name|df/gdf name|
|:---|:---|:---|
|Points|`***`|`***`|
|Trips|`***`|`***`|


In [1]:
# Import packages

import os
from urllib.request import urlopen
from requests import get
from urllib.parse import urlparse
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns
import re

<jemalloc>: MADV_DONTNEED does not work (memset will be used instead)
<jemalloc>: (This is the expected behaviour if you are running under QEMU)


<span style="color:red">For now, I am using local files, so the next coding cell won't be helpful. But I'll adjust it later to download directly using url.</span>

In [None]:
# Download data from remote location
def cache_data(src:str, dest:str) -> str:
    """Downloads and caches a remote file locally.
    
    The function sits between the 'read' step of a pandas or geopandas
    data frame and downloading the file from a remote location. The idea
    is that it will save it locally so that you don't need to remember to
    do so yourself. Subsequent re-reads of the file will return instantly
    rather than downloading the entire file for a second or n-th itme.
    
    Parameters
    ----------
    src : str
        The remote *source* for the file, any valid URL should work.
    dest : str
        The *destination* location to save the downloaded file.
        
    Returns
    -------
    str
        A string representing the local location of the file.
    """
    url = urlparse(src)
    fn  = os.path.split(url.path)[-1]
    dfn = os.path.join(dest,fn)
    
    if not os.path.isfile(dfn):
        print(f"{dfn} not found, downloading!")
        path = os.path.split(dest)
        
        if len(path) >= 1 and path[0] != '':
            os.makedirs(os.path.join(*path), exist_ok=True)
            
        with open(dfn, "wb") as file:
            response = get(src)
            file.write(response.content)  
        print("\tDone downloading...")
    else:
        print(f"Found {dfn} locally!")
        
    return dfn

Please save data files under directory: ***fsds/group/Data***

In [2]:
os.chdir('/home/jovyan/work/Documents/casa/fsds/group')
padir = 'Data/'

In [None]:
# Read files used for gentrification score

## Population Churn
popch2011 = pd.read_csv(padir+'popchurn 11.csv', skiprows=7, header=0, skip_blank_lines=True, usecols=[
    'mnemonic',
    'Whole household lived at same address one year ago', 
    'Wholly moving household: Total']).dropna(how='all')

popch2021_in_raw = pd.read_csv(padir+'MIG009EW_LTLA_IN.csv',usecols=[
    'Lower tier local authorities code',
    'Household migration LTLA (inflow) (7 categories) code',
    'Count'])
popch2021_out_raw = pd.read_csv(padir+'MIG009EW_LTLA_OUT.csv',usecols=[
    'Migrant LTLA one year ago code', 
    'Household migration LTLA (outflow) (3 categories) code',
    'Count'])
popch2021_in = popch2021_in_raw.loc[popch2021_in_raw['Lower tier local authorities code'].astype(str).str.contains(r'^E090000[0-2][0-9]$|^E090003[0-3]$', regex=True)]
popch2021_out = popch2021_out_raw.loc[popch2021_out_raw['Migrant LTLA one year ago code'].astype(str).str.contains(r'^E090000[0-2][0-9]$|^E090003[0-3]$', regex=True)]

## Ethnic Group
eg2011 = pd.read_csv(padir+'ethnic group 2011.csv', skiprows=7, header=0, skip_blank_lines=True, usecols=[
    'mnemonic','All categories: Ethnic group','White'])
eg2021 = pd.read_csv(padir+'ethnic group 2021.csv', skiprows=6, header=0, skip_blank_lines=True, usecols=[
    'mnemonic','Total: All usual residents','White'])

In [None]:
## Housing Price
price_med_raw = pd.read_excel(padir+'house price _median by MSOA.xls',sheet_name='1a', skiprows=4,header=0,
                              usecols=['Local authority code','Year ending Dec 2001','Year ending Dec 2021'])
price_med = price_med_raw.loc[price_med_raw['Local authority code'].astype(str).str.contains(r'^E090000[0-2][0-9]$|^E090003[0-3]$', regex=True)]
price_aver_raw = pd.read_excel(padir+'house price index_aver.xlsx', sheet_name = 'Average price',skiprows=0, header=0,userows=['Dec-11','Dec-21']).transpose().rename_axis('Area code')
price_aver = price_aver_raw.loc[price_aver_raw['Area code'].astype(str).str.contains(r'^E090000[0-2][0-9]$|^E090003[0-3]$', regex=True)]

In [None]:
## 

In [9]:
os.getcwd()
# Display the filtered DataFrame
popch2011.sample(3,random_state=7)
popch2021_in.sample(3,random_state=7)
popch2021_out.sample(3,random_state=7)
eg2011.sample(3,random_state=7)
eg2021.sample(3,random_state=7)

Unnamed: 0,mnemonic,Total: All usual residents,White
17,E09000018,288181.0,127083.0
37,by small amounts. Small counts at the lowest g...,,
34,published<FIELD=ENDLINE>,,


In [16]:
# data source
# https://cycling.data.tfl.gov.uk/

# files saved under Data/ActiveTravelCounts
dir = 'Data/ActiveTravelCounts'
# raw files
loc_raw = '0-Count locations.csv'
central_raw = '2022-Central.csv'
inner_raw1 = '2022-Inner-Part1.csv'
inner_raw2 = '2022-Inner-Part2.csv'
outer_raw = '2022-Outer.csv'
# saved file name
location_fn = 'count_locations.geoparquet'
travelcounts_fn = 'travel_counts.parquet'

# geodataframe for points data will be saved as loc_gdf
# dataframe for counts will be saved as counts_df

# load the points data

# check if gpkg file already exists
# if not, convert the raw file into geoparquet after reading it in
if not os.path.exists(os.path.join(dir, location_fn)):
    print("Loading locations from csv and saving as geoparquet")
    loc_df = pd.read_csv(os.path.join(dir, loc_raw))
    loc_gdf = gpd.GeoDataFrame(loc_df, geometry = gpd.points_from_xy(loc_df['Easting (UK Grid)'], loc_df['Northing (UK Grid)'], crs = 'EPSG:27700'))
    # convert Functional area for monitoring into category
    loc_gdf['Functional area for monitoring'] = loc_gdf['Functional area for monitoring'].astype('category')
    loc_gdf.to_parquet(os.path.join(dir, location_fn))

# if file already there, load from gpkg
else:
    print("Loading locations from processed geoparquet")
    loc_gdf = gpd.read_parquet(os.path.join(dir, location_fn))

print("Location load complete. Use loc_gdf")

# load the travel counts data
# check if file already exists
# if not, load from csv and save the chunk before analysis

if not os.path.exists(os.path.join(dir, travelcounts_fn)):
    print("Loading counts from CSV and cleaning data")

    # load files
    cen_df = pd.read_csv(os.path.join(dir, central_raw))
    in1_df = pd.read_csv(os.path.join(dir, inner_raw1))
    in2_df = pd.read_csv(os.path.join(dir, inner_raw2))
    out_df = pd.read_csv(os.path.join(dir, outer_raw))

    # add zone
    cen_df.insert(2, 'Zone', 'Central')
    in1_df.insert(2, 'Zone', 'Inner')
    in2_df.insert(2, 'Zone', 'Inner')
    out_df.insert(2, 'Zone', 'Outer')

    # join data frames
    counts_df = pd.concat([cen_df, in1_df, in2_df, out_df])

    # clean data
    # insert datetime column in datetime format
    counts_df.insert(3, 'datetime', pd.to_datetime(counts_df['Date'] + ' ' + counts_df['Time'], dayfirst = True))
    
    # turn into categorical data
    categorical = ['Zone', 'Weather', 'Day', 'Round', 'Dir', 'Path', 'Mode']
    
    for c in categorical:
        counts_df[c] = counts_df[c].astype('category')

    # save parquet file
    counts_df.to_parquet(os.path.join(dir, travelcounts_fn))

# if file already there, load from parquet
else:
    print("Loading counts from processed parquet")
    counts_df = pd.read_parquet(os.path.join(dir, travelcounts_fn))

print("Counts load complete. Use counts_df")

Loading locations from processed geoparquet
Location load complete. Use loc_gdf
Loading counts from processed parquet
Counts load complete. Use counts_df


### Looking at the `loc_gdf` Geodataframe

Check to confirm file loading is done correctly.


In [17]:
loc_gdf.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 2297 entries, 0 to 2296
Data columns (total 13 columns):
 #   Column                             Non-Null Count  Dtype   
---  ------                             --------------  -----   
 0   Site ID                            2297 non-null   object  
 1   Which folder?                      2297 non-null   object  
 2   Shared sites                       2297 non-null   object  
 3   Location description               2297 non-null   object  
 4   Borough                            2297 non-null   object  
 5   Functional area for monitoring     2297 non-null   category
 6   Road type                          2297 non-null   object  
 7   Is it on the strategic CIO panel?  2297 non-null   int64   
 8   Easting (UK Grid)                  2297 non-null   float64 
 9   Northing (UK Grid)                 2297 non-null   float64 
 10  Latitude                           2297 non-null   float64 
 11  Longitude                          