# London AirBnB Data Processing

## Purpose
The purpose of this notebook is to perform the required data processing for analysis looking at the impact of the global pandemic and lockdown in London on demand for and supply of AirBnB listings. 

## Methodology
The data was downloaded from AirBnB inside data. The directory structure for the datafiles is as follows:
- 1-AirBnB-Analysis
    - archive - archive of old versions of data files and analysis, included in .gitignore
    - nbs - folder containing processing and analysis notebooks
    - tmp - temporary directory containing processed data files, included in .gitignore  
    - data
        - AirBnB-Data-04-09-20 - parent directory of data downloaded
            - 2019_07_10 - subdirectories of data downloaded, named based on when data scraped
            - 2019_08_09, etc. 
            
The following processing steps are performed: 
1. Zipped data files are renamed and unzipped
2. Calendar and listings data files are processed for subsequent analysis 
3. Data files are read back as feather format for faster subsequent ingestion

# Set up

## Library import

Import external libraries required for data processing.

In [5]:
# Data manipulation
import pandas as pd
import numpy as np
from datetime import date, time, timedelta
import datetime as dt
import matplotlib.pyplot as plt
from IPython.display import display, HTML
import re 

# Options for pandas
pd.options.display.max_columns = 50
pd.options.display.max_rows = 30

# Hide warnings 
import warnings 
warnings.simplefilter('ignore')

# Autoreload extension
if 'autoreload' not in get_ipython().extension_manager.loaded:
    %load_ext autoreload
    
%autoreload 2

## Local library import

Find local library directory and import utility functions.

In [2]:
import sys
import os

# Find local library directory - which should be located within a parent directory
current_path = os.getcwd()
os.chdir("..")
current_path = os.getcwd() 
libfiledir = "library"

current_path
direxists = False
while direxists==False:
    test_dir = os.path.join(current_path,libfiledir)
    last_current_path = current_path
    direxists = os.path.isdir(test_dir)
    current_path = os.path.dirname(current_path)
    if(current_path == last_current_path):
        break;
        
if direxists == True:
    sys.path.append(test_dir)
    

# Import utility functions from data frame common
from dataframe_common import *

# Data ingestion / import

## Review of source data

The source data is held in the data directory. Data downloaded is for 12 months, from July 2019 to June 2020, and also includes a file containing neighbourhood geographic data. Each folder contains the files that were scraped at that time period.

In [4]:
data_dir = 'data/AirBnB-Data-040920'
!ls {data_dir}

[34m2019_07_10[m[m         [34m2019_11_05[m[m         [34m2020_03_15[m[m         [34m2020_08_24[m[m
[34m2019_08_09[m[m         [34m2019_12_09[m[m         [34m2020_04_14[m[m         [34mNeighbourhood_data[m[m
[34m2019_09_14[m[m         [34m2020_01_09[m[m         [34m2020_05_10[m[m
[34m2019_10_15[m[m         [34m2020_02_16[m[m         [34m2020_06_11[m[m


In [6]:
!ls {data_dir}/'2019_07_10'

calendar.csv.gz  listings.csv     listings_det.csv reviews.csv.gz
calendar_det.csv listings.csv.gz  reviews.csv      reviews_det.csv


Each folder contains 5 files.  
- listings.csv - summary information and metrics for listings in London
- reviews.csv- summary review data and listing ID for listings in London
- calendar.csv.gz - detailed calendar data for listings in London
- reviews.csv.gz - detailed review data for listings in London
- listings.csv.gz - detailed lilstings data for London

## Unzip the data

The first step to unzipping the required files is to create a list with the sub-directories containing the data.

In [7]:
sub_dirs = !ls {data_dir}
sub_dirs.remove('Neighbourhood_data')
sub_dirs

['2019_07_10',
 '2019_08_09',
 '2019_09_14',
 '2019_10_15',
 '2019_11_05',
 '2019_12_09',
 '2020_01_09',
 '2020_02_16',
 '2020_03_15',
 '2020_04_14',
 '2020_05_10',
 '2020_06_11',
 '2020_08_24']

The below function will iterate through the subdirectories and the zipped files, and first rename them (so they are distinguishable from the non-zipped files) and then unzip in the same location. If the unzipped file is already in the directory, that will be passed over. 

The function will skip onto the next directory if the unzipped file is already in the sub-directory (i.e. it has already been run).

In [8]:
def unzip_data(periods):
    ''' Takes in list of subdirectories data and copies, renames and unzips .gz files'''    
    file_names = ['calendar','listings','reviews']
    for period in periods:
        sub_dir_files = ! ls {data_dir}/{period}
        print(f'Unzipping files in subdirectory {period}')
        for file_name in file_names:
            if file_name + '_det.csv' in sub_dir_files:
                print(f'   Already unzipped {file_name}')
                pass
            else:
                print(f'   Copying and renaming {file_name}')
                cur_file_path = os.path.join(data_dir,period,file_name + '.csv.gz')
                new_file_path = os.path.join(data_dir,period,file_name + '_det.csv.gz')
                ! cp {cur_file_path} {new_file_path}    
                ! gunzip {new_file_path}
                print(f'   Unzipped copied file {file_name}') 
    print('Looped through all sub-directories')

In [9]:
unzip_data(sub_dirs)

Unzipping files in subdirectory 2019_07_10
   Already unzipped calendar
   Already unzipped listings
   Already unzipped reviews
Unzipping files in subdirectory 2019_08_09
   Already unzipped calendar
   Already unzipped listings
   Already unzipped reviews
Unzipping files in subdirectory 2019_09_14
   Already unzipped calendar
   Already unzipped listings
   Already unzipped reviews
Unzipping files in subdirectory 2019_10_15
   Already unzipped calendar
   Already unzipped listings
   Already unzipped reviews
Unzipping files in subdirectory 2019_11_05
   Already unzipped calendar
   Already unzipped listings
   Already unzipped reviews
Unzipping files in subdirectory 2019_12_09
   Already unzipped calendar
   Already unzipped listings
   Already unzipped reviews
Unzipping files in subdirectory 2020_01_09
   Already unzipped calendar
   Already unzipped listings
   Already unzipped reviews
Unzipping files in subdirectory 2020_02_16
   Already unzipped calendar
   Already unzipped listi

## Ingestion and processing of data

Now unzipped, the data can be ingested and processed for analysis.

### Detailed calendar data (unzipped)

The calendar data contains daily price and stay information for each listing. The price fields are formatted with special characters and so these need to be removed. The date field is currently at a daily level, and will be used to create date_month and date_year features.   

In [11]:
! ls {data_dir}

[34m2019_07_10[m[m         [34m2019_11_05[m[m         [34m2020_03_15[m[m         [34m2020_08_24[m[m
[34m2019_08_09[m[m         [34m2019_12_09[m[m         [34m2020_04_14[m[m         [34mNeighbourhood_data[m[m
[34m2019_09_14[m[m         [34m2020_01_09[m[m         [34m2020_05_10[m[m
[34m2019_10_15[m[m         [34m2020_02_16[m[m         [34m2020_06_11[m[m


In [12]:
file_name = 'calendar_det.csv'
! tail {os.path.join(data_dir,'2020_06_11','calendar_det.csv')}

43737744,2021-06-03,t,$83.00,$83.00,730,1125
43737744,2021-06-04,t,$83.00,$83.00,730,1125
43737744,2021-06-05,t,$83.00,$83.00,730,1125
43737744,2021-06-06,t,$83.00,$83.00,730,1125
43737744,2021-06-07,t,$83.00,$83.00,730,1125
43737744,2021-06-08,t,$83.00,$83.00,730,1125
43737744,2021-06-09,t,$83.00,$83.00,730,1125
43737744,2021-06-10,t,$83.00,$83.00,730,1125
43737744,2021-06-11,t,$83.00,$83.00,730,1125
43737744,2021-06-12,t,$83.00,$83.00,730,1125


A function is created to clean the calendar file.

In [13]:
def clean_calendar(calendar):
    ''' Takes in an unprocessed calendar file and cleans the prices cols and adds date flags'''
    calendar['date_month'] = calendar['date'].dt.month
    calendar['date_year'] = calendar['date'].dt.year
    calendar['price']  = calendar['price'].str.replace("$","")
    calendar['price']  = calendar['price'].str.replace(",","")
    calendar['price'] = calendar['price'].astype(float)
    calendar['available'] = np.where(calendar['available']=="t",1,0)
    return calendar

Processed data files will be stored within a temporary directory. Files currnetly in the temporary directory will be held within a list so files aren't unnecessarily processed.

In [19]:
tmp_dir = 'tmp'
processed_data = 'processed_data'
# os.mkdir(os.path.join(tmp_dir,processed_data))
tmp_dir_files = ! ls {tmp_dir}/{processed_data}

In [22]:
sub_dirs

['2019_07_10',
 '2019_08_09',
 '2019_09_14',
 '2019_10_15',
 '2019_11_05',
 '2019_12_09',
 '2020_01_09',
 '2020_02_16',
 '2020_03_15',
 '2020_04_14',
 '2020_05_10',
 '2020_06_11',
 '2020_08_24']

In [33]:
period = '2019_07_10'
file_name = 'calendar_det.csv'
updated_file_name = period + file_name.replace(".csv", ".feather")
updated_file_name

'2019_07_10calendar_det.feather'

Function is created to read and clean the calendar dataframe, and finally save it back to the temporary directory. The read in dataframe is deleted to reduce memory required.

In [48]:
def read_and_clean_calendars(periods):
    ''' Takes in a list of sub-directories and iteratively cleans and saves the calendar file in them '''
    for period in periods:
        print(f'Processing calendar dataframe from {period}...')
        updated_file_name = period + file_name.replace(".csv", ".feather")
        if updated_file_name in tmp_dir_files:
            print(f'    data already processed for this period.')
            pass
        else:
            try:
                df  = pd.read_csv(os.path.join(data_dir,period,file_name))
                print("   1. df read in")
                df['date'] = pd.to_datetime(df['date'])
                df  = clean_calendar(df)
                print("   2. df cleaned")
                df.to_feather(os.path.join(tmp_dir,processed_data,updated_file_name))
                print("   3. df renamed and outputted to feather")
                del df
                print("   4. df deleted from memory")            
            except MemoryError:
                print('Memory error')
            else:
                print(f'Processed calendar dataframe from {period}')
    print('Function complete')

Function is executed to work through all sub directories.

In [38]:
%%time
read_and_clean_calendars(sub_dirs)

Processing calendar dataframe from 2019_07_10...
    data already processed for this period.
Processing calendar dataframe from 2019_08_09...
    data already processed for this period.
Processing calendar dataframe from 2019_09_14...
    data already processed for this period.
Processing calendar dataframe from 2019_10_15...
    data already processed for this period.
Processing calendar dataframe from 2019_11_05...
    data already processed for this period.
Processing calendar dataframe from 2019_12_09...
    data already processed for this period.
Processing calendar dataframe from 2020_01_09...
    data already processed for this period.
Processing calendar dataframe from 2020_02_16...
    data already processed for this period.
Processing calendar dataframe from 2020_03_15...
    data already processed for this period.
Processing calendar dataframe from 2020_04_14...
    data already processed for this period.
Processing calendar dataframe from 2020_05_10...
    data already proc

The temporary directory now contains a sub-directory with the processed data files

In [39]:
! ls {tmp_dir}/{processed_data}

2019_07_10calendar_det.feather 2020_02_16listings.feather
2019_07_10listings.feather     2020_03_15calendar_det.feather
2019_08_09calendar_det.feather 2020_03_15listings.feather
2019_08_09listings.feather     2020_04_14calendar_det.feather
2019_09_14calendar_det.feather 2020_04_14listings.feather
2019_09_14listings.feather     2020_05_10calendar_det.feather
2019_10_15calendar_det.feather 2020_05_10listings.feather
2019_10_15listings.feather     2020_06_11calendar_det.feather
2019_11_05calendar_det.feather 2020_06_11listings.feather
2019_11_05listings.feather     2020_08_24calendar_det.feather
2019_12_09calendar_det.feather 2020_08_24listings.feather
2019_12_09listings.feather     all_listings.csv
2020_01_09calendar_det.feather all_listings.feather
2020_01_09listings.feather     all_listings_price.feather
2020_02_16calendar_det.feather list_price_avail_grped.feather


The processed file contains the added date feature columns and cleaned price columns.

In [58]:
example_df = pd.read_feather(os.path.join(tmp_dir,processed_data,'2020_08_24calendar_det.feather'))
example_df.head(5)

Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights,date_month,date_year
0,208952,2020-08-26,0,263.0,$263.00,2.0,30.0,8,2020
1,81951,2020-08-27,0,190.0,$190.00,5.0,10.0,8,2020
2,81951,2020-08-28,1,190.0,$190.00,5.0,10.0,8,2020
3,81951,2020-08-29,1,190.0,$190.00,5.0,10.0,8,2020
4,81951,2020-08-30,1,190.0,$190.00,5.0,10.0,8,2020


### Listings data

The listing data contains summary information for metrics for listings in London. For the analysis only ID, room type and calculated host listing count are required from these files for segmentation analysis.

In [40]:
file_name = 'listings.csv'
! head -n 5 {os.path.join(data_dir,'2020_06_11',file_name)}

id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
11551,Arty and Bright London Apartment in Zone 2,43039,Adriano,,Lambeth,51.46225,-0.11732,Entire home/apt,88,2,192,2020-03-26,1.54,2,347
13913,Holiday London DB Room Let-on going,54730,Alina,,Islington,51.56802,-0.11121,Private room,65,1,21,2020-02-22,0.18,3,347
15400,Bright Chelsea  Apartment. Chelsea!,60302,Philippa,,Kensington and Chelsea,51.48796,-0.16898,Entire home/apt,100,10,89,2020-03-16,0.70,1,288
17402,Superb 3-Bed/2 Bath & Wifi: Trendy W1,67564,Liz,,Westminster,51.52195,-0.14094,Entire home/apt,300,3,42,2019-11-02,0.37,15,326


Function is created to read and clean the listing dataframe, and finally save it back to the temporary directory. The read in dataframe is deleted to reduce memory required.

In [46]:
def read_in_listing_info(periods):
    ''' Reads and cleans the listings dataframe and saves it back to the temporary directory'''
    for period in periods:
        print(f'Processing listing dataframe from {period}...')
        updated_file_name = period + file_name.replace(".csv",".feather")
        if updated_file_name in tmp_dir_files:
            print(f'    data already processed for this period.')
            pass
        else:
            try:
                df  = pd.read_csv(os.path.join(data_dir,period,file_name)
                                 ,usecols=['id','room_type','calculated_host_listings_count'])
                print("   1. df read in")
                df.rename(columns={'id':'listing_id'},inplace=True)
                print("   2. renamed columns")
                df.to_feather(os.path.join(tmp_dir,processed_data,updated_file_name))
                print("   3. df renamed and outputted to feather")
                del df
                print("   4. df deleted from memory")            
            except MemoryError:
                print('Memory error')
            else:
                print(f'Processed calendar dataframe from {period}')
    print('Function complete')

In [47]:
read_in_listing_info(sub_dirs)

Processing listing dataframe from 2019_07_10...
    data already processed for this period.
Processing listing dataframe from 2019_08_09...
    data already processed for this period.
Processing listing dataframe from 2019_09_14...
    data already processed for this period.
Processing listing dataframe from 2019_10_15...
    data already processed for this period.
Processing listing dataframe from 2019_11_05...
    data already processed for this period.
Processing listing dataframe from 2019_12_09...
    data already processed for this period.
Processing listing dataframe from 2020_01_09...
    data already processed for this period.
Processing listing dataframe from 2020_02_16...
    data already processed for this period.
Processing listing dataframe from 2020_03_15...
    data already processed for this period.
Processing listing dataframe from 2020_04_14...
    data already processed for this period.
Processing listing dataframe from 2020_05_10...
    data already processed for t

The temporary directory now contains a sub-directory with the processed data files

In [43]:
! ls {tmp_dir}/{processed_data}

2019_07_10calendar_det.feather 2020_02_16listings.feather
2019_07_10listings.feather     2020_03_15calendar_det.feather
2019_08_09calendar_det.feather 2020_03_15listings.feather
2019_08_09listings.feather     2020_04_14calendar_det.feather
2019_09_14calendar_det.feather 2020_04_14listings.feather
2019_09_14listings.feather     2020_05_10calendar_det.feather
2019_10_15calendar_det.feather 2020_05_10listings.feather
2019_10_15listings.feather     2020_06_11calendar_det.feather
2019_11_05calendar_det.feather 2020_06_11listings.feather
2019_11_05listings.feather     2020_08_24calendar_det.feather
2019_12_09calendar_det.feather 2020_08_24listings.feather
2019_12_09listings.feather     all_listings.csv
2020_01_09calendar_det.feather all_listings.feather
2020_01_09listings.feather     all_listings_price.feather
2020_02_16calendar_det.feather list_price_avail_grped.feather


In [59]:
example_df = pd.read_feather(os.path.join(tmp_dir,processed_data,'2020_08_24listings.feather'))
example_df.head(5)

Unnamed: 0,listing_id,room_type,calculated_host_listings_count
0,11551,Entire home/apt,2
1,13913,Private room,3
2,15400,Entire home/apt,1
3,17402,Entire home/apt,14
4,17506,Private room,2
