# Analysis introduction 

## Purpose
The purpose of this analysis is to perform data processing for COVID AirBnB analysis.

## Methodology
The data was obtained from AirBnB inside data [and COVID data] TBC. 
1. Zipped data files are renamed and unzipped
2. Data files are processed for subsequent analysis 
3. Data files are read back as feather format for faster subsequent ingestion.

## WIP - improvements

## Results

## Suggested next steps

# Set up

## Library import

In [1]:
# Data manipulation
import pandas as pd
import numpy as np
from datetime import date, time, timedelta
import datetime as dt
import matplotlib.pyplot as plt
from IPython.display import display, HTML
import re 

# Data manipulation variables
also = ' and '

# Classification accuracy 
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

# Options for pandas
pd.options.display.max_columns = 50
pd.options.display.max_rows = 30

# Visualizations
import plotly
import plotly.graph_objs as go
import plotly.offline as ply
plotly.offline.init_notebook_mode(connected=True)
import plotly.express as px
import seaborn as sns

import cufflinks as cf
cf.go_offline(connected=True)
cf.set_config_file(theme='white')

%matplotlib inline

# Visualisation options
sns.set_style("whitegrid")

# Date parsers
dparser = lambda date: pd.datetime.strptime(date, '%d/%m/%Y %H:%M')
dparser2 = lambda date: pd.to_datetime(date, format='%d/%m/%Y %H:%M', errors='coerce')
dparser3 = lambda date: pd.to_datetime(date, format='%d/%m/%Y %H:%M:%S', errors='coerce')
dshortparser = lambda date: pd.datetime.strptime(date, '%d/%b/%Y')
dshortparser2 = lambda date: pd.datetime.strptime(date, '%d/%m/%Y')
dshortparser3 = lambda date: pd.to_datetime(date, format='%Y%m%d',errors='coerce')

# Hide warnings 
import warnings 
warnings.simplefilter('ignore')

# Autoreload extension
if 'autoreload' not in get_ipython().extension_manager.loaded:
    %load_ext autoreload
    
%autoreload 2


# Another try
%load_ext autoreload
%autoreload 2


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Local library import

In [4]:
import sys
import os

# Find local library directory - which should be located within a parent directory
current_path = os.getcwd()
os.chdir("..")
current_path = os.getcwd() 
libfiledir = "library"

current_path
direxists = False
while direxists==False:
    test_dir = os.path.join(current_path,libfiledir)
    last_current_path = current_path
    direxists = os.path.isdir(test_dir)
    current_path = os.path.dirname(current_path)
    if(current_path == last_current_path):
        break;
        
if direxists == True:
    sys.path.append(test_dir)
    
    
from dataframe_common import *

## Change to project directory

In [6]:
os.getcwd()

'/Users/scottbarnes/Google Drive/1 Projects/Data Scientist Nanodegree'

In [9]:
os.chdir(os.path.join(os.getcwd() , "Udacity-Data-Scientist-Nanodegree/1-AirBnB-Analysis"))
os.getcwd()

'/Users/scottbarnes/Google Drive/1 Projects/Data Scientist Nanodegree/Udacity-Data-Scientist-Nanodegree/1-AirBnB-Analysis'

# Data ingestion / import

## Review of source data

The source data is held in the data directory. Data downloaded is for 12 months, from July 2019 to June 2020, and also includes a file containing neighbourhood geograhpic data. 

In [27]:
data_dir = 'data/AirBnB-Data-040920'
!ls {data_dir}

[34m10-July-2019[m[m       [34m14-September-2019[m[m  [34m5-November-2019[m[m    [34mNeighbourhood-data[m[m
[34m10-May-2020[m[m        [34m15-March-2020[m[m      [34m9-August-2019[m[m
[34m11-June-2020[m[m       [34m15-October-2019[m[m    [34m9-December-2019[m[m
[34m14-April-2020[m[m      [34m16-February-2020[m[m   [34m9-January-2020[m[m


Each folder contains 5 files.  
- listings.csv - add description
- reviews.csv- add description
- calendar.csv.gz - add description
- reviews.csv.gz - add description
- listings.csv.gz - add description

In [28]:
!ls {data_dir}/'9-January-2020'

calendar_det.csv listings_det.csv reviews_det.csv
listings.csv     reviews.csv


## Unzip the data

The first step to unzipping the required files is to create a list with the sub-directories containing the data.

In [24]:
sub_dir = !ls {data_dir}
sub_dir.remove('Neighbourhood-data')
sub_dir

['10-July-2019',
 '10-May-2020',
 '11-June-2020',
 '14-April-2020',
 '14-September-2019',
 '15-March-2020',
 '15-October-2019',
 '16-February-2020',
 '5-November-2019',
 '9-August-2019',
 '9-December-2019',
 '9-January-2020']

A function is created which will iterate through the subdirectories and the zipped files, and first rename them (so they are distinguishable from the non-zipped files) and then unzip in the same location.

In [30]:
def unzip_data(periods):
    ''' Takes in periods of data and renames and unzips .gz files'''    
    file_names = ['calendar','listings','reviews']
    for period in periods:
        print(f'Unzipping files in subdirectory {period}')
        for file_name in file_names:
            print(f'   Unzipping {file_name}')
            cur_file_path = os.path.join(data_dir,period,file_name + '.csv.gz')
            new_file_path = os.path.join(data_dir,period,file_name + '_det.csv.gz')
            ! mv {cur_file_path} {new_file_path}    
            ! gunzip {new_file_path}
            print(f'   Unzipped {file_name}') 

In [None]:
# unzip_data(periods)

## Ingestion and processing of data

The data will be ingested and processed for analysis.

### Detailed calendar data (unzipped)

The calendar data contains daily price and stay information for each listing. The price fields are formatted with special characters and so these will need to be cleaned up. The date field is currently at a daily level, and will be used to create date_month and date_year features.   

In [71]:
file_name = 'calendar_det.csv'
! tail {os.path.join(data_dir,'11-June-2020','calendar_det.csv')}

43737744,2021-06-03,t,$83.00,$83.00,730,1125
43737744,2021-06-04,t,$83.00,$83.00,730,1125
43737744,2021-06-05,t,$83.00,$83.00,730,1125
43737744,2021-06-06,t,$83.00,$83.00,730,1125
43737744,2021-06-07,t,$83.00,$83.00,730,1125
43737744,2021-06-08,t,$83.00,$83.00,730,1125
43737744,2021-06-09,t,$83.00,$83.00,730,1125
43737744,2021-06-10,t,$83.00,$83.00,730,1125
43737744,2021-06-11,t,$83.00,$83.00,730,1125
43737744,2021-06-12,t,$83.00,$83.00,730,1125


Reviewing the last listing [here]([https://www.airbnb.co.uk/rooms/43737744?source_impression_id=p3_1599541680_kmdSBgOWT67FmBKm&check_in=2020-09-23&guests=1&adults=1&check_out=2020-10-22]) highlgihts that the night listing is in GBP not USD.

A function is created to clean the calendar file.

In [None]:
def clean_calendar(calendar):
    calendar['date_month'] = calendar['date'].dt.month
    calendar['date_year'] = calendar['date'].dt.year
    calendar['price']  = calendar['price'].str.replace("$","")
    calendar['price']  = calendar['price'].str.replace(",","")
    calendar['price'] = calendar['price'].astype(float)
    calendar['available'] = np.where(calendar['available']=="t",1,0)
    return calendar

A function is created to iterate through the sub-directories containing the file, read in the file, clean is, and output it to feather file in a temporary directory for processed data.

In [77]:
tmp_dir = 'tmp'
processed_data = 'processed_data'

In [83]:
! ls {tmp_dir}/{processed_data}

calendar_det10-July-2019.csv       calendar_det9-January-2020.csv
calendar_det10-May-2020.csv        calendar_det9-January-2020.feather
calendar_det9-January-2020         iris.feather


In [None]:
def read_and_clean_calendars(periods):
    for period in periods:
        print(f'Processing calendar dataframe from {period}...')
        try:
            df  = pd.read_csv(os.path.join(data_dir,period,file_name))
            print("   1. df read in")
            df['date'] = pd.to_datetime(df['date'])
            df  = clean_calendar(df)
            print("   2. df cleaned")
            updated_file_name = file_name.replace(".csv","_" + period + ".feather")
            df.to_feather(os.path.join(tmpdir,processed_data,updated_file_name))
            print("   3. df renamed and outputted to feather")
            del df
            print("   4. df deleted from memory")
            print(f'Processed calendar dataframe from {period}')
        except MemoryError:
            print('Memory error')
        else:
            print(f'Processed calendar dataframe from {period}')
        finally:
            print(f'Function complete')

Function is executed to work through all sub directories.

In [None]:
%%time
read_and_clean_calendars(periods)

The temporary directory now contains a sub-directory with the processed data files

In [86]:
! ls {tmp_dir}/{processed_data}

calendar_det10-July-2019.csv       calendar_det9-January-2020.csv
calendar_det10-May-2020.csv        calendar_det9-January-2020.feather
calendar_det9-January-2020         iris.feather


The processed file contains the added date feature columns and cleaned price columns.

In [85]:
! head {tmp_dir}/{processed_data}/calendar_det10-July-2019.csv

,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights,date_month,date_year,price_clean,available_flag
0,78892,2019-07-12,f,$36.00,$36.00,3.0,31.0,7,2019,36.0,0
1,78892,2019-07-13,f,$36.00,$36.00,3.0,31.0,7,2019,36.0,0
2,78892,2019-07-14,f,$36.00,$36.00,3.0,31.0,7,2019,36.0,0
3,78892,2019-07-15,f,$32.00,$32.00,3.0,31.0,7,2019,32.0,0
4,78892,2019-07-16,f,$32.00,$32.00,3.0,31.0,7,2019,32.0,0
5,78892,2019-07-17,f,$32.00,$32.00,3.0,31.0,7,2019,32.0,0
6,78892,2019-07-18,f,$32.00,$32.00,3.0,31.0,7,2019,32.0,0
7,78892,2019-07-19,f,$36.00,$36.00,3.0,31.0,7,2019,36.0,0
8,78892,2019-07-20,f,$36.00,$36.00,3.0,31.0,7,2019,36.0,0
