# AirBnB Analysis - Data 2 Insights!

## Package Imports

In [19]:
import numpy as np
import pandas as pd
import chardet
from data_cleaning import set_data_types, rename_columns, print_memory # Our first module export!
import data_cleaning

In [None]:
# Listing the user defined module's functions

# dir(data_cleaning)

## User Defined Functions

*Creating modules will enable us to export our user defined functions and reuse them again and again. We have two techniques:*
* *We can import the whole module, so you can ONLY use the functions out of this moduel if you called the function using the module name first.*
* *We can use the function directly if we imported the module and the functions out of the module during the import process.*
* *Using the function name without mentioning the module name comes with cost, CONFLICT!*

## Loading & Preprocessing & Reducing Data
*In this section of the notebook we load the data, parse the date columns and reduce the memory consumption to its lowest levels through converting object columns into Pandas.Category data type, downcasting the numeric data types into the proper size depending on its minimum and maximum range of numbers, and finally standardize the columns' header names for the further processing.*

In [2]:
# Data dictionary is good for getting a sneak peak about the data then delete it
#print_memory()

listings_data_dictionary = pd.read_csv('Airbnb Data/Listings_data_dictionary.csv')

#print_memory()

# listings_data_dictionary

# del listings_data_dictionary

In [3]:
# Detecting the dataset encoding!
# This criteria checks the first 10000 bytes which is very low to detect the proper
# encoding, so it's just a sanity check!
# with open('Airbnb Data/Listings.csv', 'rb') as f:
#    result = chardet.detect(f.read(10000)) # Reading the first 10000 bytes to detect
# print(result)

In [4]:
#print_memory()

listings = pd.read_csv('Airbnb Data/Listings.csv',
                      encoding= 'latin1', # We had a problem with utf-8 encoding.
                      parse_dates= ['host_since'],
                      low_memory= False) # First step to reduce memory.
#print_memory()

# listings

In [5]:
# The analysis will be conducted on the listings ONLY in Paris.
# Reducing the dataset before implementing any other operations will reduce memory usage.
#print_memory()

listings_paris = listings.loc[listings['city'] =='Paris']

#print_memory()

In [6]:
# Exploring the memory usage of the dataset before conducting any memory reductions
listings_paris.info(memory_usage= 'deep')

<class 'pandas.core.frame.DataFrame'>
Index: 64690 entries, 0 to 279711
Data columns (total 33 columns):
 #   Column                       Non-Null Count  Dtype         
---  ------                       --------------  -----         
 0   listing_id                   64690 non-null  int64         
 1   name                         64627 non-null  object        
 2   host_id                      64690 non-null  int64         
 3   host_since                   64657 non-null  datetime64[ns]
 4   host_location                64522 non-null  object        
 5   host_response_time           23346 non-null  object        
 6   host_response_rate           23346 non-null  float64       
 7   host_acceptance_rate         31919 non-null  float64       
 8   host_is_superhost            64657 non-null  object        
 9   host_total_listings_count    64657 non-null  float64       
 10  host_has_profile_pic         64657 non-null  object        
 11  host_identity_verified       64657 non-null  

In [7]:
# Leveraging the user define function (set_data_types) to decrease the data set memory usage
#print_memory()

listings_sm = set_data_types(listings_paris)
listings_sm.info(memory_usage= 'deep')

#print_memory()

<class 'pandas.core.frame.DataFrame'>
Index: 64690 entries, 0 to 279711
Data columns (total 33 columns):
 #   Column                       Non-Null Count  Dtype         
---  ------                       --------------  -----         
 0   listing_id                   64690 non-null  int32         
 1   name                         64627 non-null  object        
 2   host_id                      64690 non-null  int32         
 3   host_since                   64657 non-null  datetime64[ns]
 4   host_location                64522 non-null  category      
 5   host_response_time           23346 non-null  category      
 6   host_response_rate           23346 non-null  float64       
 7   host_acceptance_rate         31919 non-null  float64       
 8   host_is_superhost            64657 non-null  category      
 9   host_total_listings_count    64657 non-null  float64       
 10  host_has_profile_pic         64657 non-null  category      
 11  host_identity_verified       64657 non-null  

*Reduced more than 60% of the memory reserved for the dataset through setting the proper data type for each column*

In [8]:
listings_clean = rename_columns(listings_sm)

# listings_clean.columns

In [9]:
reviews_data_dictionary = pd.read_csv('Airbnb Data/Reviews_data_dictionary.csv',
                                     encoding= 'latin1')
# reviews_data_dictionary.head()

# del reviews_data_dictionary

In [10]:
reviews = pd.read_csv('Airbnb Data/Reviews.csv',
                     parse_dates= ['date'])

# reviews.head()

In [11]:
reviews.info(memory_usage= 'deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5373143 entries, 0 to 5373142
Data columns (total 4 columns):
 #   Column       Dtype         
---  ------       -----         
 0   listing_id   int64         
 1   review_id    int64         
 2   date         datetime64[ns]
 3   reviewer_id  int64         
dtypes: datetime64[ns](1), int64(3)
memory usage: 164.0 MB


In [12]:
reviews_sm = set_data_types(reviews)
reviews_sm.info(memory_usage= 'deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5373143 entries, 0 to 5373142
Data columns (total 4 columns):
 #   Column       Dtype         
---  ------       -----         
 0   listing_id   int32         
 1   review_id    int32         
 2   date         datetime64[ns]
 3   reviewer_id  int32         
dtypes: datetime64[ns](1), int32(3)
memory usage: 102.5 MB


In [13]:
reviews_clean = rename_columns(reviews_sm)

# reviews_clean.columns

In [14]:
print_memory()

del listings_data_dictionary
del reviews_data_dictionary
del listings
del listings_paris
del listings_sm
del reviews
del reviews_sm

print_memory()

Current memory usage: 1547.05 MB
Current memory usage: 1319.17 MB


## Data Exploration
*In this section we are going to explore the dataset and start building our analysis framework!*