# **Airbnb Listings**

This project will analyse the dataset for Airbnb listing in Melbourne as of August 2022. We will investigate the most common areas, most popular hosts, and what factors may contribute to the price.

**Contents:**
1. Introduction
2. Summary
3. Data Wrangling
4. Analysis
5. Conclusion
6. Recommendations
7. References

# **1. Introduction**

Since 2008, Airbnb has enabled homeowners to control when the want to rent out their home for short-term homestays. Seen as an easy way to profit on a home that may be vacant part of year it has become extremely popular.
For this project we will analyse the current market in Melbourne, and extract insights such as what makes a home popular, what influences the prices, and where demand and supply may not be at equilibrium. 

Please note that due to size of this dataset (exceeding the 20mb limit) I have only updated part of it which may skew results.

**Questions we will be answering:**

* As of 2022, what is the most common suburb to find an Airbnb with over a 4 star rating? 
* As of 2022, what suburbs have the most bookings?
* What is the distribution of prices for different suburbs?
* Which hosts are the most popular, why?
* Does the number of reviews have an effect on the price/Availability?
* What are the common terms used in reviews for some of the most popular hosts?
* Most common amenities in popular Airbnb's that are not available in the less common Airbnb's
* Does being a super host contribute to more bookings?


**Data Files**

This notebook uses the 3 datasets below. The hosts and reviews tables are linked to the listings table through the host_id and listing_id, respectively. The data sets contain a lot of columns, so we will only explain a certain selection that by title may not make sense what data they contain.

All file obtained from:  http://insideairbnb.com/get-the-data

**listings.csv**
* **Id**  (unique identifier)
* **Accomodates** (how many people can fit in the house)
* **Amenities** (a list of all the amenities included in the house)
* **Minimum_nights** (the minimum nights someone can book for)
* **Maximum_nights** (the maximum nights a person can book for)
* **availabilty_30** (how many nights are available in the next 30 days)
* **availablity_60** (how many nights are available in the next 60 days)
* **availability_90** (how many nights are available in the next 90 days)
* **availabilty_365** (how many nights are available in the next 365 days)
* **Instant_bookable** (If the house can be booked without the host needing to approve)


**hosts.csv**
* **host_id** (Unique identifier)
* **host_since** (how long the user has been a host)
* **host_about** (the description about the host)
* **host_acceptance_rate** (how often the host accepts request to rent their house)
* **host_superhost** (Users are given this badge by meeting/exceeding guest expectations)
* **host_total_listing_count** (how many listing the host has on Airbnb)

**reviews.csv**
* **Listing_id** (Foreign key, to link to listing table)
* **Id** (Unique identifier)

# **1. Summary**

Overall, our findings showed that the listings on Airbnb are not well correlated with many of the attributes that we investigated. The biggest finding that could determine the pricing would be the area of the listing, areas further from the cbd seem to have higher pricing.

# **2. Data Wrangling**



In [1]:
# import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import geopandas as gpd

%matplotlib inline

In [2]:
# read in data
listings = pd.read_csv('data/listings.csv')
hosts = pd.read_csv('data/hosts.csv')
reviews = pd.read_csv('data/reviews.csv')


In [3]:
# Have a look at the data
print(listings.columns)
print(listings.shape)
print(listings.dtypes)
print(listings.count())
print(listings[listings.duplicated()])
listings.head()


Most the data looks good. Columns are all loaded in and, correct amount of rows and no duplicates.

Some fixes that we will do to clean the data include:
* Remove the text from the bathrooms_text column so its only an float, making it easier to work with.
* fix data types of some columns
  * first_review
  * last_review to date types
* Rename neighbourhood_cleansed to just neighbourhood
* Explore the null values, decide what to do with them.



In [4]:
# Using regex we can remove any letters '[A-Za-z]'.
# leaving us only the number of bathrooms
listings['bathrooms_text'] = listings['bathrooms_text'].str.replace(r'[A-Za-z]','')

# remove any blank spaces from the strings
listings['bathrooms_text'] = listings['bathrooms_text'].str.strip()

# assuming a dash '-' means no bathrooms, we replace this with 0
listings['bathrooms_text'] = listings['bathrooms_text'].str.replace('-', '0')

# rename column to a more appropriate name
listings.rename(columns = {'bathrooms_text':'bathrooms'}, inplace=True)

# now we can change the data type to a float
listings['bathrooms'] = listings['bathrooms'].astype('float')

In [5]:
# convert first_review and last_review to datetime type
listings["last_review"] = pd.to_datetime(listings["last_review"], infer_datetime_format=True)
listings["first_review"] = pd.to_datetime(listings["first_review"], infer_datetime_format=True)

In [6]:
# rename columns 
listings.rename(columns={'neighbourhood_cleansed':'neighbourhood'}, inplace=True)

In [7]:
# We can focus on 30 day and 360 day availability and overall score rating.
# filter out columns availability 60/90, and all review_scores except rating.
unwanted_columns = ['availability_60', 'availability_90','review_scores_cleanliness','review_scores_checkin','review_scores_communication','review_scores_location','review_scores_value']
listings = listings.drop(columns=unwanted_columns)




In [8]:
# How many null values do we Have
listings.isna().sum()

From the empty columns :
* description is empty likely becaues the host has put any information there
* review columns empty from not receiving any reviews yet
* airbnb allows tents, single rooms and apartments that may be not have a seperate rooms for the bed so we may assume this the reason for null values in bathrooms/bedrooms/beds columns.

We will set NA values in bathrooms bedrooms and beds to 0 instead of NaN

In [9]:
listings[['bathrooms','bedrooms','beds']].fillna(0, inplace=True)

In [10]:
# final result of our cleaning.
print(listings.info())
listings.head()

In [11]:
listings.describe()

**File 2:  host.csv**

In [12]:
# Inspect dataset

display(hosts.shape)
display(hosts[hosts.duplicated()])
display(hosts.info())
display(hosts.describe())

In [13]:
hosts.head()

**Some fixes that we can do include:**

* remove duplicate row.
* remove unwanted columns.
* Change data type of columns to something more appropriate.  

In [14]:
# Drop duplicate entry
hosts.drop_duplicates(inplace=True)
# Drop unwanted columns 
unwanted_columns_hosts = ['host_listings_count', 'host_verifications', 'host_has_profile_pic']
hosts.drop(columns=unwanted_columns_hosts, inplace=True)

In [15]:
# change to date type
hosts['host_since'] = pd.to_datetime(hosts['host_since'], infer_datetime_format=True)

# remove percent symbol and make float type
hosts['host_response_rate'] = hosts['host_response_rate'].str.replace('%', '')
hosts['host_response_rate'] = hosts['host_response_rate'].astype('float')
hosts['host_acceptance_rate'] = hosts['host_acceptance_rate'].str.replace('%','') 
hosts['host_acceptance_rate'] = hosts['host_acceptance_rate'].astype('float')


In [16]:
# Merge dataset with listings
df = listings.merge(hosts, on=['host_id'], how = 'left')
df.head()

**File 3: reviews.csv**

In [17]:
# Inspect dataset

display(reviews.shape)
display(reviews.info())
display(reviews.describe())

reviews.head()

In [18]:
# check id column for uniqueness since this should be a primary key for identifies each row.
reviews['id'].is_unique
reviews[reviews['id'].duplicated()]

We see that something has gone wrong with the data import and some of the id columns have been scraped as scientific values.

This is about 1/10th of our data, so given that we wont be looking at reviews individually but as a whole we will leave these.

**Some fixes that we can do include:**

remove duplicate row.

* Remove empty rows from import error
* change columns review_id and listing_id to a int instead of float.
* change date to datetime type

In [19]:
reviews.dropna(inplace=True)

In [20]:
# change columns to appropriate data types
reviews[['reviewer_id','listing_id']] = reviews[['reviewer_id','listing_id']].astype('int')
reviews['date'] = pd.to_datetime(reviews['date'], infer_datetime_format=True) 
reviews.info()

---
Unfortunately we dont get to see the actual rating of the reviews, we can focus perhaps on the contents of the reviews.


# **4. Analysis**

**Question 1:**

**As of 2022, what is the most common suburb to find an Airbnb with a rating over 4?**

We will explore the top 10 neighboorhood with the most available Airbnb's . Using a barchart to clearly show the difference of the top 10 

and a map to visualize the locations.
-

In [21]:
# First we can filter out any listings with less than a 4 star rating
over_4_stars = df[df['review_scores_rating'] > 4]

# Group by neighbourhood 
top_neighbourhoods = over_4_stars.groupby('neighbourhood').size()

# Sort the values
top_neighbourhoods = top_neighbourhoods.nlargest(10).sort_values()

# Show on a horizontal bar graph
# A verticle bar graph gets harder to read with more values on the x axis so we go with horizontal
top_neighbourhoods.plot.barh(title = 'Top Airbnb neighbourhoods with 4+ star ratings', color = '#FF5A5F')
plt.xlabel('Count')



**Answer**

Looks like Melbourne CBD will be your best bet at finding an Airbnb with a rating of over 4.

---

**Question 2**

**As of 2022, what suburbs have the most bookings?**

Using the availability_x columns we can find out which areas have the most actual bookings. We explore the percentage of all available bookings in the area.

In [22]:
# Subtract days available from 365 for a clear int of how many days the listing has been booked 
listings['days_booked'] = 365 - listings['availability_365']

# Sum the days booked for each neighbourhood
bookings_per_neighbourhood = listings.groupby('neighbourhood')['days_booked'].sum()

# sort the values and create a bar chart
bookings_per_neighbourhood = bookings_per_neighbourhood.nlargest(10).sort_values()
bookings_per_neighbourhood.plot.barh(title = 'Days Booked', color = '#FF5A5F')

As expected the areas with the most listings have the most bookings. We will dig further into this by finding the percentage of bookings.

In [23]:
# Create column for calculation reference of total availability
listings['year_days'] = 365

# Create a new df to work out percentage
bookings_perc = listings[['neighbourhood','days_booked', 'year_days']].copy()


In [24]:
# Groupby neighbourhood and sum 
bookings_perc = bookings_perc.groupby('neighbourhood').sum()

# We can now divide the booked days by the total days to get a percentage
bookings_perc['total_booked_percent_365'] = round(bookings_perc['days_booked']/ bookings_perc['year_days'],2) * 100

# Sort values then visualize with bar chart
bookings_perc = bookings_perc['total_booked_percent_365'].nlargest(10).sort_values()
bookings_perc.plot.barh(title = '% of availability booked', color = '#FF5A5F')

**Answer**

Now we get a much more interesting result.

As we can see Yarra has the highest bookings percentage. 
Maybe if you were thinking of purchasing an investment property to list on Airbnb, you could pick one of the top areas as there is not an over supply.

---

**Question 3**

What is the distribution of prices for different suburbs?

When thinking of either buying a property to host or looking to stay at an Airbnb, we want to know if the price of these listings are appropriate.

We can compare them to the median/mean but it may be better to get an understanding of the range of values. 

we will create bar plots for each of the neighborhoods to visualize the distribution and compare easily to other areas.


In [25]:
# Extra cleaning that we missing earlier to prepare the price column
listings['price'] = listings['price'].str.replace('$','')
listings['price'] = listings['price'].str.replace(',','')
listings['price'] = listings['price'].astype('float')

# plot the prices 
sns.boxplot(data=listings, x ='price', y='neighbourhood')

As we see the outlier makes it difficult to see any clear patterns amongst the areas.

Next we will remove the outliers to clearly see the bulk of the data.

In [26]:
# plot prices excluding the outliers.
sns.boxplot(data=listings, x ='price', y='neighbourhood',showfliers = False, color='#FF5A5F' )

In [27]:
# Crate a new df exluding lines that are higher than the 95th quantile
# so colour gradient in visualization is clearer
listings_outlier_removed = listings[listings['price'] < listings['price'].quantile(.95)]

# Plot the location of all listings and darker colours will represnt higher mean prices in that hexbin
# Prices vary quite a bit so a more robust average is the median.
ax1 = listings_outlier_removed.plot.hexbin(
    x = 'longitude',
    y = 'latitude',
    C = 'price',
    reduce_C_function = np.median,
        gridsize = 50)

# Plot a point representing Melbourne CBD for reference
ax2 = plt.scatter(x=144.9631,y=-37.8136, color='red') 

# set limits so map is focused on melbourne
ax1.set_xlim([144.4,145.7])
ax1.set_ylim(-38.3,-37.4)
ax1.set_facecolor('#E5E4E2')
plt.savefig('plot1.png')

**Answer**

As we can see the distribution for most areas sits within the $80 to $200 mark with very little areas having listings over $400 per night.

The further away you get from the CBD the average listing price for neighbourhoods will likely be higher.

---

**Question 4**

**Which hosts are the most popular? Why?**

It would be good to have an understanding of what successfull hosts have in common so we can can replicate if we like.

First we need to define popular/successfull.

We will create a new column that calculates the earning potential of the listings. (days booked * price) 

We will take our top hosts and see if we can find anything in common.

In [28]:
# create new earning potential column
listings['earning_potential'] = listings['days_booked'] * listings['price']

# group by hosts and sort
top_hosts = listings.groupby(by='host_id')['earning_potential'].sum()
top_hosts = top_hosts.sort_values(ascending=False)

# Create 20 quantiles and we will grab the group in the 20th percentile (top 5%)
cut_df = pd.qcut(top_hosts, 20, labels=range(1,21))
cut_df = cut_df[cut_df==20]

#find the actual details of our hosts.
top_hosts = hosts[hosts['host_id'].isin(cut_df.index)]

In [29]:
# Analyse the hosts
display(top_hosts['host_since'].median())
display(top_hosts['host_response_rate'].value_counts())
display(top_hosts['host_response_time'].value_counts())
sns.boxplot(data=top_hosts, y='host_total_listings_count',showfliers=False, color='#FF5A5F')

In [30]:
fig, (ax1, ax2) = plt.subplots(1, 2)

# Find how many of the hosts are verified
verified = top_hosts.groupby('host_identity_verified').size()
#change the index from just 'f' and 't' to False and True
verified = verified.rename(index={'f':'False', 't':'True'})

ax1.pie(verified, colors=['grey','#FF5A5F'], labels = verified.index, autopct='%1.1f%%')
ax1.set_title('Host is verified')


# Find how many of the hosts are verified
superhost = top_hosts.groupby('host_is_superhost').size()
#change the index from just 'f' and 't' to False and True
superhost = superhost.rename(index = {'t':'True','f':'False'})

ax2.pie(superhost, colors=['grey','#FF5A5F'], labels=superhost.index, autopct='%1.1f%%')
ax2.set_title('Host is Superhost')



Most the results are what we would expect to see.

On average has been a host for a while (since 2015), quick and high response rates, verified and on average own mulitple lisitngs.

The most interesting result would be that most of them are not considered Superhosts.




In [31]:
# find the average rating of all listings per host
average_ratings = listings.groupby('host_id')['review_scores_rating'].mean()

#add the hosts average rating to the host table and find the average of the top hosts
top_hosts_avg = top_hosts.merge(average_ratings, on=['host_id'], how = 'left')
round(top_hosts_avg['review_scores_rating'].median(),2)

In [32]:
# get the hosts that are not in the top hosts dataframe and find their average.
not_top_hosts = average_ratings[~average_ratings.isin(top_hosts['host_id'])]
round(not_top_hosts.median(),2)

Interestingly the median average though very small difference is actually higher for the hosts not with high earning potential.



In [33]:
# create a df of hosts not in the top_hosts df
other_hosts = hosts[~hosts['host_id'].isin(top_hosts['host_id'])]
# count how many are superhosts
other_hosts_agg = other_hosts['host_is_superhost'].value_counts()
# relable to t to True and f to False
other_hosts_agg = other_hosts_agg.rename(index = {'t':'True','f':'False'})
# Plot as a pie chart
other_hosts_agg.plot.pie( autopct='%1.1f%%',colors=['grey','#FF5A5F'], labels=other_hosts_agg.index)

In [34]:
top_hosts

**Answer**

From this breif analysis we see that having the 'superhost' badge or higher ratings might not actually result in higher earning potential.

---

**Question 5**

**Does the number of reviews have an effect on the price/availability?**

As we all know, everytime you buy a service or a product we are promted to leave a review and often offered some sort of reward as this is supposed to encourage new customers.

We will have a look if listings with more reviews are often priced higher and have more bookings to find out if reviews do help.





In [35]:
# Create a scatter plot to see if we can identify any linear relationship 
#   between the price and number of reviews for each listings
plt.scatter(data=listings, x='number_of_reviews', y='price',alpha=0.5, color='#FF5A5F')

# limit the visual so we can easily see the bulk of the data
plt.ylim([0, 800])
plt.xlim([0,25])


In [36]:
# remove the outliers (any value greater than the 95th percentile price)
listings_outliers_removed = listings[listings['price'] < listings['price'].quantile(.95)]

# remove the outliers (any value greater than the 95th percentile number of reviews)
listings_outliers_removed = (listings_outliers_removed[listings_outliers_removed['number_of_reviews']
                 < listings_outliers_removed['number_of_reviews'].quantile(.95)])

# create a hex plot to see the relationship along with the distribution of each variable
sns.jointplot(data=listings_outliers_removed, x="number_of_reviews", y="price", kind="hex")

The plots dont immediately  show any interesting patterns so we will leave it there and assume that price of Airbnb's dont have any relationship with reviews.



In [37]:
# plot the relationship between the availability and price
plt.scatter(data=listings, x='days_booked', y='price',alpha=0.5, color='#FF5A5F')

# once again limit the x and y axis to get a clearer look at a bulk of the data.
plt.ylim([0, 1000])
plt.xlim([0,365])

**Answer**

The plots dont immediately  show any interesting patterns so we will leave it there and assume that price of Airbnb's dont have any relationship with reviews.

Again the plots dont immediately show any patterns, it is relativley straight, so we will assume that price of Airbnb's dont have any relationship with availabilty.

---

**Question 6**

**Most common amenities in higher cost Airbnb's that are not available in the lower cost Airbnb's**

We can see maybe what type of amenities are viewed as more valuable.

In [38]:
# we will create 2 tables of even size, based on rating the top 50% rated listings and bottom 50%

# create df of top 50% of listings
above_fifty_percentile = listings[listings['review_scores_rating']>listings['review_scores_rating'].quantile(.5)]
# Only include columns we want
above_fifty_percentile = above_fifty_percentile[['id','amenities']]

# create df of bottom 50% of listings
below_fifty_percentile = listings[listings['review_scores_rating']<listings['review_scores_rating'].quantile(.5)]
# Only include columns we want
below_fifty_percentile = below_fifty_percentile[['id','amenities']]

# Create a function that will transform our amenities column from a single string object to a list 
def string_to_list(series):
    """ takes in a series and converts the string object from each row into a list"""
    
    # split string into list by commas ,
    series = series.split(',')
    # initiate a new list to store the new clean values
    new_list = []
    # iterate over each item in the amentites string and clean it then add it to our list
    for item in series:
        a = item.replace('[','')
        a = a.replace(']', '')
        a = a.replace('"','')
        new_list.append(a)
    return new_list

# apply our function to the amenities column
above_fifty_percentile['amenities'] = above_fifty_percentile['amenities'].apply(lambda x: string_to_list(x))
below_fifty_percentile['amenities'] = below_fifty_percentile['amenities'].apply(lambda x: string_to_list(x))

In [39]:
# transform the df so now there is a row for each of the amenities for each id
above_fifty_percentile = above_fifty_percentile.explode('amenities')
below_fifty_percentile = below_fifty_percentile.explode('amenities')

In [40]:
# Count the occurance of each amenity and transpose so each row represts an amenity
above_fifty_pivot = above_fifty_percentile.pivot_table(columns= 'amenities', aggfunc='count')
above_fifty_pivot = above_fifty_pivot.transpose()

# Count the occurance of each amenity and transpose so each row represts an amenity
below_fifty_pivot = below_fifty_percentile.pivot_table(columns= 'amenities', aggfunc='count')
below_fifty_pivot = below_fifty_pivot.transpose()

In [41]:
above_fifty_pivot = above_fifty_pivot.sort_values(by='id', ascending=False).nlargest(n=20,columns='id')
below_fifty_pivot = below_fifty_pivot.sort_values(by='id', ascending=False).nlargest(n=20,columns='id')

In [42]:
print('Above 50th percentile')
print(above_fifty_pivot)

print('\n\nBelow 50th percentile')
print(below_fifty_pivot)

**Answer**

Overall, it doesnt seem to be any major differences in the top 20 amenities between the 2 groups.

---

**Question 7**

**From all the data sets, can we identifiy any correlatin that may be worth investigating?**

We can see what factors will likely to affect each other.




In [43]:
# create a correlation matrix for all the listings variables
corr_listings = listings.corr()

# plot the values in a heatmap to easily identify relationships
plt.figure(figsize=(15,8))
sns.heatmap(corr_listings, annot=True)

In [44]:
# create a correlation matrix for all the listings variables
corr_hosts = hosts.corr()

# plot the values in a heatmap to easily identify relationships
plt.figure(figsize=(15,8))
sns.heatmap(corr_hosts, annot=True)

In [45]:
# the review table lacks numerical columns so we will create a 
# length column which will hold the length of the review
reviews['length'] = reviews['comments'].apply(lambda x: len(x))

#merge the reviews with the listing table to get the listing ratings
reviews = reviews.merge(listings[['id', 'review_scores_rating']], left_on=['listing_id'],right_on=['id'], how='left')
reviews = reviews[['listing_id','date','reviewer_id','comments','length', 'review_scores_rating']]

# create a correlation matrix for all the listings variables
corr_reviews = reviews.corr()

# plot the values in a heatmap to easily identify relationships
plt.figure(figsize=(15,8))
sns.heatmap(corr_reviews, annot=True)

**Answer**

In the listings df the noticable relationships are what we would expect. The bathrooms, bedrooms, beds, and accomadate columns are all positively correlated. If the listings accomadates more people it would need more beds, bedrooms and bathrooms to do this.

In the hosts df, the acceptance and response rate are the most related. Hosts with higher response rates often accept more.

The reviews df doesnt have many strong relationships, the strongest being the length of the review and the reviewer_id. The reviewer_id is not a an actual value represnting anything so this doesnt really tell us anything.

---

**5. Conclusion**

We can conclude by stating the the difference in price amongst listings may be heavily dependant on the area, and possibly lightly affected by other factors such as ratings.

We found that the area with the most listing was with rating over 4 was Melbourne CBD with over 2000 listings. Doubling the amount of listings in the next area, Port Phillip.

The ranking of amount of days booked per area is relatively the same when looking at absolute values but when converting the values to percentages, Melbourne falls down to the 5th position and Yarra takes the top spot at approximately 80% booked. This can be used to indicate the supply and demand.

When looking at the pricing of listings for each neighborhood, the distribution was similar for each, 50% of the listings being between $80 - $200. 
Bayside had the widest distribution, and Yarra Ranges having the highest median.
When plotting the locations and median prices we saw that areas further from the CBD were likely to have higher median prices.

When investigating the data we had on the hosts we hoped to identify what factors may result in a more successfull host. successfull hosts being measured by earning potential (days booked * price) and taking the top 5% of hosts. We didnt find any noticable differences between the top 5% and the rest of the hosts.

When looking at the effectiveness of listings having reviews, the number of reviews for a listing didnt not show any relationship with price niether did it show any relationship with availability.

Looking at the amenities offered for each listing, the top 50% rated listings didnt seem to have any different offerings compared to the lower 50%.

There were several subtle correlations amongst the columns, perhaps could be explored more but no correlations stood out.

Overall, much of the analysis showed that listings did not clearly show any relationship with any of the explored variables, if you wanted to get an idea for the price of the property you most accurate guess would be based of the location.








**6. Recommendations**

Many of the factors explored show no indication to pricing or availabilty. In the future to better understand the data we could:


* Include all the data, this may have skewed the results by chance.
* Conduct some natural language processing on the review comments to get an idea of common terms and sentitment for listings. 
* group ammenities and remove the standard ones to get an idea of the more unique ammenties offered for different rated listings.
* Conduct analysis on a focused area to get a better idea of what factors outside of location affect pricing.

**7. References**

Pandas Developers (2022). *Pandas Documentation*. Available at: https://pandas.pydata.org/docs/index.html [Accessed 12 Nov. 2022]

Matplotlib Developers (2020) *Matplotlib 3.1.2 Documentation* Available at: https://matplotlib.org/3.1.1/contents.html [Accessed 12 Nov. 2022]

Ajitesh, Kumar (2022). *Correlation Concepts, Matrix & Heatmap using Seaborn.* Available at: https://vitalflux.com/correlation-heatmap-with-seaborn-pandas/ [Accessed 19 Nov. 2022]

Landup, David (2021) *How to set axis range* Available at: https://stackabuse.com/how-to-set-axis-range-xlim-ylim-in-matplotlib/ [Accessed 23 Nov. 2022]

MachineLearningPlus (2021) *Pandas Series to List* Available at: https://www.machinelearningplus.com/pandas/pandas-series-to-list/#:~:text=Pandas%20series%20can%20be%20converted,and%20perform%20the%20required%20operations. [Accessed 25 Nov. 2022]

Seaborn Devlopers (2022). *seaborn.jointplot* Available at: https://seaborn.pydata.org/generated/seaborn.jointplot.html [Accessed 25 Nov. 2022]
