In this project, we will explore Seattle AirBnB dataset with [CRISP-DM](https://www.datascience-pm.com/crisp-dm-2/)

CRSIP-DM follow 6 step:
1. Business understanding
2. Data understanding
3. Prepare data
4. Model data
5. Evaluation results
6. Deployment

## Business Understanding

[Airbnb](https://en.wikipedia.org/wiki/Airbnb) is an online marketplace for lodging, primarily homestays for vacation rental and tourism activities in the US since 2008. The business runs on customer data and requires analysis to fast growth and more user engagement. Airbnb has been public their data for Seattle on [Kaggle](https://www.kaggle.com/datasets/airbnb/seattle?resource=download) for data analysis.

This notebook using Seattle Kaggle data to answer these questions:
1. How Airbnb prices change by time (week, month)?
2. What are the factors that impact price of a house on Airbnb?
3. What factors impact user's experiences ?
4. How to set your Airbnb price?

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from gensim.parsing import remove_stopwords, strip_punctuation
from collections import Counter

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

calendar = pd.read_csv("../input/seattle/calendar.csv")
listings = pd.read_csv("../input/seattle/listings.csv")
reviews = pd.read_csv("../input/seattle/reviews.csv")

## calendar.csv

In [2]:
calendar.head()

In [3]:
calendar.info()

In calendar.csv has 4 columns and we can see some characteristics of the data.
* `listing_id` contains id of room in AirBnB
* Only column `price` contains null value
* `date` has string type and contains with format `yyyy-mm-dd`
* `available` has string type and contains only two values is t and f, maybe it's mean True and False
* `price` has string type and has format `$x,xxx.xx`

In [4]:
calendar_null_per = calendar.isnull().sum() / len(calendar)
sns.barplot(x=calendar_null_per.index, y=calendar_null_per.values).set_title(
    '% null value in calendar dataset')
plt.show()

In [5]:
calendar.available.value_counts(normalize=True)

In [6]:
calendar[calendar.available == 'f'].price.isnull().all()

* There are 32.94% null values in `price` column.
* If the value in `available` is `f` then the value in `price` is null.

As we can see, only column price is null.

In [7]:
calendar.listing_id.duplicated().any()

In [8]:
calendar.listing_id.value_counts().unique()

* Each `listing_id` has repeat 365 times. That mean, this is equal to a year
* Calendar.csv provide information about status and price of room within a year.

### Processing data
Before we explore more about calendar, i will do some step to normalize the data. this is:
1. Cast type of column date to datetime.
2. Convert and cast type of price to float.

In [9]:
# function for convert price to float data type
def price_to_float(x):
    return float(x.replace(",", "")[1:])

In [10]:
# cast column date to type datetime
calendar['date'] = pd.to_datetime(calendar.date)
# fill missing values of price columns with default value $0.00
calendar.price.fillna("$0.00", inplace=True)
# cast type of column price to float
calendar['price'] = calendar.price.apply(price_to_float)

calendar.price

Let take a look into the average price in dataset

In [11]:
plt.figure(figsize=(12,6))
calendar[calendar.available == 't'].groupby("date").price.mean().plot()
plt.title("Average price by day")
plt.show()

Easy to see that, the average price increase from beginning of the year and peaked in July (increase around 29%, about 35$). 

After that, the price slight decrease. In general, the average price in Seattle Airbnb increase around 17% (20$).

Let take a look into the average price by month and hopefully we can find something different.

In [12]:
calendar['month'] = calendar.date.dt.month_name()
month_summary=calendar[calendar.available == 't'].groupby("month", sort=False).agg({"price": "mean", "available": "count"}).reset_index()

In [13]:
calendar['year'] = calendar.date.dt.year

In [14]:
month_price_mean = calendar[calendar.available == 't'].groupby("month", sort=False).price.mean()
month_available = calendar[calendar.available == 'f'].groupby("month", sort=False).available.count()

plt.figure(figsize=(10,7))
ax1 = plt.subplot(211)
sns.lineplot(data=month_price_mean)
ax1.set_title("Average price changed by month")

ax2 = plt.subplot(212)
sns.lineplot(data=month_available)
ax2.set_title("Amount of rented room by month")
plt.tight_layout()
plt.show()

From these 2 chart, we can see:

* From January to March, the numbers of available room and average price both increase
* From March to end of the year, if the numbers of available room decrease, the average price tend to increase and vice versa.
* In July, average price get highest in the year and the numbers of available room is lowest.


In [15]:
calendar['day_of_week'] = calendar.date.dt.day_name()
dow_avg_price = calendar[calendar.available == 't'].sort_values('date').groupby("day_of_week", sort=False).price.mean()
dow_available = calendar[calendar.available == 't'].sort_values('date').groupby("day_of_week", sort=False).available.count()

plt.figure(figsize=(10,7))
plt.subplot(211)
sns.lineplot(data=dow_avg_price)

plt.subplot(212)
sns.lineplot(data=dow_available)
plt.show()
# sns.lineplot(x=dow_avg_price.index, y=dow_avg_price.values).set_title("Average price by day of week")
# plt.show()

Easy to see that, in the weekend has more people come to Seattle and the average price tend to slightly increase.

So, the answer of the question 1: **How Airbnb room price change by time?** is: 

*From January to March is the time to increase the price of the host. In summer, because a lot of travelers come to Seattle, the number of available room decrease tends to increase the price of the room in Seattle. And the average price of the room in Seattle tends to increase on weekend*.

In [16]:
calendar['day'] = calendar.date.dt.day
avg_price_by_day = calendar.groupby(['month', 'day'], sort=False).price.mean().reset_index()
avg_price_by_day

In [17]:
avg_price_by_day_pivot = avg_price_by_day.pivot(index='day', columns='month', values='price')

In [18]:
# cm = sns.light_palette("green", as_cmap=True)
avg_price_by_day_pivot.describe().style.highlight_max(axis=1, props='color:white; font-weight:bold; background-color:darkblue;')

In [19]:
plt.figure(figsize=(12,6))
sns.lineplot(data=avg_price_by_day_pivot)
plt.show()

* It's seem like **December** is the time has **highest** price in the year.
* In **January**, the price has **highest** price volatility.
* in **July** - the most vibrant time in the year, the price has **lowest** price volatility and average price is **highest**.

In [20]:
listing_by_status = calendar.groupby("listing_id").available.value_counts().reset_index(name='count')

In [21]:
plt.figure(figsize=(8,8))
sns.boxplot(x='available', y='count', data=listing_by_status)
plt.show()

### 1.2 reviews.csv

In [22]:
reviews

In [23]:
reviews.info()

In [24]:
reviews_null = reviews.isnull().sum() / len(reviews)

In [25]:
reviews_null

In [26]:
sns.barplot(x=reviews_null.index, y=reviews_null.values).set_title(
    "% null value in reviews.csv")
plt.show()

Only 0.02% of columns `comments` is null. Let explore null rows for considering delete if needed.

In [27]:
reviews[reviews.comments.isnull()]

`comments` contains experience of user after use room of host. And only 0.02% of `comments` is null, so we can delete it.

In [28]:
reviews = reviews.dropna().reset_index(drop=True)

In [29]:
reviews['date'] = pd.to_datetime(reviews.date)

In [30]:
reviews.groupby("date").date.count().plot(figsize=(10,7))

In [31]:
reviews.date.max()

The reviews date has wide range from 2009 to 2016. Most of comments come from last year. Let take a look into distribution of reviews for room and the time line of review to more information.

In [32]:
room_count = reviews.listing_id.value_counts()

In [33]:
room_count.describe()

In [34]:
plt.figure(figsize=(12,6))
sns.histplot(room_count, bins=40)

* almost of room have 10 reivews
* Some room have a lot of reviews, let take a look into this room to know more.

In [35]:
top10_reviews = reviews.listing_id.value_counts().head(10).index
col,row = 2,5
plt.figure(figsize=(12,7))
for i in range(col*row):
    plt.subplot(row,col,i+1)
    d = reviews[reviews.listing_id == top10_reviews[i]].groupby('date').comments.count()
    sns.lineplot(x=d.index, y=d.values, data=d).set_title(f"Review of id:{top10_reviews[i]} by year")
plt.tight_layout()
plt.show()

In top 10 room has most reviews:
* Comment time range is very long (from 2010)
* Always have comment every year

In [36]:
def tokenize_word(s):
    s = s.lower()
    s = remove_stopwords(s)
    s = strip_punctuation(s)
    return s

### listings.csv

In [37]:
listings.head()

In [38]:
listings["accommodates"].value_counts()

In [39]:
listings.info()

In [40]:
listings.dtypes.value_counts()

In [41]:
col_missing = listings.columns[listings.isnull().sum() > 0]
plt.figure(figsize=(10,10))
listings_null = (listings[col_missing].isnull().sum() / len(listings)).sort_values(ascending=False)
sns.barplot(x=listings_null.values, y=listings_null.index).set_title(
            "% null value in listings.csv")
# plt.xticks(rotation=90)
plt.show()

Some information about listings
* Contains 92 columns
* 30 columns contains numeric type, anothers has string type.
* Some columns has very high percent of null value (i.e: license - 100%, square_feet - 97%)
* Contains a lot of information related to room: host information, house detail, price, review, address, description...

As we can see that, listings.csv is a *metadata* of house in Airbnb. It contains very detail information about house like: address, host, price, etc.

In the next section, let explore some information about house in metadata and build a model for predict house price by using this information

In [42]:
listings.review_scores_rating.describe()

Almost users review and rate very high scores on Airbnb. This means, their has a good experience with home at Airbnb.

In [43]:
listings['price'] = listings.price.apply(price_to_float)

In [44]:
listings.price.plot(kind='box')

About house price in Airbnb:

* Almost of price is under 200 dollar.
* Some of price is higher and highest price reach 1000$ - this is a very very expensive price

In [45]:
negative_id = listings[listings.review_scores_value < 6].id
neg_review = reviews[reviews.listing_id.isin(negative_id)]
neg_review.comments

In [63]:
neg_text = neg_review.comments.apply(tokenize_word).tolist()
# text = [tokenize_word(x) for x in text]
neg_wordcloud = WordCloud().generate(" ".join([x for x in neg_text]))
plt.figure(figsize=(13,8))
plt.imshow(neg_wordcloud, interpolation='bilinear')
plt.axis("off")

Almost of users give low scores complain about services not reaching their expectations. Some of them are automation reviews. The reason for giving a low score can be the price too expensive, hosts canceled the reservation, information not detailed or extra fee.

In [64]:
top_50 = reviews.listing_id.value_counts().head(100).index
text = reviews[reviews.listing_id.isin(top_50)].comments.apply(tokenize_word).tolist()
# text = [tokenize_word(x) for x in text]
wordcloud = WordCloud().generate(" ".join([x for x in text]))

In [65]:
plt.figure(figsize=(13,8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")

Easy to see that, location and type of room are the most important factors that impact to user's experience. besides that, beautiful views, friendly host and clean house is also important to user's experience.

The answer of the questions: **What factors impact users' experiences?**

An apartment with a *great view*, *comfortable*, *nearby downtown*, and *clean* will more attract users. And the *accuracy* and *detailed information* on Airbnb is factors that interested user

In the last session, let build a model that helps predict price of a house and find out what are the factors that impact the price of a house?

In [48]:
feature_cols = ['id','room_type', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'bed_type', 'price']

In [49]:
df2train = listings[feature_cols].copy().reset_index(drop=True)

In [50]:
df2train = pd.concat([df2train.drop(columns='room_type'),
                      pd.get_dummies(df2train.room_type, prefix="room_type is ", prefix_sep=" ")], axis=1)
df2train = pd.concat([df2train.drop(columns='bed_type'),
                      pd.get_dummies(df2train.bed_type, prefix="bed_type is ", prefix_sep=" ")], axis=1)

In [51]:
df2train = df2train.fillna(0)

In [52]:
train = df2train[[x for x in df2train.columns if x not in ['id', 'price']]]
label = df2train.price

In [53]:
train_corr = train.corr()
plt.figure(figsize=(10,8))
sns.heatmap(data=train_corr, cmap="YlGnBu").set_title("Feature correlation")
plt.show()

In [54]:
price_vs_feature = train.corrwith(df2train.price)
plt.figure(figsize=(10,8))
sns.barplot(x=price_vs_feature.values, y=price_vs_feature.index).set_title("Correlation between price and house characteristics")
plt.show()

In [55]:
x_train, x_test, y_train, y_test = train_test_split(train, label, test_size=0.25, random_state=10)

In [56]:
model = LinearRegression()
model.fit(x_train, y_train)
y_test_pred = model.predict(x_test)
r2_score(y_test, y_test_pred)

In [57]:
coefs = pd.DataFrame(
   model.coef_,
   columns=['Coefficients'], index=train.columns
)

coefs.plot(kind='barh', figsize=(9, 7))
plt.title('Feature importance for model')
plt.axvline(x=0, color='.5')
plt.subplots_adjust(left=.3)

The feature `room_type` with value `Entire home/apt` has the highest effect on the model. Otherwise, feature `room_type` with value `Shared room` has the lowest effect. Besides that, `bedrooms` and `bathrooms` are two features that have highly effective to the model.
So, **the type of room, number of bed, and bathrooms** are important features that impact the **price** of the house.

As result above, the answer of the question: **What are the factors that impact price of a house on Airbnb?** is:

Types of room, bedrooms, bathrooms, and the number of people that the house can accommodate are the factors that impact the price of the house. Besides that, as which result of the analysis above, the season and the number of available houses are also factors that impact the price of a house on Airbnb.

In [58]:
listings.room_type.value_counts()

In [59]:
room_type_mapping = listings.set_index("id").room_type.to_dict()
calendar['room_type'] = calendar.listing_id.map(room_type_mapping)
calendar

In [60]:
room_type_by_month = calendar[calendar.available == 'f'].groupby(["month"], sort=False).room_type.value_counts().reset_index(name='count')
room_type_pivot = room_type_by_month.pivot(index='month', columns='room_type', values='count')
room_type_pivot

In [61]:
plt.figure(figsize=(10,7))
plt.subplot(311)
sns.lineplot(data=room_type_pivot['Entire home/apt'])

plt.subplot(312)
sns.lineplot(data=room_type_pivot['Private room'])

plt.subplot(313)
sns.lineplot(data=room_type_pivot['Shared room'])

plt.show()

In [62]:
plt.figure(figsize=(12,6))
sns.lineplot(data=room_type_pivot)
plt.show()

## Conclusion

In this session, I will summarize the results of 3 questions and answer the last question: **how to set your Airbnb price?**

