# Objective
When hotels want to increase the value of their room, they offer room amenities that matter most to travelers. In fact, Per PwC’s Consumer Intelligence Series Report on hotel brand loyalty, “Both business and leisure travelers say room quality is the #1 reason for choosing a hotel.” Hotels have learned the art of  maximizing the values of their rooms, however how could AirBnB hosts that do too? 

My goal is to help hosts maximize the value of their AirBnb rental. I will do that by building a model that predicts how much revenue an Airbnb rental will earn in a month, and seeing which features contribute most to those earnings.

In [None]:
## Read Data
import csv
import pandas as pd
import re
df = pd.read_csv("../input/New York.csv")

In [None]:
df.head()

In [None]:
df.describe()

## Create Monthly Revenue: Price Multiplied by Number of Days Occupied 
Revenue per available room (RevPAR) is a performance metric used in the hotel industry. It is calculated by multiplying a hotel's average daily room rate (ADR) by its occupancy rate. 

To calculate occupancy rate: The column in our dataset is called availability_30 and it stands for how many days the room is available in the next 30 days. So if a room is available 14 days, that means the room is occupied for 16 days (30-14). 

In [None]:
df["RevPAR"] = (df['price'] * (30 - df['availability_30']))




## Amenities
There is a column called amenities, which I care about in this project. As you can see below, they are all in one column and it looks pretty messy so I need to clean it.

In [None]:
df['amenities'][0]

## Remove all non-letter characters from amenities column


In [None]:
sc_sub = re.compile('\W+')
df['amenities'] = [sc_sub.sub(' ', amenity) for amenity in df['amenities']]
print(df['amenities'][0])

## Transform each amenity into a binary feature 

In [None]:
amenities2 = ['Wireless Internet', 'Air conditioning', 'Pool', 'Kitchen',
       'Free parking on premises', 'Gym', 'Hot tub', 'Indoor fireplace',
       'Heating', 'Family kid friendly', 'Suitable for events', 'Washer',
       'Dryer', 'Essentials', 'Shampoo', 'Lock on bedroom door', 'Cable TV',
       '24 hour check in', 'Laptop friendly workspace', 'Hair dryer']
for amenity in amenities2:
    df[amenity] = df.amenities.str.contains(amenity)

## What is the relationship between price and occupancy rate?


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="ticks")

# Show the results of a linear regression within each dataset
sns.lmplot(x="availability_30", y="price", data=df)
plt.show()


There is no relationship between price and occupancy rate. There are rooms that are cheap and always occupied, and there are rooms that are cheap and never occupied. This means that customers are not booking strictly based on price, which means hosts have to differentiate themselves in other ways.


## Relationship between RevPAR and Occupancy Rate

In [None]:
# Show the results of a linear regression within each dataset
sns.lmplot(x="availability_30", y="RevPAR", data=df)
plt.show()

RevPAR decreases as availability increases, however I notice that there are potential outliers, so I will test that out.

## Remove Outliers
Remove points that are more than 3 standard deviations from the mean

In [None]:
import numpy as np
df = df[np.abs(df.RevPAR-df.RevPAR.mean())<=(3*df.RevPAR.std())]

## Relationship between price and quantity after outlier removal

In [None]:
# Show the results of a linear regression within each dataset
sns.lmplot(x="availability_30", y="price", data=df)
plt.show()



## Relationship between RevPAR and Occupancy Rate after outlier removal
We can see the relationship between the two more clearly now and we can definitely say that RevPAR decreases as availability increases


In [None]:
# Show the results of a linear regression within each dataset
sns.lmplot(x="availability_30", y="RevPAR", data=df)
plt.show()

## Average Revenue Per Month
### Exploring our target variable


In [None]:
print("Average Monthly Revenue:", df["RevPAR"].mean())
print("Median Monthly Revenue:", df["RevPAR"].median())

## Create binary target variable
I will create a binary variable to turn this into a classification problem.

In [None]:
# Rooms that make more than the average monthly revenue will be labeled as 1 (Successful),
# whereas rooms that earn less will be labeled as 0 (Not Successful)
AverageRevPAR = df["RevPAR"].mean()
df["Successful"] = [1 if x >= AverageRevPAR else 0 for x in df["RevPAR"]]

## Check for columns with a lot of missing values

In [None]:
df.isnull().sum()

## Does an amenity contribute to how Successful a room is?

In [None]:
Exploratory_Analysis = ['Wireless Internet', 'Air conditioning', 'Pool', 'Kitchen',
       'Free parking on premises', 'Gym', 'Hot tub', 'Indoor fireplace',
       'Heating', 'Family kid friendly', 'Suitable for events', 'Washer',
       'Dryer', 'Essentials', 'Shampoo', 'Lock on bedroom door',
       '24 hour check in', 'Laptop friendly workspace', 'Hair dryer','is_business_travel_ready']
for r in Exploratory_Analysis :
    print(df.groupby(r)['Successful'].mean())

## Replace missing values with 0
I'm doing this because when I manually checked rentals and if they had not received any reviews or ratings, it is recorded as null in this dataset.

In [None]:
df['number_of_reviews'].fillna(0, inplace=True)
df['review_scores_rating'].fillna(0, inplace=True)
df['reviews_per_month'].fillna(0, inplace=True)

## Drop columns that won't be used for modeling

__ I tested different models, and saw that adding more and more features was not necessarily a good thing.__

<br> __Columns that were transformed in some way:__ amenities, price, weekly_price, avaialability_30, RevPAR
<br> __Columns that won't be used do to large number of missing variables__: review_scores_accuracy, "review_scores_cleanliness","review_scores_checkin", 
"review_scores_communication", "review_scores_location", "review_scores_value",
<br>__Features that cannot be modeled__: latitude, longitude, id
<br>__Features excluded that make model worse__: cleaning fee, security_deposit, "host_has_profile_pic",'bathrooms','bedrooms','beds','accommodates','square_feet','guests_included','minimum_nights','maximum_nights'


In [None]:
df.drop(["amenities","id","price","weekly_price", 
         "availability_30", "review_scores_accuracy", "review_scores_cleanliness","review_scores_checkin", 
         "review_scores_communication", "review_scores_location", "review_scores_value",
         "cleaning_fee", "security_deposit",
         "host_has_profile_pic",'bathrooms','bedrooms','beds', 'latitude','longitude',
         'accommodates','square_feet','guests_included','minimum_nights','maximum_nights'], axis=1, inplace = True)

## Establish features to be used and target variable

In [None]:
X = df[[
        'neighbourhood_cleansed','is_location_exact',
       'bed_type',
        'number_of_reviews',
       'instant_bookable', 'cancellation_policy', 'Wireless Internet',
       'Air conditioning', 'Pool', 'Kitchen', 'Free parking on premises',
       'Gym', 'Hot tub', 'Indoor fireplace', 'Heating', 'Family kid friendly',
       'Suitable for events', 'Washer', 'Dryer', 'Essentials', 'Shampoo', 'Cable TV',
       'Lock on bedroom door', '24 hour check in', 'Laptop friendly workspace',
       'Hair dryer','is_business_travel_ready','number_of_reviews','review_scores_rating','reviews_per_month']]

y = df["Successful"]

## Check for missing values to be sure 

In [None]:
X.isnull().sum()

## Correlation Analysis
Machine learning algorithms like linear and logistic regression can have poor performance if there are highly correlated input variables in your data.

In [None]:
df[df.columns[1:]].corr()['Successful'][:-1]

## Prepare data for modeling by turning categorical variables into dummy variables

In [None]:
X_train1 = pd.get_dummies(X)

## Logistic Regression 

In [None]:
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split

# Create Training and Test Dataset with 75% Training and 25% Test
X_train, X_test, y_train, y_test = train_test_split(X_train1, y, test_size=0.25)

# Run Logistic Regression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

# Analyze results
print("Results:")
print("Accuracy", metrics.accuracy_score(y_test,y_pred))

# save confusion matrix and slice into four pieces
confusion = metrics.confusion_matrix(y_test, y_pred)
TP = confusion[1, 1]
TN = confusion[0, 0]
FP = confusion[0, 1]
FN = confusion[1, 0]

#Specificity: When the actual value is negative, how often is the prediction correct?
print("Specificity:",TN / float(TN + FP))

#False Positive Rate: When the actual value is negative, how often is the prediction incorrect?
print("False Positive Rate:",FP / float(TN + FP))

#Precision: When a positive value is predicted, how often is the prediction correct?
print("Precision:",metrics.precision_score(y_test, y_pred))

#Sensitivity:
print("Recall:",metrics.recall_score(y_test, y_pred))

print("-----------------------------------------------------------------------")
# examine the class distribution of the testing set (using a Pandas Series method)
print("Class Distribution:", y_test.value_counts())
# calculate the percentage of ones
print("Percentage of Ones:", y_test.mean())

# calculate the percentage of zeros
print("Percentage of Zeros:", 1 - y_test.mean())

# calculate null accuracy (for binary classification problems coded as 0/1)
print("Null Accuracy:",max(y_test.mean(), 1 - y_test.mean()))

print('------------')
print("Improvement in accuracy compared to Naive Model", metrics.accuracy_score(y_test,y_pred) - max(y_test.mean(), 1 - y_test.mean()))


## Decision Tree Classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import train_test_split

X_train2, X_test2, y_train2, y_test2 = train_test_split(X_train1, y, test_size=0.25, random_state = 42)

tree = DecisionTreeClassifier(max_depth=8, random_state=0)
tree.fit(X_train2, y_train2)
y_pred2 = tree.predict(X_test2)
print('Accuracy on the training subset: {:.3f}'.format(tree.score(X_train2, y_train2)))
print('Accuracy on the test subset: {:.3f}'.format(tree.score(X_test2, y_test2)))


# save confusion matrix and slice into four pieces
confusion = metrics.confusion_matrix(y_test2, y_pred2)
TP = confusion[1, 1]
TN = confusion[0, 0]
FP = confusion[0, 1]
FN = confusion[1, 0]

#Specificity: When the actual value is negative, how often is the prediction correct?
print("Specificity:",TN / float(TN + FP))

#False Positive Rate: When the actual value is negative, how often is the prediction incorrect?
print("False Positive Rate:",FP / float(TN + FP))

#Precision: When a positive value is predicted, how often is the prediction correct?
print("Precision:",metrics.precision_score(y_test2, y_pred2))

#Sensitivity:
print("Recall:",metrics.recall_score(y_test2, y_pred2))

print("--------------------------------------------------------------")
# examine the class distribution of the testing set (using a Pandas Series method)
print("Class Distribution:", y_test2.value_counts())
# calculate the percentage of ones
print("Percentage of Ones:", y_test2.mean())

# calculate the percentage of zeros
print("Percentage of Zeros:", 1 - y_test2.mean())

# calculate null accuracy (for binary classification problems coded as 0/1)
print("Null Accuracy:",max(y_test2.mean(), 1 - y_test2.mean()))

print("--------------")
print("Improvement in accuracy compared to Naive Model", metrics.accuracy_score(y_test2,y_pred2) - max(y_test2.mean(), 1 - y_test2.mean()))


## Most important features according to Logistic Regression

Both models perform about the same in terms of accuracy (~ 70 %), which is ~8 % then a naive model. The decision tree however outperforms the logistic regression 57% compared to 49 % in terms of recall. To choose a model, we need to evaluate which metric has the biggest payoff.

__True Positives__: Model predicts host will be successful, and they are (This is good)
<br>__False Positives__: Model predicts host will be successful, but they aren't (We don't do anything to help the hosts, and they don't make money)
<br>__True Negatives__: Model predicts host will not be successful, and they aren't (This is good, so we can recommend to them how they can improve)
<br>__False Negatives__: Model predicts host will not be successful, but they would have been (We recommend changes to host, but it was not necessary. Host's waste money) 

__Most Important Metric__: Specificity/ Recall
Optimizing for specificity because false negatives ( we reach out to host but we didn’t have to) is more acceptable than false positive (us not reaching out to hose and host failing)
When host doesn't make money, AirBnb doesn't make money, and no one is happy.
Based on that, Decision Tree performs better even though it's accuracy is marginally worse.


## Most important features according to the Decision Tree

In [None]:
a = zip(X_train1,tree.feature_importances_)
Important_Features = pd.DataFrame(list(a), columns = ['features','FeatureImportances'])
Important_Features.sort_values(by=['FeatureImportances'],ascending = False)

## Analyzing Most Important Features

In [None]:
Avg_rev_ff = df.groupby('Family kid friendly')['RevPAR'].mean()
Suc_ff = df.groupby('Family kid friendly')['Successful'].mean()

print(Avg_rev_ff)
print(Suc_ff)

Rooms that are family kid friendly were successful 52 % of the time with an average monthly revenue of 3900 dollars, while those who were not family kid friendly were successful only 30% of the timw with an average monthly revenue of 2700 dollars.

In [None]:
Avg_rev_Lock = df.groupby('Lock on bedroom door')['RevPAR'].mean()
Suc_Lock = df.groupby('Lock on bedroom door')['Successful'].mean()

print(Avg_rev_Lock)
print(Suc_Lock)

Rooms that had a lock on the bedroom door were successful only 25 % of the time with an average monthly revenue of 2560 dollars, while those who did not have a lock on the bedroom door were successful 44% of the timw with an average monthly revenue of 3400 dollars.

This seemed surprising at first, but after thinking about it, my best guess is that these were private rooms. A host on AirBnb can rent entire home or private room. If a host is renting out entire home, then they don't need the bedroom door to lock, whereas if they were next door, guests would want their room to lock. These numbers would then make sense because if you are renting your entire home, you are bound to make more money all else equal.

For the next version, I will compare how amenities compare for private room vs entire home to test my hypothesis.

In [None]:
Avg_rev_AC = df.groupby('Air conditioning')['RevPAR'].mean()
Suc_AC = df.groupby('Air conditioning')['Successful'].mean()

print(Avg_rev_AC)
print(Suc_AC)

Rooms that provid air conditioning were successful 43 % of the time with an average monthly revenue of 3371 dollars, while those who did not provide air conditioning were successful only 20% of the timw with an average monthly revenue of 2222 dollars. 


In [None]:
Avg_rev_BT = df.groupby('is_business_travel_ready')['RevPAR'].mean()
Suc_BT = df.groupby('is_business_travel_ready')['Successful'].mean()

print(Avg_rev_BT)
print(Suc_BT)

Rooms that are business travel ready were successful 74 % of the time with an average monthly revenue of 5009 dollars, while those who did are not business travel ready  were successful only 36% of the timw with an average monthly revenue of 3049 dollars.



While it is hard to have your home be family kid friendly and business travel ready, NYC is a big city so both types of hosts can succeed. However, if I had to pick one, I would make it business travel ready as these rooms brought in an average of 5009  dollars a month 