# Regression Project: AirBNB Price Prediction

Coded by Luna McBride

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from mpl_toolkits.basemap import Basemap #Plot onto map
import matplotlib.pyplot as plt #Plotting
from pandas import Series,DataFrame
import seaborn as sns
%matplotlib inline

plt.rcParams['figure.figsize'] = (15,10) #Set the default figure size
plt.style.use('ggplot') #Set the plotting method

from sklearn.model_selection import train_test_split #Split the data into train and test
from sklearn.ensemble import RandomForestRegressor #Forest for prediction and regression
from sklearn.linear_model import LinearRegression #Regression for prediction
from sklearn.preprocessing import StandardScaler #Scale the data
from sklearn.metrics import mean_squared_error #Error testing

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
bnb = pd.read_csv("../input/us-airbnb-open-data/AB_US_2020.csv", low_memory=False) #Read the airbnb csv
bnb.head() #Take a peek at the dataset

In [None]:
airbnb_raw = pd.read_csv('../input/us-airbnb-open-data/AB_US_2020.csv', low_memory=False)
airbnb_raw.head()

In [None]:
bnb["price"] = bnb["price"].apply(lambda x: 1 if x < 1 else x) #Make 0's 1 so the log function works

In [None]:
#Print some attributes about the prices
print("Max Price: ", np.max(bnb["price"]))
print("Min Price: ", np.min(bnb["price"]))
print("Num Prices Below 20: ", len(bnb.loc[bnb["price"] < 20]))
print("Num Prices Above 1000: ", len(bnb.loc[bnb["price"] > 1000]))
print("Num Locations", len(bnb))

---

# Check for Null Values

In [None]:
print(bnb.isnull().any()) #Check for null values

In [None]:
print(bnb.loc[bnb["reviews_per_month"].isnull()]) #See where reviews_per_month is null

Reviews_Per_Month and Last_Review appear to be null when there are no nulls. The best way to fill these would probably be to make Reviews_Per_Month 0 and a dummy date for last review (01-01-01). This data field will have to be fixed into yyyy-mm-dd anyway, so a non-date value will cause problems.

In [None]:
print(bnb.loc[bnb["host_name"].isnull()]) #See where host_name is null

In [None]:
print(bnb.loc[bnb["name"].isnull()]) #See where name is null

Both name and host_name do not inherently have any importance since their ID's are what is important. I can decide what to drop later after more exploration, so I will fill these with generic, obviously filled in names. I am thinking "AIRBNB HOUSING" and "AIRBNB HOST".

As for neighborhood_group, it appears this is just here to emphasize certain areas of cities like New York. I will change nulls to "Other" in this case, as not all cities have neighborhoods like New York. The neighborhood names (besides NY) also seem inconsistent when looking at it through the other null prints, so that is something to keep in mind.

---

# Fix the Null Values

In [None]:
bnb["name"] = bnb["name"].fillna("AIRBNB HOUSING") #Fill the null name values with "AIRBNB HOUSING"
print(bnb.loc[bnb["name"] == "AIRBNB HOUSING"]) #See where name is fixed to make sure this works

In [None]:
bnb["host_name"] = bnb["host_name"].fillna("AIRBNB HOST") #Fill the null host name values with "AIRBNB HOST"
print(bnb.loc[bnb["host_name"] == "AIRBNB HOST"]) #See where host_name is fixed to make sure this works

In [None]:
bnb["neighbourhood_group"] = bnb["neighbourhood_group"].fillna("Other") #Fill the null neighbourhood group values with "Other"

In [None]:
bnb["reviews_per_month"] = bnb["reviews_per_month"].fillna(0) #Fill the null reviews_per_month values with 0

In [None]:
bnb["last_review"] = bnb["last_review"].fillna("01/01/01") #Fill the null last_review values with 01/01/01
bnb["last_review"] = pd.to_datetime(bnb["last_review"]) #Convert the last review to datetime
print(bnb["last_review"]) #Print the last review

In [None]:
print(bnb.isnull().any()) #Check for null values

All of the null values have been fixed.

---

# Fix Column Names for My Comfort

In [None]:
#Change the column names to ones I prefer
bnb = bnb.rename(columns = {"host_id" : "hostId", "host_name" : "hostName", "neighbourhood_group" : "neighGroup",
                            "neighbourhood" : "neigh", "room_type" : "roomType", "minimum_nights" : "minNights",
                            "number_of_reviews" : "numReviews", "last_review" : "lastReview", "reviews_per_month" : "monthlyReviews",
                            "calculated_host_listings_count" : "numListings", "availability_365" : "available"})
bnb.head() #Take a peek at the dataset

---

# AirBNB Locations

In [None]:
import folium
import folium.plugins as plugins

latitude2020 = bnb["latitude"].tolist()
longitude2020 = bnb["longitude"].tolist()
locations = list(zip(latitude2020, longitude2020))

# Initialize the map:
usa_map = folium.Map(location = [35, -100], zoom_start = 5)
plugins.FastMarkerCluster(data = locations).add_to(usa_map)
usa_map

In [None]:
print(bnb["city"].unique()) #See all the unique "cities" in the data

# Fix City Names into Major Areas

In [None]:
#A list of areas in the dataset that are part of the San Francisco major area
SF = ["Oakland", "Pacific Grove", "San Clara Country", "Santa Cruz County", "San Mateo County", "San Francisco"]

#Fix the cities into their major areas
#Input: the city/state/county named state (this column has so many different things)
#Output: the fixed label
def fixState(state):
    
    #Fix the labels whose major areas are not as clear
    if state == "Broward County":
        return "Miami"
    if state == "Twin Cities MSA":
        return "Minneapolis"
    if state == "Clark County":
        return "Las Vegas"
    
    #Lump labels together if thier major area is the same
    if state == "Boston" or state == "Cambridge":
        return "Boston"
    if state == "Portland" or state == "Salem":
        return "Portland"
    if state == "Jersey City":
        return "New York City"
    if state in SF:
        return "San Francisco"
    
    return state #Return the label if it does not need to change

bnb["city"] = bnb["city"].apply(fixState) #Fix the city column with its major areas

In [None]:
print(bnb["city"].unique())

## Split the Data

In [None]:
price = bnb["price"].copy() #Take the price as its own variable. That is what we are looking for
price = np.log(price) #Take the log of the set for normalization

In [None]:
print(bnb.loc[bnb["price"] > 10000])

In [None]:
characteristics = bnb.copy() #Take a copy of the dataframe for usage
characteristics = characteristics.drop(columns = {"price"}) #Remove the price, since we cannot predict price if it is already there
characteristics.head() #Take a peek at the data without the price

There are several columns that I should not take into account here. The name and hostName columns are all considered categorical data filled with the entirely different values, so there is not nearly enough memory to handle pandas bringing that to dummies. The same goes for lastReview if left as a string. As a datetime (which I set it to), the scaler does not recognize it. Then there is neigh, which I actually tried to use. It turned fitting the model into an hour long endeavor due to it creating a lot of dummy variables and only increased accuracy by about 3%. The trade off for that one is not worth it.

In [None]:
charact = characteristics.drop(columns = {"name", "hostName",  "neigh", "lastReview", "id"}) #Remove the variables discussed above
charact.head() #Take a peek at the data after removing the variables

In [None]:
charact = pd.get_dummies(charact) #Get the dummies for easier model training
scale = StandardScaler() #Add a standard scaler to scale our data for easier use later
scale.fit(charact) #Fit the scaler with our characteristics
chara = scale.transform(charact) #Transform the data with our scaler

In [None]:
print(len(chara[0])) #Print the scaled data

In [None]:
charaTrain, charaTest, priceTrain, priceTest = train_test_split(chara, price, test_size = 0.1) #Split the data into train and test
print(priceTest) #Print one of the splits

## Fit the Forest Regressor

In [None]:
forest = RandomForestRegressor(n_estimators = 150) #Build a whole forest of trees
forest.fit(charaTrain, priceTrain) #Fit the forest

In [None]:
predict = forest.predict(charaTest) #Get the predictions for RMSE

In [None]:
overallAccuracy = ("Overall", forest.score(charaTest, priceTest)) #Get the overall accuracy 
print("Forest Accuracy: ", forest.score(charaTest, priceTest)) #Print the accuracy
print("Root Mean Square Error: ", np.sqrt(mean_squared_error(priceTest, predict))) #Print the root mean square error

In [None]:
attributes = charact.columns #Get the tested attributes
attributes = list(zip(attributes, forest.feature_importances_)) #Zip the attributes together with their coefficient
sortAtt = sorted(attributes, key = lambda x: x[1], reverse = True) #Sort the zipped attributes by their coefficients

print("According to the Random Forest (most accurate), the most important factors for pricing are: ") #Start printing the most important labels
i=0 #Counter variable so only the top five are printed

#For each attribute in the sorted attributes
for label, coef in sortAtt:
    if i<5: #If there has not been five printed yet
        print(label) #Print the label as an important factor
    i += 1 #Increase i by 1

In [None]:
predictions = pd.DataFrame({"truePrice": priceTest.values, "predPrice": predict}) #Create a dataframe with the predictions
predictions.head(100) #Take a peek at the predictions

In [None]:
error = np.subtract(np.exp(predictions["truePrice"]), np.exp(predictions["predPrice"])) #Get the variance by subtracting the true and prediction
b = plt.hlines(500, xmin = 0, xmax = 25000, lw = 3) #Print a line to show variance
c = plt.hlines(-300, xmin = 0, xmax = 25000, lw = 3) #Print a lower line to show variance
a = plt.plot(error, "b.") #Plot the error
plt.show() #Show the plot

The best accuracy I could get was around 60%. This only happened when I added back longitude and latitude (which I removed since I thought the city/major area would cover that). 60% is still not the best, but considering how users can determine their own prices, it is not surprising. There are listings for 24999 and 1 in the price field, so I definitely think the hosts do not use the same criteria when determining price. 

This is for the overall data, however. I wonder if looking at different locations in isolation will provide better accuracies.

---

# Regression by City

In [None]:
#ExtractChara: extracts the desired characteristics like I did step by step for the full dataset
#Input: the area dataset
#Output: the extracted characteristics
def extractChara(data):
    characteristics = data.copy() #Take a copy of the dataframe for usage
    characteristics = characteristics.drop(columns = {"price"}) #Remove the price, since we cannot predict price if it is already there
    
    charact = characteristics.drop(columns = {"name", "hostName", "neigh", "lastReview", "id"}) #Remove the variables discussed before

    charact = pd.get_dummies(charact) #Get the dummies for easier model training
    scale = StandardScaler() #Add a standard scaler to scale our data for easier use later
    scale.fit(charact) #Fit the scaler with our characteristics
    chara = scale.transform(charact) #Transform the data with our scaler
    
    return chara #Return the extracted characteristics

#RandomForest: build a random forest for the given area
#Input: the given characteristics, the prices, and the area this is representing
#Output: the model accuracy with the area
def randomForest(chara, price, area):
    charaATrain, charaATest, priceATrain, priceATest = train_test_split(chara, price, test_size = 0.1) #Split the data into train and test
    
    forest = RandomForestRegressor(n_estimators = 150) #Build a whole forest of trees
    forest.fit(charaATrain, priceATrain) #Fit the forest
    predictA = forest.predict(charaATest) #Get the predictions for RMSE
    
    accuracyA = forest.score(charaATest, priceATest)
    
    print("{} Accuracy: {}".format(area, accuracyA)) #Print the accuracy
    print("{} Root Mean Square Error: {}".format(area, np.sqrt(mean_squared_error(priceATest, predictA)))) #Print the root mean square error
    
    return (area, accuracyA) #Return the accuracy with the area for visualization

In [None]:
areas = bnb["city"].unique() #Get all the unique major areas
accuracies = [overallAccuracy] #Build a list to build the accuracies

for area in areas:
    areaData = bnb.loc[bnb["city"] == area] #Look only at the data for the area
    
    priceArea = areaData["price"].copy() #Take the price as its own variable. That is what we are looking for
    priceArea = np.log(priceArea) #Take the log of the set for normalization
    
    areaData = areaData.drop(columns = {"city"})
    charaArea = extractChara(areaData) #Extract the wanted characteristics
    
    accuracies.append(randomForest(charaArea, priceArea, area)) #Call the random forest function for the specific area

In [None]:
accDF = pd.DataFrame(accuracies, columns = ["Area", "Accuracy"]) #Put the accuracies list into a dataframe

accSort = accDF.sort_values("Accuracy", ascending = False) #Sort the accuracies datafrmame by its accuracies
accSort.plot.bar(x = "Area", y = "Accuracy") #Plot the model accuracy for each major area in a bar graph

---

# Overall Removing Extraneous Values

The max value is 24999 and the min values is 0, so all of these mean errors are going to have issues. So, I will see what happens if I remove unusually low and high values (keep 60 < x < 1000)

In [None]:
bnbTrim = bnb.loc[bnb["price"] > 20] #Trim out values lower than 20
bnbTrim = bnbTrim.loc[bnbTrim["price"] < 2000] #Trim out values higher than 1500
print("Max: ",np.max(bnbTrim["price"])) #Print the current max price
print("Min: ",np.min(bnbTrim["price"])) #Print the current min price

The accuracy seems to have gone down upon removing values, which means the mean was not the problem. The root mean square error has remained below 1 every time though.

In [None]:
priceTrim = bnbTrim["price"].copy() #Take the price as its own variable. That is what we are looking for
priceTrim = np.log(priceTrim) #Take the log of the set for normalization
    
charaTrim = extractChara(bnbTrim) #Extract the wanted characteristics
    
area, accuracyTrim = randomForest(charaTrim, priceTrim, "Trimmed") #Call the random forest function for the specific area

---

# RMSE 線性回歸

In [None]:
airbnb = airbnb_raw.copy()

In [None]:
airbnb = airbnb.replace(np.nan,0)
airbnb.info()

In [None]:
airbnb['neighbourhood_group'].unique()

In [None]:
sns.countplot('neighbourhood_group',data=airbnb)
plt.xticks(rotation=90)

In [None]:
airbnb['last_review'] = pd.to_datetime(airbnb['last_review'])
airbnb['last_review'] = pd.to_numeric(airbnb['last_review'])
airbnb['price'].corr(airbnb['last_review']) 
#can drop last_Review date too since it has very little correlation to price

In [None]:
airbnb = airbnb.drop(['id','name','host_id','host_name','latitude','longitude','last_review'],axis=1)
airbnb.head()

In [None]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

airbnb['neighbourhood_group'] = airbnb['neighbourhood_group'].replace(0,'null')
neighbourhood_group = DataFrame({'Neighbourhood_group':airbnb['neighbourhood_group'].unique()})
code = encoder.fit_transform(neighbourhood_group['Neighbourhood_group'])
neighbourhood_group['Code'] = code
neighbourhood_group

In [None]:
neighbourhood = DataFrame({'Neighbourhood':airbnb['neighbourhood'].unique()})
neigh_code = encoder.fit_transform(neighbourhood['Neighbourhood'])
neighbourhood['Code'] = neigh_code               
neighbourhood

In [None]:
room_type = DataFrame({'Room type':airbnb['room_type'].unique()})
room_code = encoder.fit_transform(room_type['Room type'])
room_type['Code'] = room_code
room_type

In [None]:
city = DataFrame({'City' : airbnb['city'].unique()})
city_code = encoder.fit_transform(city['City'])
city['Code'] = city_code
city

In [None]:
airbnb['neighbourhood_group'] = encoder.fit_transform(airbnb['neighbourhood_group'])
airbnb['neighbourhood'] = encoder.fit_transform(airbnb['neighbourhood'])
airbnb['room_type'] = encoder.fit_transform(airbnb['room_type'])
airbnb['city'] = encoder.fit_transform(airbnb['city'])

In [None]:
airbnb.info()

In [None]:
def normalise(feature):
    nmx = 100
    nmn = 0
    
    mx = feature.max()
    mn = feature.min()
    
    return ((nmx-nmn) / (mx-mn) * (feature-mx) + nmx)

norairbnb = normalise(airbnb)

In [None]:
norairbnb.describe()

In [None]:
norairbnb['minimum_nights'] = norairbnb['minimum_nights'].astype(int)
norairbnb['reviews_per_month'] = norairbnb['reviews_per_month'].astype(int)

In [None]:
sns.heatmap(norairbnb.corr(),annot=True)

In [None]:
Y = norairbnb['price']
X = norairbnb.drop('price',axis=1)

#  Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X,Y)

In [None]:
print(f'Best fit line is {reg.intercept_}')
print(f'Number of coeffcients are {len(reg.coef_)}')
coef_df = DataFrame({'Variable':X.columns,'Coeff':reg.coef_})
coef_df

# Prediction

In [None]:
reg1 = LinearRegression()

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(X,Y, test_size = 0.2, random_state=42)

In [None]:
reg1.fit(x_train,y_train)
y_pred = reg1.predict(x_test)
rms = np.mean((y_pred-y_test)*2)
print(f'Root mean square error is {rms}')

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 150, random_state = 42)
regressor.fit(x_train, y_train)  
print("Forest Accuracy: ", regressor.score(x_train, y_train))

In [None]:
pred_df = DataFrame({'Actual':y_test,'Predict':y_pred})
pred_df

# Conclusion

In this project, the model only came to a 60% accuracy with the random forest, but root mean square errors remained below 1 the whole time. This likely means it cannot quite be fully predicted due the the strangeness that comes with users inputting prices. The accuracy could be higher or lower by major area, so it seems the areas are also inconsistent.

Looking at the characteristics the model determined was most important, these were roomType_Entire home/apt, longitude, latitude, id, and monthlyReviews. Of course higher monthly reviews and getting an entire house would make the price go higher. It means more people are coming in and wanting to not share the AirBNB. This makes perfect sense. The ID being here means that it depends on the property itself, which could have been hidden in the title or similar means. The method I was trying to practice just did not suit NLP. As for Latitude and Longitude, I am quite surprised. I originally cut them out since they were encompassed in the city/major area variable, but that variable never made it into the top spot, even when ignoring items like latitude and longitude. These appear to be more important in grouping major areas together rather than looking at one as a whole, which is pretty interesting.