# <u> Using Machine Learning Models to Predict Average Ratings for Airbnb Listings </u>

#### Blog by Shreya Kaundal

Every day tons of data is being collected about multitudes of topics, from user patterns to natural occurrences. Machine learning allows us to study and plot these patterns to be able to use them in predicting and recognition of topics of note. It is easy to be overwhelmed by the technicality one might need to know to fully understand machine learning, however this post is here to simplify the process taken to solve a machine learning problem.

The first step for any machine learning problem is to know and understand your data - if you don’t fully understand your dataset, then the machine learning algorithm you make may as well be useless. The data I am working on for this project is of __[Airbnb listings in New York City from Kaggle](https://www.kaggle.com/datasets/dgomonov/new-york-city-airbnb-open-data)__. The data that has been collected and saved, called features, are the names of the listings as well as their unique ids, the hosts’ names and their unique ids, location data, the type of residence, the price of the listing, duration of stay, data about reviews, and number of other listings this host has - the full list of which are below in table 1. For this problem, we want to predict the reviews per month of a given listing, which leads us to our first question: which of the above data helps predict the average number of reviews? If we print out some statistics (like ranges and scatter plots), we can see that the price of the listing and the length of stay are some numeric data that correlates to the number of reviews per month (seen in figures 1 and 2 respectively) and, from some common knowledge, we can see how location and data about reviews can help predictions.
![alt text](table_1.png "List of Features")
![alt text](1st_fig.png "Scatterplot of Price vs Reviews per Month")
![alt text](2nd_fig.png "Scatterplot of Length  vs Reviews per Month")

After this step of understanding our data and checking each feature for its relevance to predicting what we want, we have to format our data to make it readable for the machine learning algorithm. For example, an algorithm may not understand what is the difference between the words “sad” and “happy”, so we have to encode its meaning with some numeric interpretation the algorithm can associate with those words. In the NYC Airbnb dataset, I split the feature that stored the date of the last review of a particular listing into separate columns of day, month, and date, so the algorithm could find some pattern in how soon or around what time gives listings more or less reviews on average. I also filled in missing data from columns, scaled numeric data for the algorithm to more accurately associate features with each other, split categories into separate columns and encoded them with numeric interpretations. Again, all of this is done so the machine learning algorithms we will be testing will be able to read the data more accurately and make it more cohesive for the algorithm to find patterns.

Of course, our final aim with machine learning is to solve a machine learning problem whether it is predicting something or finding something from patterns. However, to achieve this main goal we need to be able to test many different machine learning algorithms. We start with an extremely simple model that merely takes all the data that is inputted into the model and predicts the average of the imputed predicted data (so in this case it outputs the average of all the reviews per month that were submitted to the model) regardless of the particular values of the other features entered into the model. From that, we can understand how other models we test are faring at predicting. For the Airbnb set, of all the models I tested, a model called LightGBM predicted the average reviews most accurately. The model works by creating many trees that split depending on certain values of features to output a final value for the prediction, and then combining the trees in a way where the least amount of mistakes in prediction is made. 

Machine learning models for prediction take in all the data, including the data we want predicted and finding patterns between the two. To test machine learning models, we split our whole dataset into two parts, one for the model to find those desired patterns and another to test if our model is outputting the right predictions. Using that unseen test data, the model returned an accuracy measure of 60%, meaning that two-thirds of the variation in the amount of reviews per month predicted is explained for by the model, with the categories the model considered most important being whether the last review was done in 2019, the minimum amount of time a guest can stay at the property, and if the last review was done in June or July (if yes for all these questions, the model will predict a higher number for the average number of reviews per month), all shown in the table below.

![alt text](table_2.png "Most Important Features With Their Weights")

If I were to be more thorough, first I would create more data columns from the given dataset. With that, I could have considered more information for my chosen model to be able to detect more patterns that could have increased my model’s prediction accuracy. For example, I didn’t try to encode the names of the listings which could have provided the model with more information for more accurate results. Secondly, LightGBM as a model is not as easily explainable as other models - I had to use other methods to try and explain predictions and which features the model found important - so I could make and test more models coupled with the previous idea of creating more features to try and get similar or better results of the LightGBM model with the added benefit of being able to explain those results. Finally, coupled with adding more features, I could have done some analysis of which features are more strongly correlated to the predicted amount to have lowered computation power and time for the all tested models to be able to provide results quicker with less CPU usage.