# **AISaturdays Rental Challenge**

![AISaturdays](https://www.saturdays.ai/assets/images/ai-saturdays-122x122.png)

Welcome to the challenge of **AISaturdays** of Artificial Intelligence for the prediction of rental prices of the neighborhoods of a city. In this exercise we will estimate the price of a rental offer, depending on the data described below.

**Instructions:**

- The Python 3 programming language will be used.
- The python libraries will be used: Pandas, MatPlotLib, Numpy.

**Through this exercise, we will learn:**
- Understand and run NoteBooks with Python.
- Being able to use Python functions and additional libraries.
- Dataset:
 - Get the dataset and preview the dataset information.
 - Clean and normalize the dataset information.
 - Represent and analyze the dataset information.
- Apply the Random Forest algorithm
- Improve prediction using Hyperparameter Tunning, Feature engineering and Gradient Boosting

Let us begin!


#1.Import of libraries 

In [1]:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import statsmodels.api as sm
import scipy.stats as stats

#2. Dataset

0. Read the .csv with the data and show the first rows.

In [2]:
#two lines of code


1. Shows the number of features and examples in the dataset.

In [3]:
# only one line of code


2. Get what datatypes (dtypes) the dataset contains.

In [4]:
# only one line of code


### Variables



* **Id/name:**  Identifier and name of the offer.

* **host_id/host_name:** Identifier and host name.

* **neighbourhood_group/neighbourhood:** Zone and neighborhood of the offer. Each zone is a grouping of neighborhoods.
* **latitude/longitude:** Latitude and longitude of the offer.

* **room_type:** What type of room is offered. It can be the whole apartment or house, a private room or a shared one.

* **minimum_nights:**  Minimum stay nights.

* **number_of_reviews:**  Total number of reviews of the offer.

* **last_review:**  Date of the last review made.

* **reviews_per_month:** Number of reviews per month. It is not always integer and most are less than 1.

* **calculated_host_listings_count:** How many rooms does the host have on offer?

* **availability_365:** The availability of the offer in one year: maximum of 365 (all year on offer)

* **price:** Our objetive!. The price of the offer, in dollars.



Is this a regression or classification problem? Why?:

3. Before parsing the dataset, we have to transform the dates (the last_review feature) into something we can work with. Pandas has a data type specifically for this, datetime. Transform last_review to datetime format.

In [5]:
# Only with one line of code



4. To analyze the data we also need to know how much information we lack. Use isnull () to find out which feature is missing more data.

In [6]:
# Only with one line of code


5. Finally, we only need to get rid of the features that only serve as an identifier and do not help to predict.

In [7]:
# Only with one line of code


6. All ready! We can now analyze the distribution of the data with the function .describe ()

In [8]:
# only with one line of code


### Clean and normalize dataset information
![texto alternativo](https://i.imgur.com/8u4xTI7.png)

This dataset contains incomplete information that we must fill in to be able to use it to predict the price of the offers..
We also have to transform last_review if we want to include it in the prediction (we cannot use a date as input directly).

For this cleaning we will use various functions of Pandas. Here's a [hint](https://new.paradigmadigital.com/wp-content/uploads/2019/02/Pandas_cheatsheet.pdf).

7. Find the number of offers that due to not having reviews do not have information in the columns of last_review and reviews_per_month.

In [9]:
# only with one line of code


8. We have to fill in this information if we don't want to delete the rest of the example. Fill all the NaNs of the reviews_per_month with 0 (We will complete the last_review column later).

In [10]:
# Only with one line of code



9. We are going to transform the last_review variable. It is a date, which makes it difficult for us to work with it. Let's first complete the examples that don't have a last date. Replace these NaNs with the first historical review of the dataset.

In [11]:
# Two lines of code


10. Now that we don't have empty values ​​we can change the last_review variable to something more useful. We seek that smaller values ​​correspond to older reviews or that have not had any, while larger values ​​correspond to more recent reviews. 

We can use the toordinal () function to find the number of days that have passed since day 1 of year 1, but those are still too large numbers that do not follow the distribution we are looking for.

Get last_reviews to represent the number of days that have passed since the first historical review was made. 

In [12]:
# one line of code


11. To visualize the distribution of dates, generate a graph showing the variable last_reviews.

In [13]:
# one line of code


There appear to be two very distinguished groups. What is this distribution due to ?:

#### Study of the variable to predict and noise elimination

12. When it comes to predicting the price, it is much more favorable if we first transform and analyze the variable we are looking for to make it easier to predict.

First, let's see how the price of the offers is distributed. Generate a graph showing the price of the offers. Here's a [Hint](https://seaborn.pydata.org/generated/seaborn.distplot.html).

In [14]:
# Just one line of code.


We have a variable that follows a log-normal distribution. We can transform it into a normal distribution by applying log1p (), a function that responds to the following equation:

$ y = log(x+1) $

This makes the price easier to predict, having a normal distribution.

13. We are going to visualize this transformation. Generate another graph of price after applying the log1p () function.

In [15]:
# one line of code


Now we have a much more appropriate distribution for making predictions. However, there are still many outliers that add noise to the sample.

14. Above and below what values ​​is this noise present? Removes from the dataframe those values ​​that do not fall within the normal distribution.

In [16]:
#two lines of code


14. Now, rebuild the price chart and log1p of the price (use the same code as before, or put it in a [subplot](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplots.html)).

In [17]:
#Four lines of code


15. Finally, we have a normalized, noise-free output variable that will improve our predictions. Change the variable price to the log1p of price.


In [18]:
#one line of code


#### Exploring variables


Let's explore a little more the rest of variables that can affect the price of an offer.

16. Let's start by creating a histogram of the different areas of the city and the number of offers in each of them (maybe you need to enlarge the graph)

In [19]:
#Three lines of code


17. Now create a map of the offered apartments with the latitude and longitude (extra points if you color them by zones or neighborhoods). It is best to do it in a subplot so you can control the size of the map.

In [20]:
#two lines of code


18. Now we are going to generate another histogram, this time with the offered room type (It is also a good idea to adjust the size of the graph).

In [21]:
#three lines of code


#### Variable transformation We can apply the same process that we apply to the variable price to our input variables and thus achieve a more comfortable distribution for the search methods.



19. Apply the log1p () transformation to minimum_nights, generating the before and after graphs and compare them.

In [22]:
#three lines of code


20. Finally save minimum_nights as log1p of minimum_nights

In [23]:
#one line of code


21. Repeat the process, this time with reviews_per_month. Is the transformation relevant?

In [24]:
#three lines of code


#### Availability study in number of days(0,365)

22. Let's start by representing the availability in a distplot (). Since we know the limits of this variable, it is best to limit the range of the graph and make it larger.


In [25]:
#four lines of code


#### Add artificial variables

It has been seen in the previous scatterplot that there appear to be two groups, one available most of the year and the other only a few days.

It is also intuited that those sites that do not have reviews ... As they do not give much confidence? ;)

23. Add three categories that measure if the apartment is available all year round, if its availability is very low (less than 12 days a year), and if it has no reviews.

In [26]:
#three lines of code


24. We are going to generate a heatmap that shows the relationship between all the input and price variables. Use corr () and seaborn's heatmap () function.

In [27]:
#three lines of code


#### Pass categorical variables to one_hot


25. To make the categorical features easier to interpret by the model, we are going to transform them into a OneHotEncoding. Use pandas get_dummies () function (you should have 241 columns left)

In [28]:
#two lines of code


# Models, models, models

With all the data exploration, analysis and cleaning done, we move on to the fun part: The Models!
    
We start by importing all the classes that we will need to find a good predictive model:

In [29]:
from sklearn.model_selection import train_test_split,cross_val_score,  GridSearchCV, KFold, StratifiedKFold, RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor

26. Divide the dataset into X_train, X_test, y_train and y_test using train_test_split (). Don't forget not to include a price in the training set.

In [30]:
#three lines of code


27. We are going to use cross_validation to train our model, using Kfold to find the score. Implement a Kfold that performs 5 splits and calculates the mean error and deviation of a RandomForestRegressor without changing its parameters (yet). [Hint](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html).

In [31]:
#three lines of code


28. When using a RandomForestRegressor, what hyperparameters were we using? List all the parameters that this model uses (uses the get_params () function and the pprint library)

In [32]:
from pprint import pprint
#two lines of code


We can adjust all these parameters to improve the accuracy of our model. One way to find which combination works best is to use a GridSearchCV, which tests models with many different combinations and calculates your score to find the best model in brute force. For this, you have to pass a list of values ​​for each parameter, and GridSearchCV will test them all. [More information](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).

29. Delimit what values ​​you want each parameter to have, and include each of these lists in a dictionary to be able to run the GridSearchCV. Consider the possible values ​​for each of the parameters.

In [33]:
# 8 lines of code


30. We can now implement a GridSearchCV. To make it faster, a version is used that does not test with all possible combinations, but with a few random ones (hence its name, RandomizedSearchCV). Implement it, taking into account that it has as parameters the model to adjust and the dictionary that we have defined before, among others. This step may take a few minutes as you have to adjust many models to find the best one. Here is the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) del RandomizedSearchCV.

In [34]:
#two lines of code


31. To finish, find the mean squared error and$R^2$ the best model you can find. 

In [35]:
#Six lines of code


Now, to improve that score!
You can try:
- Delete features that are not relevant to prediction
- Implement Gradient boosting using XBoost or Adaboost, among others
- Adjust hyperparameters manually to get better models
- Use a Tree Interpreter to see which decision trees are most important

At the end of the challenge, we will give you a validation set to see which group has achieved the best score. Whoever wins has a prize!