<img src="https://miro.medium.com/max/700/1*9onqVYdPPrCcwDX6mGKCpg.jpeg" width="600px">

Welcome everyone! This is my first notebook and I'm going to perform a predictive analysis of house rental prices in Brazil.

Our goals in this kernel are:
* Basic Exploratory Data Analysis.
* Guide on brazilian_houses_to_rent Dataset.
* Feature Analysis
* Modelling many Models to predict the price of rent

Our dependent variable it is rent amount (R$). This variable it's the price of rental houses in Brazil and its measured in brazilian currency.

# Predictive analysis of house rental prices in Brazil

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input/brasilian-houses-to-rent/houses_to_rent_v2.csv'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

#metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# models
from sklearn.preprocessing import PolynomialFeatures
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

## Some functions for categorize and visualize data

In [None]:
def categorize(col):
    numerical,category=[],[]
    for i in col:
        if data[i].dtype ==object:
            category.append(i)
        else:
            numerical.append(i)
    print("The numerical features {}:".format(numerical))
    print("The categorical features {}:".format(category))
    return category,numerical

In [None]:
def DistributionPlot(RedFunction, BlueFunction, RedName, BlueName, Title):
    width = 12
    height = 10
    plt.figure(figsize=(width, height))

    ax1 = sns.displot(RedFunction, hist=False, color="r", label=RedName)
    ax2 = sns.displot(BlueFunction, hist=False, color="b", label=BlueName, ax=ax1)

    plt.title(Title)

    plt.show()
    plt.close()

## Import Dataset

In [None]:
df = pd.read_csv('/kaggle/input/brasilian-houses-to-rent/houses_to_rent_v2.csv')

In [None]:
df.head()

## Inicial glances about the data

In [None]:
df.describe()

In [None]:
df.describe(include = 'object')

In [None]:
df.shape

In [None]:
df.isnull().sum()

In [None]:
df.info()

**Inicial observations:**
*     There are 13 features and 10692 instances
*     There are no NaN Values
*     Most columns are numerical
*     Most houses accept animals
*     Most houses are not furnished
*     São Paulo is the city with more houses
*     There are potencial outliers

## Cleaning the Data

**We can see that floor type is 'object', so let's check why**

In [None]:
df['floor'].unique()

**We can see that are '-' values, so we got to clean that**

In [None]:
df.loc[df['floor'] == '-', 'floor'] = 0
df['floor'] = df['floor'].astype('int64')

## Checking for Outliers

In [None]:
sns.boxplot(data = df['rent amount (R$)'], orient='horizontal')

**We can see that are some outliers, so we got treat them**

## Dealing with Outliers

In [None]:
# First let make a copy of our dataset so we can separate them.
data = df.copy()

**To treat the outliers we will use the interquartile range and we will perform this analysis in every city**

In [None]:
city_group = data.groupby('city')['rent amount (R$)']

Q1 = city_group.quantile(.25)
Q3 = city_group.quantile(.75)

# IQR = Interquartile Range
IQR = Q3 - Q1

# Limits
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

# DataFrame to store the new data
new_data = pd.DataFrame()

for city in city_group.groups.keys():
    is_city = data['city'] == city
    accepted_limit = ((data['rent amount (R$)'] >= lower[city]) &
                     (data['rent amount (R$)'] <= upper[city]))
    
    select = is_city & accepted_limit
    data_select = data[select]
    new_data = pd.concat([new_data, data_select])
    
data = new_data.copy()

In [None]:
# New dataset
data.describe()

**We can see that our dataset is now without outliers in our dependent variable**

# Exploratory Data Analysis (EDA)

In [None]:
# Lets take a look on how our data was distributed before and after treat outliers

plt.figure(1, figsize=(20, 10))
plt.subplot(2, 2, 1)
sns.distplot(df['rent amount (R$)'])
plt.title('Before Removing Outliers')
plt.subplot(2, 2, 2)
sns.distplot(data['rent amount (R$)'])
plt.title('After Removing Outliers')
plt.subplot(2, 2, 3)
plt.figure(1, figsize=(20, 12))
sns.boxplot(df['city'], df['rent amount (R$)']).set_title('Before Removing Outliers')
plt.subplot(2, 2, 4)
sns.boxplot(data['city'], data['rent amount (R$)']).set_title('After Removing Outliers')
plt.tight_layout(pad=5.0)
plt.show()

## Lets explore our numerical features

In [None]:
numerical1 = ['rooms', 'bathroom', 'parking spaces']
plt.figure(figsize=(20, 5))
sns.set(style = 'whitegrid')
i = 1
for feature in numerical1:
    plt.subplot(2, 3, i)
    sns.barplot(x = feature, y= 'rent amount (R$)', data=data)
    i+=1
plt.tight_layout()

* Houses with more rooms have more expensive rents, except for houses with 10 rooms when it decreases
* The rent increases until 8 bathrooms, beyond that curiously decreases
* Rent increases until 7 parking spaces, when it behave strangely, probably due to few samples

In [None]:
numerical2 = ['area', 'fire insurance (R$)', 'property tax (R$)', 'hoa (R$)']
plt.figure(figsize=(20, 5))
j = 1
for feature2 in numerical2:
    plt.subplot(2, 2, j)
    sns.distplot(data[feature2])
    j+=1
plt.tight_layout()

**All the distributions are right skewed**

## Lets take a deeper look about how the prices are distributed in the city's

In [None]:
plt.figure(figsize=(18, 8))

i = 1
for city in data['city'].unique():
    plt.subplot(2, 3, i)
    plt.title(city)
    city_name = data.loc[data['city'] == city]
    sns.distplot(city_name['rent amount (R$)'])
    i+=1
    

plt.tight_layout()
plt.show()

**We can see that we have right skewed distributions.**

In [None]:
plt.figure(figsize=(16, 8))

i = 1
step = 5000
for city in data['city'].unique():
    if step < 2000:
        step = 2000
    plt.subplot(2, 3, i)
    plt.title(city)
    city_name = data.loc[data['city'] == city]
    sns.boxplot(city_name['rent amount (R$)'])    
    step-=3000
    i+=1

    

plt.tight_layout()
plt.show()

* São Paulo appear to have the most expensive rent. 
* Belo Horizonte and Rio de Janeiro have slightly more expensive rents than Campinas and Porto Alegre.

## Getting the intution about all the categorical features

In [None]:
categorical,numerical = categorize(data.columns)

In [None]:
plt.figure(figsize=(20,5))
j =1
for i in categorical:
    plt.subplot(1,3,j)
    sns.countplot(data[i])
    j =j+1
plt.tight_layout()

* São Paulo is the city with more houses
* The majority of the houses accept animals
* The majority of the houses are not furnished

In [None]:
# Let's take a look about how the rent is impacted by the furniture
plt.figure(figsize = (15, 5))
sns.violinplot(x ='furniture', y ='rent amount (R$)', data = data,hue ='city').legend(loc='upper center')

* Furnished houses are more expensive than not furnished
* Furnished houses are more distributed than not furnished houses

In [None]:
# Let's take a look about how the rent is impacted by the animal acceptance
plt.figure(figsize = (15, 5))
sns.violinplot(x ='animal', y ='rent amount (R$)', data = data,hue ='city').legend(loc='upper center')

**Seems like the animal acceptance have little impact on the rent**

In [None]:
# now let's see the correlation between features
plt.figure(figsize=(12,12))
sns.heatmap(data.corr(), annot=True, cmap='RdBu_r', linecolor='black',vmin=-1, vmax=1)

## Lets split and transform our data into train and test

In [None]:
cols = ['city', 'rooms', 'bathroom', 'parking spaces', 'fire insurance (R$)',
        'furniture']
x = data[cols]
y = data['rent amount (R$)']

**We used the columns that have more correlation with the variable that we want to predict**

In [None]:
labelencoder = LabelEncoder()
x.loc[:, 'furniture'] = labelencoder.fit_transform(x.loc[:, 'furniture'])

**We used labelencoder for furniture because only have two values**

In [None]:
dummy = pd.get_dummies(x, columns=['city'])
dummy.drop(columns = ['city_Belo Horizonte'], inplace=True)
x = dummy

**For the citys we use OneHotEncoder and drop the first column to avoid the dummy variable trap**

In [None]:
# Now we split into train and test
x_train, x_test, y_train, y_test = train_test_split(x,
                                                   y,
                                                   test_size = 0.3,
                                                   random_state = 0)

# Model Predictions

**Here we are going to set the models that we want use and the parameters we want to adopt. 
In this notebook I will use:**
*     Linear Regression
*     Ridge Regression
*     Decision Tree
*     Random Forest
*     Support Vector Regression (SVR)
*     KNearestNeighbours (KNN)
*     Lasso Regression
*     GridSearch to find the best parameters on Lasso and Ridge

In [None]:
# we create a list to storage all the results for later visualization
acc = []
# parameters are the alpha's that we will use to perform the GridSearch
parameters1= [{'alpha': [0.0001, 0.001, 0.1, 1, 10, 100, 1000, 10000, 100000, 100000]}]
# on the regressors we define the models that we want use
regressors = {'Linear Regression': LinearRegression(),
              'Ridge Model': Ridge(alpha=0.1),
              'Decision Tree': DecisionTreeRegressor(),
              'Random Forest': RandomForestRegressor(random_state=1),
              'SVR': SVR(),
              'KNN': KNeighborsRegressor(),
              'Lasso': Lasso(),
              'GridSearchRidge': GridSearchCV(Ridge(), parameters1, cv=4),
              'GridSearchLasso': GridSearchCV(Lasso(), parameters1, cv=4)
             }

In [None]:
# now we perform a loop with each regressor to perform the model, predict the rent 
# and extract the metrics
for i in regressors:
    model = regressors.get(i)
    # here we create a condition because for grid we want to perform the model with the best estimator
    if i == 'GridSearchRidge' or i == 'GridSearchLasso':
        model.fit(x_train, y_train).best_estimator_ 
    model.fit(x_train, y_train)
    prediction = model.predict(x_test)
    print(i)
    print('MAE:', mean_absolute_error(y_test, prediction))
    print('RMSE:', np.sqrt(mean_squared_error(y_test, prediction)))
    print('R2:', r2_score(y_test, prediction))
    print('*' * 40)
    acc.append([i, mean_absolute_error(y_test, prediction), np.sqrt(mean_squared_error(y_test, prediction)), r2_score(y_test, prediction)])

In [None]:
# now let's follow the same loop and visualize the plot's for each regressor
j = 1
plt.figure(figsize=(20,10))
for i in regressors:
    model = regressors.get(i)
    model.fit(x_train, y_train)
    prediction = model.predict(x_test)
    plt.subplot(3, 3, j)
    plt.title(i)
    ax1 = sns.distplot(y_test,hist=False,kde =True,color ="r",label ="Actual Value")
    sns.distplot(prediction ,color ="b",hist = False,kde =True, label = "Predicted Value",ax =ax1).set_title(i)
    j+=1
plt.tight_layout(pad = 0.5)

**Since our accuracy is very high, the curves are overlapted**

## Analysis of the results

In [None]:
# lets sort our list of results and transform into a dataframe
acc.sort(key = lambda y:y[3], reverse=True)
acc = pd.DataFrame(data = acc, columns=['model', 'MAE', 'RMSE', 'R2'])

In [None]:
# now let's visualize it
acc.head(len(regressors))

**RandomForest it's our best perfomer in all three metrics**

In [None]:
# since RandomForest it's our best model, let's perform a rsquare test with differents
# degrees of polynomial transformation to see if we can improve it
rfr = RandomForestRegressor(random_state=1)
rfr.fit(x_train, y_train)
Rsqu_test = []

order = [1, 2, 3, 4]
for n in order:
    pr = PolynomialFeatures(degree=n)
    
    x_train_pr = pr.fit_transform(x_train)
    
    x_test_pr = pr.fit_transform(x_test)    
    
    rfr.fit(x_train_pr, y_train)
    
    Rsqu_test.append(rfr.score(x_test_pr, y_test))

plt.plot(order, Rsqu_test)
plt.xlabel('order')
plt.ylabel('R^2')
plt.title('R^2 Using Test Data')

## I would like to express my gratitude for everyone who visualized this kernel. I'm new on this field, so if you have any doubt, please post it on the comments so we can discuss it together.

<img src="https://www.betterteam.com/i/thank-you-letter-to-employees-420x320-20190212.jpg" width="400px">