# Introduction

Descrição / Description
This dataset contains 10962 houses to rent with 13 different features.

### Columns:

**city**: City where the property is located

**area**: Property area

**rooms**: Quantity of rooms

**bathroom**: Quantity of bathroom

**parking** spaces: Quantity of parking spaces

**floor**: Floor

**animal**: Accept animals?

**furniture**: Furniture?

**hoa**: Homeowners association tax

**rent amount**: Rent amount

**property tax**: Property tax

**fire insurance**: Fire Insurance

**total**: Total

Note the column called total represents the sum of rent amount, property tax, hoa, and fire Insurance. 

# My Goal


I will try to build a model that **can predict the rent amount** using this dataset features. This model can be useful to real estate broker that has to set a rent price using just these features according to the market. In this case, of course, this real estate broker does not have the total price or rent amount. Just hoa (because is defined by Homeowners association), property tax (because is defined by City Hall) and fire insurance (because is defined by the market)

# Importing the libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.ensemble import ExtraTreesClassifier
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Importing the dataset

In [None]:
dataset = pd.read_csv('../input/brasilian-houses-to-rent/houses_to_rent_v2.csv')
dataset.head()

# Feature explore

In [None]:
dataset.info()

In [None]:
dataset.describe().T

Having a look what city we have in this dataset

In [None]:
dataset['city'].value_counts()

In [None]:
dataset['floor'].value_counts()

I made a decision to don't use this feature because, even I could clean this '-' values and drop outliers, to predict a rent may should important know if a floor is the last one (because that can (not necessary) configure another structure in a building. So I gonna try to build this model without this feature. 

In [None]:
del dataset['floor']
del dataset['area']
del dataset['fire insurance (R$)']
del dataset['total (R$)']

Another features that I decided to drop is area, fire insurance and total. Let me explain why:

**Total (R$)**: Because we are tyring to predict the Rent amount. So, in this situation, if we dont have the Rent amount we also dont have the Total Price. 

**Area**: You probably are wonder why I did this. One thing that I'm afraid can affect my model is the relation of Area and property tax. Here in Brazil, this tax is calculated based on the current sale price of this apartment. Facts like how old this building is, the construction area, what kind of build we are talking about and the neighborhood make part of this calculation tax. Then, to avoid to use features that have a relation between each other, I chose to drop area and continue with property tax because I think can be a good resume of attributes of these apartments. 

**Fire Insurance**: Same Reason of area. Some Insurance companies use some attributes that we already considered, so I decided to drop this feature. 

In [None]:
dataset['animal'].value_counts()

In [None]:
dataset['furniture'].value_counts()

# Data Preprocessing 

In [None]:
dt_trasformed = pd.get_dummies(dataset)
dt_trasformed = dt_trasformed[['hoa (R$)', 'property tax (R$)', 'rooms', 'bathroom', 'parking spaces', 'city_Belo Horizonte', 'city_Campinas', 'city_Porto Alegre', 'city_Rio de Janeiro', 'city_São Paulo', 'animal_acept', 'animal_not acept', 'furniture_furnished', 'furniture_not furnished',  'rent amount (R$)']]

In [None]:
dt_trasformed.info()

In [None]:
X = dt_trasformed.iloc[:, :-1]
y = dt_trasformed.iloc[:, -1]

# **Feature Selection**

# Univariate Selection

In [None]:
#apply SelectKBest class to extract top 10 best features
bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(X,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
#concat two dataframes for better visualization 
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score']  #naming the dataframe columns
print(featureScores.nlargest(10,'Score'))  #print 10 best features

# Feature Importance

In [None]:
#model = ExtraTreesClassifier()
#model.fit(X,y)
#print(model.feature_importances_) #use inbuilt class feature_importances of tree based classifiers
#plot graph of feature importances for better visualization
#feat_importances = pd.Series(model.feature_importances_, index=X.columns)
#feat_importances.nlargest(10).plot(kind='barh')
#plt.show()

I left this code commented out because it exceeded the memory limit

## Correlation Matrix with Heatmap

In [None]:
#get correlations of each features in dataset
corrmat = dt_trasformed.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(20,20))
#plot heat map
g=sns.heatmap(dt_trasformed[top_corr_features].corr(),annot=True,cmap="RdYlGn")

# Splitting the dataset into the Training set and Test set

In [None]:
X = dt_trasformed.iloc[:, 0:5].values
y = dt_trasformed.iloc[:, -1].values
y = y.reshape(len(y),1)

For the same reasons I drop area, taking a look at what I found making Feature Selection, I decided to don't use the location dummies features because they already are contained at property tax and for this dataset, this category looks like so generical. I mean, even we aren't considering the whole estate of São Paulo and Rio, just the metropolitical area, we are talking about huge cities that have a lot of arrays of rent prices.  

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [None]:
print(X_test)

# Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
X_train = sc_X.fit_transform(X_train)
y_train = sc_y.fit_transform(y_train)
X_test = sc_X.transform(X_test)
y_test = sc_y.transform(y_test)

In [None]:
print(X_train[144])
print(y_train)

# Training models on the Training set

In [None]:
##Linear Model
from sklearn.linear_model import LinearRegression
lin_regressor = LinearRegression()
lin_regressor.fit(X_train, y_train)

In [None]:
#Polynomial Model
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
poly_reg = PolynomialFeatures(degree = 2)
X_poly = poly_reg.fit_transform(X_train)
poly_regressor = LinearRegression()
poly_regressor.fit(X_poly, y_train)

In [None]:
#Random Florest
from sklearn.ensemble import RandomForestRegressor
rf_regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
rf_regressor.fit(X_train, y_train)

In [None]:
#Decision Tree
from sklearn.tree import DecisionTreeRegressor
dt_regressor = DecisionTreeRegressor(random_state = 0)
dt_regressor.fit(X_train, y_train)

In [None]:
#SVR Model
from sklearn.svm import SVR
svr_regressor = SVR(kernel = 'rbf')
svr_regressor.fit(X_train, y_train)

# Predicting the Test set results

In [None]:
#Linear Prediction
y_pred_lin = lin_regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred_lin.reshape(len(y_pred_lin),1), y_test.reshape(len(y_test),1)),1))

In [None]:
#Polynomian Prediction
y_pred_poly = poly_regressor.predict(poly_reg.transform(X_test))
np.set_printoptions(precision=2)
print(np.concatenate((y_pred_poly.reshape(len(y_pred_poly),1), y_test.reshape(len(y_test),1)),1))

In [None]:
#Random Florest
y_pred_rf = rf_regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred_rf.reshape(len(y_pred_rf),1), y_test.reshape(len(y_test),1)),1))

In [None]:
#Decison Tree
y_pred_dt = dt_regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred_dt.reshape(len(y_pred_dt),1), y_test.reshape(len(y_test),1)),1))

In [None]:
#SVR Prediction
#y_pred_svr = sc_y.inverse_transform(svr_regressor.predict(sc_X.transform(X_test)))
#np.set_printoptions(precision=2)
#print(np.concatenate((y_pred_svr.reshape(len(y_pred_svr),1), y_test.reshape(len(y_test),1)),1))

# Evaluating the Models Performance

In [None]:
print('Multiple Linear Regression R2: ' + str(r2_score(y_test, y_pred_lin)))
print('Polynomian Regression R2: ' + str(r2_score(y_test, y_pred_poly)))
print('Random Florest Regression R2: ' + str(r2_score(y_test, y_pred_rf)))
print('Decison Tree Regression R2: ' + str(r2_score(y_test, y_pred_dt)))

According to R2 score, my results are bad! I know! It is my first model and I have to get more tools to improve this model. I ask you guys to help me and guide my studies to improve my participation in the next competitions. 