# Problem description

We will try to classify rent in different cities of Brazil using categorical variables. This dataset contains 16079 houses to rent with 13 diferent features. 
The rent classes that we're trying to predict do not exist in the original dataset. We will create our target variable using the rent amount variable.

# Uploading data (Kaggle)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import pandas as pd
# Load the dataset into a pandas dataframe.
data = pd.read_csv('/kaggle/input/brasilian-houses-to-rent/houses_to_rent_v2.csv')

Note: To use this notebook in google colab, upload the data using the section below.

# Uploading  Data (google colab)


In [None]:

# Google file system
#from google.colab import drive
#drive.mount('/content/drive')

In [None]:
#extract the text file
#File='/drive/My Drive/Colab Notebooks/houses_to_rent_v2.csv'

In [None]:
import pandas as pd
# Load the dataset into a pandas dataframe.
#data = pd.read_csv('/content/drive/My Drive/Colab Notebooks/houses_to_rent_v2.csv')

In [None]:
data

# Data Cleaning

Before starting to work with the data, we will proceed to clean the dataset. Although the data is relatively clean, we will add a rent category to classify the rent into: cheap, average, expensive and very expensive.

we check if any of the attributes contain NaN values.

In [None]:
data.isna().sum()

From above, we can say that the dataset do not contain NaN values

We will display the descriptive statistics of numerical variables:

In [None]:
data[['area', 'rooms', 'bathroom', 'parking spaces', 'floor' ,'hoa (R$)', 'rent amount (R$)', 'property tax (R$)', 'fire insurance (R$)', 'total (R$)']].describe()

Using the descriptive statistics of rent amount variable, we will create our target variable : **rent-class**.
It categorize the rent amount into: cheap, average, expensive and very expensive.

In [None]:
data['rent-class'] = pd.cut(x=data['rent amount (R$)'], bins=[450, 1530, 2661, 5000, 45000], labels=['cheap', 'average', 'expensive', 'very expensive'], include_lowest=True)

Below, we will try  to see the different and unique values of the categorical variables:

In [None]:
data.animal.unique()

In [None]:
data.city.unique()

In [None]:
data.furniture.unique()

Finally, we will try to see the general informations about the dataset.

In [None]:
data.info()

# Exploratory Data Analysis: 

Before proceeding with the training of machine learning models, we are interested in getting some insights about the data at hand. In particular, it would be interesting to know how different variables affect the price of the rent, to understand their potential quality as predictors.

Wee will be looking at the univariate distribution of variables. First, we will display the frequency of the target variable (rent-class).

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
ax = data['rent-class'].value_counts().sort_values().plot(kind="barh")
totals= []
for i in ax.patches:
    totals.append(i.get_width())
total = sum(totals)
for i in ax.patches:
     ax.text(i.get_width()+.3, i.get_y()+.20, 
     str(round((i.get_width()/total)*100, 2))+'%', 
     fontsize=10, color='black')
ax.grid(axis="x")
plt.suptitle('rent-class', fontsize=20)
plt.show()

We notice that approximatively each rent class represente 1/4 of the rent class of the whole dataset

### **Rent amount (R$) Distribution**



In [None]:

plt.figure(figsize=(15, 6))
plt.title('Rent amount (R$) distribution')
sns.distplot(data['rent amount (R$)'])
plt.xticks(np.arange(data['rent amount (R$)'].min(), data['rent amount (R$)'].max(), step=3000));

 The peaks of the graph shows where values of "Rent amount (R$) are concentrated over the interval.

### **Rent amount (R$) distribution by rent class and city**

In [None]:
plt.figure(figsize=(15,6))
plt.title('Rent amount (R$) distribution by rent class and city')
sns.violinplot(x=data['city'], y=data['rent amount (R$)'], hue=data['rent-class'])

In general, violin plots can be considered a combination of the box plot with a kernel density plot. they provide us information about the median (a white dot on the violin plot), the interquartile range (the black bar in the center of violin) and the lower/upper adjacent values .
We chooses this type of graph to  observe the distribution of rent amount, and make a comparison of it distribution between different cities and according to the different rent class.

Here we compare the distribution of rent amount of each group (city + rent class). We notice that there is a high probability  for the rent amount values in the case of  cheap and average rent-class to fluctuate around the median for  all the cities. 
We can also observe that The VERY-EXPENSIVE rent-class show outliers in the city of Sao Paulo.

We will delete the highest value of the rent amount attribute as it represent an outlier before proceeding to the next analysis.

In [None]:
k=max(data['rent amount (R$)'])
name=data[data['rent amount (R$)']==k].index


In [None]:
data.drop(name, inplace=True)

### **Relationship between rent and numerical variables**

In [None]:
sns.catplot(x ='bathroom', y ='rent amount (R$)', data = data, height=5, aspect=3)
plt.title("Relationship between rent and number of bathrooms", size=17)

From this plot, we conclude that there is no relationship between the rent amount and the number of bathrooms in the house.  

In [None]:
sns.catplot(x ='rooms', y ='rent amount (R$)', data = data, height=5, aspect=3)
plt.title("Relationship between rent and number of rooms", size=17)

Again, no relationship between number of rooms and high or low rent amount. this makes us ask the fellewing questions:


*   how many rooms people ask for in a house to be rented?
*   Is there a relationship between number of rooms and cities?



In [None]:
plt.figure(figsize=(15,6))
sns.countplot(x ='rooms' , hue = data['city'], data = data)

According to this graph, 1 room to 3 rooms are of high demand in people looking for rent with a highest demand on one room houses. Sao Paulo contains the majority of the houses despite de number of rooms.

In [None]:
sns.catplot(x ='area', y ='rent amount (R$)', data = data, height=5, aspect=3)
plt.title("Relationship between rent and house area", size=17)

There is a low correlation between area and rent-amount. There are small houses with probably one room and have a high rent-amount due to their location in the city center (for example).

**Correlation between fire insurance and rent**




In order to have a clear look on the whole values, we decided to plot an interactive scatter plot allowing us to zoom in on the range of interest.

In [None]:
import plotly.express as px
fig = px.scatter(data, x="rent amount (R$)",
                y="fire insurance (R$)", 
                color="city"          
                 )
fig.update_traces(marker=dict(size=12,
                              line=dict(width=1, color='LightSkyBlue')),
                  selector=dict(mode='markers'))
fig.show(renderer='colab')

There is a high correlation between the fire insurance and rent amount. If the fire insurance amount of a house increases, it rental increases.

### **Insight about categorical data**

Does most houses accept animals or not? what's their distribution in the diferrent cities

In [None]:
plt.figure(figsize=(15,6))
sns.countplot(x ='animal', hue = data['city'], data = data)

There is more houses that accept pets in the dataset then the ones that don't. Sao Paulo and Rio de Janeiro has the largest count.

**Are most houses in different cities furnished or not?**

In [None]:
plt.figure(figsize=(15,6))
sns.countplot(x ='furniture',  data = data)

More than 50% of the houses in the dataset are not furnishured.

In [None]:
plt.figure(figsize=(15,6))
plt.title('Rent amount (R$) distribution by rent class and city')
sns.violinplot(x=data['furniture'], y=data['rent amount (R$)'], hue=data['city'])

# Predictive Analysis

After the Exploratory data analysis, we will proceed to apply our model.

In [None]:
#we will locate our independent variables (predictors)
X = data.iloc[:,[0,1,2,3,6,7]].values
#Here we try to locate our target variable
y = data.iloc[:,13].values

Encoding categorical variables to binary variables.

In [None]:
# encoding categorical data e.g. gender as a dummy variable
from sklearn.preprocessing import LabelEncoder
labelencoder_X = LabelEncoder()
X[:,0] = labelencoder_X.fit_transform(X[:,0])
X[:,4] = labelencoder_X.fit_transform(X[:,4])
X[:,5] = labelencoder_X.fit_transform(X[:,5])
#X[:,0] = labelencoder_X.fit_transform(X[:,0])
# encoding categorical data e.g. disease outcome as a dummy variable
y,class_names = pd.factorize(y)

We will split our dataset into 2 sets: 80% for training and 20% for testing.

In [None]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)

Here, we will use the Decision Tree classifier on the training set

In [None]:
# Fitting Classifier to the Training Set
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion='entropy',max_depth=3)
classifier.fit(X_train, y_train)

In [None]:
# Model performance on training set
y_pred_train =classifier.predict(X_train)


Display the metrics and the model accuracy

In [None]:
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report

accuracy = metrics.accuracy_score(y_train, y_pred_train)
print("Accuracy: {:.2f}".format(accuracy))

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
#print(confusion_matrix(y_train, y_pred_train))
print(classification_report(y_train, y_pred_train))

The decision tree model has an accuracy of 54% which means that the model did not predict accurately the rent classes.