# Preços de Casas: Dados Diferentes, Mesmo Problema!

This is the same problem (but different location) from https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview 
Thus, we would like to do same approach in data exploration. 


![](https://storage.googleapis.com/kaggle-competitions/kaggle/5407/media/housesbanner.png)

Well, this is one of the famous city, 

![](https://i0.wp.com/www.mobileworldlive.com/wp-content/uploads/2019/12/Brazil.png?fit=650%2C400&ssl=1)

In my heart, I always want to go to Brazil, working there, and watch some football games there. I always on the Seleção side on World Cup since I never saw my country there. 

![](https://i.pinimg.com/originals/21/ac/8f/21ac8f9241bff1948ff1d7c47ee74bc3.jpg)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
df = pd.read_csv('/kaggle/input/brasilian-houses-to-rent/houses_to_rent.csv')

In [None]:
df = df.drop(['Unnamed: 0'], axis = 1)

In [None]:
df.head()

# Turn the "Sem Info" and "Incluso" to "0"

We turn it to string to avoid attribute error

In [None]:
df['hoa'] = df['hoa'].replace('Sem info','0')
df['hoa'] = df['hoa'].replace('Incluso','0')

In [None]:
df['property tax'] = df['property tax'].replace('Sem info','0')
df['property tax'] = df['property tax'].replace('Incluso','0')

In [None]:
df['hoa'].value_counts()

In [None]:
df['rent amount'].value_counts()

In [None]:
df['property tax'].value_counts()

In [None]:
df['fire insurance'].value_counts()

In [None]:
df['total'].value_counts()

## Cleaning the Real Sign (R$)

After that, we note that the hoa, rent amount, prop tax, fire insurance, and total is in Brazilian Real.

In [None]:
def extract_value_from(Value):
    out = Value.replace('R$', '')
    out_ = out.replace(',', '')
    out_ = float(out_)
    return out_

In [None]:
df['hoa'] = df['hoa'].apply(lambda x: extract_value_from(x))
df['rent amount'] = df['rent amount'].apply(lambda x: extract_value_from(x))
df['property tax'] = df['property tax'].apply(lambda x: extract_value_from(x))
df['fire insurance'] = df['fire insurance'].apply(lambda x: extract_value_from(x))
df['total'] = df['total'].apply(lambda x: extract_value_from(x))

In [None]:
df.head()

# Total Price Distribution

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df.describe()

In [None]:
print("Skewness: ", df['total'].skew())
print("Kurtosis: ", df['total'].kurt())

Okay then, we need to remove outliers using interquartile score

In [None]:
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
print(IQR)

### Rows with total price more than 5622.5 are removed

In [None]:
df = df[df['total']<= 5622.5]

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
plt.figure(figsize=(10,10))
sns.set()
sns.kdeplot(df['total'])
plt.title('Total Price KDE')

## What we got from total price:
1. Positive skewness
2. Deviate from normal distribution

In [None]:
plt.figure(figsize=(10,10))
sns.set()
sns.kdeplot(df['fire insurance'], color = 'r')
plt.title('Fire Insurance Price KDE')

In [None]:
plt.figure(figsize=(10,10))
sns.set()
sns.kdeplot(df['property tax'], color = 'g')
plt.title('Property Tax Price KDE')

In [None]:
plt.figure(figsize=(10,10))
sns.set()
sns.kdeplot(df['rent amount'], color = 'c')
plt.title('Rent Price KDE')

In [None]:
plt.figure(figsize=(10,10))
sns.set()
sns.kdeplot(df['hoa'], color = 'y')
plt.title('HOA Price KDE')

## Total price and other price component

In [None]:
sns.jointplot(df['total'], df['fire insurance'], kind="hex", color="r")

In [None]:
sns.jointplot(df['total'], df['property tax'], kind="hex", color="g")

In [None]:
sns.jointplot(df['total'], df['rent amount'], kind="hex", color="b")

In [None]:
sns.jointplot(df['total'], df['hoa'], kind="hex", color="y")

# Relation

In [None]:
plt.figure(figsize=(10,10))
sns.boxplot(x="city", y="total", palette=["m", "g"], data=df)
plt.title('City and Total Price')

City 1 tend to be more expensive

In [None]:
plt.figure(figsize=(10,10))
sns.boxplot(x="rooms", y="total", palette=["r", "b"], data=df)
plt.title('Room and Total Price')

In [None]:
df.head()

In [None]:
plt.figure(figsize=(10,10))
sns.boxplot(x="bathroom", y="total", palette=["c", "gold"], data=df)
plt.title('Bathroom and Total Price')

In [None]:
plt.figure(figsize=(10,10))
sns.boxplot(x="parking spaces", y="total", palette=["m", "silver"], data=df)
plt.title('Parking Spaces and Total Price')

In [None]:
plt.figure(figsize=(10,10))
sns.boxplot(x="floor", y="total", palette=["m", "silver"], data=df)
plt.title('Floor and Total Price')

In [None]:
plt.figure(figsize=(10,10))
sns.boxplot(x="animal", y="total", palette=["m", "silver"], data=df)
plt.title('Animal and Total Price')

Animal permitted tend to be more expensive

# Correlation

In [None]:
cor = df.corr()
f, ax = plt.subplots(figsize=(10, 10))
sns.heatmap(cor, annot=True, square=True);