># **Rent analysis and prediction with Pytorch and Sklearn** 

![](https://images.unsplash.com/photo-1483729558449-99ef09a8c325?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=600&q=60)

> Lets first import our datasets and at the end of this hopefully we will know the prices of those houses in that image

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.


* We will be using plotly for plotting our graphs
* It is a very simple easy to use python library , here is a link : [Plotly](https://plotly.com/python/)

In [None]:
import plotly.express as px  

### We have two csv files here we will load them in pandas dataframe 

In [None]:
df1 = pd.read_csv("/kaggle/input/brasilian-houses-to-rent/houses_to_rent.csv")
df2 = pd.read_csv("/kaggle/input/brasilian-houses-to-rent/houses_to_rent_v2.csv")

Checking real quick what kind of data we have 

In [None]:
df1.head()

In [None]:
df2.head()

## Looks like we have a two dataframes where the data in the first one is seperated by city , not a city i.e in df1 and in the second dataframe we have name of specific cities i.e df2

In [None]:
#hoa =  Homeowners association tax

Lets just first check df1 where data is divided by city or not a city

1. First of we need to remove that "R\$" which is the symbol of South African Rand 
2. We need to replace the strings in columns by a number for eg. acept can be 1 and not acept can be 0
3. There are also some ' - ' in our floor column we will assume them to be groung floor i.e floor = 0
4. Then there is a " , " in the total column we need to remove that to convert them to int

In [None]:
df1_new=df1.replace(regex=[r'\bR\$'],value='')
df1_new = df1_new.replace('acept',1)
df1_new.replace('not acept',0,inplace=True)
df1_new.replace('furnished',1,inplace=True)
df1_new.replace('not furnished',0,inplace=True)
df1_new.replace('-',0,inplace=True)
df1_new.replace(regex=[r'\b,'],value='',inplace=True)

### Divide the new dataframe in two parts , where city = 1 and city = 0

In [None]:
df1_city = df1_new[df1_new['city']==1]
df1_notcity = df1_new[df1_new['city']==0]

Checking that we did everything right

In [None]:
df1_city.head()

In [None]:
df1_notcity.head()

No need of index columns dropping them 

In [None]:
df1_city = df1_city.copy().drop(columns=['Unnamed: 0'])

In [None]:
df1_notcity = df1_notcity.copy().drop(columns=['Unnamed: 0'])

In [None]:
print(df1_city.dtypes)

### This is why we remove the " , " so that we can convert that number to int . If it is not removed python either considers it as a float or it can give us error in the future

In [None]:
df1_city['total']=df1_city['total'].astype(int)
df1_notcity['total']=df1_notcity['total'].astype(int)

In [None]:
  df1_city.head()

In [None]:
df1_notcity.head()

* ### We will plot a scatter plot here , we want to know how the total amount changes according to area of our property and also we will check weather having different number of rooms affect out total amount.
* ### Plotting is very simple with plotly just pass a dataframe , what you want to be on x and  y axis also how you want to use color , here  we want to change our color according to the number of rooms .
* ### Plotly takes care of everthing for you. We have two plots one for city and one not in city

In [None]:
fig = px.scatter(df1_notcity,x='area',y='total',trendline="lowess",color='rooms',title="Not in city")
fig1 = px.scatter(df1_city,x='area',y='total',trendline="lowess",color='rooms',title="In city")

1. What we are basically trying to do below is to plot these two graphs side by side . This is just a very lazy method of doing it.
2. This the correct method [plotly subplots](https://plotly.com/python/subplots/)


In [None]:
from plotly.subplots import make_subplots
from plotly.offline import  init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
trace1 = fig['data'][0]
trace2 = fig1['data'][0]

fig2 = make_subplots(rows=1, cols=2, shared_xaxes=False,subplot_titles=("Not in city","In city"))
fig2.add_trace(trace1, row=1, col=1)
fig2.add_trace(fig['data'][1], row=1, col=1)
fig2.add_trace(trace2, row=1, col=2)
fig2.add_trace(fig1['data'][1], row=1, col=2)
fig2.update_layout(
    title="Area vs Total Amount \t Color: Rooms",
    xaxis_title="Area",
    yaxis_title="Total Amount",
    font=dict(
        family="Courier New, monospace",
        size=18,
        color="#7f7f7f",
        
        
    ))
iplot(fig2)

1. You can use you mouse to zoom in where you want to see *( If you see only a corner filled with dots zoom in with your mouse )* 
2. Hover the mouse to see the value at that point
3. Here we see an interesting thing about plotly, our legend i.e color is according to our number of rooms , the more the rooms the brighter the color gets.
4. We can see that in \[*Not a city*\] graph there aren't many places with large number of rooms .... *well not really anything ground breaking here*...but in the \[*city*\] graph we can clearly see the change in the color as the area increases the likelihood of having more rooms increses so the colors go more bright and also the total rent amount increses... *again not anything worth the Noble Prize* but a simple graph here can help us understand very easily.
5. More expensive house have more number of rooms

#### Another side by side plot but now we want to know does number of bathrooms in our property means something.
> *I mean obviously people are going to buy a house with a bathroom but we want to know do more bathroom mean more money ..... just for fun* 

#### Also going through the dataset I found that in column hao , propety tax some of the values were not integers.<br>
#### Column hoa had some values = 'Sem info' and column propery tax had some values='Incluso'<br>
#### We won't be using these columns because we only need to analyze the total amount because in the end thats what we want to know how much money we need to give.

In [None]:
df1[df1['hoa']=='Sem info'].head(3)

In [None]:
df1[df1['property tax']=='Incluso'].head(3)

In [None]:
fig0 = px.scatter(df1_notcity,x='area',y='total',trendline="lowess",color='bathroom',title="Not in city")
fig10 = px.scatter(df1_city,x='area',y='total',trendline="lowess",color='bathroom',title="In city")
trace12 = fig0['data'][0]
trace21 = fig10['data'][0]

fig21 = make_subplots(rows=1, cols=2, shared_xaxes=False,subplot_titles=("Not in city","In city"))
fig21.add_trace(trace12, row=1, col=1)
fig21.add_trace(fig0['data'][1], row=1, col=1)
fig21.add_trace(trace21, row=1, col=2)
fig21.add_trace(fig10['data'][1], row=1, col=2)
fig21.update_layout(
    title="Area vs Total Amount \t Color: Bathroom",
    xaxis_title="Area",
    yaxis_title="Total Amount",
    font=dict(
        family="Courier New, monospace",
        size=18,
        color="#7f7f7f",
        
        
    ))
iplot(fig21)

Again we see something similar to our "rooms" graph ... more expensive houses have more number of bathrooms

In [None]:
fig01 = px.scatter(df1_notcity,x='area',y='total',trendline="lowess",color='furniture',title="Not in city")
fig101 = px.scatter(df1_city,x='area',y='total',trendline="lowess",color='furniture',title="In city")
trace121 = fig01['data'][0]
trace211 = fig101['data'][0]

fig211 = make_subplots(rows=1, cols=2, shared_xaxes=False,subplot_titles=("Not in city","In city"))
fig211.add_trace(trace121, row=1, col=1)
fig211.add_trace(fig01['data'][1], row=1, col=1)
fig211.add_trace(trace211, row=1, col=2)
fig211.add_trace(fig101['data'][1], row=1, col=2)
fig211.update_layout(
    title="Area vs Total Amount \t Color: Furniture",
    xaxis_title="Area",
    yaxis_title="Total Amount",
    font=dict(
        family="Courier New, monospace",
        size=18,
        color="#7f7f7f",
        
        
    ))
iplot(fig211)

## Now we move on to another dataframe the one with specific city names . Here we will try to predict the total price of house

Lets check how many unique cities we have

In [None]:
uc = df2['city'].unique()

In [None]:
print(uc)

We do the same changes

In [None]:
df2 = df2.replace('acept',1)
df2.replace('not acept',0,inplace=True)
df2.replace('furnished',1,inplace=True)
df2.replace('not furnished',0,inplace=True)
df2.replace('-',0,inplace=True)

In [None]:
df2.head()

### Now lets see a plot but colored according to different cities

In [None]:
fig3 = px.scatter(df2,x='area',y='total (R$)',trendline="lowess",color='city',title="Area vs Total cost per city")
#fig3.show()
iplot(fig3)

### Again if you see tiny dots zoom in also you can click on the legend to enable disable the plot points of that particular city.
1. Looks like Sao Polo is the most expensive city 
2. We don't have much data about Porto Alegre
3. Also for the same area Sao Polo and Rio de Janeiro are more expensive than Belo Horizonte.

### Here we also see that there are some outliers most of our total price is under 35K except a few . Linear regression models are sometimes sensitive to outliers . We also seem to have more data for just one city Sao Paulo 

change the name of city to number and also storing the city name:number in a dict

In [None]:
city_dict = {}
for i in range(0,len(uc)):
    df2.replace(uc[i],i+1,inplace=True)
    city_dict[uc[i]]=i+1
    print("Now city {} is : {}".format(uc[i],i+1))

### In the above plot it looked there were different number of data samples per city

What we are doing is we want to count the occurence of each city in df2\['city'\] column so we use value_counts() 

In [None]:
print("Total length {}\nNumber of Examples per city \n{}".format(len(df2),df2['city'].value_counts()))

In [None]:
city_dict

### So what is the problem,I think that we have a dataset with unequal number of examples per city ... Like we saw earlier the number of examples per city are more for Sao Paulo is 5887 out of total 10692 , nearly half of them.

### Well most of the samples are of city São Paulo nearly half of them 5887 and least are of Campinas only 853

### Also lets check the mean total amount we have to spend in each city

In [None]:
mean_total=[]
for i in range(0,len(uc)):
    mean = df2[df2['city']==city_dict[uc[i]]]['total (R$)'].mean()
    mean_total.append(mean)
    print("Mean amount for city {}:{} is {:.2f}".format(i+1,uc[i],mean))
    

* Well as expected São Paulo has heighest average = 6380 followed closely by Belo Horizonte at 6315
* Least expensive is Porto Alegre at nearly 2990

In [None]:
xdf = df2[['city','area','rooms','bathroom','floor','furniture']]
xdf.describe()

### Now lets predict the prizes of these properties

In [None]:
xdf1= xdf.to_numpy(dtype='float')
y = df2['total (R$)'].to_numpy(dtype='float')

Standard train test split

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(xdf1,y,random_state=3,test_size=200)

### Regression:

In [None]:
from sklearn import svm
clf = svm.SVR(kernel='linear')
clf.fit(x_train,y_train)

The accuracy of a regression model cannot be measured like a classfication model because it is very unlikely that it will output the same exact number as label data or true data . Instead we use something called an r squared errror , it is between 0 and 1 where 0 means our model captures no variation in data and 1 means it is perfect.
Here is a good article that explains it [r2.](https://www.datasciencecentral.com/profiles/blogs/regression-analysis-how-do-i-interpret-r-squared-and-assess-the)


In [None]:
clf.score(x_test,y_test)

Our accuracy is 0.58 which is okay-ish ...... it is just very very okay. Also given that we have more data about  a particular city it is again okaaaay.

Beacuse city number of Sao Paulo = 1

### But now lets try to build a neural network from pytorch .
#### *I mean why else use deep learing but not  to complicate simple tasks* 

In [None]:
import torch
from torch.utils.data import DataLoader, TensorDataset
import torch.nn as nn
from torch import Tensor
from torch.optim import Adam

In [None]:
xtrain =torch.from_numpy(x_train)
ytrain = torch.from_numpy(y_train)
xtest =torch.from_numpy(x_test)
ytest = torch.from_numpy(y_test)

In [None]:
dataset = TensorDataset(xtrain,ytrain)
dataset_test = TensorDataset(xtest,ytest)
loader = DataLoader(dataset,batch_size = 5)
test_loader = DataLoader(dataset_test,batch_size=5)

Uptill now everything was pretty standard we imported libraries and then made a tensor dataset.
Now we are going to create a model 
1. In pytorch you need to define you model in a class 
  1. Our number of inputs are 5 so the first layer self.layer1 ,it will take a vector of (5,1) as an input and output a vector of (25,1)
  2. Then we define hidden layer similarly with help of  "Linear(input,output)" which is just a layer of neurons
  3. Final output layer of just size = 1 
2. Forward function is must ... it defines how the feed forward step will happen.

In [None]:
class NN(nn.Module):
    def __init__(self):
        super(NN,self).__init__()
        self.layer1 = nn.Linear(6,25)
        nn.ReLU()
        self.layer2 = nn.Linear(25,25)
        nn.ReLU()
        self.layer3 = nn.Linear(25,25)
        nn.ReLU()
        self.layer4 = nn.Linear(25,1)
    def forward(self,n):
        out = self.layer1(n)
        out = self.layer2(out)
        out = self.layer3(out)
        out = self.layer4(out)
        
        return out
        
        
        

* We import a loss function L1Loss() which is very robust againts outliers in datset.
* An optimizer Adam.

In [None]:
#learning_rate = 0.01
model = NN()
model = model.double()
model=model.cuda()
criterion = nn.L1Loss()
criterion.cuda()
optimizer = Adam(model.parameters(),lr=0.1)

In [None]:
total_steps = len(loader)
loss_list = []
num_epochs=5
total_train=0

Importing r2 score 'cause that how we measure them regression models

In [None]:
from sklearn.metrics import r2_score

### This is our training loop :
#### Looks complicated but is very simple
1. We take values from our tensor loader i.e data and the label or true value
2. We pass it through out model<br>
(The loop to append in list o_list[] is because our batch size is 5 so model returns an output of size five , it is of five different inputs we append them in an output predicted  list , also append the labels)
3. Now that our model has the output , we need to calculate loss so we pass  it to our loss criterion 
4. Loss is calculated ... we need to propogate it backward i.e backpropagation.
5. Now calculate the gradients based on those loss and the optimizer will take a step ... hopefully in the right dorection towards the global minima.
6. Print the output of each epoch

In [None]:
model.train()

for epoch in range(0,num_epochs):
    o_list = []
    ll = []
    for i,(data,label) in enumerate(loader):
        outp = model(data.cuda()) #1
        [o_list.append(o) for o in outp]
        [ll.append(l) for l in label]
        loss = criterion(outp,label.reshape(-1,1).cuda()) #2
        loss_list.append(loss) 
        optimizer.zero_grad()#3
        loss.backward()#4
        optimizer.step()#5
        total_train+=1
    print("-------------------* Output {} *------------------------".format(epoch+1)) #6
    print("Total steps {}/{}\nLoss {}\nR2 score {}".format(i,total_steps,loss.item(),r2_score(o_list,ll)))  

### Let us test our NN on new unseen data

In [None]:
o_list=[]
ll=[]
model.eval()
for i,(data,label) in enumerate(test_loader):
    outp = model(data.cuda())
    [o_list.append(o.cpu().item()) for o in outp]
    [ll.append(l.item()) for l in label]
    total_train+=1
print("-------------------* Output *------------------------")
print("Total steps {}/{}\nR2 score {}".format(i,len(test_loader),r2_score(o_list,ll)))

Well accuracy is definately much worse than SVR ... but by finetuning the hyper-parameters heigher accuracy can be achieved. 

In [None]:
svm_pred = []
for i in x_test:
    pred = clf.predict(i.reshape(1,-1))
    svm_pred.append(pred.item())

## Lets compare our Outputs by SVR and Neural Network with the True value with the help of our good old friend plotly

In [None]:
fig4 = px.line(x=range(0,200),y=ll)
fig4.add_scatter(y=svm_pred,name="SVR Prediction")
fig4.add_scatter(y=o_list,name="Neural Network")
fig4.update_layout(
    title="True vs Predicted",
    xaxis_title="Example number",
    yaxis_title="Total Amount R$",
    font=dict(
        family="Courier New, monospace",
        size=18,
        color="#7f7f7f",))
        
iplot(fig4)        

### Blue line is true price

### Looks like our neural network is good at predicting high values but not so good at lower ones . SVR really does a descent job of giving an output value... it is not perfect and it shouldn't be because we only want to give an estimate, real life property deals are influenced by many factors ranging from Real Estate agent's ablities to faith of people in their horoscope.

We can print each value and also check manually the differene

In [None]:
for i in range(0,len(ll)):
    print("Output of neural network: {:.2f} , Output of SVR: {:.2f} ,  True: {}".format(svm_pred[i],o_list[i],ll[i]))