**1. Importing Libraries**

In [None]:
import numpy as np
import pandas as pd

In [None]:
data = pd.read_csv('../input/brasilian-houses-to-rent/houses_to_rent.csv')


In [None]:
print(data.shape)
data.head(10)

2. Dropping index values as it's a duplicate column

In [None]:
data.drop(data.columns[0], axis=1, inplace=True)

In [None]:
data.head(13)

3. Converting strings into numerical values

In [None]:
data['floor'].replace(to_replace='-', value=0, inplace=True)

In [None]:
data['animal'].replace(to_replace='not acept', value=0, inplace=True)
data['animal'].replace(to_replace='acept', value=1, inplace=True)

In [None]:
data['furniture'].replace(to_replace='not furnished', value=0, inplace=True)
data['furniture'].replace(to_replace='furnished', value=1, inplace=True)

In [None]:
for col in ['hoa', 'rent amount', 'property tax', 'fire insurance', 'total']:
    data[col].replace(to_replace='R\$', value='', regex=True, inplace=True)
    data[col].replace(to_replace=',', value='', regex=True, inplace=True)

4. Converting data types into int64.

Models works in general better with integer than strings.

In [None]:
data = data.astype(dtype=np.int64)

5. Getting rid of Sem info & Incluso from data set

In [None]:
data['hoa'].replace(to_replace='Sem info', value='0', inplace=True)

In [None]:
data['hoa'].replace(to_replace='Incluso', value='0', inplace=True)
data['property tax'].replace(to_replace='Incluso', value='0', inplace=True)

In [None]:
data.isin(['Sem info']).any()

In [None]:
data.isin(['Incluso']).any()

6. Shuffle data to not affect Model performance

In [None]:
data = data.sample(frac=1).reset_index(drop=True)

In [None]:
y = data['city']
X = data.drop('city', axis=1)

**7. Modelling**

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
from sklearn.preprocessing import MinMaxScaler

7.1 Normalize values in dataset

Before we split data into train & test, we want to make sure that are prepared for modelling.

For the best performance of model prediction, we have to normalize values between 0 and 1. 

In [None]:
#Normalizing values in dataset

scaler = MinMaxScaler()  #creating an object
scaler.fit(X) #fitting it to the data, finding them in Max
X = scaler.transform(X) #changing X to be new value between 0-1

In [None]:
pd.DataFrame(X)

7.2 Perform split to train & text

In [None]:

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

In [None]:
X_train

7.3 Choosing & Training models

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier

In [None]:
log_model = LogisticRegression(penalty='l2', verbose=1)
svm_model = SVC(kernel='rbf', verbose=1)
nn_model = MLPClassifier(hidden_layer_sizes=(16, 16), activation='relu', solver='adam', verbose=1)

I have 3 models which I'm going to fit into data and see how they perform.

In [None]:
log_model.fit(X_train, y_train)
svm_model.fit(X_train, y_train)
nn_model.fit(X_train, y_train)

**8. Score of models**

In [None]:
print(log_model.score(X_test, y_test))
print(svm_model.score(X_test, y_test))
print(nn_model.score(X_test, y_test))

In [None]:
#to get number positive examples devided by the total number of examples

data[data.columns[0]].sum()/data.shape[0]

86 % of our dataset is clasified as positive example. That mean that previous accuracy number (number of correct predictions over the total number of predictions) is not telling us very much. 

Let's say we predicted y=1 for every single example. then 86 % of time we will be right. So the accuracy metric is only good if we have equal number of positive and negative examples.

This is what we call "skewed data", where zeros and ones are not in equal proportion.

We are going to use different proportion, so we're going to use a different metric called "F Score".

**F Score combines two metrics and gives you information about both of them **

**9. F-score**

In [None]:
from sklearn.metrics import f1_score

In [None]:
log_predictions = log_model.predict(X_test)
svm_predictions = svm_model.predict(X_test)
nn_predictions = nn_model.predict(X_test)

In [None]:
print(f1_score(log_predictions, y_test))
print(f1_score(svm_predictions, y_test))
print(f1_score(nn_predictions, y_test))

1. Logistic regression model and Support Vector machine have almost identical values.
2. Neural network model performed slightly better