# Cleaning & adjusting geolocated data for regressor house rent estimators

*Table of contents*

1. [Data import, analysis and initial feature selection](#divaifs)
2. [Datetime conversion and currency conversion](#dcarva)
3. [Variations of training dataframes (real value adjustments (by inflation)?)](#ptd)
4. [<font color="red">Adjustment by inflation</font>](#abi)
5. [Data export](#de)

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import datetime
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Data import, visualization and initial feature selection <a id="divaifs"></a>

In [None]:
df = pd.read_csv("../input/argentina-venta-de-propiedades/ar_properties.csv",index_col="id").dropna(subset=["price","currency"])

In [None]:
df_alq_ar = df[(df["operation_type"]=="Alquiler") & (df["l1"]=="Argentina")]

In [None]:
df_alq_ar.columns

## First feature selection  
Important features: lat, lon, rooms, bathrooms, surface_covered, ¿surface total? (correlated to covered), price period, property type, currency.  

We picked lat/lon because we'll perform predictions with a KNN algorithm. This information is more than enough to drop all of the rest location-related columns. The advantage of this is that if we instead pick the labeled categorical locations we will end up unnecesarily creating hundreds of new columns when converting categoricals to numerical values. Bedrooms are not important since this info is very highly correlated to rooms and bathrooms and Argentines do not value it as a metric (domain specific knowledge).

In [None]:
df_alq_ar[df_alq_ar["currency"]=="USD"]


In [None]:
features = ["created_on","lat", "lon", "rooms", 
"bathrooms", "surface_covered", "surface_total", "price_period", "property_type","price", "currency"]
df_alq_ar_reducido = df_alq_ar[features]

In [None]:
df_alq_ar_reducido.isnull().sum()/df_alq_ar_reducido.shape[0]

Different dataframes to be tested:  
0. "created_on","lat", "lon", "rooms", "bathrooms", "surface_covered", "surface_total", "price_period", "property_type" (drop missing surfaces, impute for the mean in everything else) 
1. "created_on","lat", "lon", "rooms", "bathrooms", "surface_covered", "surface_total", "price_period", "property_type"  
2. "created_on","lat", "lon", "rooms", "bathrooms", "surface_covered", "price_period", "property_type"(drop lat/lon missing, imputation on the mean for rooms, bathrooms and surface)
3. "created_on","lat", "lon", "rooms", "bathrooms", "surface_covered", "price_period", "property_type"(drop lat/lon &rooms missing, imputation on the mean for bathrooms)
4. "created_on","lat", "lon", "rooms", "bathrooms", "price_period", "property_type" (drop lat/lon missing)  

In [None]:
df_alq_ar_reducido["price_period"].value_counts()

Todos los alquileres son mensuales y es costumbre que así lo sea en Ar, por eso voy a imputar los valores faltantes de "price_period" por mensual. Esto es lo mismo que directamente tirar la columna.

In [None]:
df_alq_ar_reducido = df_alq_ar_reducido.drop("price_period",axis=1)

In [None]:
df_alq_ar_reducido.info(verbose=True)

Sólo tenemos una columna no-numérica y es el tipo de propiedad... Que además no tiene datos faltantes en absoluto lo cual viene muy bien.

# Datetime conversion and real value adjustments <a id="dcarva"></a>

## We'll:
    1. Convert to datetime
    2. Sort by date
    3. replace column for "months away from last one registered" (pick a better name)
The reason for this is that we may want to adjust by domain inflation all of the prices, so that all of the rent values are updated.

In [None]:
df_alq_ar_reducido.loc[:,"created_on"] = pd.to_datetime(df_alq_ar_reducido["created_on"],format="%Y-%m-%d")
df_alq_ar_reducido = df_alq_ar_reducido.sort_values("created_on")

In [None]:
df_alq_ar_reducido

In [None]:
min_date = df_alq_ar_reducido["created_on"][0]
max_date = df_alq_ar_reducido["created_on"].tail(1)[0]
print("First recorded date: ", min_date, "Last recorded date: ", max_date)
year_diff_to_end = df_alq_ar_reducido["created_on"].apply(lambda x: x.year)-max_date.year
month_diff_to_end = df_alq_ar_reducido["created_on"].apply(lambda x: x.month) - max_date.month

month_diff = month_diff_to_end + year_diff_to_end * 12
print("Month differences to last date: \n", month_diff.values)

In [None]:
df_alq_ar_reducido["created_on"]=-month_diff

In [None]:
df_alq_ar_reducido = df_alq_ar_reducido.rename(columns={"created_on":"mo_dist_to_last"})

In [None]:
index_inDollars = (df_alq_ar_reducido["currency"]=="USD")
df_alq_ar_reducido.loc[index_inDollars, "price"] = df_alq_ar_reducido.loc[index_inDollars,"price"]*150
df_alq_ar_reducido = df_alq_ar_reducido.rename(columns={"currency":"original_currency"})
df_alq_ar_reducido = df_alq_ar_reducido[(df_alq_ar_reducido["original_currency"]=="ARS") | (df_alq_ar_reducido["original_currency"]=="USD") ]
df_alq_ar_reducido["original_currency"][(df_alq_ar_reducido["original_currency"]=="USD")] = 1
df_alq_ar_reducido["original_currency"][(df_alq_ar_reducido["original_currency"]=="ARS")] = 0

In [None]:
df_alq_ar_reducido["original_currency"] = df_alq_ar_reducido["original_currency"].astype("int")

In [None]:
df_alq_ar_reducido[df_alq_ar_reducido["original_currency"]=="USD"]

In [None]:
df_alq_ar_reducido[df_alq_ar_reducido["mo_dist_to_last"]<3]

# Variations of training dataframes <a id="ptd"></a>

Different dataframes to be tested:  
A: Only last 3 months

B: All months considered

0. "mo_dist_to_last","lat", "lon", "original_currency","rooms", "bathrooms", "surface_covered", "surface_total", "property_type" (drop missing surfaces & lat/lon, impute for the mean in everything else) 
1. "mo_dist_to_last","lat", "lon", "original_currency", "rooms", "bathrooms", "surface_covered", "surface_total", "property_type" (drop lat, lon and surface total misising, impute covered, rooms andbathrooms)
2. "mo_dist_to_last","lat", "lon", "original_currency", "rooms", "bathrooms", "surface_covered", "property_type"(drop lat/lon and surface missing, imputation on the mean for rooms and bathrooms)
3. "mo_dist_to_last","lat", "lon", "original_currency", "rooms", "bathrooms", "surface_covered", "property_type"(drop lat/lon, rooms and surface missing, imputation on the median for bathrooms)
4. "mo_dist_to_last","lat", "lon", "original_currency", "rooms", "bathrooms", "property_type" (drop lat/lon and rooms missing, impute missing bathrooms for the median)
5. "mo_dist_to_last","lat", "lon", "original_currency", "rooms", "bathrooms", "property_type" (drop lat/lon, rooms and bathrooms missing)


<font color="red">Idea de ajuste de precios: convertir todo a dolares del mes al que pertenecen y multiplicar por valor actual del dolar paralelo</font>

In [None]:
df_alq_ar_reducido.info()

We'll rename the dataframe to make it more simple...

In [None]:
df_alq_ar_reducido["original_currency"].value_counts()

In [None]:
df_alq_ar_reducido.dropna(subset=["lat", "lon", "surface_covered"]).isnull().sum()/df_alq_ar_reducido.dropna(subset=["lat", "lon", "rooms", "bathrooms"]).shape[0]

In [None]:
df0 = df_alq_ar_reducido[["mo_dist_to_last","lat", "lon", "rooms", "bathrooms","surface_covered","surface_total","original_currency", "property_type","price"]]
df1 = df0
df2 = df_alq_ar_reducido[["mo_dist_to_last","lat", "lon", "rooms", "bathrooms","surface_covered","original_currency", "property_type","price"]]
df3 = df2
df4 = df_alq_ar_reducido[["mo_dist_to_last","lat", "lon", "rooms", "bathrooms","original_currency", "property_type","price"]]
df5 = df4

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
def potentialTrainingDataGenerator(df, colsToDrop, colsToImpute, categoricalCols, nMonthsConsidered):
    """Given a dataframe with missing values on some rows, remove the rows or impute them as indicated by the parameters, 
    then return the data ready to be used for training"""
    ##Drop rows with missing...
    df = df.dropna(subset=colsToDrop)
    df = df[df["mo_dist_to_last"]<nMonthsConsidered]
    ## Split into train and test for posterior imputation
    X = df.drop(columns=["price"])
    y = df["price"]
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    ## Encode
    encoder = OneHotEncoder(handle_unknown="ignore",sparse=False)
    X_train_encoded_cat = pd.DataFrame(encoder.fit_transform(X_train[categoricalCols]))
    X_test_encoded_cat = pd.DataFrame(encoder.transform(X_test[categoricalCols]))
    ## recover indexes
    X_train_encoded_cat.index = X_train[categoricalCols].index
    X_test_encoded_cat.index = X_test[categoricalCols].index
    
    ## reinsert encoded
    X_train = X_train.drop(categoricalCols, axis=1)
    X_test = X_test.drop(categoricalCols, axis=1)
    
    X_train = pd.concat([X_train, X_train_encoded_cat], axis=1)
    X_test = pd.concat([X_test, X_test_encoded_cat], axis=1)
    ## impute missing values for the median
    imputer = SimpleImputer(strategy='median')
    
    X_train_imputed = pd.DataFrame(imputer.fit_transform(X_train), index = X_train.index)
    X_test_imputed = pd.DataFrame(imputer.transform(X_test), index = X_test.index)
    
    X_train_imputed.columns = X_train.columns
    X_test_imputed.columns = X_train.columns
    
    return X_train_imputed, X_test_imputed, y_train, y_test

In [None]:
X_train_5a, X_test_5a, y_train_5a, y_test_5a = potentialTrainingDataGenerator(df5, ["lat", "lon", "rooms","bathrooms"], [], ["property_type"], 3)

X_train_5b, X_test_5b, y_train_5b, y_test_5b = potentialTrainingDataGenerator(df5, ["lat", "lon", "rooms", "bathrooms"], [], ["property_type"], 14)

In [None]:
X_train_5a

In [None]:
X_train_4a, X_test_4a, y_train_4a, y_test_4a = potentialTrainingDataGenerator(df4, ["lat", "lon", "rooms"], ["bathrooms"], ["property_type"], 3)

X_train_4b, X_test_4b, y_train_4b, y_test_4b = potentialTrainingDataGenerator(df4, ["lat", "lon", "rooms"], ["bathrooms"], ["property_type"], 14)

In [None]:
X_train_3a, X_test_3a, y_train_3a, y_test_3a = potentialTrainingDataGenerator(df3, ["lat", "lon", "rooms", "surface_covered"], ["bathrooms"], ["property_type"], 3)

X_train_3b, X_test_3b, y_train_3b, y_test_3b = potentialTrainingDataGenerator(df3, ["lat", "lon", "rooms", "surface_covered"], ["bathrooms"], ["property_type"], 14)

In [None]:
X_train_2a, X_test_2a, y_train_2a, y_test_2a = potentialTrainingDataGenerator(df2, ["lat", "lon", "surface_covered"], ["bathrooms","rooms"], ["property_type"], 3)

X_train_2b, X_test_2b, y_train_2b, y_test_2b = potentialTrainingDataGenerator(df2, ["lat", "lon", "surface_covered"], ["bathrooms", "rooms"], ["property_type"], 14)

In [None]:
X_train_1a, X_test_1a, y_train_1a, y_test_1a = potentialTrainingDataGenerator(df1, ["lat", "lon","surface_total"], ["bathrooms","rooms","surface_covered"], ["property_type"], 3)

X_train_1b, X_test_1b, y_train_1b, y_test_1b = potentialTrainingDataGenerator(df1, ["lat", "lon", "surface_total"], ["bathrooms", "rooms","surface_covered"], ["property_type"], 14)

In [None]:
X_train_1a

In [None]:
X_train_0a, X_test_0a, y_train_0a, y_test_0a = potentialTrainingDataGenerator(df0, ["lat", "lon", "surface_covered","surface_total"], ["bathrooms","rooms"], ["property_type"], 3)

X_train_0b, X_test_0b, y_train_0b, y_test_0b = potentialTrainingDataGenerator(df0, ["lat", "lon", "surface_covered","surface_total"], ["bathrooms", "rooms"], ["property_type"], 14)

0. "mo_dist_to_last","lat", "lon", "original_currency","rooms", "bathrooms", "surface_covered", "surface_total", "property_type" (drop missing surfaces & lat/lon, impute for the mean in everything else) 
1. "mo_dist_to_last","lat", "lon", "original_currency", "rooms", "bathrooms", "surface_covered", "surface_total", "property_type" (drop lat, lon and surface total misising, impute covered, rooms andbathrooms)

# Adjustment by inflation <a id="abi"></a>

This wasn't yet done because I'll assume that the model can detect inflationary patterns from the "mo_dist_to_last" column

idea:
1. map locations to regions
2. 

In [None]:
# def adjustByInflation(df):
    

# Data export <a id="de"></a>

In [None]:
import pickle

In [None]:
cero_a = [(X_train_0a,"X_train_0a"), (X_test_0a,"X_test_0a"), (y_train_0a,"y_train_0a"), (y_test_0a,"y_test_0a")]
one_a = [(X_train_1a,"X_train_1a"), (X_test_1a,"X_test_1a"), (y_train_1a,"y_train_1a"), (y_test_1a,"y_test_1a")]
two_a = [(X_train_2a,"X_train_2a"), (X_test_2a,"X_test_2a"), (y_train_2a,"y_train_2a"), (y_test_2a,"y_test_2a")]
three_a = [(X_train_3a,"X_train_3a"), (X_test_3a,"X_test_3a"), (y_train_3a,"y_train_3a"), (y_test_3a,"y_test_3a")]
four_a = [(X_train_4a,"X_train_4a"), (X_test_4a,"X_test_4a"), (y_train_4a,"y_train_4a"), (y_test_4a,"y_test_4a")]
five_a = [(X_train_5a,"X_train_5a"), (X_test_5a,"X_test_5a"), (y_train_5a,"y_train_5a"), (y_test_5a,"y_test_5a")]
A = [cero_a,one_a,two_a,three_a,four_a,five_a]
cero_b = [(X_train_0b,"X_train_0b"), (X_test_0b,"X_test_0b"), (y_train_0b,"y_train_0b"), (y_test_0b,"y_test_0b")]
one_b = [(X_train_1b,"X_train_1b"), (X_test_1b,"X_test_1b"), (y_train_1b,"y_train_1b"), (y_test_1b,"y_test_1b")]
two_b = [(X_train_2b,"X_train_2b"), (X_test_2b,"X_test_2b"), (y_train_2b,"y_train_2b"), (y_test_2b,"y_test_2b")]
three_b = [(X_train_3b,"X_train_3b"), (X_test_3b,"X_test_3b"), (y_train_3b,"y_train_3b"), (y_test_3b,"y_test_3b")]
four_b = [(X_train_4b,"X_train_4b"), (X_test_4b,"X_test_4b"), (y_train_4b,"y_train_4b"), (y_test_4b,"y_test_4b")]
five_b = [(X_train_5b,"X_train_5b"), (X_test_5b,"X_test_5b"), (y_train_5b,"y_train_5b"), (y_test_5b,"y_test_5b")]
B = [cero_b,one_b,two_b,three_b,four_b,five_b]


In [None]:
for k in A:
    for i in k:
        i[0].to_csv(i[1]+".csv")