#### PMR3508 - Aprendizado de Máquina e Reconhecimento de Padrões

# Modelos de Regressão aplicados na base California Housing
# Notebook


### Kaio Teles Ogawa - NUSP: 9345957

**Bibliotecas utilizadas**

In [None]:
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
#plt.rcParams['figure.figsize'] = (15,15)

**Base de dados California Housing**

Para este trabalho, será aplicado modelos de regressão que prevêem o preço mediano de uma casa em uma região da Califórnia de acordo com a base de dados fornecida California Housing. Além da identificação "Id", verifica-se a presença de 7 atributos mais a varíavel de classe nomeada "median_house_value" que possui como rótulos numéricos que correspondem ao "target" deste trabalho.

TREINO

In [None]:
train = pd.read_csv("/kaggle/input/atividade-regressao-PMR3508/train.csv",
        sep=r'\s*,\s*',
        engine='python',
        na_values="?",index_col=['Id'])

In [None]:
train.shape  #tamanho da tabela

In [None]:
train.head() # dados iniciais

Todas as variáveis observadas são numéricas com grande variação e escalas diferentes, então vamos dar início ao estudo destes dados.

### Estudo dos dados

**Estudo dos dados faltantes**

In [None]:
for label, content in train.items():
    print(pd.isna(train[label]).value_counts()) 

Verfica-se que não há dados faltantes.

### Análise dos dados

Como ponto de partida, verificou-se como os dados se comportam estatisticamente:

DESCRIÇÃO GERAL:

In [None]:
train.describe().T  #dados numéricos

Observa-se uma grande variação relativa para todos os atributos com exceção da latitude e longidute,já que estamos observando o mesmo estado.

In [None]:
sns.set()
train.hist(bins=50, figsize=(20,15))
plt.show()

A maioria das curvas são de distribuição normal.

Análise geográfica

In [None]:
import descartes
import geopandas as gpd
from shapely.geometry import Point, Polygon

In [None]:
street_map = gpd.read_file('/kaggle/input/california/California/CA_Counties_TIGER2016.shp')

MAPA DA CALIFORNIA

In [None]:
fig,ax=plt.subplots(figsize=(15,15))
street_map.plot(ax=ax)


In [None]:
crs = {'init': 'epsg:4326'} 
geometry = gpd.points_from_xy(train.longitude, train.latitude)
geo_train = gpd.GeoDataFrame(train, #dados
                             crs=crs, #coordenadas
                             geometry = geometry) #lista da geometria criada
geo_train.head()

Cidades sobre o mapa:

In [None]:
#base = street_map.plot(color='grey',alpha=0.4, edgecolor='black',figsize=(15,15))
geo_train.plot(marker='o', color='green', markersize=10,figsize=(15,15))


> #### Inserindo sobre o mapa por imagem pois o código no Jupyter não rodou aqui![img](https://drive.google.com/uc?id=14XZ1REY3_1VqjdXWId6TlNsgSToMU3fw)

Observa-se uma concentração de dados ao longo do litoral.

Novos atributos

Vamos analisar relações entre os atributos que podem fornecer dados mais valiosos tendo como referência o código disponibilizado em https://www.kaggle.com/camnugent/geospatial-feature-engineering-and-visualization/data

Bases de dados das cidades da California, considerando que a base de dados das casas também é de 1990.

In [None]:
city_lat_long = pd.read_csv('../input/california-housing-feature-engineering/cal_cities_lat_long.csv')
city_pop_data = pd.read_csv('../input/california-housing-feature-engineering/cal_populations_city.csv')
county_pop_data = pd.read_csv('../input/california-housing-feature-engineering/cal_populations_county.csv')

Função que calcula distância e nome da cidade mais próxima

In [None]:
from geopy import distance

city_coords = {}
for dat in city_lat_long.iterrows():
    row = dat[1]
    if row['Name'] not in city_pop_data['City'].values:   
        continue           
    else: 
        city_coords[row['Name']] = (float(row['Latitude']), float(row['Longitude']))

def closest_point(location, location_dict):
    """ cidade mais próxima com sua localização """
    closest_location = None
    for city in location_dict.keys():
        d = distance.distance(location, location_dict[city]).km
        if closest_location is None:
            closest_location = (city, d)
        elif d < closest_location[1]:
            closest_location = (city, d)

    return closest_location

Similarmente, função que calcula distância e nome da cidade grande mais próxima

In [None]:
city_pop_dict = {}
for dat in city_pop_data.iterrows():
    row = dat[1]
    city_pop_dict[row['City']] =  row['pop_april_1990']


big_cities = {}

for key, value in city_coords.items():
    if city_pop_dict[key] > 500000:
        big_cities[key] = value

In [None]:
print(big_cities)

In [None]:
train.insert(len(train.columns)-2, "close_city", train.apply(lambda x: 
                            closest_point((x['latitude'],x['longitude']),city_coords), axis = 1), True)

In [None]:
train.insert(len(train.columns)-2, "close_city_name", [x[0] for x in train['close_city'].values], True)
train.insert(len(train.columns)-2, "close_city_dist", [x[1] for x in train['close_city'].values], True) 

In [None]:
train.insert(len(train.columns)-2, "close_city_pop", [city_pop_dict[x] for x in train['close_city_name'].values], True) 

In [None]:
train.insert(len(train.columns)-2, "big_city", train.apply(lambda x: 
                    closest_point((x['latitude'],x['longitude']),big_cities), axis = 1), True) 
train.insert(len(train.columns)-2, "big_city_name", [x[0] for x in train['big_city'].values], True) 
train.insert(len(train.columns)-2, "big_city_dist", [x[1] for x in train['big_city'].values], True) 
train.head()

Conforme sugestão do enunciado, foram criadas novas features a partir da razão de features existentes,

In [None]:
train.insert(len(train.columns)-2, "rooms_per_household", 
             train.apply(lambda x: round(x['total_rooms']/x['households'], 2), axis = 1), True) 

train.insert(len(train.columns)-2, "population_per_household", 
            train.apply(lambda x: round(x['population']/x['households'], 2), axis = 1), True) 

train.insert(len(train.columns)-2, "bedrooms_per_room", 
            train.apply(lambda x: round(x['total_bedrooms']/x['total_rooms'], 2), axis = 1), True) 

train.head()

Função que calcula distância entre cidade e o litoral

In [None]:
beach_cities = {}
names =['San Francisco',
'Long Beach' ,
'San Diego',
'Orange',
'Los Angeles', 
'Ventura',
'Santa Barbara', 
'San Luis Obispo']
for i in range(len(names)):
    for key, value in city_coords.items():
        if key == str(names[i]):
            beach_cities[key] = value

In [None]:
print(beach_cities)

In [None]:
train.insert(len(train.columns)-2, "beach_city", train.apply(lambda x: 
                    closest_point((x['latitude'],x['longitude']),beach_cities), axis = 1), True) 


In [None]:
train.insert(len(train.columns)-2, "beach_city_name", [x[0] for x in train['beach_city'].values], True) 
train.insert(len(train.columns)-2, "beach_city_dist", [x[1] for x in train['beach_city'].values], True) 
train.head()

Agora a feaures iniciais podem ser retiradas do dataset, já que foram analisadas;

In [None]:
trainOk = train.drop(['longitude','latitude','geometry','close_city','big_city','beach_city'], axis=1)
trainOk.head()

Até aqui, foram observadas gráficos individuais de cada atributo, a seguir, vamos analisar como cada atributo se comporta com relação à variável de classe por meio da seguinte função:

In [None]:
import matplotlib as mpl
def comparative_histogram(data, obj_var, test_var, obj_labels = None, alpha = 0.7):
        # Taking non-repetitive data of obj_var (feature)
        if obj_labels is None:
            obj_labels = [0,1]
        
        # Sum the number of repetitions of that feature
        temp = []
        n_labels = len(obj_labels)
        media = data[obj_var].mean()
        for i in range(n_labels):
            if i==1:
                temp.append(data[data[obj_var] >=media])
                temp[i] = np.array(temp[i][test_var]).reshape(-1,1)
            else:
                temp.append(data[data[obj_var] <media])
                temp[i] = np.array(temp[i][test_var]).reshape(-1,1)
        
        
        #size of plot figure\n",
        fig = plt.figure(figsize= (13,7))
        colors = ['brown','forestgreen']
        mpl.rcParams['figure.facecolor'] = '0.75'
        mpl.rcParams['grid.color'] = 'k'
        mpl.rcParams['grid.linestyle'] = ':'
        mpl.rcParams['grid.linewidth'] = 0.5
        #histogram\n",
        for i in range(n_labels):
            plt.hist(temp[i], alpha = alpha, color=colors[i])
        plt.xlabel(test_var)
        plt.ylabel('quantity')
        plt.title('Histogram over \'' + test_var + '\' filtered by \'' + obj_var + '\'')
        plt.grid()
        plt.legend(obj_labels)
   



In [None]:
comparative_histogram(trainOk, 'median_house_value','median_age')

In [None]:
comparative_histogram(trainOk, 'median_house_value','total_rooms')

In [None]:
comparative_histogram(trainOk, 'median_house_value','total_bedrooms')

In [None]:
comparative_histogram(trainOk, 'median_house_value','population')

In [None]:
comparative_histogram(trainOk, 'median_house_value','households')

In [None]:
comparative_histogram(trainOk, 'median_house_value','median_income')

In [None]:
comparative_histogram(trainOk, 'median_house_value','close_city_name')

In [None]:
comparative_histogram(trainOk, 'median_house_value','close_city_dist')

In [None]:
comparative_histogram(trainOk, 'median_house_value','close_city_pop')

In [None]:
comparative_histogram(trainOk, 'median_house_value','big_city_name')

In [None]:
comparative_histogram(trainOk, 'median_house_value','big_city_dist')

In [None]:
comparative_histogram(trainOk, 'median_house_value','rooms_per_household')

In [None]:
comparative_histogram(trainOk, 'median_house_value','population_per_household')

In [None]:
comparative_histogram(trainOk, 'median_house_value','bedrooms_per_room')

In [None]:
comparative_histogram(trainOk, 'median_house_value','beach_city_name')

In [None]:
comparative_histogram(trainOk, 'median_house_value','beach_city_dist')

In [None]:
corr_mat = trainOk.corr()
mask = np.triu(np.ones_like(corr_mat,dtype = bool))
f, ax = plt.subplots(figsize=(20, 13))
sns.heatmap(corr_mat, vmax=.7, mask=mask, square=True, cmap="coolwarm", lw=0,annot = True)

Pela matriz de correlação, verifica-se que os 5 piores resultados são das covariáveis: population_per_household, close_city_pop, households, population e total_bedrooms.

In [None]:
trainOk = trainOk.drop(['population_per_household','close_city_pop','households','population','total_bedrooms'], axis=1)
trainOk.head()

Divisão dos dados

In [None]:
from sklearn.model_selection import train_test_split

Xdata = trainOk[["median_age","total_rooms","median_income","close_city_name","close_city_dist","big_city_name",
                 "big_city_dist","rooms_per_household","bedrooms_per_room","beach_city_name","beach_city_dist"]]

Ydata = trainOk.median_house_value

X_train, X_test, Y_train, Y_test = train_test_split(Xdata, Ydata, test_size = 0.3, random_state = 42)

Tratamento das covariáveis categóricas

In [None]:
import gc
from sklearn.preprocessing import MinMaxScaler

X_train.drop(['close_city_name','big_city_name','beach_city_name'], axis=1, inplace=True)
X_test.drop(['close_city_name','big_city_name','beach_city_name'], axis=1, inplace=True)


num_cols = X_train.columns
num_train_index = X_train.index
num_test_index = X_test.index

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

X_train = pd.DataFrame(X_train,index = num_train_index, columns = num_cols)
X_test = pd.DataFrame(X_test,index = num_test_index, columns = num_cols)


gc.collect()

In [None]:
testOk = test.drop(['longitude','latitude','close_city','big_city','beach_city'], axis=1)
X_train.head()

In [None]:
Y_train.head()

Métruca RMSLE

In [None]:
from sklearn.metrics import mean_squared_log_error
from sklearn.metrics import make_scorer

def RMSLE(y_test, y_pred):
    return np.sqrt(mean_squared_log_error(np.abs(y_test), np.abs(y_pred)))

scorer = make_scorer(RMSLE, greater_is_better=False)

Regressão Linear

In [None]:
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(X_train, Y_train)
Y_pred = reg.predict(X_test)
print('Score:', reg.score(X_test, Y_test), '\n' 'Coef:', reg.coef_)

In [None]:
lin_reg_RMSLE = RMSLE(Y_pred, Y_test)
print("RMSLE:", lin_reg_RMSLE)

In [None]:
from sklearn.model_selection import cross_val_score
lin_reg_score = cross_val_score(reg, X_train, Y_train, cv = 10)
lin_reg_score 

Regressão KNN

In [None]:
from sklearn.neighbors import KNeighborsRegressor

reg = KNeighborsRegressor()
reg.fit(X_train, Y_train)
Y_pred = reg.predict(X_test)
print('Score:', reg.score(X_test, Y_test))
knn_reg_RMSLE = RMSLE(Y_pred, Y_test)
print("RMSLE:", knn_reg_RMSLE)

Regressão Logística

In [None]:
#from sklearn.linear_model import LogisticRegression

#reg = LogisticRegression()
#reg.fit(X_train, Y_train)
#Y_pred = reg.predict(X_teste)
#log_reg_RMSLE = RMSLE(Y_pred, Y_test)
#print("RMSLE:", log_reg_RMSLE)

Erro inesperado para Regressão Logística

Ridge

In [None]:
reg = linear_model.Ridge(alpha=.5)
reg.fit(X_train, Y_train)
Y_pred = reg.predict(X_test)
print('Score:', reg.score(X_test, Y_test), '\n' 'Coef:', reg.coef_)

In [None]:
rid_reg_RMSLE = RMSLE(Y_pred, Y_test)
print("RMSLE:", rid_reg_RMSLE)

In [None]:
from sklearn.model_selection import cross_val_score
rid_reg_score = cross_val_score(reg, X_train, Y_train, cv = 10)
rid_reg_score 

Lasso

In [None]:
from sklearn import linear_model
reg = linear_model.Lasso(alpha=0.1)
reg.fit(X_train, Y_train)
Y_pred = reg.predict(X_test)
print('Score:', reg.score(X_test, Y_test), '\n' 'Coef:', reg.coef_)

In [None]:
las_reg_RMSLE = RMSLE(Y_pred, Y_test)
print("RMSLE:", las_reg_RMSLE)

In [None]:
from sklearn.model_selection import cross_val_score
las_reg_score = cross_val_score(reg, X_train, Y_train, cv = 10)
las_reg_score 

Regressor Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor
reg = RandomForestRegressor()
reg.fit(X_train, Y_train)
Y_pred = reg.predict(X_test)
print('Score:', reg.score(X_test, Y_test))
rf_reg_RMSLE = RMSLE(Y_pred, Y_test)
print("RMSLE:", rf_reg_RMSLE)

Comparação dos modelos segundo RMSLE:

In [None]:
d = {'Regressor': ["Linear","KNN","Ridge","Lasso","Random Forest"], 
                      'RMSLE': [lin_reg_RMSLE,knn_reg_RMSLE,rid_reg_RMSLE,las_reg_RMSLE,rf_reg_RMSLE]}

tabela = pd.DataFrame(data=d)
tabela

Com isso, o melhor regressor é o Random Forest que obteve menor RMSLE. Assim, vamos aplicar os mesmos procedimentos para o teste.

TESTE!

In [None]:
test = pd.read_csv("/kaggle/input/atividade-regressao-PMR3508/test.csv",
        sep=r'\s*,\s*',
        engine='python',
        na_values="?",
        index_col=['Id'])

In [None]:
for label, content in test.items():
    print(pd.isna(test[label]).value_counts()) 

In [None]:
test.insert(len(test.columns)-2, "close_city", test.apply(lambda x: 
                            closest_point((x['latitude'],x['longitude']),city_coords), axis = 1), True)

test.insert(len(test.columns)-2, "close_city_name", [x[0] for x in test['close_city'].values], True)
test.insert(len(test.columns)-2, "close_city_dist", [x[1] for x in test['close_city'].values], True) 

test.insert(len(test.columns)-2, "close_city_pop", [city_pop_dict[x] for x in test['close_city_name'].values], True)

test.insert(len(test.columns)-2, "big_city", test.apply(lambda x: 
                    closest_point((x['latitude'],x['longitude']),big_cities), axis = 1), True) 
test.insert(len(test.columns)-2, "big_city_name", [x[0] for x in test['big_city'].values], True) 
test.insert(len(test.columns)-2, "big_city_dist", [x[1] for x in test['big_city'].values], True) 
test.head()

In [None]:
test.insert(len(test.columns)-2, "rooms_per_household", 
             test.apply(lambda x: round(x['total_rooms']/x['households'], 2), axis = 1), True) 

test.insert(len(test.columns)-2, "population_per_household", 
            test.apply(lambda x: round(x['population']/x['households'], 2), axis = 1), True) 

test.insert(len(test.columns)-2, "bedrooms_per_room", 
            test.apply(lambda x: round(x['total_bedrooms']/x['total_rooms'], 2), axis = 1), True) 

test.head()

In [None]:
test.insert(len(test.columns)-2, "beach_city", test.apply(lambda x: 
                    closest_point((x['latitude'],x['longitude']),beach_cities), axis = 1), True) 

test.insert(len(test.columns)-2, "beach_city_name", [x[0] for x in test['beach_city'].values], True) 
test.insert(len(test.columns)-2, "beach_city_dist", [x[1] for x in test['beach_city'].values], True) 
test.head()

In [None]:
testOk = test.drop(['longitude','latitude','close_city','big_city','beach_city'], axis=1)
testOk = testOk.drop(['population_per_household','close_city_pop','households','population','total_bedrooms'], axis=1)
testOk.head()

In [None]:
Xdata = testOk[["median_age","total_rooms","median_income","close_city_name","close_city_dist","big_city_name",
                 "big_city_dist","rooms_per_household","bedrooms_per_room","beach_city_name","beach_city_dist"]]


In [None]:
import gc
from sklearn.preprocessing import MinMaxScaler

Xdata.drop(['close_city_name','big_city_name','beach_city_name'], axis=1, inplace=True)



num_cols = Xdata.columns
num_test_index = Xdata.index


scaler = MinMaxScaler()
Xdata = scaler.fit_transform(Xdata)

Xdata = pd.DataFrame(Xdata,index = num_test_index, columns = num_cols)

gc.collect()

In [None]:
from sklearn.neighbors import KNeighborsRegressor

reg = RandomForestRegressor()
reg.fit(X_train, Y_train)
Y_pred = reg.predict(Xdata)


Arquivo de saída

In [None]:
saida = pd.DataFrame()
saida["Id"] = test.index
saida["median_house_value"] = Y_pred
saida.head()

In [None]:
saida.to_csv("Resultado_KaioTelesOgawa_9345957.csv", index = False)