# A - Prueba final módulo 5

#### Introduction
Since 2008, guests and hosts have used Airbnb to expand on traveling
possibilities and present more unique, personalized way of experiencing the
world.
The purpose of the present exercise is to find some answers about activity
generated by hosts and reach some kind of predictions about it.

#### Dataset
You’ll be using a dataset from Airbnb 2019 listing and metrics in New York
City. You can download the dataset here:

https://drive.google.com/file/d/1P0cxmchPoid4tjJ7gMEbZFY9ZQdnXdvv/view

This dataset has around 49000 observations in it with 16 columns and it is a
mix between categorical and numeric values
This public dataset is part of Airbnb, and the original source can be found
on this website: http://insideairbnb.com/

#### Columns
The fields in this dataset are:
- id: lodging id in listing
- name: lodging name in listing
- host_id: id of the host
- host_name: name of the host
- neighbourhood_group: location
- neighbourhood: area
- latitude
- longitude
- room_type: listing space type
- price: price in dollars
- minimun_nights: amount of nights minimun
- number_of_reviews
- last_review
- reviews_per_month
- calculated_host_listings_count: amount of listing (lodgings) per host
- availability_365: number of days when listing is available for booking

### Exploratory Data Analisys
#### Tasks
1. Present the code and methods for acquiring the data. Loading the data
into appropriate format for analysis. Explain the process and results
2. Explore the data by analyzing its statistics and visualizing the values of
features and correlations between different features. Explain the
process and the results
3. In case you use a scatterplot based on latitude and longitude
coordinates, you may need this image to get a better understanding
map plot:
https://drive.google.com/file/d/1Fu6L8pDt8ujRhIJbubEiIag9PEd_2
H_E/view?usp=sharing

### Model Building
#### Task
The purpose of this task is to predict the price of NYC Airbnb rentals based on
the data provided and any external dataset(s) with relevant information. Two
main goals are suggested:
1. Users should submit a csv file with each listing from the data set and the
model-predicted price:
id, price
2539, 149
2595, 225
3647, 150
…. ,...
2. A solution with low root-mean-squared error (RMSE) based on
cross-validation that can be reproduced and interpreted is ideal

Cargamos las librerías para trabajar con numpy, pandas y matplotlib.

In [None]:
#cargamos las librerías
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.image as img

## 1. Exploratory Data Analisys

Cargamos el csv con la función de pandas read_csv. Mostramos las 10 primeras filas.

In [None]:
data = pd.read_csv("./csv/AB_NYC_2019.csv")

data.head(10)

Comprobemos la forma del dataset. Está formado por 48895 filas, y 16 columnas

In [None]:
data.shape

y comprobamos el tipo de dato que almacena cada columna

In [None]:

data.dtypes

Ahora calcularemos cuántos NaN hay por columna

In [None]:
data.isnull().sum()

Reemplazamos los NaN con datos, para poder trabajar con las columnas más adelante.

In [None]:
data.fillna({'reviews_per_month':0}, inplace=True)
data.fillna({'name':"NoName"}, inplace=True)
data.fillna({'host_name':"NoName"}, inplace=True)
data.fillna({'last_review':"NotReviewed"}, inplace=True)
data.isnull().sum()

Modificamos el dataset para que no muestre datos que no nos van a ser útiles. En este caso, las columnas 'id', 'name' y 'last_review'.

In [None]:
data.drop(['id','name','last_review'], axis=1, inplace=True)

data.head(10)

A ver cuantos datos únicos hay en la columna 'neighbourhood_group' sin repetir. Vemos que podemos trabajar con 5 grupos.

In [None]:
data['neighbourhood_group'].unique()

Haremos la misma operación con la columna 'neighbourhood':

In [None]:
#Podemos trabajar tb con la columna 'neighbourhood'
data['neighbourhood'].unique()

y finalmente por la columna 'room_type'. Existen 3 tipos de habitaciones.

In [None]:
data['room_type'].unique()

Viendo estos datos, vamos a empezar a trabajar con ellos. Empezaremos con la columna 'neighbourhood_group'.
['Brooklyn', 'Manhattan', 'Queens', 'Staten Island', 'Bronx']
Obtendremos sus estadísticas.

In [None]:
# Viendo estos datos, vamos a empezar a trabajar con ellos. Empezaremos con la columna 'neighbourhood_group'.
# ['Brooklyn', 'Manhattan', 'Queens', 'Staten Island', 'Bronx']

# 'Brooklyn'
brooklyn = data.loc[data['neighbourhood_group'] == 'Brooklyn']
precio_brooklyn = brooklyn[['price']]

# 'Manhattan'
manhattan = data.loc[data['neighbourhood_group'] == 'Manhattan']
precio_manhattan = manhattan[['price']]

# 'Queens'
queens = data.loc[data['neighbourhood_group'] == 'Queens']
precio_queens = queens[['price']]

# 'Staten Island'
staten = data.loc[data['neighbourhood_group'] == 'Staten Island']
precio_staten = staten[['price']]

# 'Bronx'
bronx = data.loc[data['neighbourhood_group'] == 'Bronx']
precio_bronx = bronx[['price']]

# almaceno los precios del dataset en una lista.
precio_por_neighb = [precio_brooklyn, precio_manhattan, precio_queens, precio_staten, precio_bronx]

lista_neighb = ['Brooklyn', 'Manhattan', 'Queens', 'Staten Island', 'Bronx']
tabla = []

for lista in precio_por_neighb:
    x = lista.describe(percentiles=[.25, .50, .75])
    x = x.iloc[3:]
    x.reset_index(inplace=True)
    x.rename(columns={'index':'Estadísticas'}, inplace=True)
    tabla.append(x)

#Cambio la columna 'price' con el nombre de la vecindad a la que pertenece
tabla[0].rename(columns={'price':lista_neighb[0]}, inplace=True)
tabla[1].rename(columns={'price':lista_neighb[1]}, inplace=True)
tabla[2].rename(columns={'price':lista_neighb[2]}, inplace=True)
tabla[3].rename(columns={'price':lista_neighb[3]}, inplace=True)
tabla[4].rename(columns={'price':lista_neighb[4]}, inplace=True)

tabla = [df.set_index('Estadísticas') for df in tabla]
tabla = tabla[0].join(tabla[1:])
print(tabla)
tabla

Se puede comprobar que entre el 3er cuartil y el máximo hay una diferencia enorme, por lo que tendremos que trabajar
en ese asunto por ser un outliner

Ahora trabajamos con la tabla 'neighbourhood'. Tenemos 221. Como son muchos, vamos a trabajar con el top 10

In [None]:
len(data['neighbourhood'].unique())

In [None]:
data['neighbourhood'].value_counts().head(10)

creo un dataframe sólo con los datos de ese top 10:

In [None]:
data_neighb = data.loc[data['neighbourhood'].isin(['Williamsburg','Bedford-Stuyvesant','Harlem','Bushwick','Upper West Side','Hell\'s Kitchen','East Village','Upper East Side','Crown Heights','Midtown'])]
data_neighb

Con estos datos del top 10, vamos a crear 3 grupos de datos, que son los pertenecientes a la columna 'room_type'. De esta manera, vamos a contar cuántas habitaciones son 'Entire home/apt', 'Private room' o 'Shared room' en cada uno de los vecindarios. Muestro los datos en 3 gráficos de barras.

In [None]:
midtown = data_neighb.loc[data_neighb['neighbourhood'] == 'Midtown']
harlem = data_neighb.loc[data_neighb['neighbourhood'] == 'Harlem']
bedford = data_neighb.loc[data_neighb['neighbourhood'] == 'Bedford-Stuyvesant']
hells = data_neighb.loc[data_neighb['neighbourhood'] == 'Hell\'s Kitchen']
upperW = data_neighb.loc[data_neighb['neighbourhood'] == 'Upper West Side']
william = data_neighb.loc[data_neighb['neighbourhood'] == 'Williamsburg']
crown = data_neighb.loc[data_neighb['neighbourhood'] == 'Crown Heights']
east = data_neighb.loc[data_neighb['neighbourhood'] == 'East Village']
bush = data_neighb.loc[data_neighb['neighbourhood'] == 'Bushwick']
upperE = data_neighb.loc[data_neighb['neighbourhood'] == 'Upper East Side']

# Entire home/apt
midtown_apt = midtown.loc[midtown['room_type'] == 'Entire home/apt'].value_counts()
harlem_apt = harlem.loc[harlem['room_type'] == 'Entire home/apt'].value_counts()
bedford_apt = bedford.loc[bedford['room_type'] == 'Entire home/apt'].value_counts()
hells_apt = hells.loc[hells['room_type'] == 'Entire home/apt'].value_counts()
upperW_apt = upperW.loc[upperW['room_type'] == 'Entire home/apt'].value_counts()
william_apt = william.loc[william['room_type'] == 'Entire home/apt'].value_counts()
crown_apt = crown.loc[crown['room_type'] == 'Entire home/apt'].value_counts()
east_apt = east.loc[east['room_type'] == 'Entire home/apt'].value_counts()
bush_apt = bush.loc[bush['room_type'] == 'Entire home/apt'].value_counts()
upperE_apt = upperE.loc[upperE['room_type'] == 'Entire home/apt'].value_counts()



#Visualizamos los datos en gráficos de 'Entire home/apt' por 'neighbourhood'.
#labels = ['Midtown', 'Harlem', 'Bedford-Stuyvesant', 'Hell\'s Kitchen', 'Upper West Side','Williamsburg','Crown Heights','East Village','Bushwick','Upper East Side']
#apt = [20, 34, 30, 35, 27]

labels = ['Midtown','Harlem','Bedford-Stuyvesant','Hell\'s Kitchen','Upper West Side','Williamsburg','Crown Heights','East Village','Bushwick','Upper East Side']
apt = [midtown_apt.count(),harlem_apt.count(),bedford_apt.count(),hells_apt.count(),upperW_apt.count(),william_apt.count(),
      crown_apt.count(),east_apt.count(), bush_apt.count(), upperE_apt.count()]


x = np.arange(len(labels))  # the label locations
width= 0.35

fig, ax = plt.subplots()
rects1 = ax.bar(x, apt, width, label='Entire home/apt')
#rects1 = ax.bar(x - width/2, apt, width, label='Entire home/apt')
#rects2 = ax.bar(x + width/2, priv, width, label='Private room')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Counts')
ax.set_title('Entire home/apt by Neighbourhood')
ax.set_xticks(x)
ax.set_xticklabels(labels, rotation=90)
ax.legend()

#ax.bar_label(rects1, padding=3)
#ax.bar_label(rects2, padding=3)

fig.tight_layout()

plt.show()




In [None]:
# Private
midtown_priv = midtown.loc[midtown['room_type'] == 'Private room'].value_counts()
harlem_priv = harlem.loc[harlem['room_type'] == 'Private room'].value_counts()
bedford_priv = bedford.loc[bedford['room_type'] == 'Private room'].value_counts()
hells_priv = hells.loc[hells['room_type'] == 'Private room'].value_counts()
upperW_priv = upperW.loc[upperW['room_type'] == 'Private room'].value_counts()
william_priv = william.loc[william['room_type'] == 'Private room'].value_counts()
crown_priv = crown.loc[crown['room_type'] == 'Private room'].value_counts()
east_priv = east.loc[east['room_type'] == 'Private room'].value_counts()
bush_priv = bush.loc[bush['room_type'] == 'Private room'].value_counts()
upperE_priv = upperE.loc[upperE['room_type'] == 'Private room'].value_counts()

labels = ['Midtown','Harlem','Bedford-Stuyvesant','Hell\'s Kitchen','Upper West Side','Williamsburg','Crown Heights','East Village','Bushwick','Upper East Side']
priv = [midtown_priv.count(),harlem_priv.count(),bedford_priv.count(),hells_priv.count(),upperW_priv.count(),william_priv.count(),
      crown_priv.count(),east_priv.count(), bush_priv.count(), upperE_priv.count()]

x = np.arange(len(labels))  # the label locations
width= 0.35

fig, ax = plt.subplots()
rects1 = ax.bar(x, priv, width, label='Private room')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Counts')
ax.set_title('Private room by Neighbourhood')
ax.set_xticks(x)
ax.set_xticklabels(labels, rotation=90)
ax.legend()

#ax.bar_label(rects1, padding=3)
#ax.bar_label(rects2, padding=3)

fig.tight_layout()

plt.show()

In [None]:
# Shared room
midtown_shar = midtown.loc[midtown['room_type'] == 'Shared room'].value_counts()
harlem_shar = harlem.loc[harlem['room_type'] == 'Shared room'].value_counts()
bedford_shar = bedford.loc[bedford['room_type'] == 'Shared room'].value_counts()
hells_shar = hells.loc[hells['room_type'] == 'Shared room'].value_counts()
upperW_shar = upperW.loc[upperW['room_type'] == 'Shared room'].value_counts()
william_shar = william.loc[william['room_type'] == 'Shared room'].value_counts()
crown_shar = crown.loc[crown['room_type'] == 'Shared room'].value_counts()
east_shar = east.loc[east['room_type'] == 'Shared room'].value_counts()
bush_shar = bush.loc[bush['room_type'] == 'Shared room'].value_counts()
upperE_shar = upperE.loc[upperE['room_type'] == 'Shared room'].value_counts()

labels = ['Midtown','Harlem','Bedford-Stuyvesant','Hell\'s Kitchen','Upper West Side','Williamsburg','Crown Heights','East Village','Bushwick','Upper East Side']
share = [midtown_shar.count(),harlem_shar.count(),bedford_shar.count(),hells_shar.count(),upperW_shar.count(),william_shar.count(),
      crown_shar.count(),east_shar.count(), bush_shar.count(), upperE_shar.count()]

x = np.arange(len(labels))  # the label locations
width= 0.35

fig, ax = plt.subplots()
rects1 = ax.bar(x, share, width, label='Private room')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Counts')
ax.set_title('Shared room by Neighbourhood')
ax.set_xticks(x)
ax.set_xticklabels(labels, rotation=90)
ax.legend()

fig.tight_layout()
plt.show()

Para visualizar mejor los resultados, unifico estos datos en un gráfico multibarra.

In [None]:
#prueba para grafico de 3 barras

#cargamos las librerías
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.image as img

#sub-df con datos exclusivos para cada uno de los neighbourhood
midtown = data_neighb.loc[data_neighb['neighbourhood'] == 'Midtown']
harlem = data_neighb.loc[data_neighb['neighbourhood'] == 'Harlem']
bedford = data_neighb.loc[data_neighb['neighbourhood'] == 'Bedford-Stuyvesant']
hells = data_neighb.loc[data_neighb['neighbourhood'] == 'Hell\'s Kitchen']
upperW = data_neighb.loc[data_neighb['neighbourhood'] == 'Upper West Side']
william = data_neighb.loc[data_neighb['neighbourhood'] == 'Williamsburg']
crown = data_neighb.loc[data_neighb['neighbourhood'] == 'Crown Heights']
east = data_neighb.loc[data_neighb['neighbourhood'] == 'East Village']
bush = data_neighb.loc[data_neighb['neighbourhood'] == 'Bushwick']
upperE = data_neighb.loc[data_neighb['neighbourhood'] == 'Upper East Side']

# Entire home/apt
midtown_apt = midtown.loc[midtown['room_type'] == 'Entire home/apt'].value_counts()
harlem_apt = harlem.loc[harlem['room_type'] == 'Entire home/apt'].value_counts()
bedford_apt = bedford.loc[bedford['room_type'] == 'Entire home/apt'].value_counts()
hells_apt = hells.loc[hells['room_type'] == 'Entire home/apt'].value_counts()
upperW_apt = upperW.loc[upperW['room_type'] == 'Entire home/apt'].value_counts()
william_apt = william.loc[william['room_type'] == 'Entire home/apt'].value_counts()
crown_apt = crown.loc[crown['room_type'] == 'Entire home/apt'].value_counts()
east_apt = east.loc[east['room_type'] == 'Entire home/apt'].value_counts()
bush_apt = bush.loc[bush['room_type'] == 'Entire home/apt'].value_counts()
upperE_apt = upperE.loc[upperE['room_type'] == 'Entire home/apt'].value_counts()

# Private room
midtown_priv = midtown.loc[midtown['room_type'] == 'Private room'].value_counts()
harlem_priv = harlem.loc[harlem['room_type'] == 'Private room'].value_counts()
bedford_priv = bedford.loc[bedford['room_type'] == 'Private room'].value_counts()
hells_priv = hells.loc[hells['room_type'] == 'Private room'].value_counts()
upperW_priv = upperW.loc[upperW['room_type'] == 'Private room'].value_counts()
william_priv = william.loc[william['room_type'] == 'Private room'].value_counts()
crown_priv = crown.loc[crown['room_type'] == 'Private room'].value_counts()
east_priv = east.loc[east['room_type'] == 'Private room'].value_counts()
bush_priv = bush.loc[bush['room_type'] == 'Private room'].value_counts()
upperE_priv = upperE.loc[upperE['room_type'] == 'Private room'].value_counts()

# Shared room
midtown_shar = midtown.loc[midtown['room_type'] == 'Shared room'].value_counts()
harlem_shar = harlem.loc[harlem['room_type'] == 'Shared room'].value_counts()
bedford_shar = bedford.loc[bedford['room_type'] == 'Shared room'].value_counts()
hells_shar = hells.loc[hells['room_type'] == 'Shared room'].value_counts()
upperW_shar = upperW.loc[upperW['room_type'] == 'Shared room'].value_counts()
william_shar = william.loc[william['room_type'] == 'Shared room'].value_counts()
crown_shar = crown.loc[crown['room_type'] == 'Shared room'].value_counts()
east_shar = east.loc[east['room_type'] == 'Shared room'].value_counts()
bush_shar = bush.loc[bush['room_type'] == 'Shared room'].value_counts()
upperE_shar = upperE.loc[upperE['room_type'] == 'Shared room'].value_counts()

names = ['Midtown','Harlem','Bedford-Stuyvesant','Hell\'s Kitchen','Upper West Side','Williamsburg','Crown Heights','East Village','Bushwick','Upper East Side']
values_apt = [midtown_apt.count(),harlem_apt.count(),bedford_apt.count(),hells_apt.count(),upperW_apt.count(),william_apt.count(),
      crown_apt.count(),east_apt.count(), bush_apt.count(), upperE_apt.count()]
values_priv = [midtown_priv.count(),harlem_priv.count(),bedford_priv.count(),hells_priv.count(),upperW_priv.count(),william_priv.count(),
      crown_priv.count(),east_priv.count(), bush_priv.count(), upperE_priv.count()]
values_share = [midtown_shar.count(),harlem_shar.count(),bedford_shar.count(),hells_shar.count(),upperW_shar.count(),william_shar.count(),
      crown_shar.count(),east_shar.count(), bush_shar.count(), upperE_shar.count()]

x = np.arange(len(labels))
width= 0.25

fig, ax = plt.subplots()
rects1 = ax.bar(x - width, values_apt, width, label='Entire home/apt')
rects2 = ax.bar(x, values_priv, width, label='Private room')
rects3 = ax.bar(x + width, values_share, width, label='Shared room')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Counts')
ax.set_title('Room type by Neighbourhood')
ax.set_xticks(x)
ax.set_xticklabels(names, rotation=90)
ax.legend()

fig.tight_layout()
plt.show()

Vamos a definit outliners. Para ellos calculo el Q1 y el Q3. La diferencia de éstos forma el rango intercuartílico. Con él, calculo el umbral superior, y podemos solventar la diferencia entre el Q3 y el Max, que era de 10000 tal como vimos en la tabla de estadísticas.

In [None]:
#definir outliners
x = data_neighb["price"]
Q1 = np.percentile(x, 25)
Q3 = np.percentile(x, 75)
Q4 = np.percentile(x,100)
rangointer = Q3 - Q1
umbralsuperior = Q3 + (1.5 * rangointer)
umbralinferior = Q1 - (1.5 * rangointer)
mediaSuperior = np.mean(x > umbralsuperior)
print("Q1:", Q1,"\nQ3:",Q3,"\nQ4:",Q4,"\nRango Intercuartíico:",rangointer,"\nUmbral Superior:",umbralsuperior,"\nMedia Superior:",mediaSuperior)
print("Forma del dataset:",x.shape)

Genero Scatterplot con los datos del dataset. Para evitar outliners, establezco el tope de precio según el umbral superior que calculé anteriormente:

In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as img

#tengo que acotar que el precio sea menor a 342 (evitamos outliner superior) para quitar valores extremos
data_price = data[data.price < 342]
#mapa con scatterplot
mapa=data_price.plot(kind='scatter', x='longitude', y='latitude', label='availability_365', c='price', 
                     cmap=plt.get_cmap('jet'), colorbar=True, alpha=0.4)
mapa.legend()

Finalmente, añado el mapa de NYC que nos solicita el ejercicio, y muestro sobre él los datos del scatterplot

In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as img

nyc_img=plt.imread("./images/New_York_City_.PNG")

#coordenadas de NYC
plt.imshow(nyc_img, zorder=0, extent=[-74.258, -73.7, 40.49,40.92])
ax=plt.gca()

#Scatterplot
data_price.plot(kind='scatter', x='longitude', y='latitude', label='availability_365', c='price', ax=ax, 
           cmap=plt.get_cmap('jet'), colorbar=True, alpha=0.4, zorder=5)

plt.legend()
plt.show()

## 2. Model Building

Cargamos la librería de sklear, train_test_split, para generar los datos de entrenamiento:

In [None]:
from sklearn.model_selection import train_test_split

Genero un nuevo dataset con las columnas con las que voy a trabajar:

In [None]:
all_data=data[['neighbourhood_group','room_type','price','minimum_nights','calculated_host_listings_count','availability_365']]
all_data.head()

Factorizar las variables categóricas:

In [None]:
#Encoding categorical variables
pd.options.mode.chained_assignment = None
all_data['room_type']=all_data['room_type'].factorize()[0]
all_data['neighbourhood_group']=all_data['neighbourhood_group'].factorize()[0]

Genero los datos de prueba y los de test:

In [None]:
#Train test split
y = all_data['price']
x= all_data.drop(['price'],axis=1)
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.1,random_state=105)

Genero un modelo de regresión linear, con los datos de RMSE para los datos de entrenamiento, y de prueba:

In [None]:
# Modelling  Linear Regression
from sklearn.linear_model import LinearRegression

regr = LinearRegression()
regr.fit(x_train,y_train)

print('root main squared error train score: {:.3f}'.format(regr.score(x_train, y_train)))
print('root main squared error test score: {:.3f}'.format(regr.score(x_test, y_test)))
print('coeficients: \n',regr.coef_)