# Data engineer
A data engineer is responsible for designing, developing, and maintaining the infrastructure that is used for managing and processing large volumes of data. The key responsibilities of a data engineer may include:

* **Data pipeline development**: Creating data pipelines that extract, transform, and load (ETL) data from various sources into a centralized data warehouse or data lake.  
* **Data modeling**: Developing data models that enable efficient storage, retrieval, and analysis of data.  
* **Data integration**: Integrating data from various sources such as databases, APIs, and files, into a unified format.  
* **Data architecture**: Designing and implementing data architectures that support the requirements of various data stakeholders, including data analysts, data scientists, and business users.  
* **Data quality management**: Ensuring that data is accurate, complete, and consistent by implementing data validation and data cleaning processes.  
* **Performance optimization**: Tuning data processing systems and databases to optimize performance and reduce latency.  
* **Data security and governance**: Ensuring that data is secure and meets regulatory compliance requirements.  

Overall, data engineers play a critical role in ensuring that organizations can effectively store, manage, and analyze large volumes of data to drive business insights and decision-making.

## Airbnb Paris: how much can I charge? $$$

In [18]:
# importing libraries
import pandas as pd
import psycopg
import requests

import os

# defining constants
CITY = "Paris"

In [2]:
%load_ext dotenv
%dotenv

### Data collection
The first step of a data problem starts with ... data! Where do we get it? Can we collect it ourselves? Is it free? Is it legal?

Here we list some of the most common data sources:
* **Databases**: Databases are one of the primary sources of data for many organizations. Data can be collected from different types of databases, such as SQL databases (such as MySQL, Oracle, and SQL Server), NoSQL databases (such as MongoDB and Cassandra) etc.  
* **APIs**: Many organizations expose their data through APIs (Application Programming Interfaces). APIs can be used to collect data from various sources, including social media platforms, e-commerce sites, and financial data providers.
* **File systems**: Data can be collected from various file systems, including local file systems, network file systems, and cloud-based file systems such as Amazon S3 and Google Cloud Storage.
* **Web**: Data can also be collected by scraping web pages and extracting data from HTML pages, XML, and JSON documents.

Our data is in a .csv file that can be read from the web. To make it a bit faster we downloaded it already and are just reading it from our file system.

In [3]:
#df = pd.read_csv('https://public.opendatasoft.com/api/explore/v2.1/catalog/datasets/airbnb-listings/exports/csv?lang=en&facet=facet(name%3D%22host_verifications%22%2C%20disjunctive%3Dtrue)&facet=facet(name%3D%22amenities%22%2C%20disjunctive%3Dtrue)&facet=facet(name%3D%22features%22%2C%20disjunctive%3Dtrue)&refine=city%3A%22Paris%22&timezone=Europe%2FBerlin&use_labels=true&delimiter=%3B')
df = pd.read_csv('../data/airbnb-listings.csv', sep=";")
df.shape

  df = pd.read_csv('../data/airbnb-listings.csv', sep=";")


(54513, 89)

In [4]:
df.columns

Index(['ID', 'Listing Url', 'Scrape ID', 'Last Scraped', 'Name', 'Summary',
       'Space', 'Description', 'Experiences Offered', 'Neighborhood Overview',
       'Notes', 'Transit', 'Access', 'Interaction', 'House Rules',
       'Thumbnail Url', 'Medium Url', 'Picture Url', 'XL Picture Url',
       'Host ID', 'Host URL', 'Host Name', 'Host Since', 'Host Location',
       'Host About', 'Host Response Time', 'Host Response Rate',
       'Host Acceptance Rate', 'Host Thumbnail Url', 'Host Picture Url',
       'Host Neighbourhood', 'Host Listings Count',
       'Host Total Listings Count', 'Host Verifications', 'Street',
       'Neighbourhood', 'Neighbourhood Cleansed',
       'Neighbourhood Group Cleansed', 'City', 'State', 'Zipcode', 'Market',
       'Smart Location', 'Country Code', 'Country', 'Latitude', 'Longitude',
       'Property Type', 'Room Type', 'Accommodates', 'Bathrooms', 'Bedrooms',
       'Beds', 'Bed Type', 'Amenities', 'Square Feet', 'Price', 'Weekly Price',
       'Month

In [5]:
# Removing columns we do not need
df.drop(list(df.filter(regex = 'Url')), axis = 1, inplace = True)
df.drop(list(df.filter(regex = 'URL')), axis = 1, inplace = True)
df.drop(list(df.filter(regex = 'Review')), axis = 1, inplace = True)

In [6]:
df.shape

(54513, 70)

In [7]:
pd.set_option('display.max_columns', None)
df.head()

Unnamed: 0,ID,Scrape ID,Last Scraped,Name,Summary,Space,Description,Experiences Offered,Neighborhood Overview,Notes,Transit,Access,Interaction,House Rules,Host ID,Host Name,Host Since,Host Location,Host About,Host Response Time,Host Response Rate,Host Acceptance Rate,Host Neighbourhood,Host Listings Count,Host Total Listings Count,Host Verifications,Street,Neighbourhood,Neighbourhood Cleansed,Neighbourhood Group Cleansed,City,State,Zipcode,Market,Smart Location,Country Code,Country,Latitude,Longitude,Property Type,Room Type,Accommodates,Bathrooms,Bedrooms,Beds,Bed Type,Amenities,Square Feet,Price,Weekly Price,Monthly Price,Security Deposit,Cleaning Fee,Guests Included,Extra People,Minimum Nights,Maximum Nights,Calendar Updated,Has Availability,Availability 30,Availability 60,Availability 90,Availability 365,Calendar last Scraped,License,Jurisdiction Names,Cancellation Policy,Calculated host listings count,Geolocation,Features
0,7735531,20170404145355,2017-04-06,Petit cocon au coeur de Paris,"Bienvenue chez moi, j'habite en plein centre d...","Idéalement situé a 2 pas du métro, de Beaubour...","Bienvenue chez moi, j'habite en plein centre d...",none,"Mon quartier ""Le Marais"" est unique à Paris. A...",Si possible il est préférable que les voyageur...,Mon appartement est à seulement quelques mètre...,Guides et cartes de Paris à disposition ! Free...,Je serais entièrement disponible pour répondre...,De la propreté et du respect ! Prenez svp vos ...,35578778,Marion,2015-06-11,"Paris, Île-de-France, France","Bonjour, je travaille dans l'hôtellerie depuis...",within an hour,100.0,,Le Marais,1.0,1.0,"email,phone,reviews","Le Marais, Paris, Île-de-France 75004, France",Le Marais,Hôtel-de-Ville,,Paris,Île-de-France,75004.0,Paris,"Paris, France",FR,France,48.858654,2.353462,Apartment,Entire home/apt,2,1.0,0.0,1.0,Real Bed,"Wireless Internet,Kitchen,Elevator in building...",,90.0,,,,10.0,1,0,1,1125,yesterday,,28,52,82,357,2017-04-06,,Paris,moderate,1,"48.85865448642082, 2.35346198925107","Host Has Profile Pic,Is Location Exact"
1,3036231,20170404145355,2017-04-06,Enjoy the lovely heart of Paris !,Beautiful and lightful 58 m2 apartment close t...,Come to enjoy the real Paris !,Beautiful and lightful 58 m2 apartment close t...,none,Le Marais est un quartier incroyable pour se b...,,"Proche Ligne 1 (Saint-Paul, Bastille) et 8 (Ch...",,,,6466602,Pierre,2013-05-19,"Paris, Île-de-France, France","Bonjour tout le monde,\r\n\r\nJe me prénomme P...",within an hour,100.0,,Le Marais,2.0,2.0,"email,phone,reviews,jumio","Le Marais, Paris, Île-de-France 75004, France",Le Marais,Hôtel-de-Ville,,Paris,Île-de-France,75004.0,Paris,"Paris, France",FR,France,48.855628,2.365637,Apartment,Entire home/apt,6,1.0,2.0,2.0,Real Bed,"TV,Internet,Wireless Internet,Kitchen,Heating,...",,140.0,,,500.0,10.0,4,0,7,1125,a week ago,,0,3,3,21,2017-04-06,,Paris,moderate,2,"48.855627773678485, 2.3656368498036344","Host Has Profile Pic,Host Identity Verified,Is..."
2,2183529,20170404145355,2017-04-06,Studio Saint Paul,Colourful and bright studio for 2 guests in a ...,This charming 25m2 studio is very bright and c...,Colourful and bright studio for 2 guests in a ...,none,"Located on a great and quiet neighbourhood, wi...",If you need anything or something is missing i...,The closest subway station is Saint Paul on li...,The beds are prepared with a fitted sheet and ...,After confirming the booking I send you a welc...,No Pets. No Smoking. No Parties.,10574661,Carolyn,2013-12-11,Wales,I am an architect with a passion for photograp...,within a few hours,99.0,,Invalides - Ecole Militaire,37.0,37.0,"email,phone,reviews,jumio","Le Marais, Paris, Île-de-France 75004, France",Le Marais,Hôtel-de-Ville,,Paris,Île-de-France,75004.0,Paris,"Paris, France",FR,France,48.855027,2.365122,Apartment,Entire home/apt,2,1.0,0.0,1.0,Real Bed,"TV,Cable TV,Internet,Wireless Internet,Kitchen...",,80.0,600.0,,200.0,40.0,1,0,4,90,today,,9,33,37,152,2017-04-06,,Paris,strict,31,"48.855026887661616, 2.3651218949410855","Host Has Profile Pic,Host Identity Verified,Is..."
3,515970,20170404145355,2017-04-06,160 M2 Place des Vosges .Marais.,,Welcome to this beautiful apartment which is l...,Welcome to this beautiful apartment which is l...,none,"The apartment is in the center of Paris , in t...",,"The subway is at one minute , walking distance.",All apartment,I can help you all the time during your stay,This is a protected historical building which ...,2474755,Artiste Sandrine,2012-05-27,"Paris, Île-de-France, France",I am french and I am a painter. I'm in my for...,within an hour,100.0,,Le Marais,1.0,1.0,"email,phone,facebook,reviews,jumio","Le Marais, Paris, IDF 75004, France",Le Marais,Hôtel-de-Ville,,Paris,IDF,75004.0,Paris,"Paris, France",FR,France,48.854519,2.365711,Apartment,Entire home/apt,7,2.0,5.0,5.0,Real Bed,"TV,Cable TV,Internet,Wireless Internet,Kitchen...",,690.0,,,,200.0,5,90,3,30,yesterday,,17,38,61,301,2017-04-06,,Paris,strict,1,"48.85451946378233, 2.3657113179989033","Host Has Profile Pic,Host Identity Verified,Is..."
4,3144316,20170404145355,2017-04-06,Heart Marais-22m2 Lovely Studio,Located in the heart of Marais District (Rue d...,Located in the heart of Marais District (Rue d...,Located in the heart of Marais District (Rue d...,none,-A Large variety of restaurants (rue des rosie...,,The nearest tube stations are Saint Paul line ...,,,,15766162,Lea,2014-05-20,"Paris, Île-de-France, France","Hi my name is Lea, I was born in Paris have li...",within a few hours,80.0,,Le Marais,1.0,1.0,"email,phone,reviews","Le Marais, Paris, Île-de-France 75004, France",Le Marais,Hôtel-de-Ville,,Paris,Île-de-France,75004.0,Paris,"Paris, France",FR,France,48.857004,2.358326,Apartment,Entire home/apt,2,1.0,0.0,1.0,Real Bed,"TV,Cable TV,Internet,Wireless Internet,Kitchen...",,90.0,,,200.0,,1,0,5,1125,yesterday,,2,2,6,218,2017-04-06,,Paris,moderate,1,"48.85700449555545, 2.358325752229841","Host Has Profile Pic,Is Location Exact"


### Storing the data
We can always set up a database locally, however, nowadays cloud is a popular and easy choice for us to be able to share the database with the world. We set up the database using Amazon Web Services (and it is free! ... I hope).

![Alt text](../database.png)

In [None]:
#establishing the connection (local)
#conn = psycopg.connect(
#   dbname="postgres", user='postgres', password='password', host='127.0.0.1', port= '5432'
#)
#cursor = conn.cursor()

In [None]:
#establishing the connection (cloud)
conn = psycopg.connect(
   dbname=os.environ.get("DB_NAME"),
   user=os.environ.get("DB_USER"),
   password=os.environ.get("DB_PASSWORD"),
   host=os.environ.get("DB_HOST"),
   port= os.environ.get("DB_PORT")
)
cursor = conn.cursor()

In [14]:
unique_roomtype = df["Room Type"].unique()
tmp = [(i,) for i in unique_roomtype]
with conn.cursor() as cur:
    cur.executemany(
        'INSERT INTO roomtype ("room_type_name") VALUES (%s) ON CONFLICT DO NOTHING',
        tmp
    )
conn.commit()

In [23]:
unique_neighbourhoods = df["Neighbourhood Cleansed"].unique()
tmp = [(i,) for i in unique_neighbourhoods]
with conn.cursor() as cur:
    cur.executemany(
        'INSERT INTO neighbourhood ("neighbourhood_name") VALUES (%s) ON CONFLICT DO NOTHING',
        tmp
    )
conn.commit()

In [24]:
unique_propertytype = df["Property Type"].unique()
tmp = [(i,) for i in unique_propertytype]
with conn.cursor() as cur:
    cur.executemany(
        'INSERT INTO propertytype ("property_type_name") VALUES (%s) ON CONFLICT DO NOTHING',
        tmp
    )
conn.commit()

In [25]:
unique_bedtype = df["Bed Type"].unique()
tmp = [(i,) for i in unique_bedtype]
with conn.cursor() as cur:
    cur.executemany(
        'INSERT INTO bedtype ("bed_type_name") VALUES (%s) ON CONFLICT DO NOTHING',
        tmp
    )
conn.commit()

In [26]:
unique_cancelpolicy = df["Cancellation Policy"].unique()
tmp = [(i,) for i in unique_cancelpolicy]
with conn.cursor() as cur:
    cur.executemany(
        'INSERT INTO cancelpolicy ("cancel_policy_name") VALUES (%s) ON CONFLICT DO NOTHING',
        tmp
    )
conn.commit()

Get additional data from an API.

In [35]:
api_url = "https://nominatim.openstreetmap.org/search"
params = {
    "q": "paris",
    "format": "json"
}
headers = {
    "User-Agent": "my-app-name/1.0"
}

response = requests.get(api_url, headers=headers, params=params)
response_paris = response.json()[0]
response_paris

{'place_id': 88066702,
 'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. http://osm.org/copyright',
 'osm_type': 'relation',
 'osm_id': 71525,
 'lat': '48.8534951',
 'lon': '2.3483915',
 'class': 'boundary',
 'type': 'administrative',
 'place_rank': 12,
 'importance': 0.8845663630228834,
 'addresstype': 'city',
 'name': 'Paris',
 'display_name': 'Paris, Île-de-France, France métropolitaine, France',
 'boundingbox': ['48.8155755', '48.9021560', '2.2241220', '2.4697602']}

In [36]:
center_lon, center_lat = response_paris["lon"], response_paris["lat"]

In [38]:
with conn.cursor() as cur:
    cur.executemany(
        'INSERT INTO city ("city_name", "center_longitude", "center_latitude") VALUES (%s, %s, %s) ON CONFLICT DO NOTHING',
        [(CITY, center_lon, center_lat)]
    )
conn.commit()

In [39]:
# only keep the columns that we need for the project
df = df[["ID",
        "Neighbourhood Cleansed",
        "Property Type",
        "Room Type",
        "Accommodates",
        "Bathrooms",
        "Bedrooms",
        "Beds",
        "Bed Type",
        "Price",
        "Minimum Nights",
        "Cancellation Policy",
        "Features",
        "Amenities",
        "Longitude",
        "Latitude"]]

# rename columns
df = df.rename(columns={'ID': 'id', 'Neighbourhood Cleansed': 'neighbourhood', 'Property Type': 'property_type', 'Room Type': 'room_type',
                        'Accommodates': 'accommodates', 'Bathrooms': 'bathrooms', 'Bedrooms': 'bedrooms',
                        'Beds': 'beds', 'Bed Type': 'bed_type', 'Price': 'price', 'Minimum Nights': 'minimum_nights',
                        'Cancellation Policy': 'cancel_policy', 'Features': 'features', 'Amenities': 'amenities',
                        'Longitude': 'longitude', 'Latitude': 'latitude'})

# add the city
df['city'] = CITY

In [40]:
# send raw to the database

# Postgres does not support NaN in integer columns, so we will set them to -1 (ugly)
# numeric_columns = df.select_dtypes(include='number').columns
numeric_columns = ["accommodates", "bathrooms", "bedrooms", "beds", "minimum_nights", "price"]
df[numeric_columns] = df[numeric_columns].fillna(-1).astype(int)

with conn.cursor() as cur:
        records = df.to_dict(orient="records")
        
        insert_query = """
            INSERT INTO raw
            (id, neighbourhood, room_type, property_type, accommodates, bathrooms, bedrooms, beds, bed_type, price,
             minimum_nights, cancel_policy, features, amenities, longitude, latitude, city)
            VALUES (
                %(id)s, %(neighbourhood)s, %(room_type)s, %(property_type)s, %(accommodates)s,
                %(bathrooms)s, %(bedrooms)s, %(beds)s, %(bed_type)s, %(price)s,
                %(minimum_nights)s, %(cancel_policy)s, %(features)s, %(amenities)s,
                %(longitude)s, %(latitude)s, %(city)s
            )
            ON CONFLICT DO NOTHING;
        """

        # Execute batch insert
        cur.executemany(insert_query, records)
conn.commit()

In [44]:
# execute stored procedure to fill up the listings table
with conn.cursor() as cur:
    cur.execute("CALL storelisting();")
conn.commit()

In [45]:
# get data for ds part:
ds_dat = pd.read_sql('SELECT * FROM vw_airbnb', con=conn)


  ds_dat = pd.read_sql('SELECT * FROM vw_airbnb', con=conn)


In [46]:
ds_dat

Unnamed: 0,city_name,center_longitude,center_latitude,room_type_name,neighbourhood_name,longitude,latitude,price,minimum_nights,listing_given_id,property_type_name,accommodates,bathrooms,bedrooms,beds,bed_type_name,cancel_policy_name,features,amenities
0,Paris,2.348391,48.853495,Entire home/apt,Hôtel-de-Ville,2.353462,48.858654,$90.00,1,7735531,Apartment,2,1,0,1,Real Bed,moderate,"Host Has Profile Pic,Is Location Exact","Wireless Internet,Kitchen,Elevator in building..."
1,Paris,2.348391,48.853495,Entire home/apt,Hôtel-de-Ville,2.365637,48.855628,$140.00,7,3036231,Apartment,6,1,2,2,Real Bed,moderate,"Host Has Profile Pic,Host Identity Verified,Is...","TV,Internet,Wireless Internet,Kitchen,Heating,..."
2,Paris,2.348391,48.853495,Entire home/apt,Hôtel-de-Ville,2.365122,48.855027,$80.00,4,2183529,Apartment,2,1,0,1,Real Bed,strict,"Host Has Profile Pic,Host Identity Verified,Is...","TV,Cable TV,Internet,Wireless Internet,Kitchen..."
3,Paris,2.348391,48.853495,Entire home/apt,Hôtel-de-Ville,2.365711,48.854519,$690.00,3,515970,Apartment,7,2,5,5,Real Bed,strict,"Host Has Profile Pic,Host Identity Verified,Is...","TV,Cable TV,Internet,Wireless Internet,Kitchen..."
4,Paris,2.348391,48.853495,Entire home/apt,Hôtel-de-Ville,2.358326,48.857004,$90.00,5,3144316,Apartment,2,1,0,1,Real Bed,moderate,"Host Has Profile Pic,Is Location Exact","TV,Cable TV,Internet,Wireless Internet,Kitchen..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
54508,Paris,2.348391,48.853495,Entire home/apt,Palais-Bourbon,2.306002,48.858121,$70.00,2,2610162,Apartment,2,1,1,2,Real Bed,moderate,"Host Has Profile Pic,Host Identity Verified,Is...","TV,Cable TV,Internet,Wireless Internet,Kitchen..."
54509,Paris,2.348391,48.853495,Entire home/apt,Palais-Bourbon,2.325696,48.852896,$400.00,2,13760682,Apartment,5,2,2,3,Real Bed,strict,"Host Has Profile Pic,Host Identity Verified,Is...","Wireless Internet,Kitchen,Elevator in building..."
54510,Paris,2.348391,48.853495,Shared room,Palais-Bourbon,2.324841,48.852811,$0.00,1,9173969,Apartment,1,0,1,1,Real Bed,flexible,"Host Has Profile Pic,Is Location Exact","Cable TV,Carbon monoxide detector"
54511,Paris,2.348391,48.853495,Entire home/apt,Palais-Bourbon,2.316800,48.858549,$99.00,4,13754942,Apartment,3,1,2,3,Real Bed,strict,"Host Has Profile Pic,Is Location Exact","TV,Cable TV,Internet,Wireless Internet,Kitchen..."


In [47]:
conn.close()