# 2. Data exploration

In the previous notebook, two types of data were loaded to a PostGIS database called 'bcn_traffic':

1. Monthly traffic report of different street sections of Barcelona
2. Geometries of street sections of Barcelona

-------------------------

**Field information:**

1. Monthly traffic report
   * idTram: Section identification number

   * data: Date of registration (format: YYYY-MM-DD)

   * estatActual: Current traffic flow(0 = sense dades / 1 = molt fluid / 2 = fluid / 3 = dens / 4 = molt dens / 5 = congestió / 6 = tallat)

   * estatPrevist: Estimated traffic flow after 15min (0 = sense dades / 1 = molt fluid / 2 = fluid / 3 = dens / 4 = molt dens / 5 = congestió / 6 = tallat)


2. Geometries (each traffic section)
   * Tram = idTram
   * Tram_Components: Different points within the same section	
   * Descripció: Description of the 
   * Longitud: Longitude
   * Latitud: Latitude

In [1]:
# Import libraries
import db # db.py
import numpy as np
import pandas as pd
import geopandas as gpd
import contextily as ctx
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_palette('Set2')
sns.despine()
sns.set_context(rc={'axes.labelsize':18,
                    'axes.titlesize':18,
                    'font.size':15,
                    'legend.fontsize':15,
                    'lines.linewidth':2.2})

<Figure size 432x288 with 0 Axes>

In [2]:
table_name = 'Gener2019'

# Query to select data from table
# Filter if the current traffic flow data exists (value 1-6)
query = f"""
        select * from {table_name}
        where current_flow <> 0
        """

column_names = ['sectionID', 'time', 'current_flow', 'estimated_flow']

jan2019 = db.db_to_df(query, column_names)

print(jan2019.shape)
jan2019.head()

(471955, 4)


Unnamed: 0,sectionID,time,current_flow,estimated_flow
0,37,20190101000551,2,5
1,62,20190101000551,1,0
2,67,20190101000551,2,0
3,69,20190101000551,2,0
4,81,20190101000551,6,6


In [3]:
# Check if the data has corrupted points:
# 1. empty rows
# 2. duplicated data

print(f'Proportion of empty rows?: {jan2019.isnull().any().mean()}')
print(f'Proportion of duplicated rows?: {jan2019.duplicated().mean()}')

Proportion of empty rows?: 0.0
Proportion of duplicated rows?: 0.0


In [4]:
# How many rows lacks estimation data (value=0)?

no_data = (jan2019['estimated_flow']==0)
print(f'How many rows lacks estimated traffic flow (value=0)?: {len(jan2019[no_data])}')

How many rows lacks estimated traffic flow (value=0)?: 60151


In [5]:
# Print data types
jan2019.dtypes

sectionID         int64
time              int64
current_flow      int64
estimated_flow    int64
dtype: object

In [6]:
jan2019['time'] = pd.to_datetime(jan2019['time'].astype(str), yearfirst=True)

print(jan2019['time'].dtype)
jan2019.head()

datetime64[ns]


Unnamed: 0,sectionID,time,current_flow,estimated_flow
0,37,2019-01-01 00:05:51,2,5
1,62,2019-01-01 00:05:51,1,0
2,67,2019-01-01 00:05:51,2,0
3,69,2019-01-01 00:05:51,2,0
4,81,2019-01-01 00:05:51,6,6


In [7]:
# How many unique sectionIDs are there?
len(jan2019['sectionID'].unique())

403

In [9]:
# Load geometry file
query = '''
        select *, ST_SetSRID(ST_MakePoint(longitude, latitude), 4326) as geometry
        from section_geom
        '''

geom_df = db.db_to_gdf(query, 'geometry')
geom_df.head()

Unnamed: 0,index,sectionID,section_components,description,longitude,latitude,geometry
0,0,1,1,Diagonal (Ronda de Dalt a Doctor Marañón),2.112035,41.384191,POINT (2.11204 41.38419)
1,1,1,2,Diagonal (Ronda de Dalt a Doctor Marañón),2.101503,41.381631,POINT (2.10150 41.38163)
2,2,2,1,Diagonal (Doctor Marañón a Ronda de Dalt),2.111944,41.384467,POINT (2.11194 41.38447)
3,3,2,2,Diagonal (Doctor Marañón a Ronda de Dalt),2.101594,41.381868,POINT (2.10159 41.38187)
4,4,3,1,Diagonal (Doctor Marañón a Pl. Pius XII),2.112093,41.384229,POINT (2.11209 41.38423)
