# 2. Data exploration

In the previous notebook, two types of data were loaded to a PostGIS database called 'bcn_traffic':

1. Monthly traffic report of different street sections of Barcelona
2. Geometries of street sections of Barcelona

-------------------------

**Field information:**

1. Monthly traffic report
   * sectionID: Section identification number

   * time: Date & time of registration (format: YYYY-MM-DD)

   * current_flow: Current traffic flow(0 = sense dades / 1 = molt fluid / 2 = fluid / 3 = dens / 4 = molt dens / 5 = congestió / 6 = tallat)

   * estimated_flow: Estimated traffic flow after 15min (0 = sense dades / 1 = molt fluid / 2 = fluid / 3 = dens / 4 = molt dens / 5 = congestió / 6 = tallat)


2. Geometries (each traffic section)
   * Tram = idTram
   * Tram_Components: Different points within the same section	
   * Descripció: Description of the 
   * Longitud: Longitude
   * Latitud: Latitude

In [1]:
# Import libraries
import db # db.py
import numpy as np
import pandas as pd
import geopandas as gpd
import contextily as ctx
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_palette('Set2')
sns.despine()
sns.set_context(rc={'axes.labelsize':18,
                    'axes.titlesize':18,
                    'font.size':15,
                    'legend.fontsize':15,
                    'lines.linewidth':2.2})

<Figure size 432x288 with 0 Axes>

In [2]:
# Get all names of tables exist in our database
table_names = db.get_table_names()
table_names

[('spatial_ref_sys',),
 ('gener2019',),
 ('febrer2019',),
 ('marc2019',),
 ('abril2019',),
 ('maig2019',),
 ('juny2019',),
 ('juliol2019',),
 ('agost2019',),
 ('setembre2019',),
 ('octubre2019',),
 ('novembre2019',),
 ('desembre2019',),
 ('gener2020',),
 ('febrer2020',),
 ('marc2020',),
 ('abril2020',),
 ('maig2020',),
 ('juny2020',),
 ('juliol2020',),
 ('agost2020',),
 ('setembre2020',),
 ('octubre2020',),
 ('section_geom',)]

In [3]:
# Select tables containing monthly traffic flow info
monthly_tables = [','.join(name) for name in table_names][1:-1]
monthly_tables

['gener2019',
 'febrer2019',
 'marc2019',
 'abril2019',
 'maig2019',
 'juny2019',
 'juliol2019',
 'agost2019',
 'setembre2019',
 'octubre2019',
 'novembre2019',
 'desembre2019',
 'gener2020',
 'febrer2020',
 'marc2020',
 'abril2020',
 'maig2020',
 'juny2020',
 'juliol2020',
 'agost2020',
 'setembre2020',
 'octubre2020']

In [14]:
# Let's compare September's traffic in 2019 and 2020]

# Names of tables containing data from month of september
septembers = ['setembre2019', 'setembre2020']

# New dataframe names
df_sep = pd.DataFrame()

df_septembers = ['sep2019', 'sep2020']

In [15]:
# Query to select data from tables
# Filter if the current traffic flow data exists (value 1-6)
    
for i in range(2):
    query = f"""
            select * from {septembers[i]}
            where current_flow <> 0
            """
    # Predifine column names
    column_names = ['sectionID', 'time', 'current_flow', 'estimated_flow']

    # Create dataframe
    result = db.db_to_df(query, column_names)
    df_sep = df_sep.append(result)

print(df_sep.shape)
df_sep.head()

(4993502, 4)


Unnamed: 0,sectionID,time,current_flow,estimated_flow
0,1,20190901000054,1,0
1,3,20190901000054,1,0
2,4,20190901000054,2,0
3,5,20190901000054,1,0
4,6,20190901000054,1,0


In [27]:
# Check if the data has corrupted points:
# 1. empty rows
# 2. duplicated data
print(f'Proportion of empty rows?: {df_sep.isnull().any().mean()}')
print(f'Proportion of duplicated rows?: {df_sep.duplicated().mean()}')

Proportion of empty rows?: 0.0
Proportion of duplicated rows?: 0.950547531572031


In [18]:
# How many rows lacks estimation data (value=0)?

no_data = (df_sep['estimated_flow']==0)
print(f'How many rows lacks estimated traffic flow (value=0)?: {len(df_sep[no_data])}')

How many rows lacks estimated traffic flow (value=0)?: 431653


In [19]:
# Print data types
df_sep.dtypes

sectionID         int64
time              int64
current_flow      int64
estimated_flow    int64
dtype: object

In [20]:
df_sep['time'] = pd.to_datetime(df_sep['time'].astype(str), yearfirst=True)

print(df_sep['time'].dtype)
df_sep.head()

datetime64[ns]


Unnamed: 0,sectionID,time,current_flow,estimated_flow
0,1,2019-09-01 00:00:54,1,0
1,3,2019-09-01 00:00:54,1,0
2,4,2019-09-01 00:00:54,2,0
3,5,2019-09-01 00:00:54,1,0
4,6,2019-09-01 00:00:54,1,0


In [21]:
# How many unique sectionIDs are there?
len(df_sep['sectionID'].unique())

442

In [None]:
# Load geometry file
query = '''
        select *, ST_SetSRID(ST_MakePoint(longitude, latitude), 4326) as geometry
        from section_geom
        '''

geom_df = db.db_to_gdf(query, 'geometry')
geom_df.head()