In [1]:
# Load my_utils.ipynb in Notebook
from ipynb.fs.full.my_utils import *

Opening connection to database
Add pythagore() function to SQLite engine
Fraction of the dataset used to train models: 10.00%
my_utils library loaded :-)


# Group dataset by *DATE*

What kind of feature engineering could I do on the categorical features build from my weather dataset ?

Well, most of the work has been done while bulding this categorical dataset (refer to [NYC Weather Data Preparation](13.NYC%20Weather%20Data%20Preparation.ipynb) notebook for more details), but there's one last thing I've decided to do: Group all those categorical features by day.

Looking at my *weather_cat* dataset, I've found that some weather stations did not reported any metrics for some days and moreover, some stations did not reported always all of their metrics: Were thay out of order ? Stopped for maintenance ? Not equiped to measure some features others do ?

Whatever the reason, it would be nice to have a complete dataset, and for that reason, I've decided to drop the *STATION* feature and group by *DATE* the lines in the *weather_cat* dataset. Doing so, I'll get a dataset with one line per day, and categorical feature set for that day.

To perform this *grouping* operation, I will use an SQL SELECT query to group by *DATE*, and grab the max value of the other columns. As the value are either 0 or 1, taking the max will set the feature to 1 if at least one of the *STATION* for this *DATE* reported 1 for the considered feature.



## What will be the result ?

Grouping by DATE, taking for each categorical feature the maximum value of each of them, and droping the *STATION* feature, I'll obtain a dataset of 182 lines, which is the number of days between the 1st of January 2016 and the 30th of June 2016.

Of course, some of the features might be multivaluated, in the sense that taking, for one particular *DATE*, the max value of each WDIR_* feature, it might result in a day with multiple wind directions.

Is that a problem ? I don't think so. Furthermore, it would be far more complicated and error prone to try to extrapolate features on missing entries.


## Let's do it...

And start by verifying that the name of the SQL tablename is defined

In [2]:
# Verify SQL tablename is defined in my_utils library
print("Table name used to save the improved dataset:", WEATHER_CAT_TABLENAME)

Table name used to save the improved dataset: weather_cat_improved


Load the *weather_cat* dataset built in [NYC Weather Data Preparation](13.NYC%20Weather%20Data%20Preparation.ipynb) notebook.

In [3]:
# Load weather categorical dataset from SQL Database
df=load_sql('weather_cat')

Query: SELECT * FROM weather_cat


Verify that I do not have any NaN value in it

In [4]:
# Check there is no NaN values
print("Number of NaN values in dataset: ", df.isna().sum().sum())

Number of NaN values in dataset:  0


Run the following query that will group line by *DATE*, dropping *STATION* column and keeping the MAX() value of the other ones:

    SELECT
    DATE,
    MAX(WT01) AS WT01,
    MAX(WT02) AS WT02,
    MAX(WT03) AS WT03,
    MAX(WT04) AS WT04,
    MAX(WT06) AS WT06,
    MAX(WT08) AS WT08,
    MAX(WT09) AS WT09,
    MAX(WT11) AS WT11,
    MAX(WDIR_E) AS WDIR_E,
    MAX(WDIR_N) AS WDIR_N,
    MAX(WDIR_NE) AS WDIR_NE,
    MAX(WDIR_NW) AS WDIR_NW,
    MAX(WDIR_S) AS WDIR_S,
    MAX(WDIR_SE) AS WDIR_SE,
    MAX(WDIR_SW) AS WDIR_SW,
    MAX(WDIR_W) AS WDIR_W,
    MAX(PEAK_Y) AS PEAK_Y,
    MAX(SNOW_FALL) AS SNOW_FALL,
    MAX(SNOW_ROAD) AS SNOW_ROAD
    FROM weather_cat
    GROUP BY DATE ORDER BY DATE ASC

> Note: Column built with the MAX() function are renamed to keep the same feature names.


In [5]:
# Build query described above
query = 'SELECT DATE'

# Loop for each column name except STATION and DATE ([1:2])
for col in list(df.columns[2:]):
    query+=f', MAX({col}) AS {col}'

# Finalize query
query+=" FROM weather_cat WHERE DATE!='2016-07-01' GROUP BY DATE ORDER BY DATE ASC"

# Run query
df=load_sql(query=query, verbose=False)

# Display some lines
df.head(3)

Unnamed: 0,DATE,WT01,WT02,WT03,WT04,WT06,WT08,WT09,WT11,WDIR_E,WDIR_N,WDIR_NE,WDIR_NW,WDIR_S,WDIR_SE,WDIR_SW,WDIR_W,PEAK_Y,SNOW_FALL,SNOW_ROAD
0,2016-01-01,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,1,0,0
1,2016-01-02,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0
2,2016-01-03,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,1,0,0


Check that the number of lines and columns is correct.

- Number of line must be 182, it's the number of days between the 1st of January 2016 and the 30th of June 2016

- Number of column must be 20, 1 less that the *weather_cat* dataset as we've removed the *STATION* feature

In [6]:
# Get number of lines:
number_of_lines=len(df.DATE)

# Get number of column:
number_of_columns=len(df.columns)

if (number_of_lines==182 and number_of_columns==20) : # Number of lines and columns matches
    print("Number of lines and columns in dataset is correct: {} x {}".format(number_of_lines, number_of_columns))
else:
    print("ERROR: Number of lines and columns in dataset is incorrect: {} x {}".format(number_of_lines, number_of_columns))


Number of lines and columns in dataset is correct: 182 x 20


In [7]:
# Save improved dataset to SQL Database
save_sql(df, tablename=WEATHER_CAT_TABLENAME)

Saving OK


True

# Go to next...

...notebook: [NYC Weather Numerical Dataset Feature Engineering](16.NYC%20Weather%20Numerical%20Dataset%20Feature%20Engineering.ipynb)