## Basic Exploration of the Daily Dataset

The objective of this notebook is to find problems in the data set of daily history and posible cleaup strategies before doing the full analysis.

### Libraries used in this analysis

It will be mostly executed in spark but will do some local processing

In [1]:
import pandas as pd
from pyspark.sql.functions import *
from pyspark.sql import Row

### PWS Information and Globals

This is the infromation of the sensor that performed the measurements. The ID is the one given by wunderground. In this case we use only cinvestav telchac.

In [2]:
#PWS info
pwsID = 'IYUCATNT2'
pwsTz = 'America/Merida'

#Global info of the analysis
startDate='2016-01-01'
endDate='2016-12-31'

### Reading the dataset

The data set is saved in a hierarchical maner having independent directories for year and subdirectories for each month.

In [3]:
basePath = 'dailyHistory/'+pwsID+'/'
datePath='%Y/%m/'

dirDateRange = pd.date_range(start=startDate,end=endDate,freq='MS',tz=pwsTz)

dailyHistoryFiles = map(lambda ts:  basePath+ts.strftime(datePath)+'*.csv',dirDateRange)

# This line consumes the generator and it gets empty for future call
dailyHistoryFilesList = list(dailyHistoryFiles)
dailyHistoryFilesList

['dailyHistory/IYUCATNT2/2016/01/*.csv',
 'dailyHistory/IYUCATNT2/2016/02/*.csv',
 'dailyHistory/IYUCATNT2/2016/03/*.csv',
 'dailyHistory/IYUCATNT2/2016/04/*.csv',
 'dailyHistory/IYUCATNT2/2016/05/*.csv',
 'dailyHistory/IYUCATNT2/2016/06/*.csv',
 'dailyHistory/IYUCATNT2/2016/07/*.csv',
 'dailyHistory/IYUCATNT2/2016/08/*.csv',
 'dailyHistory/IYUCATNT2/2016/09/*.csv',
 'dailyHistory/IYUCATNT2/2016/10/*.csv',
 'dailyHistory/IYUCATNT2/2016/11/*.csv',
 'dailyHistory/IYUCATNT2/2016/12/*.csv']

Now that we have the list of files we proceed to open them as spark dataset.

It looks like some files came empty at the time of querying the url. It looks like this causes the infer schema to fail. Everything will be a string.

The problem with calling the load with the result of the map is that it is a generator and it was being consumed by the list() operation before. When called in the load it was empty. A generator can't be used 2 times. In order to be reusable I explicitly converted to list in the previous cell.

In [4]:
# Changing the default timezone for the analysis
spark.conf.set("spark.sql.session.timeZone", pwsTz)

# Inferr schema creates a problem when changing the session timezone
#    .option("inferSchema","true") \
    
staticDataFrame = spark.read \
    .format("csv") \
    .option("header","true") \
    .load(dailyHistoryFilesList)
#    .load('dailyHistory/IYUCATNT2/2016/01/IYUCATNT2-2016-01-02.csv')
#    .load('dailyHistory/IYUCATNT2/*/*/*.csv')
#    .load(list(dailyHistoryFiles))
#    .load(['dailyHistory/IYUCATNT2/2016/01/*.csv',
# 'dailyHistory/IYUCATNT2/2016/02/*.csv'])
    #.load('dailyHistory/IYUCATNT2/2016/*/*.csv')
    #.load(list(dailyHistoryFiles))

staticDataFrame.printSchema()
staticDataFrame.show()

root
 |-- Time: string (nullable = true)
 |-- TemperatureF: string (nullable = true)
 |-- DewpointF: string (nullable = true)
 |-- PressureIn: string (nullable = true)
 |-- WindDirection: string (nullable = true)
 |-- WindDirectionDegrees: string (nullable = true)
 |-- WindSpeedMPH: string (nullable = true)
 |-- WindSpeedGustMPH: string (nullable = true)
 |-- Humidity: string (nullable = true)
 |-- HourlyPrecipIn: string (nullable = true)
 |-- Conditions: string (nullable = true)
 |-- Clouds: string (nullable = true)
 |-- dailyrainin: string (nullable = true)
 |-- SolarRadiationWatts/m^2: string (nullable = true)
 |-- SoftwareType: string (nullable = true)
 |-- DateUTC: string (nullable = true)

+-------------------+------------+---------+----------+-------------+--------------------+------------+----------------+--------+--------------+----------+------+-----------+-----------------------+-------------------+-------------------+
|               Time|TemperatureF|DewpointF|PressureIn|W

## Look for low samples or Zero Sample Dates

We will see how many readings we have per day and if there are null days (will be substitute by 0).

First creating a data frame with all the dates possible in the range.

In [5]:
allDateRange = pd.date_range(start=startDate,end=endDate,freq='D',tz=pwsTz)
# Range to RDD of stringd
allDateStrRDD = sc.parallelize(allDateRange).map(lambda ts: ts.strftime('%Y-%m-%d'))
# RDD of strings to df of dates
allDateDF = allDateStrRDD.map(Row("Date")).toDF().selectExpr("cast(Date as date)")

allDateDF.printSchema()

root
 |-- Date: date (nullable = true)



Now converting the time colum to timestamp

In [6]:
tsDateDF = staticDataFrame \
    .selectExpr("to_timestamp(Time,'yyyy-MM-dd HH:mm:ss') as TS") \
    .withColumn("Date",col("TS").cast("date"))
tsDateDF.printSchema()

root
 |-- TS: timestamp (nullable = true)
 |-- Date: date (nullable = true)



In [7]:
dateCountsDF = tsDateDF.groupBy("Date").count() #.sort("count")
size = dateCountsDF.count()
dateCountsDF.show(size)

+----------+-----+
|      Date|count|
+----------+-----+
|2016-03-01|  188|
|2016-04-25|  183|
|2016-05-03|  280|
|2016-08-31|  275|
|2016-08-15|  222|
|2016-10-03|  178|
|2016-01-28|  211|
|2016-07-17|  249|
|2016-11-08|  176|
|2016-12-19|  173|
|2016-08-23|  181|
|2016-07-03|  180|
|2016-02-04|  197|
|2016-05-26|  179|
|2016-06-02|  164|
|2016-09-23|  172|
|2016-06-16|  177|
|2016-04-22|  176|
|2016-09-30|  175|
|2016-01-19|  176|
|2016-05-09|  285|
|2016-07-19|  233|
|2016-09-15|  176|
|2016-02-08|  273|
|2016-10-07|  178|
|2016-12-12|  177|
|2016-05-23|  181|
|2016-09-26|  177|
|2016-12-13|  172|
|2016-03-25|  278|
|2016-08-26|  166|
|2016-02-03|  165|
|2016-09-09|  176|
|2016-08-01|  167|
|2016-06-17|  176|
|2016-09-27|  173|
|2016-08-16|  277|
|2016-10-23|  177|
|2016-04-30|  159|
|2016-07-21|  235|
|2016-05-27|  188|
|2016-07-02|  178|
|2016-08-06|  188|
|2016-10-20|  188|
|2016-05-07|  273|
|2016-08-05|  192|
|2016-04-26|  177|
|2016-05-13|  169|
|2016-05-31|  164|
|2016-02-22|

Join all dates dataset with the counts

In [8]:
# Using a sequencer gives better results since there is just one Date column
joinDF = allDateDF.join(dateCountsDF,["Date"],"left_outer").na.fill({'count':0})


size = joinDF.count()
joinDF.sort("count").show(size)



+----------+-----+
|      Date|count|
+----------+-----+
|2016-11-14|    0|
|2016-06-13|    0|
|2016-06-08|    0|
|2016-06-20|    0|
|2016-06-09|    0|
|2016-06-05|    0|
|2016-06-06|    0|
|2016-11-06|    0|
|2016-06-10|    0|
|2016-08-27|    0|
|2016-08-21|    0|
|2016-11-03|    0|
|2016-03-31|    0|
|2016-01-06|    0|
|2016-01-13|    0|
|2016-01-16|    0|
|2016-01-01|    0|
|2016-03-19|    0|
|2016-01-03|    0|
|2016-01-08|    0|
|2016-07-26|    0|
|2016-03-21|    0|
|2016-03-26|    0|
|2016-01-09|    0|
|2016-03-30|    0|
|2016-12-24|   89|
|2016-05-17|  144|
|2016-09-25|  156|
|2016-10-28|  158|
|2016-04-03|  158|
|2016-03-05|  158|
|2016-04-30|  159|
|2016-10-16|  162|
|2016-03-16|  162|
|2016-02-12|  163|
|2016-02-15|  163|
|2016-03-15|  163|
|2016-06-03|  164|
|2016-05-31|  164|
|2016-06-02|  164|
|2016-10-04|  164|
|2016-02-26|  164|
|2016-03-18|  164|
|2016-02-24|  164|
|2016-07-29|  165|
|2016-10-17|  165|
|2016-02-03|  165|
|2016-06-18|  166|
|2016-08-26|  166|
|2016-02-11|

## Exploring the Wind Speed and Wind Direction

Now let's look at the ranges in the wind speed and wind directon to see what we got.

We will be dealing with this data, this is the inferred schema of a good one:

 |-- Time: timestamp (nullable = true)

 |-- WindDirection: string (nullable = true)
 
 |-- WindDirectionDegrees: integer (nullable = true)
 
 |-- WindSpeedMPH: double (nullable = true)
 
 |-- WindSpeedGustMPH: double (nullable = true)
 
 

In [9]:
windDF = staticDataFrame.selectExpr(
    "Time",
    "to_timestamp(Time,'yyyy-MM-dd HH:mm:ss') as TS",
    "WindDirection",
    "cast(WindDirectionDegrees as integer)",
    "cast(WindSpeedMPH as double)",
    "cast(WindSpeedGustMPH as double)")
# Creating SQL for future analisys
windDF.createOrReplaceTempView("windTable")
windDF.printSchema()
windDF.show()

root
 |-- Time: string (nullable = true)
 |-- TS: timestamp (nullable = true)
 |-- WindDirection: string (nullable = true)
 |-- WindDirectionDegrees: integer (nullable = true)
 |-- WindSpeedMPH: double (nullable = true)
 |-- WindSpeedGustMPH: double (nullable = true)

+-------------------+-------------------+-------------+--------------------+------------+----------------+
|               Time|                 TS|WindDirection|WindDirectionDegrees|WindSpeedMPH|WindSpeedGustMPH|
+-------------------+-------------------+-------------+--------------------+------------+----------------+
|2016-05-10 00:00:00|2016-05-10 00:00:00|          N/A|             -737280|         0.0|          -999.0|
|2016-05-10 00:05:00|2016-05-10 00:05:00|          N/A|             -737280|         0.0|          -999.0|
|2016-05-10 00:10:00|2016-05-10 00:10:00|          N/A|             -737280|         0.0|          -999.0|
|2016-05-10 00:15:00|2016-05-10 00:15:00|          N/A|             -737280|         0.0|

Summary statistics:

In [10]:
windDF.describe().show()

+-------+-------------------+-------------+--------------------+-----------------+-------------------+
|summary|               Time|WindDirection|WindDirectionDegrees|     WindSpeedMPH|   WindSpeedGustMPH|
+-------+-------------------+-------------+--------------------+-----------------+-------------------+
|  count|              67834|        67834|               67834|            67834|              67834|
|   mean|               null|         null|   -8694.20044520447|  9.7538726892119|-291.51708582716634|
| stddev|               null|         null|   80046.34975036867|8.566824638701357|  462.6558174471389|
|    min|2016-01-02 00:00:00|          ENE|             -737280|           -999.9|             -999.0|
|    max|2016-12-31 23:52:00|         West|               32767|             38.0|               40.0|
+-------+-------------------+-------------+--------------------+-----------------+-------------------+



There are some values that need to be investigated in the dataset. There is a very high and verylow degrees in WindDirectionDegrees. Let's see how they look like.

### Odd Degrees

These are the 32767 and -737280 degrees:

In [11]:
spark.sql("""
SELECT *
FROM windTable
WHERE WindDirectionDegrees=32767
""").show()

+-------------------+-------------------+-------------+--------------------+------------+----------------+
|               Time|                 TS|WindDirection|WindDirectionDegrees|WindSpeedMPH|WindSpeedGustMPH|
+-------------------+-------------------+-------------+--------------------+------------+----------------+
|2016-12-15 12:12:00|2016-12-15 12:12:00|        North|               32767|         9.0|            12.0|
|2016-12-15 12:22:00|2016-12-15 12:22:00|        North|               32767|      -999.9|             0.0|
|2016-12-15 12:32:00|2016-12-15 12:32:00|        North|               32767|      -999.9|             0.0|
+-------------------+-------------------+-------------+--------------------+------------+----------------+



In [12]:
spark.sql("""
SELECT *
FROM windTable
WHERE WindDirectionDegrees=-737280
""").show(100)

+-------------------+-------------------+-------------+--------------------+------------+----------------+
|               Time|                 TS|WindDirection|WindDirectionDegrees|WindSpeedMPH|WindSpeedGustMPH|
+-------------------+-------------------+-------------+--------------------+------------+----------------+
|2016-05-10 00:00:00|2016-05-10 00:00:00|          N/A|             -737280|         0.0|          -999.0|
|2016-05-10 00:05:00|2016-05-10 00:05:00|          N/A|             -737280|         0.0|          -999.0|
|2016-05-10 00:10:00|2016-05-10 00:10:00|          N/A|             -737280|         0.0|          -999.0|
|2016-05-10 00:15:00|2016-05-10 00:15:00|          N/A|             -737280|         0.0|          -999.0|
|2016-05-10 00:20:00|2016-05-10 00:20:00|          N/A|             -737280|         0.0|          -999.0|
|2016-05-10 00:25:00|2016-05-10 00:25:00|          N/A|             -737280|         0.0|          -999.0|
|2016-05-10 00:30:00|2016-05-10 00:30

It looks like the very low -737280 means there is a a windspeed of 0 not a big deal. In case of the windspeed of 32767 there is very few and looks like some sort of NA.

So now let's see the frequency of distinct values that are not inside the 0 360 degrees.

In [13]:
spark.sql("""
SELECT WindDirectionDegrees, COUNT(WindDirectionDegrees)
FROM windTable
WHERE WindDirectionDegrees NOT BETWEEN 0 and 360
GROUP BY WindDirectionDegrees
""").show()

+--------------------+---------------------------+
|WindDirectionDegrees|count(WindDirectionDegrees)|
+--------------------+---------------------------+
|               32767|                          3|
|             -737280|                        809|
+--------------------+---------------------------+



Now look at all possible combinations

In [14]:
spark.sql("""
SELECT DISTINCT WindDirection, WindDirectionDegrees, WindSpeedMPH, WindSpeedGustMPH
FROM windTable
WHERE WindDirectionDegrees NOT BETWEEN 0 and 360
""").show()

+-------------+--------------------+------------+----------------+
|WindDirection|WindDirectionDegrees|WindSpeedMPH|WindSpeedGustMPH|
+-------------+--------------------+------------+----------------+
|        North|               32767|         9.0|            12.0|
|        North|               32767|      -999.9|             0.0|
|          N/A|             -737280|         0.0|          -999.0|
+-------------+--------------------+------------+----------------+



Now excluding the angles and get another summary

In [15]:
spark.sql("""
SELECT *
FROM windTable
WHERE WindDirectionDegrees BETWEEN 0 AND 360
""").describe().show()

+-------+-------------------+-------------+--------------------+-----------------+-------------------+
|summary|               Time|WindDirection|WindDirectionDegrees|     WindSpeedMPH|   WindSpeedGustMPH|
+-------+-------------------+-------------+--------------------+-----------------+-------------------+
|  count|              67022|        67022|               67022|            67022|              67022|
|   mean|               null|         null|   98.45761093372326|9.901748679538063|-282.99052549908987|
| stddev|               null|         null|   73.22550994927565|6.533684870646145|  458.8303581371724|
|    min|2016-01-02 00:00:00|          ENE|                   0|              0.0|             -999.0|
|    max|2016-12-31 23:52:00|         West|                 360|             38.0|               40.0|
+-------+-------------------+-------------+--------------------+-----------------+-------------------+



The only negative number still remains the negative windgust. Let's find out what is that

In [16]:
spark.sql("""
SELECT *
FROM windTable
WHERE WindSpeedGustMPH < 0 AND WindDirectionDegrees BETWEEN 0 AND 360
""").show(100)

+-------------------+-------------------+-------------+--------------------+------------+----------------+
|               Time|                 TS|WindDirection|WindDirectionDegrees|WindSpeedMPH|WindSpeedGustMPH|
+-------------------+-------------------+-------------+--------------------+------------+----------------+
|2016-05-10 08:35:00|2016-05-10 08:35:00|           SE|                 135|         0.0|          -999.0|
|2016-05-10 08:40:00|2016-05-10 08:40:00|           SE|                 135|         0.0|          -999.0|
|2016-05-10 18:45:00|2016-05-10 18:45:00|          ENE|                  68|         8.0|          -999.0|
|2016-05-10 18:50:00|2016-05-10 18:50:00|          ENE|                  68|         8.0|          -999.0|
|2016-05-10 18:55:00|2016-05-10 18:55:00|          ENE|                  68|        17.0|          -999.0|
|2016-05-10 19:00:00|2016-05-10 19:00:00|          ENE|                  68|        17.0|          -999.0|
|2016-05-10 19:05:00|2016-05-10 19:05

In [17]:
spark.sql("""
SELECT DISTINCT WindDirection, WindDirectionDegrees, WindSpeedMPH, WindSpeedGustMPH
FROM windTable
WHERE WindSpeedGustMPH < 0 AND WindDirectionDegrees BETWEEN 0 AND 360
""").show(100)

+-------------+--------------------+------------+----------------+
|WindDirection|WindDirectionDegrees|WindSpeedMPH|WindSpeedGustMPH|
+-------------+--------------------+------------+----------------+
|          NNW|                 338|        14.0|          -999.0|
|          WSW|                 248|         1.0|          -999.0|
|          ENE|                  68|        12.0|          -999.0|
|          ENE|                  68|         4.0|          -999.0|
|          SSE|                 158|         9.0|          -999.0|
|         East|                  90|         5.0|          -999.0|
|          WNW|                 293|        24.0|          -999.0|
|          ESE|                 113|         7.0|          -999.0|
|          WNW|                 293|         2.0|          -999.0|
|        North|                   0|        10.0|          -999.0|
|           NE|                  45|        12.0|          -999.0|
|          SSE|                 158|        11.0|          -99

It is really common and it is not repeated too much. We are not taking into account the gust now but otherwise this is candidate for N/A.

Next, let's find out why there is Wind directions for 0 and 360.

In [18]:
spark.sql("""
SELECT *
FROM windTable
WHERE WindDirectionDegrees=0 OR WindDirectionDegrees=360
""").show()

+-------------------+-------------------+-------------+--------------------+------------+----------------+
|               Time|                 TS|WindDirection|WindDirectionDegrees|WindSpeedMPH|WindSpeedGustMPH|
+-------------------+-------------------+-------------+--------------------+------------+----------------+
|2016-05-05 00:00:00|2016-05-05 00:00:00|        North|                   0|        15.0|          -999.0|
|2016-05-05 00:05:00|2016-05-05 00:05:00|        North|                   0|        15.0|          -999.0|
|2016-05-05 00:10:00|2016-05-05 00:10:00|        North|                   0|        15.0|          -999.0|
|2016-05-05 00:15:00|2016-05-05 00:15:00|        North|                   0|        15.0|          -999.0|
|2016-05-05 00:20:00|2016-05-05 00:20:00|        North|                   0|        15.0|          -999.0|
|2016-05-05 00:45:00|2016-05-05 00:45:00|        North|                   0|        13.0|          -999.0|
|2016-05-05 00:50:00|2016-05-05 00:50

The distribution of 0 and 360 looks like:

In [19]:
spark.sql("""
SELECT WindDirection, WindDirectionDegrees, COUNT(1)
FROM windTable
WHERE WindDirectionDegrees=0 OR WindDirectionDegrees=360
GROUP BY WindDirection, WindDirectionDegrees
""").show()

+-------------+--------------------+--------+
|WindDirection|WindDirectionDegrees|count(1)|
+-------------+--------------------+--------+
|        North|                 360|     545|
|        North|                   0|    1231|
+-------------+--------------------+--------+



Now extracting north only and see how it looks like

In [20]:
spark.sql("""
SELECT *
FROM windTable
WHERE WindDirection='North' AND WindDirectionDegrees BETWEEN 0 AND 360
""").describe().show()

+-------+-------------------+-------------+--------------------+------------------+------------------+
|summary|               Time|WindDirection|WindDirectionDegrees|      WindSpeedMPH|  WindSpeedGustMPH|
+-------+-------------------+-------------+--------------------+------------------+------------------+
|  count|               3317|         3317|                3317|              3317|              3317|
|   mean|               null|         null|  113.24992463069039|14.592402773590594|-361.2595719023214|
| stddev|               null|         null|  163.72925315626318| 6.233275053937386| 490.0088043289027|
|    min|2016-01-04 00:10:00|        North|                   0|               0.0|            -999.0|
|    max|2016-12-31 10:27:00|        North|                 360|              38.0|              40.0|
+-------+-------------------+-------------+--------------------+------------------+------------------+



In [21]:
spark.sql("""
SELECT WindDirection, WindDirectionDegrees, COUNT(1)
FROM windTable
WHERE WindDirection='North' AND WindDirectionDegrees BETWEEN 0 AND 360
GROUP BY WindDirection, WindDirectionDegrees
""").show()

+-------------+--------------------+--------+
|WindDirection|WindDirectionDegrees|count(1)|
+-------------+--------------------+--------+
|        North|                 349|      31|
|        North|                 360|     545|
|        North|                   5|     573|
|        North|                   0|    1231|
|        North|                 350|      65|
|        North|                  11|     176|
|        North|                 355|     338|
|        North|                 351|      52|
|        North|                  10|      84|
|        North|                   9|     222|
+-------------+--------------------+--------+



#### Bad Average Around North

The extraction of the North degrees show there is a problem when averaging them, the average is not in the north and it should be. Min and max are useless since it covers all the range.

In [22]:
spark.sql("""
SELECT AVG(WindDirectionDegrees), MIN(WindDirectionDegrees), MAX(WindDirectionDegrees)
FROM windTable
WHERE WindDirection='North' AND WindDirectionDegrees BETWEEN 0 AND 360
""").show()

+-------------------------+-------------------------+-------------------------+
|avg(WindDirectionDegrees)|min(WindDirectionDegrees)|max(WindDirectionDegrees)|
+-------------------------+-------------------------+-------------------------+
|       113.24992463069039|                        0|                      360|
+-------------------------+-------------------------+-------------------------+



This is an attempt to calculate the average by making high angles negatives:

In [32]:
spark.sql("""
SELECT AVG(Angle), MIN(Angle), MAX(Angle)
FROM (
  SELECT DISTINCT
    WindDirectionDegrees,
    CASE WHEN WindDirectionDegrees > 180 THEN WindDirectionDegrees-360 ELSE WindDirectionDegrees END AS Angle     
  FROM windTable
  WHERE WindDirection='North' AND WindDirectionDegrees BETWEEN 0 AND 360
)
""").show()

+----------+----------+----------+
|avg(Angle)|min(Angle)|max(Angle)|
+----------+----------+----------+
|       0.0|       -11|        11|
+----------+----------+----------+



Now let's calculate the X and Y components and use that to average.