## Basic Exploration of the Daily Dataset

The objective of this notebook is to find problems in the data set of daily history and posible cleaup strategies before doing the full analysis.

### Libraries used in this analysis

It will be mostly executed in spark but will do some local processing

In [1]:
import pandas as pd
from pyspark.sql.functions import *
from pyspark.sql import Row

### PWS Information and Globals

This is the infromation of the sensor that performed the measurements. The ID is the one given by wunderground. In this case we use only cinvestav telchac.

In [2]:
#PWS info
pwsID = 'IYUCATNT2'
pwsTz = 'America/Merida'

#Global info of the analysis
startDate='2016-01-01'
endDate='2016-12-31'

### Reading the dataset

The data set is saved in a hierarchical maner having independent directories for year and subdirectories for each month.

In [3]:
basePath = 'dailyHistory/'+pwsID+'/'
datePath='%Y/%m/'

dirDateRange = pd.date_range(start=startDate,end=endDate,freq='MS',tz=pwsTz)

dailyHistoryFiles = map(lambda ts:  basePath+ts.strftime(datePath)+'*.csv',dirDateRange)

# This line consumes the generator and it gets empty for future call
dailyHistoryFilesList = list(dailyHistoryFiles)
dailyHistoryFilesList

['dailyHistory/IYUCATNT2/2016/01/*.csv',
 'dailyHistory/IYUCATNT2/2016/02/*.csv',
 'dailyHistory/IYUCATNT2/2016/03/*.csv',
 'dailyHistory/IYUCATNT2/2016/04/*.csv',
 'dailyHistory/IYUCATNT2/2016/05/*.csv',
 'dailyHistory/IYUCATNT2/2016/06/*.csv',
 'dailyHistory/IYUCATNT2/2016/07/*.csv',
 'dailyHistory/IYUCATNT2/2016/08/*.csv',
 'dailyHistory/IYUCATNT2/2016/09/*.csv',
 'dailyHistory/IYUCATNT2/2016/10/*.csv',
 'dailyHistory/IYUCATNT2/2016/11/*.csv',
 'dailyHistory/IYUCATNT2/2016/12/*.csv']

Now that we have the list of files we proceed to open them as spark dataset.

It looks like some files came empty at the time of querying the url. It looks like this causes the infer schema to fail. Everything will be a string.

The problem with calling the load with the result of the map is that it is a generator and it was being consumed by the list() operation before. When called in the load it was empty. A generator can't be used 2 times. In order to be reusable I explicitly converted to list in the previous cell.

In [4]:
# Changing the default timezone for the analysis
spark.conf.set("spark.sql.session.timeZone", pwsTz)

# Inferr schema creates a problem when changing the session timezone
#    .option("inferSchema","true") \
    
staticDataFrame = spark.read \
    .format("csv") \
    .option("header","true") \
    .load(dailyHistoryFilesList)
#    .load('dailyHistory/IYUCATNT2/*/*/*.csv')
#    .load(list(dailyHistoryFiles))
#    .load(['dailyHistory/IYUCATNT2/2016/01/*.csv',
# 'dailyHistory/IYUCATNT2/2016/02/*.csv'])
    #.load('dailyHistory/IYUCATNT2/2016/*/*.csv')
    #.load(list(dailyHistoryFiles))

staticDataFrame.printSchema()
staticDataFrame.show()

root
 |-- Time: string (nullable = true)
 |-- TemperatureF: string (nullable = true)
 |-- DewpointF: string (nullable = true)
 |-- PressureIn: string (nullable = true)
 |-- WindDirection: string (nullable = true)
 |-- WindDirectionDegrees: string (nullable = true)
 |-- WindSpeedMPH: string (nullable = true)
 |-- WindSpeedGustMPH: string (nullable = true)
 |-- Humidity: string (nullable = true)
 |-- HourlyPrecipIn: string (nullable = true)
 |-- Conditions: string (nullable = true)
 |-- Clouds: string (nullable = true)
 |-- dailyrainin: string (nullable = true)
 |-- SolarRadiationWatts/m^2: string (nullable = true)
 |-- SoftwareType: string (nullable = true)
 |-- DateUTC: string (nullable = true)

+-------------------+------------+---------+----------+-------------+--------------------+------------+----------------+--------+--------------+----------+------+-----------+-----------------------+-------------------+-------------------+
|               Time|TemperatureF|DewpointF|PressureIn|W

We will see how many readings we have per day and if there are null days (will be substitute by 0).

First creating a data frame with all the dates possible in the range.

In [5]:
allDateRange = pd.date_range(start=startDate,end=endDate,freq='D',tz=pwsTz)
# Range to RDD of stringd
allDateStrRDD = sc.parallelize(allDateRange).map(lambda ts: ts.strftime('%Y-%m-%d'))
# RDD of strings to df of dates
allDateDF = allDateStrRDD.map(Row("Date")).toDF().selectExpr("cast(Date as date)")

allDateDF.printSchema()

root
 |-- Date: date (nullable = true)



Now converting the time colum to timestamp

In [6]:
tsDateDF = staticDataFrame \
    .selectExpr("to_timestamp(Time,'yyyy-MM-dd HH:mm:ss') as TS") \
    .withColumn("Date",col("TS").cast("date"))
tsDateDF.printSchema()

root
 |-- TS: timestamp (nullable = true)
 |-- Date: date (nullable = true)



In [9]:
dateCountsDF = tsDateDF.groupBy("Date").count() #.sort("count")
size = dateCountsDF.count()
dateCountsDF.show(size)

+----------+-----+
|      Date|count|
+----------+-----+
|2016-03-01|  188|
|2016-04-25|  183|
|2016-05-03|  280|
|2016-08-31|  275|
|2016-08-15|  222|
|2016-10-03|  178|
|2016-01-28|  211|
|2016-07-17|  249|
|2016-11-08|  176|
|2016-12-19|  173|
|2016-08-23|  181|
|2016-07-03|  180|
|2016-02-04|  197|
|2016-05-26|  179|
|2016-06-02|  164|
|2016-09-23|  172|
|2016-06-16|  177|
|2016-04-22|  176|
|2016-09-30|  175|
|2016-01-19|  176|
|2016-05-09|  285|
|2016-07-19|  233|
|2016-09-15|  176|
|2016-02-08|  273|
|2016-10-07|  178|
|2016-12-12|  177|
|2016-05-23|  181|
|2016-09-26|  177|
|2016-12-13|  172|
|2016-03-25|  278|
|2016-08-26|  166|
|2016-02-03|  165|
|2016-09-09|  176|
|2016-08-01|  167|
|2016-06-17|  176|
|2016-09-27|  173|
|2016-08-16|  277|
|2016-10-23|  177|
|2016-04-30|  159|
|2016-07-21|  235|
|2016-05-27|  188|
|2016-07-02|  178|
|2016-08-06|  188|
|2016-10-20|  188|
|2016-05-07|  273|
|2016-08-05|  192|
|2016-04-26|  177|
|2016-05-13|  169|
|2016-05-31|  164|
|2016-02-22|

Join all dates dataset with the counts

In [13]:
# Using a sequencer gives better results since there is just one Date column
joinDF = allDateDF.join(dateCountsDF,["Date"],"left_outer").na.fill({'count':0})


size = joinDF.count()
joinDF.sort("count").show(size)



+----------+-----+
|      Date|count|
+----------+-----+
|2016-03-31|    0|
|2016-01-06|    0|
|2016-01-13|    0|
|2016-01-16|    0|
|2016-11-14|    0|
|2016-01-01|    0|
|2016-11-06|    0|
|2016-03-19|    0|
|2016-11-03|    0|
|2016-01-03|    0|
|2016-01-08|    0|
|2016-06-13|    0|
|2016-06-08|    0|
|2016-03-21|    0|
|2016-06-20|    0|
|2016-06-09|    0|
|2016-06-05|    0|
|2016-03-26|    0|
|2016-08-27|    0|
|2016-06-06|    0|
|2016-01-09|    0|
|2016-08-21|    0|
|2016-06-10|    0|
|2016-03-30|    0|
|2016-07-26|    0|
|2016-12-24|   89|
|2016-05-17|  144|
|2016-09-25|  156|
|2016-03-05|  158|
|2016-10-28|  158|
|2016-04-03|  158|
|2016-04-30|  159|
|2016-03-16|  162|
|2016-10-16|  162|
|2016-02-12|  163|
|2016-02-15|  163|
|2016-03-15|  163|
|2016-02-26|  164|
|2016-10-04|  164|
|2016-06-03|  164|
|2016-05-31|  164|
|2016-06-02|  164|
|2016-03-18|  164|
|2016-02-24|  164|
|2016-10-17|  165|
|2016-02-03|  165|
|2016-07-29|  165|
|2016-02-11|  166|
|2016-06-18|  166|
|2016-02-16|