# Initial data analysis of Weather data
This notebook contains EDA of Weather data in order to gain information for the ETL process

In [2]:
import os
import pandas as pd
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master('local') \
    .appName('weather') \
    .getOrCreate()

We begin by reading one weather attribute and:
* take a look at the dataframe schema
* verify correct time period of the data
* check basic statistical description of the data

In [3]:
data_folder = '/Users/tomra/Projects/data-engineering/udacity-data-engineer-nanodegree/06-capstone-project/data'
df = spark.read \
    .format('csv') \
    .options(header=True, inferSchema=True) \
    .load(os.path.join(data_folder, 'historical-hourly-weather-data', 'temperature.csv'))
df.printSchema()

root
 |-- datetime: timestamp (nullable = true)
 |-- Vancouver: double (nullable = true)
 |-- Portland: double (nullable = true)
 |-- San Francisco: double (nullable = true)
 |-- Seattle: double (nullable = true)
 |-- Los Angeles: double (nullable = true)
 |-- San Diego: double (nullable = true)
 |-- Las Vegas: double (nullable = true)
 |-- Phoenix: double (nullable = true)
 |-- Albuquerque: double (nullable = true)
 |-- Denver: double (nullable = true)
 |-- San Antonio: double (nullable = true)
 |-- Dallas: double (nullable = true)
 |-- Houston: double (nullable = true)
 |-- Kansas City: double (nullable = true)
 |-- Minneapolis: double (nullable = true)
 |-- Saint Louis: double (nullable = true)
 |-- Chicago: double (nullable = true)
 |-- Nashville: double (nullable = true)
 |-- Indianapolis: double (nullable = true)
 |-- Atlanta: double (nullable = true)
 |-- Detroit: double (nullable = true)
 |-- Jacksonville: double (nullable = true)
 |-- Charlotte: double (nullable = true)
 |-- M

In [12]:
df.select('datetime').take(1)

[Row(datetime=datetime.datetime(2012, 10, 1, 12, 0))]

In [17]:
df.select('datetime').sort('datetime', ascending=False).take(1)

[Row(datetime=datetime.datetime(2017, 11, 30, 0, 0))]

In [7]:
df.select('datetime', 'Chicago').describe().toPandas()

Unnamed: 0,summary,Chicago
0,count,45250.0
1,mean,283.3505727940784
2,stddev,10.997137350331892
3,min,248.89
4,max,308.48


In [9]:
df = spark.read \
    .format('csv') \
    .options(header=True, inferSchema=True) \
    .load(os.path.join(data_folder, 'historical-hourly-weather-data', 'weather_description.csv'))
df.select('datetime', 'Chicago').limit(5).toPandas()

Unnamed: 0,datetime,Chicago
0,2012-10-01 12:00:00,
1,2012-10-01 13:00:00,overcast clouds
2,2012-10-01 14:00:00,overcast clouds
3,2012-10-01 15:00:00,overcast clouds
4,2012-10-01 16:00:00,overcast clouds


## Summary
This dataset seems straightforward to use. We have data with one hour granularity. Since precipitation and taxi rides have finer grain we'll settle for summarizing the fact table on an hourly basis. The temperature seems to be in Kelvin, so a unit conversion is probably feasible to conduct for added usability.