# Setting up our Schema

Spark can automatically create a schema for CSV files, but ours don't have headings. Let's set this up here:

In [1]:
import datetime
from pyspark.sql.types import StructType, StructField, FloatType, LongType, StringType

feats = []
f = open('features.txt')
for line_num, line in enumerate(f):
    if line_num == 0:
        # Timestamp
        feats.append(StructField(line.strip(), LongType(), True))
    elif line_num == 1:
        # Geohash
        feats.append(StructField(line.strip(), StringType(), True))
    else:
        # Other features
        feats.append(StructField(line.strip(), FloatType(), True))
    
schema = StructType(feats)

print(schema)

StructType(List(StructField(Timestamp,LongType,true),StructField(Geohash,StringType,true),StructField(geopotential_height_lltw,FloatType,true),StructField(water_equiv_of_accum_snow_depth_surface,FloatType,true),StructField(drag_coefficient_surface,FloatType,true),StructField(sensible_heat_net_flux_surface,FloatType,true),StructField(categorical_ice_pellets_yes1_no0_surface,FloatType,true),StructField(visibility_surface,FloatType,true),StructField(number_of_soil_layers_in_root_zone_surface,FloatType,true),StructField(categorical_freezing_rain_yes1_no0_surface,FloatType,true),StructField(pressure_reduced_to_msl_msl,FloatType,true),StructField(upward_short_wave_rad_flux_surface,FloatType,true),StructField(relative_humidity_zerodegc_isotherm,FloatType,true),StructField(categorical_snow_yes1_no0_surface,FloatType,true),StructField(u-component_of_wind_tropopause,FloatType,true),StructField(surface_wind_gust_surface,FloatType,true),StructField(total_cloud_cover_entire_atmosphere,FloatType,tru

# Creating a Dataframe

Let's load our CSV into a 'dataframe' - Spark's abstraction for working with tabular data (built on top of RDDs)

In [2]:
# from pyspark.storagelevel import StorageLevel
# spark.conf.set("spark.sql.broadcastTimeout", 36000)
spark.conf.set("spark.sql.broadcastTimeout", 1200)
# df = spark.read.format('csv').option('sep', '\t').schema(schema).load('/Volumes/evo/Datasets/NAM_2015_S/*')
# df = spark.read.format('csv').option('sep', '\t').schema(schema).load('hdfs://orion11:37000/nam_tiny.tdv')
# df = spark.read.format('csv').option('sep', '\t').schema(schema).load('hdfs://orion11:37000/data/nam/nam_201509.tdv.gz')
# df = spark.read.format('csv').option('sep', '\t').schema(schema).load('hdfs://orion11:37000/data/nam_s/*')
df = spark.read.format('csv').option('sep', '\t').schema(schema).load('hdfs://orion11:37000/data/nam/*')

# df.cache()
# df.persist(StorageLevel.DISK_ONLY)
df.take(1)

[Row(Timestamp=1430438400000, Geohash='dndf9tz5r8eb', geopotential_height_lltw=1915.593994140625, water_equiv_of_accum_snow_depth_surface=0.0, drag_coefficient_surface=0.0, sensible_heat_net_flux_surface=-12.571273803710938, categorical_ice_pellets_yes1_no0_surface=0.0, visibility_surface=24220.529296875, number_of_soil_layers_in_root_zone_surface=3.0, categorical_freezing_rain_yes1_no0_surface=0.0, pressure_reduced_to_msl_msl=101235.0, upward_short_wave_rad_flux_surface=4.25, relative_humidity_zerodegc_isotherm=95.0, categorical_snow_yes1_no0_surface=0.0, u-component_of_wind_tropopause=20.28228759765625, surface_wind_gust_surface=3.9325132369995117, total_cloud_cover_entire_atmosphere=98.0, upward_long_wave_rad_flux_surface=371.25927734375, land_cover_land1_sea0_surface=1.0, vegitation_type_as_in_sib_surface=10.0, v-component_of_wind_pblri=-3.47259521484375, albedo_surface=17.25, lightning_surface=0.0, ice_cover_ice1_no_ice0_surface=0.0, convective_inhibition_surface=-12.582763671875,

### [0.5 pt] Unknown Feature: Choose a feature from the data dictionary above that you have never heard of before. Inspect some of the values for the feature (such as its average, min, max, etc.) and try to guess what it measures. Was your hypothesis correct? (Note: if you are a professional meteorologist, you can skip this question ;-))

* The surface_roughness_surface_surface feature interests me, and I guess it measures the body senses the roughness of the air.
* Surface roughness often shortened to roughness, is a component of surface texture. It is quantified by the deviations in the direction of the normal vector of a real surface from its ideal form. https://en.wikipedia.org/wiki/Surface_roughness
* Research uses surface roughness model to explore the impact of isolated surface roughness anomalies on the model climate. https://journals.ametsoc.org/doi/full/10.1175/2007JAS2509.1
* Job running time: 35 mins (seems each value's calculation runs for 11 mins)

In [3]:
from pyspark.sql.functions import avg, min, max

a = datetime.datetime.now().replace(microsecond=0)

# What's the maximum value?
df.select(max(df.surface_roughness_surface)).show()

# What's the minimum value?
df.select(min(df.surface_roughness_surface)).show()

# What's the average value?
df.select(avg(df.surface_roughness_surface)).show()

b = datetime.datetime.now().replace(microsecond=0)

print('Job running time:', b-a)

+------------------------------+
|max(surface_roughness_surface)|
+------------------------------+
|                      2.750016|
+------------------------------+

+------------------------------+
|min(surface_roughness_surface)|
+------------------------------+
|                  1.5900003E-5|
+------------------------------+

+------------------------------+
|avg(surface_roughness_surface)|
+------------------------------+
|            0.4834922812580254|
+------------------------------+

Job running time: 0:35:14


From: https://data.planetos.com/datasets/noaa_nam_awips_12

![](./images/planetos_srs_title.png)<br>
![](./images/planetos_srs.png)

### [0.5 pt] Hot hot hot: When and where was the hottest temperature observed in the dataset? Is it an anomaly?

In [4]:
a = datetime.datetime.now().replace(microsecond=0)

df.createOrReplaceTempView("TEMP_DF")
hottest = spark.sql("SELECT Timestamp, Geohash, temperature_surface FROM TEMP_DF \
                        WHERE temperature_surface in (SELECT MAX(temperature_surface) FROM TEMP_DF)").collect()

b = datetime.datetime.now().replace(microsecond=0)

print('Hottest temperature observed:', hottest)

print('Job running time:', b-a)

Hottest temperature observed: [Row(Timestamp=1440266400000, Geohash='d5dpds10m55b', temperature_surface=331.390625)]
Job running time: 0:24:06


![](./images/hottest.png)<br>
* Timestamp=1440266400000 -----> GMT: Saturday, August 22, 2015 6:00:00 PM
* Super hot, can't live.
* Job running time: 24 mins

### [1 pt] So Snowy: Find a location that is snowy all year (there are several). Locate a nearby town/city and provide a small writeup about it. Include pictures if you’d like.

In [5]:
a = datetime.datetime.now().replace(microsecond=0)

df.createOrReplaceTempView("SNOWY_DF")

snow_1 = spark.sql("SELECT count(*) as Count, Geohash FROM SNOWY_DF \
                       WHERE categorical_snow_yes1_no0_surface = 1 group by Geohash \
                           order by count(*) DESC").collect()

for s in range(10):
    print(snow_1[s])

b = datetime.datetime.now().replace(microsecond=0)

print('Job running time:', b-a)

Row(Count=436, Geohash='c43k6uu1egxb')
Row(Count=436, Geohash='c43kcu3t702p')
Row(Count=434, Geohash='c41uhb4r5n00')
Row(Count=434, Geohash='c41ueb1jyypb')
Row(Count=432, Geohash='c41v48pupf00')
Row(Count=422, Geohash='c438x5esgf00')
Row(Count=421, Geohash='c43b05v7222p')
Row(Count=421, Geohash='c41v98n9w0xb')
Row(Count=417, Geohash='c438fqgmsm00')
Row(Count=417, Geohash='c439n53vsxzz')
Job running time: 0:11:51


![](./images/snowy_location.png)<br>
* One of the snowy locations: Geohash='c43k6uu1egxb', which is in Juneau (the capital city of Alaska) and near the Glacier Bay National Park and Preserve. In the Geohashes Google Map picture, we can see the area is covered by white snow, and it seems as a mountain peak and is often snow. https://en.wikipedia.org/wiki/Juneau,_Alaska
* Job running time: 11 mins