<a href="https://colab.research.google.com/github/werowe/HypatiaAcademy/blob/master/spark/pyspark_sql_weather.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# How to use Apache Spark with Python


Here we show how to use Spark SQL to query data using SQL in Apache Spark.  

PySpark can be installed as a Python package using `pip install pyspark` but that's not necessary with Google colab as it's already included.

## Program Steps

1.  Download sample csv data.
2.  Read the data into a Spark Dataframe
3.  Create a Spark View so we can run SQL queries
4.  Write and run some SQL queries


# Sample Data

First download weather data in the bash shell

> Note: Add the **Addenum** below is you want to see how this data was create.  

In [8]:
!wget https://raw.githubusercontent.com/werowe/HypatiaAcademy/refs/heads/master/basics/combined_paphos2024.csv

--2024-12-26 06:35:22--  https://raw.githubusercontent.com/werowe/HypatiaAcademy/refs/heads/master/basics/combined_paphos2024.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647674 (1.6M) [text/plain]
Saving to: ‘combined_paphos2024.csv.2’


2024-12-26 06:35:22 (33.4 MB/s) - ‘combined_paphos2024.csv.2’ saved [1647674/1647674]



# Read the csv file into a Spark dataframe

pip install pyspark is not necessary since Google Colab has already installed the pyspark packages

Here we create a Spark dataframe reading the data we have loaded to the load drive in the Colab virtual machine.



In [9]:

from pyspark.sql import SparkSession



# Initialize SparkSession
spark = SparkSession.builder \
    .appName("weather") \
    .getOrCreate()

df = spark.read.csv(
    "combined_paphos2024.csv",
    header=True,        # Use the first row as column names
    inferSchema=True,   # Automatically infer data types
    sep=",",            # Specify delimiter (default is ',')
    encoding="UTF-8"    # Handle encoding
)

Then we show the columns so we can see what data we have downloaded.



In [10]:
df.columns

['name',
 'datetime',
 'temp',
 'feelslike',
 'dew',
 'humidity',
 'precip',
 'precipprob',
 'preciptype',
 'snow',
 'snowdepth',
 'windgust',
 'windspeed',
 'winddir',
 'sealevelpressure',
 'cloudcover',
 'visibility',
 'solarradiation',
 'solarenergy',
 'uvindex',
 'severerisk',
 'conditions',
 'icon',
 'stations']

Here we create a temporary view in Apache Spark so that we can run queries again it.

In [11]:

df.createOrReplaceTempView("weather")

#Write SQL Queries

Now is is just a matter on create SQL queries and then runing them.

In [12]:
sql = '''
SELECT SUM(precip) AS total_precip, MONTH(datetime) AS month
FROM weather
GROUP BY MONTH(datetime)

'''

result = spark.sql(sql)
result.show()

+------------------+-----+
|      total_precip|month|
+------------------+-----+
|205.63699999999983|   12|
|1.6430000000000005|    9|
|             0.008|    8|
|               0.0|    7|
|1.5000000000000002|   10|
|166.32799999999978|   11|
+------------------+-----+



In [13]:
sql = '''
select max(temp) from weather  WHERE MONTH(datetime) = 8
'''

result = spark.sql(sql)
result.show()

+---------+
|max(temp)|
+---------+
|     33.1|
+---------+



In [14]:
sql = '''
SELECT max(temp) AS max_temp, MONTH(datetime) AS month
FROM weather
GROUP BY MONTH(datetime)
'''

result = spark.sql(sql)
result.show()

+--------+-----+
|max_temp|month|
+--------+-----+
|    22.6|   12|
|    32.1|    9|
|    33.1|    8|
|    34.6|    7|
|    28.8|   10|
|    27.0|   11|
+--------+-----+



# Addendum: How this Data was created

This data was downloaded using the API https://www.visualcrossing.com/weather/weather-data-services#

The code to do that is [here](https://github.com/werowe/HypatiaAcademy/blob/master/stats/paphos_daily_weather_csv.ipynb).  And the code to combine multiple days of data into one csv file is [here](https://github.com/werowe/HypatiaAcademy/blob/master/stats/consolidate.ipynb).


# Further Reading

Here's an example of [how to do logistic regression using PySpark](https://github.com/werowe/HypatiaAcademy/blob/841c6dbd22daa9dba58fc629cfb3b5135b837cd6/spark/sparkLR.ipynb#L4)