<a href="https://colab.research.google.com/github/rendzina/BigDataAndVisualisation/blob/main/Colab/Weekly_Fuel_PricesExample_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Weekly Fuel Price Analysis

*MKU: Big Data and Visualisation*

## Unit 4 Problem Definition and Suggested Solutions
Google Colab Notebook

# Useful links
## Data source
https://www.data.gov.uk/dataset/21db6396-3daf-4d90-8b3f-054995256018/petrol-and-diesel-prices

https://assets.publishing.service.gov.uk/media/66422e51b7249a4c6e9d3345/weekly_fuel_prices_130524.xlsx

## Programming
https://colab.research.google.com/github/datacamp/data-cleaning-with-pyspark-live-training/blob/master/notebooks/Cleaning_Data_with_PySpark.ipynb#scrollTo=2NRGmdeqa2L3

https://sparkbyexamples.com/pyspark/pyspark-split-dataframe-column-into-multiple-columns/

Below the magic %%capture suppresses output for the installation

In [10]:
%%capture
%pip install mount-azure-blob==0.0.3

In [11]:
# Connection details are given in class
from mount_azure_blob import mount_storage
mount_storage(mount_path="bdv-2024-05-09t15-59-02-855z", config_file=None)

VBox(children=(HBox(children=(Text(value='', description='accountName'), Text(value='', description='accountKe…

In [12]:
# Install Pyspark
!pip install pyspark



In [13]:
# Create a PySpark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
spark

In [19]:
# Get data - the file is in the form of an Excel Spreadsheet (makes a change from loading CSV), we want the particular worksheet 'data' and will ignore some header text lines
from datetime import datetime, date
import pandas as pd
pandas_df = pd.read_excel('/content/bdv-2024-05-09t15-59-02-855z/HdiSamples/weekly_fuel_prices_130524.xlsx',sheet_name='Data',skiprows=7)
pandas_df.drop([0, 7])
print('\nCheck that pandas_df is a pandas dataframe: ',isinstance(pandas_df, pd.DataFrame))


Check that pandas_df is a pandas dataframe:  True


In [20]:
# Convert the pandas dataframe to a Spark dataframe
#pd.DataFrame.iteritems = pd.DataFrame.items # see https://stackoverflow.com/questions/75926636/databricks-issue-while-creating-spark-data-frame-from-pandas

# pull the pandas df over into Spark as a Spark df (they are not the same)
spark_df = spark.createDataFrame(pandas_df)
spark_df

Traceback (most recent call last):
  File "/content/spark-3.1.1-bin-hadoop3.2/python/pyspark/serializers.py", line 437, in dumps
    return cloudpickle.dumps(obj, pickle_protocol)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/content/spark-3.1.1-bin-hadoop3.2/python/pyspark/cloudpickle/cloudpickle_fast.py", line 72, in dumps
    cp.dump(obj)
  File "/content/spark-3.1.1-bin-hadoop3.2/python/pyspark/cloudpickle/cloudpickle_fast.py", line 540, in dump
    return Pickler.dump(self, obj)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/content/spark-3.1.1-bin-hadoop3.2/python/pyspark/cloudpickle/cloudpickle_fast.py", line 630, in reducer_override
    return self._function_reduce(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/content/spark-3.1.1-bin-hadoop3.2/python/pyspark/cloudpickle/cloudpickle_fast.py", line 503, in _function_reduce
    return self._dynamic_function_reduce(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/content/spark-3.1.1-bin-hadoop3.2/python/

PicklingError: Could not serialize object: IndexError: tuple index out of range

In [None]:
# Inspect the result - to view a Spark df we use the show() function (different from Pandas dfs)
spark_df.show()

# Problem 1
## Plot average price by year for both fuel types

In [None]:
# Time series prediction - we will start this by undertaking a prediction of fuel prices
# Use FaceBook's Prophet API (it's built into Colab so doesn't need to be installed) - see https://facebook.github.io/prophet/docs/quick_start.html
from prophet import Prophet

costs = pandas_df[['Date', ' ULSP:  Pump price (p/litre)']]
costs.columns = ["ds", "y"] # Prophet requires the column names to be named like this

model = Prophet()
model.fit(costs)



In [None]:
# Time Series cont.
# Import library
import datetime

# Create a data frame for predictions with dates from 13/5/2024 - 21/8/2025
future = model.make_future_dataframe(periods=100)

# Change the column name
future.columns = ['ds']
#future.tail()
# in-sample prediction
prediction = model.predict(future)

# Plot
fig = model.plot(prediction, figsize=(10,5))
ax = fig.gca()
ax.set_title("ULSP Forecast", size=20)
ax.set_xlabel("Date", size=18)
ax.set_ylabel("Price", size=18)
ax.tick_params(axis='y', labelsize=15)
ax.tick_params(axis='x', rotation=45, labelsize=15)
ax.set_xlim(pd.to_datetime(['2003-06-09', '2025-08-21']))
plt.show();

In [None]:
# Now back to the original challenge, to produce average values by year we need to first identify the year for each data rows - to do this we will split out just the year into a new column for subsequent grouping


In [None]:
from pyspark.sql.functions import year
spark_df2 = spark_df.withColumn('Year', year(spark_df['Date']))
spark_df2.printSchema() # Now see how the schema looks
# Note as the Spark df is 'immutable', we change it by making a new df - thus spark_df goes to spark_df2 and so on.

In [None]:
spark_df3 = spark_df2.drop(spark_df2.Date) # Now we can drop the date column, but again as Spark dfs are immutable, we need to create another
spark_df3.printSchema() # and see how the schema looks

In [None]:
# Here's an alternatiive approach we could have used for splitting up the date
# Use pyspark split, to get this we need to first load in the sql functions
#from pyspark.sql import functions as F
#spark_df2 = spark_df.withColumn('Year', F.split(spark_df['Date'], '-').getItem(0)) \
#       .withColumn('Month', F.split(spark_df['Date'], '-').getItem(1)) \
#       .withColumn('Day', F.split(spark_df['Date'], '-').getItem(2))

In [None]:
# Now we can do the grouping by year before we plot it out
# note the way the commands can be chained together with the full stop separator. Note finally the toPandas() converts the spark df to pandas - for the graphing
pd_df = spark_df3.groupby('Year').avg().sort('Year', ascending=[True]).toPandas()
pd_df.set_index(['Year'],drop=True, inplace=True) # an index is needed for the x axis in plots

#from IPython.display import display # print out the result if we want to see the data table
#display(pd_df)
pd_df.info() # print schema to check col names
print('\nCheck that pd_df is a pandas dataframe: ',isinstance(pd_df, pd.DataFrame)) # check it's a panda df

In [None]:
# We will now plot the data - first as a bar chart
# Note here we are using the '.plot()' member function of Pandas to achieve the plot. This is part of Pandas, but uses Matplotlib in the background.
import matplotlib.pyplot as plt
pd_df[['avg( ULSP:  Pump price (p/litre))', 'avg(ULSD: Pump price (p/litre))']].plot(kind="bar", stacked=False, width=0.6, figsize=(16, 5))
plt.title('Average fuel price by year')
plt.xlabel('Year')
plt.ylabel('Fuel Pump Price (p/Litre)')

In [None]:
# Using the same approach, we will do a line plot of the same data
import matplotlib.pyplot as plt
pd_df[['avg( ULSP:  Pump price (p/litre))', 'avg(ULSD: Pump price (p/litre))']].plot(kind="line", figsize=(16, 5))
plt.title('Average fuel price by year')
plt.xlabel('Year')
plt.ylabel('Fuel Pump Price (p/Litre)')

# Take a moment at this point to explore the Gemini AI feature to 'explain the code' - this can help your learning! Select the small star icon to the right.

In [None]:
# Here's an alternative bar chart form using MatPlotLib directly - the approach above seems simpler - both plots are the same though!
import matplotlib.pyplot as plt
fig = plt.figure()
fig.set_figwidth(16)
axes = plt.axes()
width = 0.3
plt.title('Average fuel price by year')
plt.xlabel('Year')
plt.ylabel('Fuel Pump Price (p/Litre)')
axes.xaxis.set_tick_params(rotation=90)
pd_df.plot(color='steelblue', y='avg( ULSP:  Pump price (p/litre))', width=width, position=1, legend=True, kind='bar', ax=axes)
pd_df.plot(color='darkorange', y='avg(ULSD: Pump price (p/litre))', width=width, position=0, legend=True, kind='bar', ax=axes)
plt.show() # for colours, see https://matplotlib.org/2.0.2/examples/color/named_colors.html

# Problem 2
## Price variation by year

In [None]:
# Distribution of fuel prices in each year
pd_df_box = spark_df3.sort("Year", ascending=[True]).toPandas()
spark_df3.printSchema()

In [None]:
axes = pd_df_box.boxplot(figsize = (15,20), fontsize= '10', grid = True, by = 'Year', column = [' ULSP:  Pump price (p/litre)','ULSD: Pump price (p/litre)'], layout=(2, 1))
#plt.title('Variation in USLP fuel prices by year (2003-2024)')
plt.xlabel('Year') # set up the horizontal 'x' axis label
plt.ylabel('Pump price (p/litre)') # set up the vertical 'y' axis label
plt.show()
# Note, the boxes in this 'box and whisker' plot extend from the Q1 to Q3 quartile values of the data, with a line at the median (Q2), outliers are plotted as separate dots.

# Problem 3
## Additional visualisations
A final challenge, made all the easier by Google CoLab is to see some other potential visualisations of the data.

In CoLab, if you present a Pandas dataframe with the 'display' command as below, CoLab places an option below offering auto-generation of a wide range of graph types. Selecting any of these graphs will provide a further option to show the generated code to produce that graph - very helpful! Click on 'View recommended plots' below the table, then select a graph to see its source code.

In [None]:
from IPython.display import display # print out the result if we want to see the data table
display(pd_df)