# Introduction to Least Squares Fitting with Flight Data

In this notebook, we will explore the **least squares method** by applying it to a dataset containing information about flight prices and durations. Least squares is a fundamental technique in regression analysis that helps us find the best-fitting line through a set of data points by minimising the sum of the squared differences between observed values and the values predicted by the model.

### Objective
We aim to model the relationship between flight prices and durations. By applying the least squares method, we can estimate how flight duration affects prices and predict future prices based on the observed data.

### Dataset Overview
The dataset contains the following key features:
- **Flight Price**: The cost of a flight ticket in GBP.
- **Flight Duration**: The total time of the flight in hours.

By the end of this notebook, you will:
- Understand how the least squares method works.
- Visualise the relationship between flight prices and durations.
- Implement a simple linear regression model using the least squares approach.

Let’s begin by loading the data and exploring the relationship between these variables.

In [None]:
#Load relevant python modules
import pandas as pd #for reading in and processing data
import numpy as np #for handling arrays, i.e. vector and matrix operations
import matplotlib.pyplot as plt #for plotting
#Set style for plots
plt.style.use('ggplot')
import least_squares_utils as lsu #utility functions for least squares analysis and plotting

In [None]:
#Load the data
flights_econ = "./basic_economy_fares.csv"
basic_economy_df = pd.read_csv(flights_econ)

#Inspect the first 10 rows
print(basic_economy_df[:11])

In [None]:
# Set the style for the plot and create a figure and axis object
fig, ax = plt.subplots(figsize=(7.0, 5.5))

#Plot the data
lsu.plot_flights_scatter_data(basic_economy_df, ax)

plt.show()

### Finding the Line of Best Fit

To model the relationship between flight duration and price, we aim to find a line that best fits the data. This line will allow us to predict future flight prices based on the duration of the flight.

The equation of a line is typically written as:

\[
y = mx + c
\]

where:
- **m** represents the slope of the line, indicating how much the flight price (y) changes for each additional hour of flight duration (x).
- **c** is the y-intercept, which is the price when the flight duration is zero, i.e. the fixed costs per passenger.

One way to approach this is to visually estimate the line by eye, trying to guess both the slope (m) and the intercept (c).

Take a moment to consider: What would your best estimate be for these parameters based on the data points? How well do you think your line would predict future prices?


In [None]:
# Set the style for the plot and create a figure and axis object
fig, ax = plt.subplots(figsize=(7.0, 5.5))

#Plot the flights data
lsu.plot_flights_scatter_data(basic_economy_df, ax)

# Guess the best fit parameters
intercept = FIXME
slope = FIXME

check_error = False

total_error = lsu.plot_best_fit_line(basic_economy_df, ax, slope=slope, intercept=intercept, 
                                 error_check=check_error, show_error_on_plot=check_error)


In [None]:
# Create the initial scatter plot
fig, ax = plt.subplots(figsize=(10, 6))
lsu.plot_flights_scatter_data(basic_economy_df, ax)

# Plot the least squares line of best fit and print the equation on the plot
lsu.plot_least_squares_fit(basic_economy_df, ax)

# Show the final plot
plt.show()