<h1>Homework</h1>

The objective is to create a regression model using machine learning. 

You can work with the data science salary or pick a new dataset from [Kaggle](https://www.kaggle.com/datasets?tags=14203-Regression) .

If you are selecting a new dataset be mindful of the "usability" score as it is an indication of how much work you will have to put on pre processing the data.

You can use the lecture notebooks as a guide but you are free to use any methods and tools you like.

Don't forget that understanding the data is part of the process.

And, above all, try to enjoy the process and be kind to yourself when you get stuck.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [2]:
df = pd.read_csv('dataset\\flight_data.csv', sep=',', index_col=0)
print(df.head())  # Display the first few rows of the dataset
print(df.info())  # Get information about the dataset

    airline   flight source_city departure_time stops   arrival_time  \
0  SpiceJet  SG-8709       Delhi        Evening  zero          Night   
1  SpiceJet  SG-8157       Delhi  Early_Morning  zero        Morning   
2   AirAsia   I5-764       Delhi  Early_Morning  zero  Early_Morning   
3   Vistara   UK-995       Delhi        Morning  zero      Afternoon   
4   Vistara   UK-963       Delhi        Morning  zero        Morning   

  destination_city    class  duration  days_left  price  
0           Mumbai  Economy      2.17          1   5953  
1           Mumbai  Economy      2.33          1   5953  
2           Mumbai  Economy      2.17          1   5956  
3           Mumbai  Economy      2.25          1   5955  
4           Mumbai  Economy      2.33          1   5955  
<class 'pandas.core.frame.DataFrame'>
Int64Index: 300153 entries, 0 to 300152
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   airline

In [3]:
# Handle missing values (if any)
df.dropna(inplace=True)

# Convert categorical variables into numerical using one-hot encoding
df = pd.get_dummies(df, columns=['airline', 'flight', 'source_city', 'departure_time', 'stops', 'arrival_time', 'destination_city', 'class'])

# Split the data into features (X) and target (y)
X = df.drop('price', axis=1)
y = df['price']

# Split the data into training and testing sets (e.g., 80% for training, 20% for testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
# Example: Visualize the correlation matrix to identify feature relationships
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()


In [4]:
# Create a Linear Regression model
model = LinearRegression()

# Train the model on the training data
model.fit(X_train, y_train)


In [5]:
# Make predictions on the testing data
y_pred = model.predict(X_test)

# Calculate Mean Squared Error (MSE) and R-squared (R2) for evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error (MSE): {mse:.2f}')
print(f'R-squared (R2): {r2:.2f}')


Mean Squared Error (MSE): 1194773577107981056.00
R-squared (R2): -2317778200.83


In [6]:
# Example: Predict the price for a new data point
new_data_point = X_test.iloc[0]  # Use any row from the testing set
predicted_price = model.predict([new_data_point])
print(f'Predicted Price: {predicted_price[0]:.2f}')


Predicted Price: 6328.28


