# CMSC320 Final Project


Introduction: In this project, we will be comparing gas prices and public transportation usage. We have two datasets. One is soley for gas prices and the other contains many forms of transportation usage and costs.

**Step 1: Identifying NaNs**

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.ticker as ticker
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import MinMaxScaler
from scipy.stats import f_oneway
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from scipy.stats import ttest_rel
from sklearn.impute import KNNImputer
import numpy as np
from scipy import stats
from scipy.stats import mannwhitneyu

transportation_df = pd.read_csv('Monthly_Transportation_Statistics.csv')
gasprices_df = pd.read_csv('USGasanddieselprices.csv')

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', None)

#Selecting the desired columns
transportation_df = transportation_df[['Index', 'Date', 'U.S. Airline Traffic - Total - Seasonally Adjusted', 'Transit Ridership - Other Transit Modes - Adjusted',
'Transit Ridership - Fixed Route Bus - Adjusted', 'Transit Ridership - Urban Rail - Adjusted', 'Highway Fuel Price - Regular Gasoline', 'Passenger Rail Passengers',
'Passenger Rail Total Train Miles', 'Personal Spending on Transportation - Gasoline and Other Energy Goods - Seasonally Adjusted']]
gasprices_df = gasprices_df[['Date', 'A1']]

#Converting to Datetime object
transportation_df['Date'] = pd.to_datetime(transportation_df['Date'])
gasprices_df['Date'] = pd.to_datetime(gasprices_df['Date'])
transportation_df = transportation_df.loc[transportation_df['Date'] >= '01-01-1975']
display(transportation_df.shape)

#Observing the number of NaNs
total_nan = 0
for curr in range(0, len(transportation_df.columns)):
    print("Column: ", transportation_df.columns[curr] , "----- Number of NaN: ", transportation_df.iloc[:, curr].isnull().sum())
    total_nan += transportation_df.iloc[:, curr].isnull().sum()

print('Total NaN: ', total_nan)
print('Gas Prices Missing:  ', gasprices_df.loc[:, 'A1'].isnull().sum())

#Pie chart to show the distribution among the transportation
pie_df = transportation_df[['U.S. Airline Traffic - Total - Seasonally Adjusted', 'Transit Ridership - Other Transit Modes - Adjusted',
'Transit Ridership - Fixed Route Bus - Adjusted', 'Transit Ridership - Urban Rail - Adjusted', 'Highway Fuel Price - Regular Gasoline', 'Passenger Rail Passengers',
'Passenger Rail Total Train Miles', 'Personal Spending on Transportation - Gasoline and Other Energy Goods - Seasonally Adjusted']]
colors = sns.color_palette('bright')[0:10]
plt.pie(x = pie_df.isna().sum(), labels = pie_df.isna().sum().index, colors = colors, autopct='%.0f%%')
plt.tight_layout()
plt.show()

*The two overlapping partitons represent Passenger Total Train Miles and Pasenger Rail Passengers. Both are 1% of the pie chart.*

Conclusion: We have two datasets, one is for transportation, another is for gas prices specifically. As the chart above demonstrates, there is a lot of data missing in the transportation dataframe especially from airline traffic and personal spending on transportation. Looking at the dataframes themselves, one can notice that for all missing data, the data is not missing after a certain date, meaning that the data is missing not at random (MNAR) because there is a clear time frame where the data is missing. In fact, none of the choosen columns have any data before 1975. The gas dataframe on the other hand, has no prices missing as we are using the "All Grades All Formulations Gasoline Prices" column only for simplicity. Features that are overrepresentative are Passenger Rail Passengers, Passenger Total Train Miles, and gas prices. This is because other feature are missing large portions of their data in this time frame. In addition, features that may be correlated are Highway Fuel Price and Personal Spending on Transportation.

We would like to explore if there is a correlation between gas prices and transit ridership. In this checkpoint, attributes that will be important are all froms of transit ridership and gas prices.

**Step 2: Removing NaNs**

In [None]:
gasprices_df.rename(columns={"A1": "Weekly Retail Gasoline Prices(All Grades Formulation)"}, inplace=True)
gasprices_df['Date'] = pd.to_datetime(gasprices_df['Date'])
display(gasprices_df.head())

The only column used from the gas prices csv file was renamed from "A1" to "Weekly Retail Gasoline Prices(All Grades Formulation)" to indicate what the column values represent.
The purpose of renaming not only serves as an indicator to people who view the gas prices csv but it also helps in making it clear what is being predicted during the ML model phase
of this project.

In [None]:
# List only the data columns that can be NaN (exclude 'Index' and 'Date')
cols = [
    'U.S. Airline Traffic - Total - Seasonally Adjusted',
    'Transit Ridership - Other Transit Modes - Adjusted',
    'Transit Ridership - Fixed Route Bus - Adjusted',
    'Transit Ridership - Urban Rail - Adjusted',
    'Highway Fuel Price - Regular Gasoline',
    'Passenger Rail Passengers',
    'Passenger Rail Total Train Miles',
    'Personal Spending on Transportation - Gasoline and Other Energy Goods - Seasonally Adjusted'
]
print(transportation_df[cols].isna().sum())
print(transportation_df.shape)
print("\n")
# Drop rows where all specified columns are NaN
transportation_df = transportation_df.dropna(subset=cols, how='all')
print(transportation_df[cols].isna().sum())
print(transportation_df.shape)

When looking at the three transit columns, it can be noticed that the data only exists from 2002 to the end of 2022, causing many nan. To solve this, we will create a seperate dataframe just with the three columns.

In [None]:
#Checking for NaNs
transit_df = transportation_df[[ 'Date','Transit Ridership - Other Transit Modes - Adjusted',
    'Transit Ridership - Fixed Route Bus - Adjusted',
    'Transit Ridership - Urban Rail - Adjusted']]

transit_df = transit_df.loc[ '2002-01-01' <= transit_df['Date'], :]
transit_df = transit_df.loc[ transit_df['Date'] <= '2022-12-01', :]
print(transit_df.isnull().sum())
#display(transit_df.head())

The dates for which all eight of those data columns are simultaneously missing will be dropped since they will not give us any information.

In [None]:
rail_cols = ['Passenger Rail Passengers', 'Passenger Rail Total Train Miles']
# Fill NaN values in the rail columns with forward and backward fill
transportation_df[rail_cols] = (transportation_df[rail_cols].ffill().bfill())
print(transportation_df[rail_cols].isna().sum())

Since the number of NaNs is only 8, we use the nearest value to imputate the NaNs for the above two fields.

In [None]:
#Created a dataframe within 1975 till 1990-08-01. Outside of this time period, there are many columns where there are NaNs.
col = 'Highway Fuel Price - Regular Gasoline'
start, end = '1975-01-01', '1990-08-01'

c_hwy = (transportation_df['Date'] >= start) & (transportation_df['Date'] <= end)
df_pre1990 = transportation_df.loc[c_hwy].copy()

#Dropping all the columns where every single entry is a NaN.
df_pre1990 = df_pre1990.dropna(axis=1, how='all')
print(df_pre1990.shape)
print("Remaining columns:", list(df_pre1990.columns))
print(df_pre1990.isna().sum())

Since most of the columns have all NaNs in from the period of 1975 to 1990, we create a seperate data (df_pre1990) frame during that time. Columns where all values are NaNs are dropeed and the other NaNs

In [None]:
#Creating a dataframe between 1990 and 2001
cols = [
  'Transit Ridership - Other Transit Modes - Adjusted',
  'Transit Ridership - Fixed Route Bus - Adjusted',
  'Transit Ridership - Urban Rail - Adjusted'
]

#Using the information found, creating a datafarme between 1990 and 2001 was beneficial.
start, end = '1990-09-01', '2001-12-01'
mask = (transportation_df['Date'] >= start) & (transportation_df['Date'] <= end)
df_pre2002 = transportation_df.loc[mask].copy()
df_pre2002 = df_pre2002.dropna(axis=1, how='all')
print(df_pre2002.shape)
print("Remaining columns:", list(df_pre2002.columns))
df_pre2002 = df_pre2002.ffill().bfill()
print(df_pre2002.isna().sum())

All three transit‐ridership columns in transportation_df share the same missing‐value dates. In other words, none of these series exist before a common “start date.” Moreover, the NaNs are grouped from 1990-01-01 to 2001-12-01 and then from 2023-01-01 to 2023-07-01. Thus, we make a seperate data frame to analyse the data in this period. All the remaining NaNs are handled using imputation from the nearest neighbour.

In [None]:
#Creating a new dataframe from 2002 to 2016
start_mid, end_mid = '2002-01-01', '2016-12-01'
mask_mid = (transportation_df['Date'] >= start_mid) & (transportation_df['Date'] <= end_mid)
df_2002_2016 = transportation_df.loc[mask_mid].copy()
df_2002_2016 = df_2002_2016.dropna(axis=1, how='all')

#Displaying results
print("Shape:", df_2002_2016.shape)
print("Columns remaining:", df_2002_2016.columns.tolist())
print(df_2002_2016.isna().sum())

In [None]:
col = 'U.S. Airline Traffic - Total - Seasonally Adjusted'
missing_air_dates = (transportation_df.loc[transportation_df[col].isna(), 'Date'].sort_values().unique())

#Creating a dataframe after 2016
cutoff = '2016-12-01'

df_post2016 = transportation_df.loc[transportation_df['Date'] > cutoff].copy()
#Dropped the columns where every value is a NaN
df_post2016 = df_post2016.dropna(axis=1, how='all')
df_post2016 = df_post2016.ffill().bfill()

print(df_post2016.shape)
print("Columns remaining:", df_post2016.columns.tolist())
#display(df_post2016)
#display(df_2002_2016)
#display(df_pre2002)
#display(df_pre1990)

**Step 3: Exploring Outliers**

Here we explore how different subsets of the data have or don't have outliers.

In [None]:
df_list = [df_pre1990, df_pre2002, df_2002_2016, df_post2016]
df_names = ["Pre 1990", "Pre 2002", "2002 to 2016", "Post 2016"]
updated_dfs = []
#Looping through lists containing all the datafarmes and creating a whisker plot for each column of each  dataframe
for df_name, df in zip(df_names, df_list):

  columns_plot = [col for col in df.columns[2:] if isinstance(col, str) and not col.startswith("Outliers")]

  fig, axes = plt.subplots(nrows=len(columns_plot), ncols=1, figsize=(6,3 * len(columns_plot)))
  if len(columns_plot)==1:
    axes = [axes]

  for ax, column in zip(axes,columns_plot):
    df[column].plot.box(ax=ax)
    ax.set_title(f"{df_name}: Whisker Plot for {column}")
    ax.ticklabel_format(style="plain", axis="y")
    ax.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, _:  f"{int(x):,.2f}"))

    #Added Isolation Forest for Outlier Detection
    iso_forest = IsolationForest(contamination=0.1, random_state=43)
    predictions = iso_forest.fit_predict(df[[column]])

    #Created a outlier flagging column for each respective column of each dataframe.
    flag_title = f"Outliers Flagged {column}"
    df[flag_title] = (predictions == -1).astype(int)
    columns = list(df.columns)
    if flag_title in columns:
      columns.remove(flag_title)
      index = columns.index(column)
      columns.insert(index + 1, flag_title)
      df = df[columns]

  plt.tight_layout()
  plt.show()

  print(f"\n{df_name}:")
  display(df)
  updated_dfs.append(df)

#Updated the dataframes to include the flagging column
df_pre1990, df_pre2002, df_2002_2016, df_post2016 = updated_dfs

I identified the outliers for each column of each dataframe and then flagged the outliers by using IsolationForest. Then the outliers were displayed in their own respective flagging columns that are placed to the right of the columns each flagging column corresponds with. The outliers are given a one while non outliers are given the default zero.

The purpose of flagging outliers is so that we can currently obtain information on what are the extreme values in each respective era. When we eventually train our Machine Learning model we can make it account for whether a values was an outlier or not in that respective era. Then the model will be able to take all the outliers and rank them on what was the most extreme compared to all eras for that column. With this information we will be able to determine what was time period or even year was considered an anomaly when compared to other eras. Determining on a global level what is considered an anomaly based on the value's extremity is crucial for the model. This is because our primary objective is to be able to make future predictions on how gas prices impact transportation and rail usage. Knowing whether an extreme value is an anomaly or part of a trend can help the model accurately predict the future pattern of gas prices impact on transportation and rail usage.

**Step 4: Hypothesis Testing**

*Test 1: Mann-Whitney U Test*

At Significance level of α = 0.05

Null Hypothesis: There is no difference between the amount of rail passengers and the amount of bus passengers.


Alternative Hypothesis: There are significantly more bus passengers than rail passengers.

In [None]:
stats, pval = mannwhitneyu(transit_df.iloc[:, [2]], transit_df.iloc[:, [ 3]])
print("P-value for Mann-Whitney U: ", pval)

Conclusion: Since the p-value returned is less than .05, we reject the null hypothesis. There is a signficantly more amount of passengers that prefer to use buses than rail.  Rail usage might be more expensive and less accessable. Meaning gas prices will not have the same effect on railways and buses.

In [None]:
plt.figure(figsize=(15,7))
plt.plot(transit_df['Date'], transit_df['Transit Ridership - Urban Rail - Adjusted'], label = 'Rail')
plt.plot(transit_df['Date'], transit_df['Transit Ridership - Fixed Route Bus - Adjusted'], label = 'Bus')
plt.legend()
plt.xlabel('Date')
plt.ylabel('Rail Ridership')
plt.title('Line Plot for Rail Useage')
plt.show()

This graph indicates the predominant usage of passengers preferring bus over rail. This shows that buses have more of an impact than rail when considering the impacts of gasprices.

*Test 2: ANOVA Test*

At a significance level of 95% , α = 0.05

Null Hypothesis: There is no difference in Transit ridership when the gas prices are low/medium/high.


Alternative Hypthesis: There is a difference between transit ridership when the gas prices are low/medium/high.

In [None]:
#Creating fresh dataframes and setting up the dataframe from the CSV file.
gasprices_b_df = pd.read_csv('USGasanddieselprices.csv')
transportation_b_df = pd.read_csv('Monthly_Transportation_Statistics.csv')

#Sellecting desirable columns.
gasprices_b_df = gasprices_b_df[['Date','A1']]
transportation_b_df = transportation_df[['Index', 'Date', 'U.S. Airline Traffic - Total - Seasonally Adjusted', 'Transit Ridership - Other Transit Modes - Adjusted',
'Transit Ridership - Fixed Route Bus - Adjusted', 'Transit Ridership - Urban Rail - Adjusted', 'Highway Fuel Price - Regular Gasoline', 'Passenger Rail Passengers',
'Passenger Rail Total Train Miles', 'Personal Spending on Transportation - Gasoline and Other Energy Goods - Seasonally Adjusted']]

#converting to Datetime Object.
gasprices_b_df['Date'] = pd.to_datetime(gasprices_b_df['Date'])
transportation_b_df['Date'] = pd.to_datetime(transportation_b_df['Date'])

#Limiting the range of the data to align the dataframes.
gasprices_b_df = gasprices_b_df[gasprices_b_df['Date'] >= '2002-01-01']
gasprices_b_df = gasprices_b_df[gasprices_b_df['Date'] <= '2021-01-04']
transportation_b_df = transportation_b_df.loc[transportation_df['Date'] >= '2002-01-01']
transportation_b_df = transportation_b_df.loc[transportation_b_df['Date'] <= '2021-01-01']

The "Transit Ridership - Urban Rail - Adjusted" column from the transportation_b_df is going to get dropped due to it not having as much of an impact with regards to Transit Ridership as a whole. This was proven in the Mann-Whitney U test which rejected the null hypothesis of there being no difference in the passenger ridership of rail and bus.

In [35]:
#It groups the gasprices dataframe from weekly to montly.
gasprices_month_df = gasprices_b_df.groupby(pd.Grouper( key = 'Date', freq = 'ME')).agg({'A1': 'mean'}).reset_index()


#Setting the range to determine what is considered a low, medium, and high gasprice.
low_gasprices = gasprices_month_df[gasprices_month_df['A1'] < 1.5]
medium_gasprices = gasprices_month_df[(gasprices_month_df['A1'] >= 1.5) & (gasprices_month_df['A1'] <= 2.3)]
high_gasprices = gasprices_month_df[gasprices_month_df['A1'] > 2.3]

#A new dataframe without the transit urban rail.
total_transit_ridership = transportation_b_df[['Date', 'Transit Ridership - Fixed Route Bus - Adjusted', 'Transit Ridership - Other Transit Modes - Adjusted']]

#A new column where all ridership except urban rail is included.
total_transit_ridership['Total Transit Ridership'] = total_transit_ridership[['Transit Ridership - Fixed Route Bus - Adjusted', 'Transit Ridership - Other Transit Modes - Adjusted']].sum(axis = 1)
total_transit_ridership = total_transit_ridership.drop(total_transit_ridership[['Transit Ridership - Fixed Route Bus - Adjusted', 'Transit Ridership - Other Transit Modes - Adjusted']], axis = 1)

#Converting total_transit_ridership, low_gasprices, medium_gasprices, and high_gasprices to a yyyy-mm basis so that they can be matched.
total_transit_ridership['YearMonth'] = total_transit_ridership['Date'].dt.strftime('%Y-%m') #yyyy-mm
low_gasprices['YearMonth'] = low_gasprices['Date'].dt.strftime('%Y-%m')
medium_gasprices['YearMonth'] = medium_gasprices['Date'].dt.strftime('%Y-%m')
high_gasprices['YearMonth'] = high_gasprices['Date'].dt.strftime('%Y-%m')

#Filtering out the transit ridership based on the gas price labels, low, medium, high.
low_merge = pd.merge(low_gasprices, total_transit_ridership, on = 'YearMonth', how = 'inner')
medium_merge = pd.merge(medium_gasprices, total_transit_ridership, on = 'YearMonth', how = 'inner')
high_merge = pd.merge(high_gasprices, total_transit_ridership, on = 'YearMonth', how = 'inner')

#extracting only the Transit ridership.
low_riders = low_merge['Total Transit Ridership']
medium_riders = medium_merge['Total Transit Ridership']
high_riders = high_merge['Total Transit Ridership']

#ANOVA Test
f_stat, p_val = f_oneway(low_riders, medium_riders, high_riders)
print(f"ANOVA Test results: \nF =  {f_stat:.2f}, p = {p_val:.2f}\n\n\n")

ANOVA Test results: 
F =  9.82, p = 0.00





A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  total_transit_ridership['Total Transit Ridership'] = total_transit_ridership[['Transit Ridership - Fixed Route Bus - Adjusted', 'Transit Ridership - Other Transit Modes - Adjusted']].sum(axis = 1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  low_gasprices['YearMonth'] = low_gasprices['Date'].dt.strftime('%Y-%m')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/s

Conclusion: The ANOVA Test results indicate that there is a difference in transit ridership depending on the price of gas.What this means is that gas prices may have an affect on transit ridership or may indicate that there are other related factors that affect on transit ridership. A p-value of close to 0 indicates that we reject the null hypothesis since p-value of close to 0 <= 0.05 = α

In [None]:
#Converting the date format to yyyy-mm
gasprices_month_df['YearMonth'] = gasprices_month_df['Date'].dt.strftime("%Y-%m")

#Dropping the original Date columns
gasprices_month_df = gasprices_month_df.drop(columns = ['Date'])
total_transit_ridership = total_transit_ridership.drop(columns = ['Date'])

#Normalizing the graph to visualize the difference between them
scaler = MinMaxScaler()
gasprices_month_df['A1 Normalized'] = scaler.fit_transform(gasprices_month_df[['A1']])
total_transit_ridership['Ridership Normalized'] = scaler.fit_transform(total_transit_ridership[['Total Transit Ridership']])

#Plotting
plt.figure(figsize=(30,7))
plt.plot(total_transit_ridership['YearMonth'], total_transit_ridership['Ridership Normalized'], label = 'Ridership')
plt.plot(gasprices_month_df['YearMonth'], gasprices_month_df['A1 Normalized'], label = 'gasprice')
plt.xticks(rotation = 90, fontsize = 6)
plt.xlabel('Date')
plt.ylabel('Gas Prices Monthly (Normalized)')
plt.legend()
plt.grid(True)
plt.title('Ridership Normalized')
plt.show()


*Test 3: Tukey's Honest Significance Difference (HSD) Test*

At Significance level of α = 0.05

*Because this is the Tukey's HSD Test, we will need three different Null Hypotheses and Alterantive Hypotheses*

Null Hypothesis: The average ridership for low, medium, and high gas prices are the same.

1) Average of High gas prices = Average of Low gas prices

2) Average of High gas prices = Average of Medium gas prices

3) Average of Low gas prices = Average of Medium gas prices

Alternative Hypothesis:

1) Average of High gas prices $\neq$ Average of Low gas prices

2) Average of High gas prices $\neq$ Average of Medium gas prices

3) Average of Low gas prices $\neq$ Average of Medium gas prices

In [None]:
low_merge['Total Transit Ridership'] = low_merge['Total Transit Ridership'] #/ len(low_merge)
medium_merge['Total Transit Ridership'] = medium_merge['Total Transit Ridership'] #/ len(medium_merge)
high_merge['Total Transit Ridership'] = high_merge['Total Transit Ridership'] #/ len(high_merge)


all_ridership = pd.concat([
    low_merge['Total Transit Ridership'],
    medium_merge['Total Transit Ridership'],
    high_merge['Total Transit Ridership']
])

group_labels = (
    ['Low'] * len(low_merge) +
    ['Medium'] * len(medium_merge) +
    ['High'] * len(high_merge)
)

tukey_result = pairwise_tukeyhsd(endog=all_ridership, groups=group_labels, alpha=0.05)

# Print the result
print(tukey_result)

High Vs. Low is not statisticlally significant. Same with Low Vs. Medium. But Medium Vs. High is statistcally significant. We reject the second hypothesis, but fail to reject the rest.

In [None]:
group = ['Low', 'Medium', 'High']
low_avg = low_merge['Total Transit Ridership'].mean()
medium_avg = medium_merge['Total Transit Ridership'].mean()
high_avg = high_merge['Total Transit Ridership'].mean()

avg = [low_avg, medium_avg, high_avg]

plt.figure(figsize=(18,5))
plt.bar(group, avg, color = ['red', 'blue', 'green'])
plt.xlabel
plt.title("Average Rider for Gas Price Type")
plt.xlabel('Type of Gas price')
plt.ylabel('Rider')
plt.show()

Conclusion: As demonstrated by Tukey's HSD and the Bar Chart, ridership is the lowest when gas prices is medium, which is \$1.5 to \$2.3. When gas prices are either too high or too low it means that there maybe external factors. A potential factors is the economy which prompts people to choose public transportation rather than their own cars. High gas prices means it is more expensive to drive a car, whereas low gas prices may indicate a recession or a loss of jobs.

Here, all the metrics of the transportation_df has been graphed to visualize the behavior and trends displayed by the data.

In [None]:
transportation_b_df = transportation_b_df[transportation_b_df['Date'] >= '2002-01-01']

transportation_columns = ['U.S. Airline Traffic - Total - Seasonally Adjusted', 'Transit Ridership - Other Transit Modes - Adjusted',
'Transit Ridership - Fixed Route Bus - Adjusted', 'Transit Ridership - Urban Rail - Adjusted', 'Highway Fuel Price - Regular Gasoline', 'Passenger Rail Passengers',
'Passenger Rail Total Train Miles', 'Personal Spending on Transportation - Gasoline and Other Energy Goods - Seasonally Adjusted'
]

for col in transportation_columns:
    transportation_df[col] = pd.to_numeric(transportation_df[col], errors='coerce')

scaler = MinMaxScaler()
transportation_df[transportation_columns] = scaler.fit_transform(transportation_df[transportation_columns])

plt.figure(figsize=(14, 7))
colors = plt.cm.get_cmap('tab10', len(transportation_columns))\

for i, col in enumerate(transportation_columns):
    plt.plot(transportation_df['Date'], transportation_df[col], label=col, color=colors(i), linewidth=2, alpha=0.8)

plt.title("Transportation Metrics Over Time (Normalized)")
plt.xlabel("Date")
plt.ylabel("Normalized Value")
plt.legend(loc="best", bbox_to_anchor=(1, 1))
plt.grid(True)
plt.tight_layout()
plt.show()

As you can see the multiple graphs overlapping each other making it unrecognizable. As such, dropping columns such as Personal Spending on Transportation and Highway Fuel Price can be dropped. It is also better to group some of the columns which share similaritie such as the Transit columns and the Passenger columns.

In [None]:
transportation_df['Transit Ridership Avg'] = transportation_df[['Transit Ridership - Fixed Route Bus - Adjusted', 'Transit Ridership - Other Transit Modes - Adjusted', 'Transit Ridership - Urban Rail - Adjusted']].mean(axis = 1)
transportation_df['Passenger Rail Avg'] = transportation_df[['Passenger Rail Passengers', 'Passenger Rail Total Train Miles']].mean(axis = 1)
transportation_df['Date'] = pd.to_datetime(transportation_df['Date'])
transportation_df = transportation_df[transportation_df['Date'] >= '2002-01-01']
transportation_avg_df = transportation_df[['Date', 'Transit Ridership Avg', 'Passenger Rail Avg']]
transportation_avg_columns = ['Transit Ridership Avg','Passenger Rail Avg',]

for col in transportation_avg_columns:
    transportation_avg_df[col] = pd.to_numeric(transportation_avg_df[col], errors='coerce')

#\\\transportation_avg_df[transportation_avg_columns] = scaler.fit_transform(transportation_avg_df[transportation_avg_columns])

plt.figure(figsize=(14, 7))
colors = plt.cm.get_cmap('tab10', len(transportation_avg_columns))  # or 'tab20'

for i, col in enumerate(transportation_avg_columns):
    plt.plot(transportation_avg_df['Date'], transportation_avg_df[col], label=col, color=colors(i), linewidth=2, alpha=0.8)

plt.title("Average Transportation Metrics Over Time (Normalized)")
plt.xlabel("Date")
plt.ylabel("Normalized Value")
plt.legend(loc="best", bbox_to_anchor=(1, 1))
plt.grid(True)
plt.tight_layout()
plt.show()

Combined columns regarding tarnsit into one by taking their average. Same is done for passengers on rail. This was done to reduce the number of plots in total and to get a general idea of transit and passengers on rail. All other columns are left as original. For both dataframes, intersection of the core of data begins from 2002. Hence, both dataframe's timeline begins at 2002.In addition, normalization allows for both dataframes to be compared and to improve visualization.

In [None]:
gasprices_df['Date'] = pd.to_datetime(gasprices_df['Date'])
gasprices_df = gasprices_df[gasprices_df['Date'] >= '2002-01-01']

plt.figure(figsize=(10, 6))
plt.plot(gasprices_b_df['Date'], gasprices_b_df['A1'], label='Gasoline Price', color='blue')
plt.grid(True)
plt.title('U.S. Gasoline Retail Prices (1995–2021)')
plt.xlabel('Year')
plt.ylabel('Price (USD per gallon)')
plt.legend()

<h5> Machine Learning Model

In [60]:
transportation_df = pd.read_csv('Monthly_Transportation_Statistics.csv')
gasprices_df = pd.read_csv('USGasanddieselprices.csv')
transportation_df['Date'] = pd.to_datetime(transportation_df['Date'])
gasprices_df['Date'] = pd.to_datetime(gasprices_df['Date'])
gasprices_df = gasprices_df[['Date', 'A1']]
transportation_df = transportation_df.loc[transportation_df['Date'] >= '01-01-1975']
gasprices_df.rename(columns={"A1": "Weekly Retail Gasoline Prices(All Grades Formulation)"}, inplace=True)
gasprices_df['Date'] = pd.to_datetime(gasprices_df['Date'])


gasprices_df = gasprices_df[gasprices_df['Date'] >= '2017-01-01']
gasprices_df = gasprices_df[gasprices_df['Date'] <= '2021-01-04']
transportation_df = transportation_df.loc[transportation_df['Date'] >= '2017-01-01']
transportation_df = transportation_df.loc[transportation_df['Date'] <= '2021-01-01']



gasprices_df.set_index('Date', inplace = True)
transportation_df.set_index('Date', inplace = True)
gasprices_df = gasprices_df.resample('M').mean()
print(gasprices_df.shape)
print(transportation_df.shape)
display(gasprices_df.head())
#Convert to period so only year and month are used in order to merge dataframes
transportation_df.index = transportation_df.index.to_period('M')
gasprices_df.index = gasprices_df.index.to_period('M')
total_df = pd.merge( transportation_df, gasprices_df, on ='Date', how = 'outer')
total_df['Personal Spending on Transportation - Gasoline and Other Energy Goods - Seasonally Adjusted'] = total_df['Personal Spending on Transportation - Gasoline and Other Energy Goods - Seasonally Adjusted'].ffill()
display(total_df.head())
print(total_df.dtypes)
withoutdate_df = total_df.select_dtypes(include=['number'])
corr_matrix = withoutdate_df.corr()
display(withoutdate_df.head())
display(withoutdate_df[['U.S. Airline Traffic - Total - Seasonally Adjusted', 'Transit Ridership - Other Transit Modes - Adjusted',
'Transit Ridership - Fixed Route Bus - Adjusted', 'Transit Ridership - Urban Rail - Adjusted', 'Highway Fuel Price - Regular Gasoline', 'Passenger Rail Passengers',
'Passenger Rail Total Train Miles', 'Personal Spending on Transportation - Gasoline and Other Energy Goods - Seasonally Adjusted', 'Weekly Retail Gasoline Prices(All Grades Formulation)']].corr())

#strftime takes datetime, strftime("%M")




(49, 1)
(49, 135)


  transportation_df['Date'] = pd.to_datetime(transportation_df['Date'])
  gasprices_df = gasprices_df.resample('M').mean()


Unnamed: 0_level_0,Weekly Retail Gasoline Prices(All Grades Formulation)
Date,Unnamed: 1_level_1
2017-01-31,2.4584
2017-02-28,2.416
2017-03-31,2.43675
2017-04-30,2.5275
2017-05-31,2.5026


Unnamed: 0_level_0,Index,Air Safety - General Aviation Fatalities,Highway Fatalities Per 100 Million Vehicle Miles Traveled,Highway Fatalities,U.S. Airline Traffic - Total - Seasonally Adjusted,U.S. Airline Traffic - International - Seasonally Adjusted,U.S. Airline Traffic - Domestic - Seasonally Adjusted,Transit Ridership - Other Transit Modes - Adjusted,Transit Ridership - Fixed Route Bus - Adjusted,Transit Ridership - Urban Rail - Adjusted,Freight Rail Intermodal Units,Freight Rail Carloads,Highway Vehicle Miles Traveled - All Systems,Highway Vehicle Miles Traveled - Total Rural,Highway Vehicle Miles Traveled - Other Rural,Highway Vehicle Miles Traveled - Rural Other Arterial,Highway Vehicle Miles Traveled - Rural Interstate,State and Local Government Construction Spending - Breakwater/Jetty,State and Local Government Construction Spending - Dam/Levee,State and Local Government Construction Spending - Conservation and Development,State and Local Government Construction Spending - Pump Station,State and Local Government Construction Spending - Line,State and Local Government Construction Spending - Water Treatment Plant,State and Local Government Construction Spending - Water Supply,State and Local Government Construction Spending - Line/Drain,State and Local Government Construction Spending - Waste Water Treatment Plant,State and Local Government Construction Spending - Waste Water,State and Local Government Construction Spending - Line/Pump Station,State and Local Government Construction Spending - Sewage Treatment Plant,State and Local Government Construction Spending - Sewage / Dry Waste,State and Local Government Construction Spending - Sewage and Waste Disposal,State and Local Government Construction Spending - Rest Facility,State and Local Government Construction Spending - Bridge,State and Local Government Construction Spending - Lighting,State and Local Government Construction Spending - Pavement,State and Local Government Construction Spending - Highway and Street,State and Local Government Construction Spending - Power,State and Local Government Construction Spending - Dock / Marina,State and Local Government Construction Spending - Water,State and Local Government Construction Spending - Mass Transit,State and Local Government Construction Spending - Land Passenger Terminal,State and Local Government Construction Spending - Land,State and Local Government Construction Spending - Runway,State and Local Government Construction Spending - Air Passenger Terminal,State and Local Government Construction Spending - Air,State and Local Government Construction Spending - Transportation,State and Local Government Construction Spending - Park / Camp,State and Local Government Construction Spending - Neighborhood Center,State and Local Government Construction Spending - Social Center,State and Local Government Construction Spending - Convention Center,State and Local Government Construction Spending - Performance / Meeting Center,State and Local Government Construction Spending - Sports,State and Local Government Construction Spending - Amusement and Recreation,State and Local Government Construction Spending - Fire & Rescue,State and Local Government Construction Spending - Other Public Safety,State and Local Government Construction Spending - Police & Sheriff,State and Local Government Construction Spending - Detention,State and Local Government Construction Spending - Correctional,State and Local Government Construction Spending - Public Safety,State and Local Government Construction Spending - Library / Archive,State and Local Government Construction Spending - Other Educational,State and Local Government Construction Spending - Infrastructure,State and Local Government Construction Spending - Sports & Recreation,State and Local Government Construction Spending - Dormitory,State and Local Government Construction Spending - Instructional,State and Local Government Construction Spending - Higher Education,State and Local Government Construction Spending - High School,State and Local Government Construction Spending - Middle School / Junior High,State and Local Government Construction Spending - Elementary Schools,State and Local Government Construction Spending - Primary/Secondary Schools,State and Local Government Construction Spending - Educational,State and Local Government Construction Spending - Special Care,State and Local Government Construction Spending - Medical Building,State and Local Government Construction Spending - Hospital,State and Local Government Construction Spending - Health Care,State and Local Government Construction Spending - Parking,State and Local Government Construction Spending - Automotive,State and Local Government Construction Spending - Commercial,State and Local Government Construction Spending - Office,State and Local Government Construction Spending - Non Residential,State and Local Government Construction Spending - Multi Family,State and Local Government Construction Spending - Residential,State and Local Government Construction Spending - Total,National Highway Construction Cost Index (NHCCI),Highway Fuel Price - On-highway Diesel,Highway Fuel Price - Regular Gasoline,Transportation Employment - Pipeline Transportation,Transportation Employment - Water Transportation,Transportation Employment - Rail Transportation,Transportation Employment - Air Transportation,Transportation Employment - Transit and ground passenger transportation,Transportation Employment - Truck Transportation,Personal Spending on Transportation - Transportation Services - Seasonally Adjusted,Personal Spending on Transportation - Gasoline and Other Energy Goods - Seasonally Adjusted,Personal Spending on Transportation - Motor Vehicles and Parts - Seasonally Adjusted,Unemployment Rate - Seasonally Adjusted,Labor Force Participation Rate - Seasonally Adjusted,Unemployed - Seasonally Adjusted,Real Gross Domestic Product - Seasonally Adjusted,Passenger Rail Passengers,Passenger Rail Passenger Miles,Passenger Rail Total Train Miles,Passenger Rail Employee Hours Worked,Passenger Rail Yard Switching Miles,Passenger Rail Total Reports,U.S. Waterway Tonnage,Amtrak On-time Performance,Rail Fatalities,Rail Fatalities at Highway-Rail Crossings,Trespasser Fatalities Not at Highwaya-Rail Crossings,Transportation Services Index - Freight,Transportation Services Index - Passenger,Transportation Services Index - Combined,U.S.-Canada Incoming Person Crossings,U.S.-Canada Incoming Truck Crossings,U.S.-Mexico Incoming Person Crossings,Air Safety - Air Taxi and Commuter Fatalities,Heavy truck sales,U.S.-Mexico Incoming Truck Crossings,Light truck sales,Auto sales,Air Safety - Air Carrier Fatalities,U.S. Air Carrier Cargo (millions of revenue ton-miles) - International,Truck tonnage index,U.S. Air Carrier Cargo (millions of revenue ton-miles) - Domestic,Heavy truck sales SAAR (millions),U.S. Airline Traffic - Total - Non Seasonally Adjusted,Light truck sales SAAR (millions),U.S. Airline Traffic - International - Non Seasonally Adjusted,Auto sales SAAR (millions),U.S. Airline Traffic - Domestic - Non Seasonally Adjusted,Transborder - Total North American Freight,Transborder - U.S. - Mexico Freight,U.S. marketing air carriers on-time performance (percent),Transborder - U.S. - Canada Freight,Weekly Retail Gasoline Prices(All Grades Formulation)
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1,Unnamed: 110_level_1,Unnamed: 111_level_1,Unnamed: 112_level_1,Unnamed: 113_level_1,Unnamed: 114_level_1,Unnamed: 115_level_1,Unnamed: 116_level_1,Unnamed: 117_level_1,Unnamed: 118_level_1,Unnamed: 119_level_1,Unnamed: 120_level_1,Unnamed: 121_level_1,Unnamed: 122_level_1,Unnamed: 123_level_1,Unnamed: 124_level_1,Unnamed: 125_level_1,Unnamed: 126_level_1,Unnamed: 127_level_1,Unnamed: 128_level_1,Unnamed: 129_level_1,Unnamed: 130_level_1,Unnamed: 131_level_1,Unnamed: 132_level_1,Unnamed: 133_level_1,Unnamed: 134_level_1,Unnamed: 135_level_1,Unnamed: 136_level_1
2017-01,840,21.0,1.12,8301.0,70190000.0,8910000.0,61280000.0,15695383.0,379032287.0,385406234.0,999980.0,990999.0,,,,,,74000000.0,60000000.0,183000000.0,95000000.0,371000000.0,348000000.0,914000000.0,148000000.0,465000000.0,612000000.0,662000000.0,156000000.0,833000000.0,1445000000.0,36000000.0,1409000000.0,167000000.0,2316000000.0,4102000000.0,461000000.0,76000000.0,92000000.0,500000000.0,289000000.0,1053000000.0,159000000.0,482000000.0,726000000.0,1872000000.0,406000000.0,88000000.0,97000000.0,74000000.0,125000000.0,111000000.0,746000000.0,138000000.0,171000000.0,119000000.0,142000000.0,261000000.0,432000000.0,82000000.0,117000000.0,70000000.0,163000000.0,237000000.0,1036000000.0,1696000000.0,1412000000.0,613000000.0,896000000.0,2965000000.0,4814000000.0,49000000.0,96000000.0,273000000.0,418000000.0,40000000.0,40000000.0,128000000.0,442000000.0,16005000000.0,401000000.0,427000000.0,16433000000.0,,2.58,2.349,48900.0,62800.0,181800.0,481500.0,501600.0,1419200.0,420061000000.0,320998000000.0,513689000000.0,0.047,0.628,7468000.0,19398340000000.0,2330923.0,458016423.0,3432608.0,3215098.0,177548.0,2.0,43900000.0,,64.0,25.0,32.0,124.0,127.2,125.0,13537.0,483159.0,3424417.0,0.0,25700.0,482882.0,726600.0,412900.0,0.0,1861554000.0,103.0,1131530000.0,363000.0,70190000.0,11041000.0,8910000.0,6248000.0,61280000.0,87960380000.0,43015360000.0,,44945030000.0,2.4584
2017-02,841,28.0,,,69790000.0,8800000.0,60990000.0,15160465.0,373246708.0,367315442.0,1045836.0,1038433.0,,,,,,100000000.0,66000000.0,214000000.0,70000000.0,323000000.0,390000000.0,910000000.0,179000000.0,462000000.0,642000000.0,658000000.0,163000000.0,844000000.0,1486000000.0,35000000.0,1525000000.0,224000000.0,2474000000.0,4456000000.0,454000000.0,82000000.0,99000000.0,634000000.0,289000000.0,1183000000.0,147000000.0,522000000.0,732000000.0,2013000000.0,453000000.0,88000000.0,110000000.0,67000000.0,122000000.0,101000000.0,796000000.0,133000000.0,177000000.0,112000000.0,150000000.0,262000000.0,439000000.0,94000000.0,145000000.0,76000000.0,179000000.0,259000000.0,1052000000.0,1769000000.0,1300000000.0,637000000.0,927000000.0,2930000000.0,4877000000.0,50000000.0,105000000.0,324000000.0,480000000.0,33000000.0,33000000.0,121000000.0,450000000.0,16743000000.0,431000000.0,457000000.0,17200000000.0,,2.568,2.304,48700.0,61900.0,182800.0,482000.0,505200.0,1425100.0,,320998000000.0,,0.046,0.629,7379000.0,,2153891.0,405865664.0,3121837.0,2921855.0,164751.0,2.0,43000000.0,,58.0,17.0,39.0,124.8,126.0,125.1,12616.0,453800.0,3155207.0,1.0,28000.0,469120.0,840900.0,484300.0,0.0,1680568000.0,104.0,1029817000.0,384000.0,69790000.0,11093000.0,8800000.0,6221000.0,60990000.0,86474170000.0,42056620000.0,,44417550000.0,2.416
2017-03,842,22.0,,,69680000.0,8770000.0,60910000.0,17507282.0,415647479.0,430111836.0,1007335.0,1017513.0,,,,,,85000000.0,88000000.0,230000000.0,90000000.0,370000000.0,464000000.0,1059000000.0,132000000.0,526000000.0,658000000.0,751000000.0,180000000.0,944000000.0,1602000000.0,33000000.0,1899000000.0,170000000.0,2968000000.0,5258000000.0,426000000.0,90000000.0,110000000.0,679000000.0,341000000.0,1293000000.0,200000000.0,541000000.0,807000000.0,2210000000.0,539000000.0,108000000.0,134000000.0,81000000.0,141000000.0,121000000.0,946000000.0,134000000.0,176000000.0,138000000.0,164000000.0,301000000.0,477000000.0,82000000.0,123000000.0,83000000.0,206000000.0,267000000.0,1092000000.0,1890000000.0,1317000000.0,706000000.0,1001000000.0,3103000000.0,5165000000.0,59000000.0,129000000.0,334000000.0,522000000.0,35000000.0,35000000.0,119000000.0,464000000.0,18525000000.0,488000000.0,505000000.0,19030000000.0,1.617234,2.554,2.325,49500.0,62700.0,184000.0,486700.0,508200.0,1429800.0,,320998000000.0,,0.044,0.629,7073000.0,,2659548.0,530748447.0,3464598.0,3410941.0,196791.0,2.0,47700000.0,,49.0,18.0,31.0,125.1,126.2,125.4,14144.0,522488.0,3693446.0,0.0,35200.0,537552.0,958200.0,590600.0,0.0,2155153000.0,103.0,1280820000.0,405000.0,69680000.0,10568000.0,8770000.0,6088000.0,60910000.0,100288900000.0,49076100000.0,,51212840000.0,2.43675
2017-04,843,34.0,1.13,9460.0,70350000.0,9150000.0,61210000.0,17044306.0,388066756.0,407072908.0,1291296.0,1275287.0,,,,,,82000000.0,96000000.0,242000000.0,83000000.0,451000000.0,493000000.0,1192000000.0,149000000.0,515000000.0,664000000.0,764000000.0,205000000.0,983000000.0,1646000000.0,32000000.0,2233000000.0,150000000.0,3577000000.0,6222000000.0,354000000.0,94000000.0,116000000.0,634000000.0,360000000.0,1327000000.0,262000000.0,635000000.0,953000000.0,2396000000.0,544000000.0,117000000.0,137000000.0,95000000.0,157000000.0,104000000.0,953000000.0,157000000.0,195000000.0,138000000.0,164000000.0,302000000.0,497000000.0,85000000.0,134000000.0,86000000.0,201000000.0,271000000.0,1141000000.0,1935000000.0,1416000000.0,760000000.0,1068000000.0,3335000000.0,5482000000.0,65000000.0,117000000.0,336000000.0,518000000.0,35000000.0,35000000.0,130000000.0,429000000.0,20123000000.0,453000000.0,474000000.0,20597000000.0,,2.583,2.417,48900.0,64200.0,183700.0,488400.0,503600.0,1443100.0,423606000000.0,326000000000.0,514169000000.0,0.044,0.63,7089000.0,19506950000000.0,2764389.0,561148298.0,3342228.0,3156697.0,190095.0,2.0,44100000.0,0.76,57.0,13.0,40.0,125.5,127.3,126.0,25111.0,477189.0,3581996.0,3.0,32000.0,483439.0,880500.0,538500.0,0.0,2065234000.0,104.4,1184722000.0,395000.0,70350000.0,10614000.0,9150000.0,6157000.0,61210000.0,91067620000.0,44021260000.0,,47046360000.0,2.5275
2017-05,844,22.0,,,70720000.0,9060000.0,61660000.0,18539625.0,405517838.0,429163483.0,1069859.0,1026679.0,,,,,,103000000.0,87000000.0,272000000.0,80000000.0,547000000.0,499000000.0,1244000000.0,132000000.0,602000000.0,733000000.0,954000000.0,216000000.0,1180000000.0,1913000000.0,35000000.0,2629000000.0,167000000.0,4876000000.0,7902000000.0,449000000.0,83000000.0,108000000.0,665000000.0,364000000.0,1405000000.0,356000000.0,614000000.0,1031000000.0,2545000000.0,635000000.0,120000000.0,146000000.0,98000000.0,166000000.0,117000000.0,1076000000.0,167000000.0,205000000.0,129000000.0,172000000.0,301000000.0,507000000.0,84000000.0,148000000.0,120000000.0,306000000.0,364000000.0,1209000000.0,2270000000.0,1702000000.0,891000000.0,1157000000.0,3831000000.0,6351000000.0,74000000.0,99000000.0,375000000.0,548000000.0,54000000.0,57000000.0,151000000.0,490000000.0,23519000000.0,527000000.0,557000000.0,24076000000.0,,2.56,2.391,48400.0,66000.0,183600.0,492100.0,516200.0,1453100.0,,326000000000.0,,0.044,0.628,7000000.0,,2822981.0,574606036.0,3487216.0,3345829.0,188425.0,2.0,42600000.0,0.73,63.0,17.0,44.0,126.3,127.0,126.4,35628.0,527854.0,3542039.0,3.0,34600.0,526011.0,947900.0,562700.0,0.0,2159377000.0,104.7,1236290000.0,417000.0,70720000.0,10665000.0,9060000.0,6000000.0,61660000.0,98246030000.0,47044730000.0,,51201300000.0,2.5026


Index                                                                                            int64
Air Safety - General Aviation Fatalities                                                       float64
Highway Fatalities Per 100 Million Vehicle Miles Traveled                                      float64
Highway Fatalities                                                                             float64
U.S. Airline Traffic - Total - Seasonally Adjusted                                             float64
U.S. Airline Traffic - International - Seasonally Adjusted                                     float64
U.S. Airline Traffic - Domestic - Seasonally Adjusted                                          float64
Transit Ridership - Other Transit Modes - Adjusted                                             float64
Transit Ridership - Fixed Route Bus - Adjusted                                                 float64
Transit Ridership - Urban Rail - Adjusted                                

Unnamed: 0_level_0,Index,Air Safety - General Aviation Fatalities,Highway Fatalities Per 100 Million Vehicle Miles Traveled,Highway Fatalities,U.S. Airline Traffic - Total - Seasonally Adjusted,U.S. Airline Traffic - International - Seasonally Adjusted,U.S. Airline Traffic - Domestic - Seasonally Adjusted,Transit Ridership - Other Transit Modes - Adjusted,Transit Ridership - Fixed Route Bus - Adjusted,Transit Ridership - Urban Rail - Adjusted,Freight Rail Intermodal Units,Freight Rail Carloads,Highway Vehicle Miles Traveled - All Systems,Highway Vehicle Miles Traveled - Total Rural,Highway Vehicle Miles Traveled - Other Rural,Highway Vehicle Miles Traveled - Rural Other Arterial,Highway Vehicle Miles Traveled - Rural Interstate,State and Local Government Construction Spending - Breakwater/Jetty,State and Local Government Construction Spending - Dam/Levee,State and Local Government Construction Spending - Conservation and Development,State and Local Government Construction Spending - Pump Station,State and Local Government Construction Spending - Line,State and Local Government Construction Spending - Water Treatment Plant,State and Local Government Construction Spending - Water Supply,State and Local Government Construction Spending - Line/Drain,State and Local Government Construction Spending - Waste Water Treatment Plant,State and Local Government Construction Spending - Waste Water,State and Local Government Construction Spending - Line/Pump Station,State and Local Government Construction Spending - Sewage Treatment Plant,State and Local Government Construction Spending - Sewage / Dry Waste,State and Local Government Construction Spending - Sewage and Waste Disposal,State and Local Government Construction Spending - Rest Facility,State and Local Government Construction Spending - Bridge,State and Local Government Construction Spending - Lighting,State and Local Government Construction Spending - Pavement,State and Local Government Construction Spending - Highway and Street,State and Local Government Construction Spending - Power,State and Local Government Construction Spending - Dock / Marina,State and Local Government Construction Spending - Water,State and Local Government Construction Spending - Mass Transit,State and Local Government Construction Spending - Land Passenger Terminal,State and Local Government Construction Spending - Land,State and Local Government Construction Spending - Runway,State and Local Government Construction Spending - Air Passenger Terminal,State and Local Government Construction Spending - Air,State and Local Government Construction Spending - Transportation,State and Local Government Construction Spending - Park / Camp,State and Local Government Construction Spending - Neighborhood Center,State and Local Government Construction Spending - Social Center,State and Local Government Construction Spending - Convention Center,State and Local Government Construction Spending - Performance / Meeting Center,State and Local Government Construction Spending - Sports,State and Local Government Construction Spending - Amusement and Recreation,State and Local Government Construction Spending - Fire & Rescue,State and Local Government Construction Spending - Other Public Safety,State and Local Government Construction Spending - Police & Sheriff,State and Local Government Construction Spending - Detention,State and Local Government Construction Spending - Correctional,State and Local Government Construction Spending - Public Safety,State and Local Government Construction Spending - Library / Archive,State and Local Government Construction Spending - Other Educational,State and Local Government Construction Spending - Infrastructure,State and Local Government Construction Spending - Sports & Recreation,State and Local Government Construction Spending - Dormitory,State and Local Government Construction Spending - Instructional,State and Local Government Construction Spending - Higher Education,State and Local Government Construction Spending - High School,State and Local Government Construction Spending - Middle School / Junior High,State and Local Government Construction Spending - Elementary Schools,State and Local Government Construction Spending - Primary/Secondary Schools,State and Local Government Construction Spending - Educational,State and Local Government Construction Spending - Special Care,State and Local Government Construction Spending - Medical Building,State and Local Government Construction Spending - Hospital,State and Local Government Construction Spending - Health Care,State and Local Government Construction Spending - Parking,State and Local Government Construction Spending - Automotive,State and Local Government Construction Spending - Commercial,State and Local Government Construction Spending - Office,State and Local Government Construction Spending - Non Residential,State and Local Government Construction Spending - Multi Family,State and Local Government Construction Spending - Residential,State and Local Government Construction Spending - Total,National Highway Construction Cost Index (NHCCI),Highway Fuel Price - On-highway Diesel,Highway Fuel Price - Regular Gasoline,Transportation Employment - Pipeline Transportation,Transportation Employment - Water Transportation,Transportation Employment - Rail Transportation,Transportation Employment - Air Transportation,Transportation Employment - Transit and ground passenger transportation,Transportation Employment - Truck Transportation,Personal Spending on Transportation - Transportation Services - Seasonally Adjusted,Personal Spending on Transportation - Gasoline and Other Energy Goods - Seasonally Adjusted,Personal Spending on Transportation - Motor Vehicles and Parts - Seasonally Adjusted,Unemployment Rate - Seasonally Adjusted,Labor Force Participation Rate - Seasonally Adjusted,Unemployed - Seasonally Adjusted,Real Gross Domestic Product - Seasonally Adjusted,Passenger Rail Passengers,Passenger Rail Passenger Miles,Passenger Rail Total Train Miles,Passenger Rail Employee Hours Worked,Passenger Rail Yard Switching Miles,Passenger Rail Total Reports,U.S. Waterway Tonnage,Amtrak On-time Performance,Rail Fatalities,Rail Fatalities at Highway-Rail Crossings,Trespasser Fatalities Not at Highwaya-Rail Crossings,Transportation Services Index - Freight,Transportation Services Index - Passenger,Transportation Services Index - Combined,U.S.-Canada Incoming Person Crossings,U.S.-Canada Incoming Truck Crossings,U.S.-Mexico Incoming Person Crossings,Air Safety - Air Taxi and Commuter Fatalities,Heavy truck sales,U.S.-Mexico Incoming Truck Crossings,Light truck sales,Auto sales,Air Safety - Air Carrier Fatalities,U.S. Air Carrier Cargo (millions of revenue ton-miles) - International,Truck tonnage index,U.S. Air Carrier Cargo (millions of revenue ton-miles) - Domestic,Heavy truck sales SAAR (millions),U.S. Airline Traffic - Total - Non Seasonally Adjusted,Light truck sales SAAR (millions),U.S. Airline Traffic - International - Non Seasonally Adjusted,Auto sales SAAR (millions),U.S. Airline Traffic - Domestic - Non Seasonally Adjusted,Transborder - Total North American Freight,Transborder - U.S. - Mexico Freight,U.S. marketing air carriers on-time performance (percent),Transborder - U.S. - Canada Freight,Weekly Retail Gasoline Prices(All Grades Formulation)
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1,Unnamed: 110_level_1,Unnamed: 111_level_1,Unnamed: 112_level_1,Unnamed: 113_level_1,Unnamed: 114_level_1,Unnamed: 115_level_1,Unnamed: 116_level_1,Unnamed: 117_level_1,Unnamed: 118_level_1,Unnamed: 119_level_1,Unnamed: 120_level_1,Unnamed: 121_level_1,Unnamed: 122_level_1,Unnamed: 123_level_1,Unnamed: 124_level_1,Unnamed: 125_level_1,Unnamed: 126_level_1,Unnamed: 127_level_1,Unnamed: 128_level_1,Unnamed: 129_level_1,Unnamed: 130_level_1,Unnamed: 131_level_1,Unnamed: 132_level_1,Unnamed: 133_level_1,Unnamed: 134_level_1,Unnamed: 135_level_1,Unnamed: 136_level_1
2017-01,840,21.0,1.12,8301.0,70190000.0,8910000.0,61280000.0,15695383.0,379032287.0,385406234.0,999980.0,990999.0,,,,,,74000000.0,60000000.0,183000000.0,95000000.0,371000000.0,348000000.0,914000000.0,148000000.0,465000000.0,612000000.0,662000000.0,156000000.0,833000000.0,1445000000.0,36000000.0,1409000000.0,167000000.0,2316000000.0,4102000000.0,461000000.0,76000000.0,92000000.0,500000000.0,289000000.0,1053000000.0,159000000.0,482000000.0,726000000.0,1872000000.0,406000000.0,88000000.0,97000000.0,74000000.0,125000000.0,111000000.0,746000000.0,138000000.0,171000000.0,119000000.0,142000000.0,261000000.0,432000000.0,82000000.0,117000000.0,70000000.0,163000000.0,237000000.0,1036000000.0,1696000000.0,1412000000.0,613000000.0,896000000.0,2965000000.0,4814000000.0,49000000.0,96000000.0,273000000.0,418000000.0,40000000.0,40000000.0,128000000.0,442000000.0,16005000000.0,401000000.0,427000000.0,16433000000.0,,2.58,2.349,48900.0,62800.0,181800.0,481500.0,501600.0,1419200.0,420061000000.0,320998000000.0,513689000000.0,0.047,0.628,7468000.0,19398340000000.0,2330923.0,458016423.0,3432608.0,3215098.0,177548.0,2.0,43900000.0,,64.0,25.0,32.0,124.0,127.2,125.0,13537.0,483159.0,3424417.0,0.0,25700.0,482882.0,726600.0,412900.0,0.0,1861554000.0,103.0,1131530000.0,363000.0,70190000.0,11041000.0,8910000.0,6248000.0,61280000.0,87960380000.0,43015360000.0,,44945030000.0,2.4584
2017-02,841,28.0,,,69790000.0,8800000.0,60990000.0,15160465.0,373246708.0,367315442.0,1045836.0,1038433.0,,,,,,100000000.0,66000000.0,214000000.0,70000000.0,323000000.0,390000000.0,910000000.0,179000000.0,462000000.0,642000000.0,658000000.0,163000000.0,844000000.0,1486000000.0,35000000.0,1525000000.0,224000000.0,2474000000.0,4456000000.0,454000000.0,82000000.0,99000000.0,634000000.0,289000000.0,1183000000.0,147000000.0,522000000.0,732000000.0,2013000000.0,453000000.0,88000000.0,110000000.0,67000000.0,122000000.0,101000000.0,796000000.0,133000000.0,177000000.0,112000000.0,150000000.0,262000000.0,439000000.0,94000000.0,145000000.0,76000000.0,179000000.0,259000000.0,1052000000.0,1769000000.0,1300000000.0,637000000.0,927000000.0,2930000000.0,4877000000.0,50000000.0,105000000.0,324000000.0,480000000.0,33000000.0,33000000.0,121000000.0,450000000.0,16743000000.0,431000000.0,457000000.0,17200000000.0,,2.568,2.304,48700.0,61900.0,182800.0,482000.0,505200.0,1425100.0,,320998000000.0,,0.046,0.629,7379000.0,,2153891.0,405865664.0,3121837.0,2921855.0,164751.0,2.0,43000000.0,,58.0,17.0,39.0,124.8,126.0,125.1,12616.0,453800.0,3155207.0,1.0,28000.0,469120.0,840900.0,484300.0,0.0,1680568000.0,104.0,1029817000.0,384000.0,69790000.0,11093000.0,8800000.0,6221000.0,60990000.0,86474170000.0,42056620000.0,,44417550000.0,2.416
2017-03,842,22.0,,,69680000.0,8770000.0,60910000.0,17507282.0,415647479.0,430111836.0,1007335.0,1017513.0,,,,,,85000000.0,88000000.0,230000000.0,90000000.0,370000000.0,464000000.0,1059000000.0,132000000.0,526000000.0,658000000.0,751000000.0,180000000.0,944000000.0,1602000000.0,33000000.0,1899000000.0,170000000.0,2968000000.0,5258000000.0,426000000.0,90000000.0,110000000.0,679000000.0,341000000.0,1293000000.0,200000000.0,541000000.0,807000000.0,2210000000.0,539000000.0,108000000.0,134000000.0,81000000.0,141000000.0,121000000.0,946000000.0,134000000.0,176000000.0,138000000.0,164000000.0,301000000.0,477000000.0,82000000.0,123000000.0,83000000.0,206000000.0,267000000.0,1092000000.0,1890000000.0,1317000000.0,706000000.0,1001000000.0,3103000000.0,5165000000.0,59000000.0,129000000.0,334000000.0,522000000.0,35000000.0,35000000.0,119000000.0,464000000.0,18525000000.0,488000000.0,505000000.0,19030000000.0,1.617234,2.554,2.325,49500.0,62700.0,184000.0,486700.0,508200.0,1429800.0,,320998000000.0,,0.044,0.629,7073000.0,,2659548.0,530748447.0,3464598.0,3410941.0,196791.0,2.0,47700000.0,,49.0,18.0,31.0,125.1,126.2,125.4,14144.0,522488.0,3693446.0,0.0,35200.0,537552.0,958200.0,590600.0,0.0,2155153000.0,103.0,1280820000.0,405000.0,69680000.0,10568000.0,8770000.0,6088000.0,60910000.0,100288900000.0,49076100000.0,,51212840000.0,2.43675
2017-04,843,34.0,1.13,9460.0,70350000.0,9150000.0,61210000.0,17044306.0,388066756.0,407072908.0,1291296.0,1275287.0,,,,,,82000000.0,96000000.0,242000000.0,83000000.0,451000000.0,493000000.0,1192000000.0,149000000.0,515000000.0,664000000.0,764000000.0,205000000.0,983000000.0,1646000000.0,32000000.0,2233000000.0,150000000.0,3577000000.0,6222000000.0,354000000.0,94000000.0,116000000.0,634000000.0,360000000.0,1327000000.0,262000000.0,635000000.0,953000000.0,2396000000.0,544000000.0,117000000.0,137000000.0,95000000.0,157000000.0,104000000.0,953000000.0,157000000.0,195000000.0,138000000.0,164000000.0,302000000.0,497000000.0,85000000.0,134000000.0,86000000.0,201000000.0,271000000.0,1141000000.0,1935000000.0,1416000000.0,760000000.0,1068000000.0,3335000000.0,5482000000.0,65000000.0,117000000.0,336000000.0,518000000.0,35000000.0,35000000.0,130000000.0,429000000.0,20123000000.0,453000000.0,474000000.0,20597000000.0,,2.583,2.417,48900.0,64200.0,183700.0,488400.0,503600.0,1443100.0,423606000000.0,326000000000.0,514169000000.0,0.044,0.63,7089000.0,19506950000000.0,2764389.0,561148298.0,3342228.0,3156697.0,190095.0,2.0,44100000.0,0.76,57.0,13.0,40.0,125.5,127.3,126.0,25111.0,477189.0,3581996.0,3.0,32000.0,483439.0,880500.0,538500.0,0.0,2065234000.0,104.4,1184722000.0,395000.0,70350000.0,10614000.0,9150000.0,6157000.0,61210000.0,91067620000.0,44021260000.0,,47046360000.0,2.5275
2017-05,844,22.0,,,70720000.0,9060000.0,61660000.0,18539625.0,405517838.0,429163483.0,1069859.0,1026679.0,,,,,,103000000.0,87000000.0,272000000.0,80000000.0,547000000.0,499000000.0,1244000000.0,132000000.0,602000000.0,733000000.0,954000000.0,216000000.0,1180000000.0,1913000000.0,35000000.0,2629000000.0,167000000.0,4876000000.0,7902000000.0,449000000.0,83000000.0,108000000.0,665000000.0,364000000.0,1405000000.0,356000000.0,614000000.0,1031000000.0,2545000000.0,635000000.0,120000000.0,146000000.0,98000000.0,166000000.0,117000000.0,1076000000.0,167000000.0,205000000.0,129000000.0,172000000.0,301000000.0,507000000.0,84000000.0,148000000.0,120000000.0,306000000.0,364000000.0,1209000000.0,2270000000.0,1702000000.0,891000000.0,1157000000.0,3831000000.0,6351000000.0,74000000.0,99000000.0,375000000.0,548000000.0,54000000.0,57000000.0,151000000.0,490000000.0,23519000000.0,527000000.0,557000000.0,24076000000.0,,2.56,2.391,48400.0,66000.0,183600.0,492100.0,516200.0,1453100.0,,326000000000.0,,0.044,0.628,7000000.0,,2822981.0,574606036.0,3487216.0,3345829.0,188425.0,2.0,42600000.0,0.73,63.0,17.0,44.0,126.3,127.0,126.4,35628.0,527854.0,3542039.0,3.0,34600.0,526011.0,947900.0,562700.0,0.0,2159377000.0,104.7,1236290000.0,417000.0,70720000.0,10665000.0,9060000.0,6000000.0,61660000.0,98246030000.0,47044730000.0,,51201300000.0,2.5026


Unnamed: 0,U.S. Airline Traffic - Total - Seasonally Adjusted,Transit Ridership - Other Transit Modes - Adjusted,Transit Ridership - Fixed Route Bus - Adjusted,Transit Ridership - Urban Rail - Adjusted,Highway Fuel Price - Regular Gasoline,Passenger Rail Passengers,Passenger Rail Total Train Miles,Personal Spending on Transportation - Gasoline and Other Energy Goods - Seasonally Adjusted,Weekly Retail Gasoline Prices(All Grades Formulation)
U.S. Airline Traffic - Total - Seasonally Adjusted,1.0,0.933882,0.957653,0.974498,0.768452,0.952532,0.936701,0.908478,0.786711
Transit Ridership - Other Transit Modes - Adjusted,0.933882,1.0,0.956126,0.964632,0.809822,0.981961,0.959853,0.900677,0.829973
Transit Ridership - Fixed Route Bus - Adjusted,0.957653,0.956126,1.0,0.991559,0.757659,0.958575,0.964267,0.911501,0.784643
Transit Ridership - Urban Rail - Adjusted,0.974498,0.964632,0.991559,1.0,0.767087,0.977466,0.969233,0.909347,0.79207
Highway Fuel Price - Regular Gasoline,0.768452,0.809822,0.757659,0.767087,1.0,0.781423,0.740852,0.733075,0.997411
Passenger Rail Passengers,0.952532,0.981961,0.958575,0.977466,0.781423,1.0,0.966889,0.903897,0.803842
Passenger Rail Total Train Miles,0.936701,0.959853,0.964267,0.969233,0.740852,0.966889,1.0,0.880055,0.76674
Personal Spending on Transportation - Gasoline and Other Energy Goods - Seasonally Adjusted,0.908478,0.900677,0.911501,0.909347,0.733075,0.903897,0.880055,1.0,0.75436
Weekly Retail Gasoline Prices(All Grades Formulation),0.786711,0.829973,0.784643,0.79207,0.997411,0.803842,0.76674,0.75436,1.0


In [53]:
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler

withoutdate_df = withoutdate_df[['U.S. Airline Traffic - Total - Seasonally Adjusted', 'Transit Ridership - Other Transit Modes - Adjusted',
'Transit Ridership - Fixed Route Bus - Adjusted', 'Transit Ridership - Urban Rail - Adjusted', 'Highway Fuel Price - Regular Gasoline', 'Passenger Rail Passengers',
'Passenger Rail Total Train Miles', 'Personal Spending on Transportation - Gasoline and Other Energy Goods - Seasonally Adjusted', 'Weekly Retail Gasoline Prices(All Grades Formulation)']]
print(withoutdate_df.isna().sum().sum())
#display(df_combined)

X = withoutdate_df[['U.S. Airline Traffic - Total - Seasonally Adjusted', 'Transit Ridership - Other Transit Modes - Adjusted',
'Transit Ridership - Fixed Route Bus - Adjusted', 'Transit Ridership - Urban Rail - Adjusted', 'Passenger Rail Passengers',
'Passenger Rail Total Train Miles']]
Y = withoutdate_df[['Weekly Retail Gasoline Prices(All Grades Formulation)']]


print(X.shape)
print(Y.shape)
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 42)
scaler = StandardScaler()
scaler2 = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
y_train_scaled = scaler2.fit_transform(y_train)

X_test_scaled = scaler.transform(X_test)
y_test_scaled = scaler2.transform(y_test)

0
(49, 6)
(49, 1)


In [54]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, silhouette_score
from sklearn.metrics import mean_squared_error



np.random.seed = 42
model = LinearRegression()
model.fit(X_train_scaled, y_train_scaled)

y_pred_scaled = model.predict(X_test_scaled)
accuracy = mean_squared_error(y_test_scaled, y_pred_scaled)
print(accuracy)

0.2951857724921884
