<a href="https://colab.research.google.com/github/roccoderosa1982/taxi-drive-demand-forecast/blob/main/bolt_taxi_demand_forecast.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Import Data**

In [None]:
#%pip install pydantic

In [None]:
#%pip install holidays

In [1]:
import pandas as pd
from pydantic import BaseModel, ValidationError, field_validator
from datetime import datetime

# Define the schema for the DataFrame using Pydantic
class Ride(BaseModel):
    start_time: datetime
    start_lat: float
    start_lng: float
    end_lat: float
    end_lng: float
    ride_value: float

    @field_validator('*', mode='before')
    def check_empty_strings(cls, v):
        if v == '':
            raise ValueError('Empty strings are not allowed')
        return v

    @field_validator('start_lat', 'end_lat')
    def validate_latitude(cls, v):
        if not -90 <= v <= 90:
            raise ValueError('Latitude must be between -90 and 90')
        return v

    @field_validator('start_lng', 'end_lng')
    def validate_longitude(cls, v):
        if not -180 <= v <= 180:
            raise ValueError('Longitude must be between -180 and 180')
        return v

# Initialize lists to hold good lines and bad lines
data = []
bad_lines = []

# Read the CSV file line by line
with open('/content/drive/MyDrive/datasets/robotex5.csv', 'r') as file:
    for line_no, line in enumerate(file, start=0):
        # Skip the header line
        if line_no == 0:
            continue
        # Split the line into fields
        fields = line.strip().split(',')
        # Create a dictionary with the fields and their corresponding column names
        row_data = dict(zip(Ride.__fields__.keys(), fields))
        try:
            # Remove the last three digits from the start_time string to remove milliseconds
            start_time_str = row_data['start_time'][:-3]
            # Convert the start_time string to a datetime object
            row_data['start_time'] = datetime.strptime(start_time_str, '%Y-%m-%d %H:%M:%S.%f')
            # Validate the row data against the schema
            validated_data = Ride(**row_data)
            # If validation is successful, add the validated data to the cleaned_lines list
            data.append(validated_data.dict())
        except ValueError as e:
            # If the conversion or validation fails, add the line number and error message to the bad_lines list
            bad_lines.append([line_no, line, str(e)])
            # Print the bad line
            #print(f"Bad Line {line_no}: {line.strip()} - {str(e)}")

# Create DataFrames for correct lines and bad lines
data = pd.DataFrame(data).drop_duplicates()
bad_lines_df = pd.DataFrame(bad_lines, columns=['line_number', 'line_content', 'error'])

In [2]:
data.head()

Unnamed: 0,start_time,start_lat,start_lng,end_lat,end_lng,ride_value
0,2022-03-06 15:02:39.329452,59.40791,24.689836,59.513027,24.83163,3.51825
1,2022-03-10 11:15:55.177526,59.44165,24.762712,59.42645,24.783076,0.5075
2,2022-03-06 14:23:33.893257,59.435404,24.749795,59.431901,24.761588,0.19025
3,2022-03-03 09:11:59.104192,59.40692,24.659006,59.381093,24.641652,0.756
4,2022-03-06 00:13:01.290346,59.43494,24.753641,59.489203,24.87617,2.271


In [3]:
data.shape

(622646, 6)

In [4]:
bad_lines_df.head()

Unnamed: 0,line_number,line_content,error


In [5]:
bad_lines_df.shape

(0, 3)

In [6]:
data.isna().sum()

start_time    0
start_lat     0
start_lng     0
end_lat       0
end_lng       0
ride_value    0
dtype: int64

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 622646 entries, 0 to 627209
Data columns (total 6 columns):
 #   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         
 0   start_time  622646 non-null  datetime64[ns]
 1   start_lat   622646 non-null  float64       
 2   start_lng   622646 non-null  float64       
 3   end_lat     622646 non-null  float64       
 4   end_lng     622646 non-null  float64       
 5   ride_value  622646 non-null  float64       
dtypes: datetime64[ns](1), float64(5)
memory usage: 33.3 MB


In [8]:
data.describe()

Unnamed: 0,start_time,start_lat,start_lng,end_lat,end_lng,ride_value
count,622646,622646.0,622646.0,622646.0,622646.0,622646.0
mean,2022-03-15 18:59:21.545893376,59.428683,24.743474,59.397579,24.724682,2.268597
min,2022-03-01 00:00:07.936317,59.321557,24.505199,-37.819979,-122.453962,0.107628
25%,2022-03-09 00:03:57.959799552,59.418812,24.713154,59.415213,24.707899,0.54525
50%,2022-03-16 08:22:42.513948160,59.43207,24.744677,59.430697,24.744334,1.059
75%,2022-03-22 21:29:45.173893376,59.439024,24.768124,59.439262,24.773922,1.712
max,2022-03-28 23:59:53.175658,59.566998,24.973743,61.552744,144.96611,3172.701
std,,0.021761,0.05687,1.397846,1.656725,45.053886


In [9]:
data['start_time_hour'] = data.start_time.dt.hour
#data['start_time_minute'] = data.start_time.dt.minute
#data['start_time_second'] = data.start_time.dt.second
data.head()

Unnamed: 0,start_time,start_lat,start_lng,end_lat,end_lng,ride_value,start_time_hour
0,2022-03-06 15:02:39.329452,59.40791,24.689836,59.513027,24.83163,3.51825,15
1,2022-03-10 11:15:55.177526,59.44165,24.762712,59.42645,24.783076,0.5075,11
2,2022-03-06 14:23:33.893257,59.435404,24.749795,59.431901,24.761588,0.19025,14
3,2022-03-03 09:11:59.104192,59.40692,24.659006,59.381093,24.641652,0.756,9
4,2022-03-06 00:13:01.290346,59.43494,24.753641,59.489203,24.87617,2.271,0


**Feature Extraction**

In [10]:
import pandas as pd
from datetime import datetime
import holidays

# Assuming 'df' is your DataFrame and 'start_date' is a datetime column

# Initialize the holidays for Estonia
ee_holidays = holidays.country_holidays('EE')

# Function to determine if it's a weekend
def is_weekend(date):
    return date.weekday() >= 5  # 5 and 6 correspond to Saturday and Sunday

# Function to determine the time of day
def determine_time_of_day(hour):
    if 5 <= hour < 8:
        return 'Early Morning'
    elif 8 <= hour < 12:
        return 'Morning'
    elif 12 <= hour < 15:
        return 'Lunch'
    elif 15 <= hour < 18:
        return 'Afternoon'
    elif 18 <= hour < 20:
        return 'Evening'
    else:
        return 'Night'

# Function to check if a date is a holiday
def is_holiday(date):
    return date in ee_holidays

# Add new columns
data['is_weekend'] = data['start_time'].apply(is_weekend)
data['day_of_week'] = data['start_time'].dt.day_name()
data['is_holiday'] = data['start_time'].apply(is_holiday)
data['time_of_day'] = data['start_time'].dt.hour.apply(determine_time_of_day)
data.head(10)

Unnamed: 0,start_time,start_lat,start_lng,end_lat,end_lng,ride_value,start_time_hour,is_weekend,day_of_week,is_holiday,time_of_day
0,2022-03-06 15:02:39.329452,59.40791,24.689836,59.513027,24.83163,3.51825,15,True,Sunday,False,Afternoon
1,2022-03-10 11:15:55.177526,59.44165,24.762712,59.42645,24.783076,0.5075,11,False,Thursday,False,Morning
2,2022-03-06 14:23:33.893257,59.435404,24.749795,59.431901,24.761588,0.19025,14,True,Sunday,False,Lunch
3,2022-03-03 09:11:59.104192,59.40692,24.659006,59.381093,24.641652,0.756,9,False,Thursday,False,Morning
4,2022-03-06 00:13:01.290346,59.43494,24.753641,59.489203,24.87617,2.271,0,True,Sunday,False,Night
5,2022-03-02 07:17:34.858783,59.433606,24.712736,59.435205,24.748843,0.50275,7,False,Wednesday,False,Early Morning
6,2022-03-17 11:08:25.117959,59.39896,24.710864,59.440976,24.760222,1.352,11,False,Thursday,False,Morning
7,2022-03-18 14:34:56.333676,59.416808,24.799002,59.406496,24.683917,1.622,14,False,Friday,False,Lunch
8,2022-03-13 19:19:32.659761,59.432321,24.760523,59.423296,24.749209,0.2955,19,True,Sunday,False,Evening
9,2022-03-17 16:20:20.028387,59.410783,24.721219,59.439901,24.771756,1.06975,16,False,Thursday,False,Afternoon


In [11]:
data.query('is_holiday == True').head()

Unnamed: 0,start_time,start_lat,start_lng,end_lat,end_lng,ride_value,start_time_hour,is_weekend,day_of_week,is_holiday,time_of_day


In [12]:
data.start_time.min()

Timestamp('2022-03-01 00:00:07.936317')

In [13]:
data.start_time.max()

Timestamp('2022-03-28 23:59:53.175658')

In [14]:
%pip install geohash2

Collecting geohash2
  Downloading geohash2-1.1.tar.gz (15 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: geohash2
  Building wheel for geohash2 (setup.py) ... [?25l[?25hdone
  Created wheel for geohash2: filename=geohash2-1.1-py3-none-any.whl size=15544 sha256=bd7e75ddc5fdba31b29e1ac1319b56b2d3e01796fa5646f5ae41089f2424691b
  Stored in directory: /root/.cache/pip/wheels/c0/21/8d/fe65503f4f439aef35193e5ec10a14adc945e20ff87eb35895
Successfully built geohash2
Installing collected packages: geohash2
Successfully installed geohash2-1.1


**Cluster City Areas**

In [15]:
import pandas as pd
import geohash2

# Function to create geohash from longitude and latitude
def create_geohash(lat, lng, precision=5):
    full_geohash = geohash2.encode(lat, lng)
    return full_geohash[:precision]

# Example usage:
# Assuming you have a DataFrame 'data' with columns 'start_lat' and 'start_lng'
# Apply the function to each row and create a new column 'geohash'
data['geohash_start'] = data.apply(lambda row: create_geohash(row['start_lat'], row['start_lng']), axis=1)
data['geohash_end'] = data.apply(lambda row: create_geohash(row['end_lat'], row['end_lng']), axis=1)

# Display the first few rows of the DataFrame
data.head()

Unnamed: 0,start_time,start_lat,start_lng,end_lat,end_lng,ride_value,start_time_hour,is_weekend,day_of_week,is_holiday,time_of_day,geohash_start,geohash_end
0,2022-03-06 15:02:39.329452,59.40791,24.689836,59.513027,24.83163,3.51825,15,True,Sunday,False,Afternoon,ud99c,ud9dt
1,2022-03-10 11:15:55.177526,59.44165,24.762712,59.42645,24.783076,0.5075,11,False,Thursday,False,Morning,ud9d5,ud9d5
2,2022-03-06 14:23:33.893257,59.435404,24.749795,59.431901,24.761588,0.19025,14,True,Sunday,False,Lunch,ud9d5,ud9d5
3,2022-03-03 09:11:59.104192,59.40692,24.659006,59.381093,24.641652,0.756,9,False,Thursday,False,Morning,ud99c,ud99b
4,2022-03-06 00:13:01.290346,59.43494,24.753641,59.489203,24.87617,2.271,0,True,Sunday,False,Night,ud9d5,ud9dq


In [16]:
data['geohash_start'].value_counts().head(10)

geohash_start
ud9d5    224808
ud9d4    132631
ud99c     48413
ud9dh     44268
ud99f     43558
ud9dj     22562
ud9d1     21026
ud99b     13594
ud9d0     12963
ud9dn     11432
Name: count, dtype: int64

In [None]:
# consider only greater n areas
# Step 1: Get the three highest count 'geohash_start' values
cons_n=5
top_three_geohashes = data['geohash_start'].value_counts().head(cons_n).index.tolist()


# Step 2: Filter the DataFrame to include only rows where 'geohash_start' is in the top three
filtered_data = data[data['geohash_start'].isin(top_three_geohashes)]

filtered_data.head()

Unnamed: 0,start_time,start_lat,start_lng,end_lat,end_lng,ride_value,start_time_hour,is_weekend,day_of_week,is_holiday,time_of_day,geohash_start,geohash_end
1,2022-03-10 11:15:55.177526,59.44165,24.762712,59.42645,24.783076,0.5075,11,False,Thursday,False,Morning,ud9d5,ud9d5
2,2022-03-06 14:23:33.893257,59.435404,24.749795,59.431901,24.761588,0.19025,14,True,Sunday,False,Lunch,ud9d5,ud9d5
4,2022-03-06 00:13:01.290346,59.43494,24.753641,59.489203,24.87617,2.271,0,True,Sunday,False,Night,ud9d5,ud9dq
5,2022-03-02 07:17:34.858783,59.433606,24.712736,59.435205,24.748843,0.50275,7,False,Wednesday,False,Early Morning,ud9d4,ud9d5
8,2022-03-13 19:19:32.659761,59.432321,24.760523,59.423296,24.749209,0.2955,19,True,Sunday,False,Evening,ud9d5,ud9d5


In [None]:
filtered_data.shape

(357439, 13)

In [None]:
filtered_data['start_time'].dt.date.unique()

array([datetime.date(2022, 3, 10), datetime.date(2022, 3, 6),
       datetime.date(2022, 3, 2), datetime.date(2022, 3, 13),
       datetime.date(2022, 3, 26), datetime.date(2022, 3, 7),
       datetime.date(2022, 3, 28), datetime.date(2022, 3, 1),
       datetime.date(2022, 3, 27), datetime.date(2022, 3, 22),
       datetime.date(2022, 3, 9), datetime.date(2022, 3, 21),
       datetime.date(2022, 3, 18), datetime.date(2022, 3, 15),
       datetime.date(2022, 3, 25), datetime.date(2022, 3, 23),
       datetime.date(2022, 3, 16), datetime.date(2022, 3, 5),
       datetime.date(2022, 3, 8), datetime.date(2022, 3, 19),
       datetime.date(2022, 3, 12), datetime.date(2022, 3, 24),
       datetime.date(2022, 3, 11), datetime.date(2022, 3, 3),
       datetime.date(2022, 3, 14), datetime.date(2022, 3, 17),
       datetime.date(2022, 3, 4), datetime.date(2022, 3, 20)],
      dtype=object)

In [None]:
filtered_data['day'] = filtered_data['start_time'].dt.date
filtered_data['hour'] = filtered_data['start_time'].dt.hour

# Get the first X unique days sorted in ascending order
cons_days = len(filtered_data['day'].unique())
considered_days = sorted(filtered_data['day'].unique())[:cons_days]

# Filter the DataFrame for the first 8 days
filtered_data = filtered_data[filtered_data['day'].isin(considered_days)]

# Group the data by 'geohash_start', 'day', 'hour', and 'is_weekend'
grouped = filtered_data.groupby(['geohash_start', 'day', 'hour', 'is_weekend', 'day_of_week']).agg({
    'start_time': 'count',
    'ride_value': 'mean'
}).reset_index()

# Rename the columns
grouped.columns = ['geohash_start', 'day', 'hour', 'is_weekend', 'day_of_week', 'num_rides', 'avg_ride_value']

# Filter out days where any hour has zero rides to consider only complete time series
grouped = grouped[grouped['num_rides'] != 0]

# Reset the index
grouped = grouped.reset_index(drop=True)

# Filter out days that do not have all 24 hours with a value
days_with_all_hours = grouped.groupby(['geohash_start', 'day']).filter(lambda x: len(x['hour'].unique()) == 24)

# Reset the index
days_with_all_hours = days_with_all_hours.reset_index(drop=True)

grouped.head(500)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,geohash_start,day,hour,is_weekend,day_of_week,num_rides,avg_ride_value
0,ud9d4,2022-03-01,0,False,Tuesday,91,1.027250
1,ud9d4,2022-03-01,1,False,Tuesday,89,0.998199
2,ud9d4,2022-03-01,2,False,Tuesday,66,0.952383
3,ud9d4,2022-03-01,3,False,Tuesday,65,0.915919
4,ud9d4,2022-03-01,4,False,Tuesday,78,0.973538
...,...,...,...,...,...,...,...
495,ud9d4,2022-03-21,15,False,Monday,267,0.995265
496,ud9d4,2022-03-21,16,False,Monday,238,1.013950
497,ud9d4,2022-03-21,17,False,Monday,152,1.023213
498,ud9d4,2022-03-21,18,False,Monday,123,1.060986


In [None]:
grouped['day'].unique()

array([datetime.date(2022, 3, 1), datetime.date(2022, 3, 2),
       datetime.date(2022, 3, 3), datetime.date(2022, 3, 4),
       datetime.date(2022, 3, 5), datetime.date(2022, 3, 6),
       datetime.date(2022, 3, 7), datetime.date(2022, 3, 8),
       datetime.date(2022, 3, 9), datetime.date(2022, 3, 10),
       datetime.date(2022, 3, 11), datetime.date(2022, 3, 12),
       datetime.date(2022, 3, 13), datetime.date(2022, 3, 14),
       datetime.date(2022, 3, 15), datetime.date(2022, 3, 16),
       datetime.date(2022, 3, 17), datetime.date(2022, 3, 18),
       datetime.date(2022, 3, 19), datetime.date(2022, 3, 20),
       datetime.date(2022, 3, 21), datetime.date(2022, 3, 22),
       datetime.date(2022, 3, 23), datetime.date(2022, 3, 24),
       datetime.date(2022, 3, 25), datetime.date(2022, 3, 26),
       datetime.date(2022, 3, 27), datetime.date(2022, 3, 28)],
      dtype=object)

In [None]:
grouped[['day', 'hour', 'is_weekend', 'day_of_week', 'num_rides']].head(40).to_string()

'           day  hour  is_weekend day_of_week  num_rides\n0   2022-03-01     0       False     Tuesday         91\n1   2022-03-01     1       False     Tuesday         89\n2   2022-03-01     2       False     Tuesday         66\n3   2022-03-01     3       False     Tuesday         65\n4   2022-03-01     4       False     Tuesday         78\n5   2022-03-01     5       False     Tuesday        162\n6   2022-03-01     6       False     Tuesday        268\n7   2022-03-01     7       False     Tuesday        286\n8   2022-03-01     8       False     Tuesday        205\n9   2022-03-01     9       False     Tuesday        195\n10  2022-03-01    10       False     Tuesday        185\n11  2022-03-01    11       False     Tuesday        167\n12  2022-03-01    12       False     Tuesday        162\n13  2022-03-01    13       False     Tuesday        190\n14  2022-03-01    14       False     Tuesday        203\n15  2022-03-01    15       False     Tuesday        223\n16  2022-03-01    16       Fal

**EDA**

In [None]:
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Assuming 'grouped' is the DataFrame you have created after filtering

# Get unique geohash_start values
geohash_starts = grouped['geohash_start'].unique()

# Iterate over each geohash_start
for geohash in geohash_starts:
    # Filter the data for the current geohash_start
    geohash_data = grouped[grouped['geohash_start'] == geohash]

    # Get unique days for the current geohash_start
    days = geohash_data['day'].unique()

    # Iterate over each day
    for day in days:
        # Filter the data for the current day
        day_data = geohash_data[geohash_data['day'] == day]

        # Determine if the day is a weekend
        is_weekend = 'Weekend' if day_data['is_weekend'].any() else 'Weekday'

        # Determine the day of the week
        day_of_week = day_data['day_of_week'].iloc[0]  # Assuming all values are the same for the day

        # Create a figure with two subplots
        fig = make_subplots(rows=1, cols=2, subplot_titles=("Number of Rides", "Average Ride Value"))

        # Create the time series for num_rides
        fig.add_trace(go.Scatter(x=day_data['hour'], y=day_data['num_rides'], mode='lines+markers', name='Number of Rides'), row=1, col=1)

        # Create the time series for avg_ride_value
        fig.add_trace(go.Scatter(x=day_data['hour'], y=day_data['avg_ride_value'], mode='lines+markers', name='Average Ride Value'), row=1, col=2)

        # Update layout for both subplots
        fig.update_layout(title_text=f'Number of Rides and Average Ride Value for {geohash} on {day} ({is_weekend}, {day_of_week})')
        fig.update_xaxes(title_text="Hour", row=1, col=1)
        fig.update_xaxes(title_text="Hour", row=1, col=2)
        fig.update_yaxes(title_text="Number of Rides", row=1, col=1)
        fig.update_yaxes(title_text="Average Ride Value", row=1, col=2)

        # Show the plot
        fig.show()

In [None]:
#%pip install plotly --upgrade

In [None]:
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Assuming 'grouped' is the DataFrame you have created after filtering

# Get unique geohash_start values
geohash_starts = grouped['geohash_start'].unique()

# Define the order of days
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

# Iterate over each geohash_start
for geohash in geohash_starts:
    # Filter the data for the current geohash_start
    geohash_data = grouped[grouped['geohash_start'] == geohash]

    # Calculate the average ride counts and average ride values for each day of the week
    avg_rides_by_day = geohash_data.groupby('day_of_week')['num_rides'].mean().reset_index()
    avg_values_by_day = geohash_data.groupby('day_of_week')['avg_ride_value'].mean().reset_index()

    # Sort the data by day_of_week
    avg_rides_by_day['day_of_week'] = pd.Categorical(avg_rides_by_day['day_of_week'], categories=day_order, ordered=True)
    avg_values_by_day['day_of_week'] = pd.Categorical(avg_values_by_day['day_of_week'], categories=day_order, ordered=True)

    # Sort the dataframes
    avg_rides_by_day = avg_rides_by_day.sort_values('day_of_week')
    avg_values_by_day = avg_values_by_day.sort_values('day_of_week')

    # Create a figure with two subplots
    fig = make_subplots(rows=1, cols=2, subplot_titles=("Average Number of Rides", "Average Ride Value"))

    # Create the bar chart for average number of rides
    fig.add_trace(go.Bar(x=avg_rides_by_day['day_of_week'], y=avg_rides_by_day['num_rides'], name='Average Number of Rides'), row=1, col=1)

    # Create the bar chart for average ride value
    fig.add_trace(go.Bar(x=avg_values_by_day['day_of_week'], y=avg_values_by_day['avg_ride_value'], name='Average Ride Value'), row=1, col=2)

    # Update layout for both subplots
    fig.update_layout(title_text=f'Average Number of Rides and Average Ride Value for {geohash} by Day of the Week')
    fig.update_xaxes(title_text="Day of the Week", row=1, col=1)
    fig.update_xaxes(title_text="Day of the Week", row=1, col=2)
    fig.update_yaxes(title_text="Average Number of Rides", row=1, col=1)
    fig.update_yaxes(title_text="Average Ride Value", row=1, col=2)

    # Show the plot
    fig.show()

In [None]:
import pandas as pd
import plotly.graph_objects as go

# Assuming 'grouped' is the DataFrame you have created after filtering

# Get unique geohash_start values
geohash_starts = grouped['geohash_start'].unique()

# Iterate over each geohash_start
for geohash in geohash_starts:
    # Filter the data for the current geohash_start
    geohash_data = grouped[grouped['geohash_start'] == geohash]

    # Calculate the average and median number of rides and average and median ride values for each hour of the day
    avg_rides_by_hour = geohash_data.groupby('hour')['num_rides'].mean().reset_index()
    median_rides_by_hour = geohash_data.groupby('hour')['num_rides'].median().reset_index()
    avg_values_by_hour = geohash_data.groupby('hour')['avg_ride_value'].mean().reset_index()
    median_values_by_hour = geohash_data.groupby('hour')['avg_ride_value'].median().reset_index()

    # Create a figure for average and median number of rides
    fig_rides = go.Figure()
    fig_rides.add_trace(go.Bar(x=avg_rides_by_hour['hour'], y=avg_rides_by_hour['num_rides'], name='Average Number of Rides'))
    fig_rides.add_trace(go.Bar(x=median_rides_by_hour['hour'], y=median_rides_by_hour['num_rides'], name='Median Number of Rides'))
    fig_rides.update_layout(title_text=f'Average and Median Number of Rides for {geohash} by Hour of the Day', xaxis_title="Hour of the Day", yaxis_title="Number of Rides")
    fig_rides.show()

    # Create a figure for average and median ride value
    fig_values = go.Figure()
    fig_values.add_trace(go.Bar(x=avg_values_by_hour['hour'], y=avg_values_by_hour['avg_ride_value'], name='Average Ride Value'))
    fig_values.add_trace(go.Bar(x=median_values_by_hour['hour'], y=median_values_by_hour['avg_ride_value'], name='Median Ride Value'))
    fig_values.update_layout(title_text=f'Average and Median Ride Value for {geohash} by Hour of the Day', xaxis_title="Hour of the Day", yaxis_title="Ride Value")
    fig_values.show()

if the mean and median are different, it means that the data is not symmetrically distributed around the central value. This can occur for several reasons:

Outliers: If there are extreme values that are much larger or smaller than the rest of the data, they can pull the mean in one direction, while the median is not influenced as much by these outliers.

Skewness: If the data is skewed (not normally distributed), the mean will be pulled in the direction of the skew, while the median will be closer to the "middle" of the dataset.

Selection Bias: If the data has been intentionally selected to include certain values (e.g., only including values above a certain threshold), this can affect the mean but not the median.

**Forecasting**

We concentrate only on a city area showing that we can forecast the demand of taxi drivers on a particular day, time and location (the selected geohash
)

In [None]:
dataset = grouped[grouped['geohash_start'] == 'ud9d4']
dataset = dataset.drop('geohash_start', axis=1)
dataset.head(100)

Unnamed: 0,day,hour,is_weekend,day_of_week,num_rides,avg_ride_value
0,2022-03-01,0,False,Tuesday,91,1.027250
1,2022-03-01,1,False,Tuesday,89,0.998199
2,2022-03-01,2,False,Tuesday,66,0.952383
3,2022-03-01,3,False,Tuesday,65,0.915919
4,2022-03-01,4,False,Tuesday,78,0.973538
...,...,...,...,...,...,...
95,2022-03-04,23,False,Friday,154,1.018563
96,2022-03-05,0,True,Saturday,98,0.856862
97,2022-03-05,1,True,Saturday,80,1.092369
98,2022-03-05,2,True,Saturday,83,0.961630


In [None]:
dataset.day.unique()

array([datetime.date(2022, 3, 1), datetime.date(2022, 3, 2),
       datetime.date(2022, 3, 3), datetime.date(2022, 3, 4),
       datetime.date(2022, 3, 5), datetime.date(2022, 3, 6),
       datetime.date(2022, 3, 7), datetime.date(2022, 3, 8),
       datetime.date(2022, 3, 9), datetime.date(2022, 3, 10),
       datetime.date(2022, 3, 11), datetime.date(2022, 3, 12),
       datetime.date(2022, 3, 13), datetime.date(2022, 3, 14),
       datetime.date(2022, 3, 15), datetime.date(2022, 3, 16),
       datetime.date(2022, 3, 17), datetime.date(2022, 3, 18),
       datetime.date(2022, 3, 19), datetime.date(2022, 3, 20),
       datetime.date(2022, 3, 21), datetime.date(2022, 3, 22),
       datetime.date(2022, 3, 23), datetime.date(2022, 3, 24),
       datetime.date(2022, 3, 25), datetime.date(2022, 3, 26),
       datetime.date(2022, 3, 27), datetime.date(2022, 3, 28)],
      dtype=object)

In [None]:
import pandas as pd
import plotly.express as px

def plot_time_series(dataset, day_column, hour_column, y_column):
    # Create a copy of the DataFrame to avoid modifying the original data
    dataset_copy = dataset.copy()

    # Convert 'day' and 'hour' to a datetime format
    dataset_copy['datetime'] = pd.to_datetime(dataset_copy[day_column]) + pd.to_timedelta(dataset_copy[hour_column], unit='h')

    # Set 'datetime' as the index
    dataset_copy.set_index('datetime', inplace=True)

    # Drop the original 'day' and 'hour' columns
    dataset_copy.drop([day_column, hour_column], axis=1, inplace=True)

    # Create the plot
    fig = px.line(dataset_copy, y=y_column)

    # Customize the layout
    fig.update_layout(
        title=f'{y_column} Over Time',
        xaxis_title='Date and Time',
        yaxis_title=y_column,
        hovermode='x unified',  # Show date and time when hovering
        hoverlabel=dict(namelength=-1)  # Show all values in hover
    )

    # Show the plot
    fig.show()

# Example usage:
plot_time_series(dataset, 'day', 'hour', 'num_rides')

In [None]:
dataset.index

Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
       ...
       662, 663, 664, 665, 666, 667, 668, 669, 670, 671],
      dtype='int64', length=672)

In [None]:
dataset.head().to_string()

'          day  hour  is_weekend day_of_week  num_rides  avg_ride_value\n0  2022-03-01     0       False     Tuesday         91        1.027250\n1  2022-03-01     1       False     Tuesday         89        0.998199\n2  2022-03-01     2       False     Tuesday         66        0.952383\n3  2022-03-01     3       False     Tuesday         65        0.915919\n4  2022-03-01     4       False     Tuesday         78        0.973538'

In [None]:
import pandas as pd

def create_train_test_set(dataset, train_size_percentage):
    # Convert 'day' and 'hour' to a datetime format
    dataset['datetime'] = pd.to_datetime(dataset['day']) + pd.to_timedelta(dataset['hour'], unit='h')

    # Set 'datetime' as the index
    dataset.set_index('datetime', inplace=True)

    # Drop the original 'day' and 'hour' columns
    #dataset.drop(['day', 'hour'], axis=1, inplace=True)

    # Calculate the number of days in the dataset
    total_days = (dataset.index.max() - dataset.index.min()).days + 1

    # Calculate the number of days for the training set
    train_days = int(total_days * train_size_percentage / 100)

    # Calculate the cutoff date based on the number of days for the training set
    cutoff_date = dataset.index.min() + pd.DateOffset(days=train_days)

    # Split the data into train and test sets
    train = dataset.loc[dataset.index < cutoff_date]
    test = dataset.loc[dataset.index >= cutoff_date]

    # Ensure that we have full days in the training set
    train_days_count = train.resample('D').count()['num_rides']
    while train_days_count.iloc[-1] != 24:
        cutoff_date -= pd.DateOffset(days=1)
        train = dataset.loc[dataset.index < cutoff_date]
        test = dataset.loc[dataset.index >= cutoff_date]
        train_days_count = train.resample('D').count()['num_rides']

    # Ensure that we have full days in the testing set
    test_days_count = test.resample('D').count()['num_rides']
    while test_days_count.iloc[0] != 24:
        cutoff_date += pd.DateOffset(days=1)
        train = dataset.loc[dataset.index < cutoff_date]
        test = dataset.loc[dataset.index >= cutoff_date]
        test_days_count = test.resample('D').count()['num_rides']

    return train, test

# Example usage:
train, test = create_train_test_set(dataset, 80)

In [None]:
train.head(100)

Unnamed: 0_level_0,day,hour,is_weekend,day_of_week,num_rides,avg_ride_value
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2022-03-01 00:00:00,2022-03-01,0,False,Tuesday,91,1.027250
2022-03-01 01:00:00,2022-03-01,1,False,Tuesday,89,0.998199
2022-03-01 02:00:00,2022-03-01,2,False,Tuesday,66,0.952383
2022-03-01 03:00:00,2022-03-01,3,False,Tuesday,65,0.915919
2022-03-01 04:00:00,2022-03-01,4,False,Tuesday,78,0.973538
...,...,...,...,...,...,...
2022-03-04 23:00:00,2022-03-04,23,False,Friday,154,1.018563
2022-03-05 00:00:00,2022-03-05,0,True,Saturday,98,0.856862
2022-03-05 01:00:00,2022-03-05,1,True,Saturday,80,1.092369
2022-03-05 02:00:00,2022-03-05,2,True,Saturday,83,0.961630


In [None]:
import pandas as pd
import plotly.graph_objs as go

def plot_time_series(dataset, day_column, hour_column, y_column, label):
    # Convert 'day' and 'hour' to a datetime format
    dataset['datetime'] = pd.to_datetime(dataset[day_column]) + pd.to_timedelta(dataset[hour_column], unit='h')

    # Set 'datetime' as the index
    dataset.set_index('datetime', inplace=True)

    # Drop the original 'day' and 'hour' columns
    dataset.drop([day_column, hour_column], axis=1, inplace=True)

    # Create the plot
    trace = go.Scatter(x=dataset.index, y=dataset[y_column], mode='lines', name=label)
    return trace

# Example usage with train and test sets:
train_trace = plot_time_series(train, 'day', 'hour', 'num_rides', 'Train')
test_trace = plot_time_series(test, 'day', 'hour', 'num_rides', 'Test')

# Combine the traces into a single plot
fig = go.Figure(data=[train_trace, test_trace])

# Customize the layout
fig.update_layout(
    title='Number of Rides Over Time',
    xaxis_title='Date and Time',
    yaxis_title='Number of Rides',
    hovermode='x unified',  # Show date and time when hovering
    hoverlabel=dict(namelength=-1),  # Show all values in hover
    showlegend=True  # Show the legend
)

# Show the plot
fig.show()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



**Simulate production**

In [None]:
# Tune the ARIMA model parameters only on the training set
model = auto_arima(train['num_rides'], start_p=1, start_q=1,
                   max_p=5, max_q=5, m=1,
                   start_P=0, seasonal=False,
                   d=None, D=1, trace=True,
                   error_action='ignore',
                   suppress_warnings=True,
                   stepwise=True)

incremental_train = train.copy()

# Fit the model on the training data
model.fit(incremental_train['num_rides'])

# Initialize an empty DataFrame to store the predictions
predictions = pd.DataFrame(index=test.index, columns=['num_rides_predicted'])

# Initialize an empty DataFrame to store the actual values
actuals = pd.DataFrame(index=test.index, columns=['num_rides_actual'])

# Iterate over the test set, making predictions and updating the model
for t in range(len(test)):
    # Forecast the next hour
    forecast, conf_int = model.predict(n_periods=1, return_conf_int=True)

    # Store the prediction and actual value
    predictions.loc[test.index[t], 'num_rides_predicted'] = forecast[0]
    actuals.loc[test.index[t], 'num_rides_actual'] = test.iloc[t]['num_rides']

    # Print the prediction and actual value
    print(f"Predicted: {forecast[0]}, Actual: {test.iloc[t]['num_rides']}")

    # Assign the actual value to the training set for the next iteration
    # Ensure the new row has the same columns as the train DataFrame
    new_row = test.iloc[t].copy()
    incremental_train.loc[test.index[t]] = new_row

    # Retrain the model on the updated training set with the same parameters
    model.fit(incremental_train['num_rides'])

# Plot the training time series, predicted values, and actual values
train_plot = go.Scatter(x=incremental_train.index, y=incremental_train['num_rides'], name='Train', mode='lines')
predictions_plot = go.Scatter(x=predictions.index, y=predictions['num_rides_predicted'], name='Predictions', mode='lines')
actuals_plot = go.Scatter(x=actuals.index, y=actuals['num_rides_actual'], name='Actuals', mode='lines')

layout = go.Layout(title='ARIMA Forecast in Production Environment',
                   xaxis=dict(title='Date'),
                   yaxis=dict(title='Number of Rides'))

fig = go.Figure(data=[train_plot, predictions_plot, actuals_plot], layout=layout)
fig.show()

Performing stepwise search to minimize aic
 ARIMA(1,1,1)(0,0,0)[0] intercept   : AIC=5528.316, Time=0.26 sec
 ARIMA(0,1,0)(0,0,0)[0] intercept   : AIC=5598.610, Time=0.03 sec
 ARIMA(1,1,0)(0,0,0)[0] intercept   : AIC=5534.874, Time=0.09 sec
 ARIMA(0,1,1)(0,0,0)[0] intercept   : AIC=5530.201, Time=0.24 sec
 ARIMA(0,1,0)(0,0,0)[0]             : AIC=5596.620, Time=0.06 sec
 ARIMA(2,1,1)(0,0,0)[0] intercept   : AIC=inf, Time=1.99 sec
 ARIMA(1,1,2)(0,0,0)[0] intercept   : AIC=inf, Time=1.92 sec
 ARIMA(0,1,2)(0,0,0)[0] intercept   : AIC=5519.323, Time=0.84 sec
 ARIMA(0,1,3)(0,0,0)[0] intercept   : AIC=inf, Time=0.82 sec
 ARIMA(1,1,3)(0,0,0)[0] intercept   : AIC=inf, Time=0.66 sec
 ARIMA(0,1,2)(0,0,0)[0]             : AIC=5517.326, Time=0.12 sec
 ARIMA(0,1,1)(0,0,0)[0]             : AIC=5528.206, Time=0.09 sec
 ARIMA(1,1,2)(0,0,0)[0]             : AIC=inf, Time=0.29 sec
 ARIMA(0,1,3)(0,0,0)[0]             : AIC=5427.252, Time=0.41 sec
 ARIMA(1,1,3)(0,0,0)[0]             : AIC=inf, Time=0.77 s

In [None]:
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error

def calculate_forecast_metrics(actuals, predictions):
    """
    Calculate various forecasting metrics.

    Parameters:
    actuals (pandas Series): Series of actual values.
    predictions (pandas Series): Series of predicted values.

    Returns:
    dict: Dictionary containing the calculated metrics.
    """
    # Calculate MAE
    mae = mean_absolute_error(actuals, predictions)

    # Calculate Bias
    bias = np.mean(predictions - actuals)

    # Calculate MSE
    mse = mean_squared_error(actuals, predictions)

    # Calculate RMSE
    rmse = np.sqrt(mse)

    # Calculate MAPE
    mape = np.mean(np.abs((actuals - predictions) / actuals)) * 100

    # Return the metrics in a dictionary
    return {
        'MAE': mae,
        'Bias': bias,
        'MSE': mse,
        'RMSE': rmse,
        'MAPE': mape
    }

# Example usage:
# Assuming 'actuals' and 'predictions' are pandas Series with the actual and predicted values
metrics = calculate_forecast_metrics(actuals['num_rides_actual'], predictions['num_rides_predicted'])


# MAE: 39.62837477404839
# Bias: -1.5267489120483067
# MSE: 2582.5292864608473
# RMSE: 50.818591937015015
# MAPE: 22.111911033761086

# Print the metrics
for metric, value in metrics.items():
    print(f"{metric}: {value}")

MAE: 39.62837477404839
Bias: -1.5267489120483067
MSE: 2582.5292864608473
RMSE: 50.818591937015015
MAPE: 22.111911033761086


In [None]:
# Tune the ARIMA model parameters only on the training set
# Include exogenous variables 'is_weekend' and 'day_of_week'
model = auto_arima(train['num_rides'], exogenous=train[['is_weekend', 'day_of_week']],
                   start_p=1, start_q=1, max_p=3, max_q=3, m=1,
                   start_P=0, seasonal=True,
                   d=None, D=1, trace=True,
                   error_action='ignore',
                   suppress_warnings=True,
                   stepwise=True)


incremental_train = train.copy()

# Fit the model on the training data
model.fit(incremental_train['num_rides'], exogenous=incremental_train[['is_weekend', 'day_of_week']])

# Initialize an empty DataFrame to store the predictions
predictions = pd.DataFrame(index=test.index, columns=['num_rides_predicted'])

# Initialize an empty DataFrame to store the actual values
actuals = pd.DataFrame(index=test.index, columns=['num_rides_actual'])

# Iterate over the test set, making predictions and updating the model
for t in range(len(test)):
    # Forecast the next hour with exogenous variables
    forecast, conf_int = model.predict(n_periods=1,
                                       exogenous=test[['is_weekend', 'day_of_week']].iloc[t:t+1],
                                       return_conf_int=True)

    # Store the prediction and actual value
    predictions.loc[test.index[t], 'num_rides_predicted'] = forecast[0]
    actuals.loc[test.index[t], 'num_rides_actual'] = test.iloc[t]['num_rides']

    # Print the prediction and actual value
    print(f"Predicted: {forecast[0]}, Actual: {test.iloc[t]['num_rides']}")

    # Assign the actual value to the training set for the next iteration
    new_row = test.iloc[t].copy()
    incremental_train.loc[test.index[t]] = new_row

    # Retrain the model on the updated training set with the same parameters
    model.fit(incremental_train['num_rides'], exogenous=incremental_train[['is_weekend', 'day_of_week']])

# Plot the training time series, predicted values, and actual values
train_plot = go.Scatter(x=incremental_train.index, y=incremental_train['num_rides'], name='Train', mode='lines')
predictions_plot = go.Scatter(x=predictions.index, y=predictions['num_rides_predicted'], name='Predictions', mode='lines')
actuals_plot = go.Scatter(x=actuals.index, y=actuals['num_rides_actual'], name='Actuals', mode='lines')

layout = go.Layout(title='ARIMA Forecast in Production Environment',
                   xaxis=dict(title='Date'),
                   yaxis=dict(title='Number of Rides'))

fig = go.Figure(data=[train_plot, predictions_plot, actuals_plot], layout=layout)
fig.show()

Performing stepwise search to minimize aic
 ARIMA(1,1,1)(0,0,0)[0] intercept   : AIC=5528.316, Time=0.58 sec
 ARIMA(0,1,0)(0,0,0)[0] intercept   : AIC=5598.610, Time=0.03 sec
 ARIMA(1,1,0)(0,0,0)[0] intercept   : AIC=5534.874, Time=0.12 sec
 ARIMA(0,1,1)(0,0,0)[0] intercept   : AIC=5530.201, Time=0.40 sec
 ARIMA(0,1,0)(0,0,0)[0]             : AIC=5596.620, Time=0.06 sec
 ARIMA(2,1,1)(0,0,0)[0] intercept   : AIC=inf, Time=1.96 sec
 ARIMA(1,1,2)(0,0,0)[0] intercept   : AIC=inf, Time=1.63 sec
 ARIMA(0,1,2)(0,0,0)[0] intercept   : AIC=5519.323, Time=0.43 sec
 ARIMA(0,1,3)(0,0,0)[0] intercept   : AIC=inf, Time=0.43 sec
 ARIMA(1,1,3)(0,0,0)[0] intercept   : AIC=inf, Time=0.59 sec
 ARIMA(0,1,2)(0,0,0)[0]             : AIC=5517.326, Time=0.13 sec
 ARIMA(0,1,1)(0,0,0)[0]             : AIC=5528.206, Time=0.09 sec
 ARIMA(1,1,2)(0,0,0)[0]             : AIC=inf, Time=0.27 sec
 ARIMA(0,1,3)(0,0,0)[0]             : AIC=5427.252, Time=0.52 sec
 ARIMA(1,1,3)(0,0,0)[0]             : AIC=inf, Time=1.92 s

In [None]:
# Example usage:
# Assuming 'actuals' and 'predictions' are pandas Series with the actual and predicted values
metrics = calculate_forecast_metrics(actuals['num_rides_actual'], predictions['num_rides_predicted'])

# previous
# MAE: 39.62837477404839
# Bias: -1.5267489120483067
# MSE: 2582.5292864608473
# RMSE: 50.818591937015015
# MAPE: 22.111911033761086

# Print the metrics
for metric, value in metrics.items():
    print(f"{metric}: {value}")

MAE: 39.62837477404839
Bias: -1.5267489120483067
MSE: 2582.5292864608473
RMSE: 50.818591937015015
MAPE: 22.111911033761086


In [None]:
import pandas as pd
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller
from pmdarima import auto_arima
import plotly.graph_objects as go
from statsmodels.graphics.tsaplots import plot_acf
import matplotlib.pyplot as plt

# Assuming 'train' and 'test' are your DataFrames and 'is_weekend' and 'day_of_week' are your exogenous variables


incremental_train = train.copy()

# Convert 'day_of_week' to categorical type and then to numerical type
# First, ensure 'day_of_week' is a string type
incremental_train['day_of_week'] = incremental_train['day_of_week'].astype(str)
incremental_train['day_of_week'] = test['day_of_week'].astype(str)

# Convert 'day_of_week' to categorical type
incremental_train['day_of_week'] = pd.Categorical(train['day_of_week'])
test['day_of_week'] = pd.Categorical(test['day_of_week'])

# Now, convert the categorical type to numerical codes
incremental_train['day_of_week'] = incremental_train['day_of_week'].cat.codes
test['day_of_week'] = test['day_of_week'].cat.codes


incremental_train['is_weekend'] = incremental_train['is_weekend'].astype(int)

# Tune the ARIMA model parameters only on the training set
# Include exogenous variables 'is_weekend' and 'day_of_week'
model = auto_arima(incremental_train['num_rides'], exogenous=incremental_train[['is_weekend', 'day_of_week']],
                   start_p=1, start_q=1, max_p=3, max_q=3, m=1,
                   start_P=0, seasonal=True,
                   d=None, D=1, trace=True,
                   error_action='ignore',
                   suppress_warnings=True,
                   stepwise=True)

# Fit the model on the training data
model.fit(incremental_train['num_rides'], exogenous=incremental_train[['is_weekend', 'day_of_week']])

# Initialize an empty DataFrame to store the predictions
predictions = pd.DataFrame(index=test.index, columns=['num_rides_predicted'])

# Initialize an empty DataFrame to store the actual values
actuals = pd.DataFrame(index=test.index, columns=['num_rides_actual'])

# Iterate over the test set, making predictions and updating the model
for t in range(len(test)):
    # Forecast the next hour with exogenous variables
    forecast, conf_int = model.predict(n_periods=1,
                                       exogenous=test[['is_weekend', 'day_of_week']].iloc[t:t+1],
                                       return_conf_int=True)

    # Store the prediction and actual value
    predictions.loc[test.index[t], 'num_rides_predicted'] = forecast[0]
    actuals.loc[test.index[t], 'num_rides_actual'] = test.iloc[t]['num_rides']

    # Print the prediction and actual value
    print(f"Predicted: {forecast[0]}, Actual: {test.iloc[t]['num_rides']}")

    # Assign the actual value to the training set for the next iteration
    new_row = test.iloc[t].copy()
    incremental_train.loc[test.index[t]] = new_row


    # Retrain the model on the updated training set with the same parameters
    model.fit(incremental_train['num_rides'], exogenous=incremental_train[['is_weekend', 'day_of_week']])

# Plot the training time series, predicted values, and actual values
train_plot = go.Scatter(x=incremental_train.index, y=incremental_train['num_rides'], name='Train', mode='lines')
predictions_plot = go.Scatter(x=predictions.index, y=predictions['num_rides_predicted'], name='Predictions', mode='lines')
actuals_plot = go.Scatter(x=actuals.index, y=actuals['num_rides_actual'], name='Actuals', mode='lines')

layout = go.Layout(title='ARIMA Forecast in Production Environment',
                   xaxis=dict(title='Date'),
                   yaxis=dict(title='Number of Rides'))

fig = go.Figure(data=[train_plot, predictions_plot, actuals_plot], layout=layout)
fig.show()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Performing stepwise search to minimize aic
 ARIMA(1,1,1)(0,0,0)[0] intercept   : AIC=5528.316, Time=0.27 sec
 ARIMA(0,1,0)(0,0,0)[0] intercept   : AIC=5598.610, Time=0.03 sec
 ARIMA(1,1,0)(0,0,0)[0] intercept   : AIC=5534.874, Time=0.08 sec
 ARIMA(0,1,1)(0,0,0)[0] intercept   : AIC=5530.201, Time=0.19 sec
 ARIMA(0,1,0)(0,0,0)[0]             : AIC=5596.620, Time=0.03 sec
 ARIMA(2,1,1)(0,0,0)[0] intercept   : AIC=inf, Time=0.95 sec
 ARIMA(1,1,2)(0,0,0)[0] intercept   : AIC=inf, Time=1.45 sec
 ARIMA(0,1,2)(0,0,0)[0] intercept   : AIC=5519.323, Time=0.90 sec
 ARIMA(0,1,3)(0,0,0)[0] intercept   : AIC=inf, Time=0.84 sec
 ARIMA(1,1,3)(0,0,0)[0] intercept   : AIC=inf, Time=1.17 sec
 ARIMA(0,1,2)(0,0,0)[0]             : AIC=5517.326, Time=0.28 sec
 ARIMA(0,1,1)(0,0,0)[0]             : AIC=5528.206, Time=0.22 sec
 ARIMA(1,1,2)(0,0,0)[0]             : AIC=inf, Time=0.98 sec
 ARIMA(0,1,3)(0,0,0)[0]             : AIC=5427.252, Time=1.12 sec
 ARIMA(1,1,3)(0,0,0)[0]             : AIC=inf, Time=0.90 s

In [None]:
# previous
# MAE: 39.62837477404839
# Bias: -1.5267489120483067
# MSE: 2582.5292864608473
# RMSE: 50.818591937015015
# MAPE: 22.111911033761086

# Print the metrics
for metric, value in metrics.items():
    print(f"{metric}: {value}")

MAE: 39.62837477404839
Bias: -1.5267489120483067
MSE: 2582.5292864608473
RMSE: 50.818591937015015
MAPE: 22.111911033761086


**Mean Absolute Error (MAE)**: This is the average of the absolute differences between the predicted and actual values. It gives an idea of the magnitude of the error, but no idea of the direction (e.g., over- or under-estimation). A lower MAE is better.

**Bias: **This is the difference between the average prediction and the actual values. A positive bias means that the model tends to over-predict, while a negative bias means it tends to under-predict. A bias close to zero is generally desirable.

**Mean Squared Error (MSE):** This is similar to MAE, but squares the differences before averaging them. This means that MSE is more sensitive to large errors because the square function increases the size of errors. A lower MSE is better.

**Root Mean Squared Error (RMSE):** This is the square root of the MSE. It's on the same scale as the original data, which can be more interpretable. A lower RMSE is better.

**Mean Absolute Percentage Error (MAPE):** This is the average of the absolute percentage differences between the predicted and actual values. It's often used when you want to measure the forecast error in terms of relative rather than absolute terms. A lower MAPE is better.

Here's when to use each one:

**MAE:** Use when you want a simple measure of the average error magnitude.

**Bias:** Use when you want to understand if the model tends to over- or under-predict.

**MSE/RMSE:** Use when large errors are particularly undesirable.

**MAPE:** Use when you want to measure the forecast error in terms of relative rather than absolute terms.

Sure, here's how you could use AWS cloud platform for your ride-hailing service:

**AWS Cloud Platform Usage**

1. **Data Storage**: You can use AWS S3 for storing your data. S3 is a scalable and durable storage service that allows you to store and retrieve any amount of data from anywhere on the web.

2. **Data Processing**: AWS provides several services for data processing. For example, you can use AWS Lambda for serverless computing, AWS Glue for ETL jobs, and AWS EMR for big data processing.

3. **Model Training**: You can use AWS SageMaker for building, training, and deploying machine learning models. SageMaker provides a Jupyter notebook instance that you can use to prepare your data and train your models.

4. **Model Deployment**: After training your model, you can use AWS Elastic Beanstalk or AWS Lambda to deploy your model. Elastic Beanstalk is a service for deploying and scaling web applications and services, while AWS Lambda lets you run your code without provisioning or managing servers.

5. **Real-time Predictions**: For real-time predictions, you can use AWS Kinesis for real-time data streaming and AWS Lambda for serverless computing.

**Possible Architecture**

1. **Data Collection**: Use AWS Kinesis to collect real-time data from your drivers and riders.

2. **Data Storage**: Store the data in AWS S3 for later analysis.

3. **Data Processing**: Use AWS Glue to clean and preprocess the data.

4. **Model Training**: Train your model using AWS SageMaker.

5. **Model Deployment**: Deploy your model using AWS Elastic Beanstalk or AWS Lambda.

6. **Real-time Predictions**: Use AWS Kinesis to process real-time data and AWS Lambda to make predictions in real-time.

**A/B Testing**

You can use AWS CloudWatch for monitoring and logging your application. For A/B testing, you can use AWS SageMaker A/B Testing feature, which allows you to compare the performance of different models or different versions of the same model.

**Defining Reasonable Metrics for Taxi Drivers**

1. **Ride Completion Rate**: The percentage of rides that are completed successfully.

2. **Average Ride Time**: The average time taken for a ride to be completed.

3. **Cancellation Rate**: The percentage of rides that are cancelled.

4. **Customer Satisfaction Score**: A score based on customer feedback and ratings.

5. **Driver Utilization**: The percentage of time that a driver is actively driving a ride.

Remember, the choice of metrics will depend on your specific business requirements and goals.


Sure, here is the information rewritten in Markdown format:

1. **Explore the data and suggest a solution to guide the drivers towards areas with higher expected demand at given time and location**

The data provided includes the start time, start and end latitude and longitude, and the ride value. To suggest a solution, we can use time series analysis to predict the demand at different times and locations. We can use ARIMA or other time series models to forecast the demand based on historical data.

We can also use geospatial analysis to understand the demand distribution across different areas. This can be done using techniques like heatmaps or kernel density estimation.

2. **Build and document a baseline model for your solution**

The baseline model for this solution could be a simple ARIMA model. The model would predict the number of rides at a given time and location based on historical data. The model would be trained on the past data and then used to predict the future demand.

3. **Describe how you would design and deploy such a model**

The model would be designed and deployed in the following steps:

- Data Preprocessing: Clean the data, handle missing values, and convert categorical variables to numerical variables if necessary.
- Feature Engineering: Create new features that might be useful for the model, such as time-related features (hour of the day, day of the week) and location-related features (distance between pick-up and drop-off points).
- Model Training: Train the model on the historical data. Use a validation set to tune the model's parameters.
- Model Evaluation: Evaluate the model's performance using appropriate metrics (e.g., RMSE, MAE).
- Model Deployment: Deploy the model in a production environment where it can make predictions in real-time.
- Monitoring and Updating: Continuously monitor the model's performance and retrain it with new data as needed.

4. **Describe how to communicate model recommendations to drivers**

The model can provide recommendations to drivers in the form of predicted demand at different times and locations. Drivers can use this information to plan their routes and schedules. For example, if the model predicts high demand at a certain time and location, drivers can choose to take a ride there.

5. **Think through and describe the design of the experiment that would validate your solution for live operations taking into account marketplace specifics**

The experiment could be a simulation where a portion of the real data is used as a test set and the model's performance is evaluated. The experiment could be designed in the following way:

- Split the data into training and test sets. The test set could be a portion of the data that is not used for training the model.
- Train the model on the training set.
- Use the model to predict the number of rides in the test set.
- Compare the predicted values with the actual values in the test set.
- Calculate the performance metrics (e.g., RMSE, MAE) to evaluate the model's performance.

This experiment would help validate the model's performance and provide insights into how well it can predict demand in a live marketplace setting.
