# Cleaning the data

In [10]:
import pandas as pd

df = pd.read_csv("../data/raw/taxi_trip_pricing.csv")
df.head(3)

Unnamed: 0,Trip_Distance_km,Time_of_Day,Day_of_Week,Passenger_Count,Traffic_Conditions,Weather,Base_Fare,Per_Km_Rate,Per_Minute_Rate,Trip_Duration_Minutes,Trip_Price
0,19.35,Morning,Weekday,3.0,Low,Clear,3.56,0.8,0.32,53.82,36.2624
1,47.59,Afternoon,Weekday,1.0,High,Clear,,0.62,0.43,40.57,
2,36.87,Evening,Weekend,1.0,High,Clear,2.7,1.21,0.15,37.27,52.9032


### Changing the column-names to lowercase and strip of spaces

In [11]:
df.columns = (
    df.columns
    .str.strip()
    .str.lower()
    .str.replace(" ", "_")
)
df.columns

Index(['trip_distance_km', 'time_of_day', 'day_of_week', 'passenger_count',
       'traffic_conditions', 'weather', 'base_fare', 'per_km_rate',
       'per_minute_rate', 'trip_duration_minutes', 'trip_price'],
      dtype='object')

### Since the NaN-values were evenly distributed, i drop them all

In [12]:
df = df.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 562 entries, 0 to 998
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   trip_distance_km       562 non-null    float64
 1   time_of_day            562 non-null    object 
 2   day_of_week            562 non-null    object 
 3   passenger_count        562 non-null    float64
 4   traffic_conditions     562 non-null    object 
 5   weather                562 non-null    object 
 6   base_fare              562 non-null    float64
 7   per_km_rate            562 non-null    float64
 8   per_minute_rate        562 non-null    float64
 9   trip_duration_minutes  562 non-null    float64
 10  trip_price             562 non-null    float64
dtypes: float64(7), object(4)
memory usage: 52.7+ KB


### From the EDA, I decided to drop some columns
(Since they made little to no difference on the final price)

However, I first save the new "cleaned" data to a separate file, so I can try training on it later and compare.

In [13]:
df.to_csv("../data/processed/taxi_clean_full.csv", index=False)

And one without extra columns:

In [14]:
columns_to_drop = [
    "base_fare",
    "passenger_count",
    "per_minute_rate",
]
df_dropped = df.drop(columns=columns_to_drop, axis=1)