# Spring 2025 Data Science Project

By: Katherine, Sayee, Yiran, Asmita

## Contributions

## Introduction

## Data Curation

In [74]:
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt

#importing dataset from file
df = pd.read_parquet('yellow_tripdata_2024-01.parquet')

#cleaning dataset by removing entriees with missing data or inappropiate dates
df = df.dropna()
date = pd.Timestamp(2024, 1, 1)
df = df[df['tpep_pickup_datetime'] >= date]
date = pd.Timestamp(2024, 2, 1)
df = df[df['tpep_pickup_datetime'] < date]


Index(['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance', 'RatecodeID', 'store_and_fwd_flag',
       'PULocationID', 'DOLocationID', 'payment_type', 'fare_amount', 'extra',
       'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge',
       'total_amount', 'congestion_surcharge', 'Airport_fee'],
      dtype='object')

## Exploratory Data Analysis

In [75]:
#port the stuff from the other document

## Primary Analysis

### An Evaluation of Factors Impacting Tips

Here we are applying some filtering to ensure the exclusion of problematic data like significant outliers in the trip distance and negative fares. We are the creating a column with a calculation of the tip as a percentage of the total fare. In this section we are evaluating a few models for predicting this tip percentage based on a few other features. For the purpose of not overtaxing my computer and waiting 50 hours for a result, we are only using a sample of the whole data set. This is because I would like to look at how accurate Random Forest is at fitting the data, but I simply cannot get it to run on the whole dataset of over 2000000 points. 

In [155]:
from sklearn.model_selection import train_test_split
df = df.dropna()
date = pd.Timestamp(2024, 1, 1)
df = df[df['tpep_pickup_datetime'] >= date]
date = pd.Timestamp(2024, 2, 1)
df = df[df['tpep_pickup_datetime'] < date]

dff = df[df['trip_distance'] <= 800]
dff = dff[dff['tip_amount'] >= 0]
dff = dff[dff['total_amount'] > 0]
dff['tip_percent'] = dff['tip_amount']/dff['total_amount']
dff.dropna()
dft = dff.sample(10000)


In this section we are selecting the features we wish to evaluate and creating a training set. We are also scaling the data to ensure that none are unduely weighted in training.

In [147]:
from sklearn import preprocessing
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.inspection import permutation_importance



features = ['trip_distance', 'total_amount',
       'tolls_amount', 'PULocationID']
X = dft[features]
Y = dft['tip_percent']

random_state = 42
test_size = 0.5
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size= test_size, random_state=random_state)

scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
scaler = preprocessing.StandardScaler().fit(X_test)
X_test_scaled = scaler.transform(X_test)



Here we are evaluating the accuracy of three models: K nearest neighbors, Decision Tree, and Random Forest.

In [149]:
models = {
    "KNN": KNeighborsRegressor(),
    "Decision Tree": DecisionTreeRegressor(),
    "Random Forest": RandomForestRegressor(),
         }

for model_name, model in models.items():
    np.random.seed(42)
    model.fit(X_train_scaled, y_train)

for model_name, model in models.items():
    np.random.seed(42)
    y_pred = model.predict(X_test_scaled)
    accuracy = model.score(X_test_scaled, y_test)
    print(f"Accuracy of {model_name}: {accuracy:.3f}") # Your accuracy table header here


Accuracy of KNN: 0.183
Accuracy of Decision Tree: -0.402
Accuracy of Random Forest: 0.235


From these results we see that Random Forest has the highest accuracy of the tested models. K nearest neighbors is close behind while Decision Tree actually has negative accuracy. Unfortunately Random Forest is not very efficient on the particular machines we are using so we will proceed with K nearest neighbors. 


We sill now fit the selected model on the entirety of the dataset.

In [162]:
X = dff[features]
Y = dff['tip_percent']

random_state = 42
test_size = 0.5
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size= test_size, random_state=random_state)

scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
scaler = preprocessing.StandardScaler().fit(X_test)
X_test_scaled = scaler.transform(X_test)

knn = KNeighborsRegressor()
fit = knn.fit(X_train_scaled, y_train)
y_pred = knn.predict(X_test_scaled)

In [None]:
from sklearn.inspection import permutation_importance
result = permutation_importance(fit, X_train_scaled, y_train, n_repeats=10,
                                random_state=0)
importance = result.importances
print(result.importances)
feature_importance_df = pd.DataFrame({'feature':X_train.columns, 'importance':importance})
print(feature_importance_df.sort_values('importance', ascending = False))



In [None]:
plt.pie(feature_importance_df[['importance']], labels=feature_importance_df[['feature']], colors=colors, shadow=True)
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.title('Effect of various features on tip percentage')
plt.show()

In [None]:
sorted_importances_idx = importance.importances_mean.argsort()
importances = pd.DataFrame(
    result.importances[sorted_importances_idx].T,
    columns=X.columns[sorted_importances_idx],
)
ax = importances.plot.box(vert=False, whis=10)
ax.set_title("Permutation Importances (test set)")
ax.axvline(x=0, color="k", linestyle="--")
ax.set_xlabel("Decrease in accuracy score")
ax.figure.tight_layout()

## Visualization

## Conclusion

Stuff written so nicely...