In [68]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import preprocessing

# Generate Dataframe
df = pd.read_csv("uber-fares-dataset/uber.csv")

## Cleaning the Data

Before we start formally cleaning the data, we made sure to take a look at the data we are working with as well as their data types and potential missing values.

Right off the bat, we noted that some latitude and longitude values do not fall within the actual latitude and longitude ranges.
For example, latitude values range between -90 and 90 whereas longitude values range between -180 and 180. Values like -3356.66630 for the longitude will skew the data, so we will need to handle their observational units accordingly.

In [69]:
df.describe()

Unnamed: 0.1,Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,200000.0,200000.0,200000.0,200000.0,199999.0,199999.0,200000.0
mean,27712500.0,11.359955,-72.527638,39.935885,-72.525292,39.92389,1.684535
std,16013820.0,9.901776,11.437787,7.720539,13.117408,6.794829,1.385997
min,1.0,-52.0,-1340.64841,-74.015515,-3356.6663,-881.985513,0.0
25%,13825350.0,6.0,-73.992065,40.734796,-73.991407,40.733823,1.0
50%,27745500.0,8.5,-73.981823,40.752592,-73.980093,40.753042,1.0
75%,41555300.0,12.5,-73.967154,40.767158,-73.963658,40.768001,2.0
max,55423570.0,499.0,57.418457,1644.421482,1153.572603,872.697628,208.0


From the .info() function we found that the longitude and latitude values contain a null value. We will need to address that later during our data cleanup.

In [70]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 9 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   Unnamed: 0         200000 non-null  int64  
 1   key                200000 non-null  object 
 2   fare_amount        200000 non-null  float64
 3   pickup_datetime    200000 non-null  object 
 4   pickup_longitude   200000 non-null  float64
 5   pickup_latitude    200000 non-null  float64
 6   dropoff_longitude  199999 non-null  float64
 7   dropoff_latitude   199999 non-null  float64
 8   passenger_count    200000 non-null  int64  
dtypes: float64(5), int64(2), object(2)
memory usage: 13.7+ MB


To combat the rather strange values in the latitude and longitude columns, we will need to remove any observational units that do not fall within the specified latitude and longitude range. 

Before doing so, we will first need to drop any null values from our dataset.

In [71]:
df = df.dropna()

In [72]:
# Drop Pickup Latitude & Longitude Values Outside of Range
df = df[(df.pickup_longitude < -180) & (df.pickup_longitude > 180)]
df = df[(df.pickup_latitude < -90) & (df.pickup_latitude > 90)]

# Drop Dropoff Latitude & Longitude Values Outside of Range
df = df[(df.dropoff_longitude < -180) & (df.dropoff_longitude > 180)]
df = df[(df.dropoff_latitude < -90) & (df.dropoff_latitude > 90)]


In [73]:
df_scale = df_new.drop({"key", "pickup_datetime"}, axis = 1)
df_labels = df_new[{"key", "pickup_datetime"}]

In [74]:
scaler = preprocessing.StandardScaler().fit(df_scale)

In [75]:
scaler.mean_

array([ 2.77124786e+07,  1.13598915e+01, -7.25276308e+01,  3.99358812e+01,
       -7.25252916e+01,  3.99238904e+01,  1.68454342e+00])

In [76]:
scaler.scale_

array([1.60138183e+07, 9.90173524e+00, 1.14377869e+01, 7.72053918e+00,
       1.31173750e+01, 6.79481185e+00, 1.38599143e+00])

In [77]:
df_scaled = scaler.transform(df_scale)
df_scaled.mean(axis=0)

array([-8.29385155e-17, -7.83022010e-17,  1.24395339e-15,  2.85710663e-16,
       -2.71908301e-16, -1.11335498e-15, -5.59555202e-18])

In [78]:
df_scaled.std(axis=0)

array([1., 1., 1., 1., 1., 1., 1.])