Exploratory Analysis For Uber Dataset
#pip install package_name
#pip freeze

In [1]:
# importing basic packages
import numpy as np
import pandas as pd
import os

In [2]:
# Setting the directory paths
WORKING_DIR = os.getcwd()
data_file_name = 'Uber_Data.csv'
DATA_DIR = 'data'
DATA = os.path.join(WORKING_DIR, DATA_DIR, data_file_name)

In [3]:
# Loading the dataset and making a copy
df_orig = pd.read_csv(DATA)
df = df_orig

In [5]:
# look at top 5 observations
print(df.head())

          pickup_dt    borough  pickups  spd   vsb  temp  dewp     slp  pcp01  \
0  01-01-2015 01:00      Bronx      152  5.0  10.0  30.0   7.0  1023.5    0.0   
1  01-01-2015 01:00   Brooklyn     1519  5.0  10.0   NaN   7.0  1023.5    0.0   
2  01-01-2015 01:00        EWR        0  5.0  10.0  30.0   7.0  1023.5    0.0   
3  01-01-2015 01:00  Manhattan     5258  5.0  10.0  30.0   7.0  1023.5    0.0   
4  01-01-2015 01:00     Queens      405  5.0  10.0  30.0   7.0  1023.5    0.0   

   pcp06  pcp24   sd hday  
0    0.0    0.0  0.0    Y  
1    0.0    0.0  0.0    Y  
2    0.0    0.0  0.0    Y  
3    0.0    0.0  0.0    Y  
4    0.0    0.0  0.0    Y  


The 
* `pickup_dt column` consists of date and time of pickup
* `borough column` contains the names of the boroughs in New York where pickup was made
* `pickups` is the number of pickups done in the given borough for the given time
* `spd`, `vsb`, `temp`, `dewp`, `slp` and other columns are weather related data
* `hday` represents if the day was a holiday or not [Y: Holiday, N: Not a holiday]


In [4]:
#Checking the rows and columns
rows, columns = df.shape
print(f'Total Rows: {rows}')
print(f'Total Columns/Features: {columns}')

Total Rows: 29101
Total Columns/Features: 13


In [6]:
# look at column details
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29101 entries, 0 to 29100
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pickup_dt  29101 non-null  object 
 1   borough    26058 non-null  object 
 2   pickups    29101 non-null  int64  
 3   spd        29101 non-null  float64
 4   vsb        29101 non-null  float64
 5   temp       28742 non-null  float64
 6   dewp       29101 non-null  float64
 7   slp        29101 non-null  float64
 8   pcp01      29101 non-null  float64
 9   pcp06      29101 non-null  float64
 10  pcp24      29101 non-null  float64
 11  sd         29101 non-null  float64
 12  hday       29101 non-null  object 
dtypes: float64(9), int64(1), object(3)
memory usage: 2.9+ MB
None


The dataset consists of 10 numerical columns, three object columns
The column `pickup_dt` data type should be chaged to date-time format
The columns `borough` & `hday` data type should be changed to categorical
Each column has 29101 observations and some columns are missing data - `borough` and `temp`

In [7]:
# Statistical summary
print(df.describe().T)

           count         mean         std    min     25%     50%          75%  \
pickups  29101.0   490.215903  995.649536    0.0     1.0    54.0   449.000000   
spd      29101.0     5.984924    3.699007    0.0     3.0     6.0     8.000000   
vsb      29101.0     8.818125    2.442897    0.0     9.1    10.0    10.000000   
temp     28742.0    47.900019   19.798783    2.0    32.0    46.5    65.000000   
dewp     29101.0    30.823065   21.283444  -16.0    14.0    30.0    50.000000   
slp      29101.0  1017.817938    7.768796  991.4  1012.5  1018.2  1022.900000   
pcp01    29101.0     0.003830    0.018933    0.0     0.0     0.0     0.000000   
pcp06    29101.0     0.026129    0.093125    0.0     0.0     0.0     0.000000   
pcp24    29101.0     0.090464    0.219402    0.0     0.0     0.0     0.050000   
sd       29101.0     2.529169    4.520325    0.0     0.0     0.0     2.958333   

             max  
pickups  7883.00  
spd        21.00  
vsb        10.00  
temp       89.00  
dewp       73

In the `pickup_dt` & `sd` column the 3rd and max value difference is large
The temp data range is wide hence the data covers different weather ranges / seasons


In [12]:
# Looking at unique categories in columns
print(df['borough'].unique())
print(df['borough'].value_counts())

['Bronx' 'Brooklyn' 'EWR' 'Manhattan' 'Queens' 'Staten Island' nan]
Bronx            4343
Brooklyn         4343
EWR              4343
Manhattan        4343
Queens           4343
Staten Island    4343
Name: borough, dtype: int64


There are six boroughs in the dataset and there are equal number of samples for each type

In [10]:
# Label Distribution in %
print(df['hday'].value_counts(normalize=True))

N    0.961479
Y    0.038521
Name: hday, dtype: float64


Over 96% of the data has the label N
And .03% of data has label Y as the number of holidays in a year are few

In [13]:
# Change datatype of `pickup_dt` column to date-time format
df['pickup_dt'] = pd.to_datetime(df['pickup_dt'],format= "%d-%m-%Y %H:%M")

In [15]:
# Cheking the updated datatype of the pickup_dt column
print(f"Data type of pickup_dts column is: {df.dtypes['pickup_dt']}")

Data type of pickup_dts column is: datetime64[ns]


In [17]:
# Exploring date column
# Earliest date
df['pickup_dt'].min()

Timestamp('2015-01-01 01:00:00')

In [18]:
# Latest date
df['pickup_dt'].max()

Timestamp('2015-06-30 23:00:00')

The dataset contains data over six months starting January 1st to June 30th for the year 2015 which is the Winter and Spring season hence the temp column has the range of 2F to 89F
 