Exploration of the data from the [Di-Tech Challenge](http://research.xiaojukeji.com/competition), organized by Didi Chuxing, a ride-hailing company in China. The data is described [here](http://research.xiaojukeji.com/competition/detail.action?competitionId=DiTech2016).

In [1]:
import pandas as pd
# pd.read_csv?

#### Order Info Table
<table>
        <tr>
            <th>Field</th>
            <th>Type</th>
            <th>Meaning</th>
            <th>Example</th>
        </tr>
        <tr>
            <td>order_id</td>
            <td>string</td>
            <td>order ID</td>
            <td>70fc7c2bd2caf386bb50f8fd5dfef0cf</td>
        </tr>
        <tr>
            <td>driver_id</td>
            <td>string</td>
            <td>driver ID</td>
            <td>56018323b921dd2c5444f98fb45509de</td>
        </tr>
        <tr>
            <td>passenger_id</td>
            <td>string</td>
            <td>user ID</td>
            <td>238de35f44bbe8a67bdea86a5b0f4719</td>
        </tr>
        <tr>
            <td>start_district_hash</td>
            <td>string</td>
            <td>departure</td>
            <td>d4ec2125aff74eded207d2d915ef682f</td>
        </tr>
        <tr>
            <td>dest_district_hash</td>
            <td>string</td>
            <td>destination</td>
            <td>929ec6c160e6f52c20a4217c7978f681</td>
        </tr>
        <tr>
            <td>Price</td>
            <td>double</td>
            <td>Price</td>
            <td>37.5</td>
        </tr>
        <tr>
            <td>Time</td>
            <td>string</td>
            <td>Timestamp of the order</td>
            <td>2016-01-15 00:35:11</td>
        </tr>
</table>

The Order Info Table shows the basic information of an order, including the passenger and the driver (if driver_id =NULL, it means the order was not answered by any driver), place of origin, destination, price and time. The fields order_id, driver_id, passenger_id, start_hash, and dest_hash are made not sensitive.

In [8]:
# Columns in order files
columns = ['order_id', 'driver_id', 'passenger_id', 'start_district_hash', 'dest_district_hash', 'price', 'datetime']

# Files are organized by dates
# order_files = ["data/season_1/training_data/order_data/order_data_2016-01-{:02d}".format(i) for i in range(1, 22)]
order_files = ["data/season_1/training_data/order_data/order_data_2016-01-{:02d}".format(i) for i in range(1, 3)]

# Open all of them
order_dfs = []
for order_file in order_files:
    order_dfs.append(pd.read_csv(order_file, sep = "\t", names = columns, parse_dates = columns[-1]))
df = pd.concat(order_dfs)

# Keep a random number of the rows
df_sampled = df.sample(frac = 0.70, random_state = 111)
df_rest = df.loc[~df.index.isin(df_sampled.index)]
df = df_sampled

# Open only one file
# order_file_1 = "data/season_1/training_data/order_data/order_data_2016-01-01"
# df = df_1 = pd.read_csv(order_file_1, sep = "\t", names = columns, parse_dates = columns[-1])

print(df.head(2))

Unnamed: 0,order_id,driver_id,passenger_id,start_district_hash,dest_district_hash,price,time
0,97ebd0c6680f7c0535dbfdead6e51b4b,dd65fa250fca2833a3a8c16d2cf0457c,ed180d7daf639d936f1aeae4f7fb482f,4725c39a5e5f4c188d382da3910b3f3f,3e12208dd0be281c92a6ab57d9a6fb32,24,2016-01-01 13:37:23
1,92c3ac9251cc9b5aab90b114a1e363be,c077e0297639edcb1df6189e8cda2c3d,191a180f0a262aff3267775c4fac8972,82cc4851f9e4faa4e54309f8bb73fd7c,b05379ac3f9b7d99370d443cfd5dcc28,2,2016-01-01 09:47:54


In [29]:
# Quick look at data frame
print(df.describe())

# Find the range of dates
print("Dates from {} to {}.".format(df['time'].min(), df['time'].max()))

Unnamed: 0,price
count,823571.0
mean,18.632872
std,16.793238
min,0.0
25%,8.1
50%,14.0
75%,23.0
max,499.0


In [30]:
# Count how many rows per order_id and driver_id
count = df[['order_id', 'driver_id']].groupby('order_id').count()
count = count['driver_id']

# Orders picked up by more than one driver?
print(sum(count > 1))
# Yes..? Surprising.

1452


In [10]:
# Turns out there are duplicate and almost-duplicate entries. 
# For now, let's keep the last ones.
dup = df.duplicated(['order_id', 'driver_id', 'passenger_id', 'time'], keep = 'last')
df = df[~dup]
# Depending on the test data, it might be a better idea to leave them in.

In [11]:
# Count how many rows per order_id and driver_id
count = df[['order_id', 'driver_id']].groupby('order_id').count()
count = count['driver_id']

# Orders picked up by more than one driver?
print(sum(count > 1))
# No more.

# Create gap column
gap = (count == 0).astype('int').tolist()
df['gap'] = gap

0


In [14]:
# Proportion of orders no picked up by a driver
s = sum(count == 0)
l = len(count)

print("There are {} orders-without-drivers out of {} orders: {:.1%}.".format(s, l, s/l))
# It appears the gap is simply the number of orders not picked up.

There are 174713 orders-without-drivers out of 498789 orders: 35.0%.


In [16]:
# Compute time slot
# The first time slot on Jan. 23rd, 2016; one day is uniformly divided into 144 ten minute time slots.

df['date'] = pd.to_datetime(df.datetime.dt.date)
df['time'] = df.datetime.dt.time
# df = df.drop('datetime', axis = 1)

df['timeslot'] = (df['datetime'] - df['date']).astype('timedelta64[m]')//10

print(df.head(2))
print(df.describe())

Unnamed: 0,order_id,driver_id,passenger_id,start_district_hash,dest_district_hash,price,time,gap,datetime,date,timeslot
0,97ebd0c6680f7c0535dbfdead6e51b4b,dd65fa250fca2833a3a8c16d2cf0457c,ed180d7daf639d936f1aeae4f7fb482f,4725c39a5e5f4c188d382da3910b3f3f,3e12208dd0be281c92a6ab57d9a6fb32,24,13:37:23,0,2016-01-01 13:37:23,2016-01-01,81
1,92c3ac9251cc9b5aab90b114a1e363be,c077e0297639edcb1df6189e8cda2c3d,191a180f0a262aff3267775c4fac8972,82cc4851f9e4faa4e54309f8bb73fd7c,b05379ac3f9b7d99370d443cfd5dcc28,2,09:47:54,0,2016-01-01 09:47:54,2016-01-01,58


In [17]:
# Compute gap per time slot
s = df[['date', 'timeslot', 'gap']].groupby(['date', 'timeslot']).sum()
# print(s)

# Sanity check: do the numbers add up?
print(sum(s.gap))
# Yup.

174713


<table>
        <tr>
            <th>Data name</th>
            <th>Data type</th>
            <th>Example</th>
        </tr>
        <tr>
            <td>District ID</td>
            <td>string</td>
            <td>1,2,3,4 (the same as district mapping ID)</td>
        </tr>
        <tr>
            <td>Time slot</td>
            <td>string</td>
            <td>2016-01-23-1 (The first time slot on Jan. 23rd, 2016; one day is uniformly divided into 144 ten minute time slots)</td>
        </tr>
        <tr>
            <td>Prediction value</td>
            <td>double</td>
            <td>6.0</td>
        </tr>
</table>