# California Household Travel Survey: data prep

Notebook developed by Sam Maurer

This notebook loads anonymized trip records from the 2012 California Household Travel Survey (CHTS), drops most of the variables, and merges data into a single table for use in other notebooks.

You can access the data and learn more about it here: https://www.nrel.gov/transportation/secure-transportation-data/tsdc-california-travel-survey.html

The data dictionary is available here: [caltrans_data_dictionary.pdf](https://www.nrel.gov/transportation/secure-transportation-data/assets/pdfs/caltrans_data_dictionary.pdf)

#### A note about memory-efficient data loading:

Reading the full data into Pandas exhausts the memory of a 1GB JupyterHub instance, so this notebooks also illustrates some tips and tricks for reading data more efficiently!

## Set up the data access

We're using a file called `caltrans_full_survey.zip`. It's available from the link above.

In [1]:
import zipfile         # a core library for working with zip files
import requests        # third-party library for making HTTP requests
import numpy as np
import pandas as pd

In [2]:
# Download the file to DataHub. It includes multiple CSVs in the same archive, so
# we can't work with it directly from a URL.

url = "https://www.dropbox.com/s/do6djp9vjzthe72/caltrans_full_survey.zip?dl=1"

# The next line is equivalent to 'f = open(...)', but defines the variable 'f' 
# only within a single code block. This automatically releases memory and "closes" 
# the file when the code block ends.

with open('caltrans_full_survey.zip', 'wb') as f:
    r = requests.get(url)
    f.write(r.content)

In [2]:
# Open up the file we just saved to disk. This function lets us access the
# individual CSVs separately.

z = zipfile.ZipFile('caltrans_full_survey.zip')

## Load the places table

Think of these as *place visits* -- each row represents a location where a trip began or ended. If a place was visited multiple times or by multiple people, each visit shows up as a separate row. 

Table also includes information about the travel that occurred to reach each place.

In [3]:
# 'usecols' lets us specify which columns to load, which for large tables can be 
# much faster and more memory-efficient.

# 'low_memory=False' tells Pandas to scan more of the file before guessing data types, 
# which can reduce errors. (But it uses more memory.)

places = pd.read_csv(z.open('caltrans_full_survey/survey_place.csv'), 
                     low_memory=False,
                     usecols=['sampno','perno','plano','mode','trip_distance_miles',
                              'prev_trip_duration_min','county_id','state','city'])
len(places)

460524

In [4]:
places.head()

Unnamed: 0,sampno,perno,plano,mode,trip_distance_miles,prev_trip_duration_min,county_id,city,state
0,1031985,1,1,,,,95.0,VALLEJO,CA
1,1031985,1,2,6.0,13.428271,22.0,95.0,BENICIA,CA
2,1031985,1,3,6.0,12.975526,20.0,95.0,VALLEJO,CA
3,1031985,2,1,,,,95.0,VALLEJO,CA
4,1031985,2,2,5.0,5.12596,10.0,95.0,VALLEJO,CA


## Load the activities table

These describe what a person did at a place. Multiple activities can be associated with each place visit.

In [5]:
activities = pd.read_csv(z.open('caltrans_full_survey/survey_activity.csv'), 
                         low_memory=False,
                         usecols=['sampno','perno','plano','purpose'])
len(activities)

604711

In [6]:
activities.head()

Unnamed: 0,sampno,perno,plano,purpose
0,1039879,1,1,1
1,1041766,3,1,1
2,1043722,2,4,1
3,1050668,1,1,4
4,1051203,1,9,1


## Load the persons table

CHTS over- or under-sampled households with different characteristics, in order to get more detailed data about certain subsets of the population.

Each household and person includes a "weight" indicating how to balance the observations to get accurate aggregate statistics (trip counts, mode splits, etc).

In [10]:
persons = pd.read_csv(z.open('caltrans_full_survey/survey_person.csv'), 
                         low_memory=False,
                         usecols=['sampno','perno','perwgt'])
len(persons)

109113

In [11]:
persons.head()

Unnamed: 0,sampno,perno,perwgt
0,7128119,1,0.386276
1,7128119,3,0.607951
2,7128138,1,0.567107
3,7128262,1,0.826053
4,7128262,3,1.2054


## Build a trips table

This will include information from the places, activities, and persons tables.

In [12]:
# ACTIVITY PURPOSES

# 1- Personal activities (sleeping, personal care, leisure, chores); 
# 2- Preparing meals/eating; 
# 3- Hosting visitors/entertaining guests; 
# 4- Exercise (with or without equipment)/playing sports; 
# 5- Study/schoolwork; 
# 6- Work for pay at home using telecommunications equipment; 
# 7- Using computer/telephone/cell or smart phone, or other communications device for personal activities; 
# 8- All other activities at home; 
# 9- Work/job duties; 
# 10- Training; 
# 11- Meals at work; 
# 12- Work-sponsored social activities (holiday/birthday celebrations, etc.); 
# 13- Non-work-related activities (social clubs, etc.); 
# 14- Exercise/sports; 
# 15- Volunteer work/activities, 
# 16- All other work-related activities at work; 
# 17- School/ classroom/ laboratory; 
# 18- Meals at school/college; 
# 19- After-school or non-class-related sports/physical activities; 
# 20- All other after-school or non-class-related activities (library, music rehearsal, clubs, etc.); 
# 21- Change type of transportation/transfer (walk to bus, walk to/from parked car); 
# 22- pick up/drop off passenger(s); 
# 23- Drive-through meals (snacks, coffee, etc.) (show if PTYPE <> 1 [Home]); 
# 24- Drive-through other (ATM, bank, etc.) (show if PTYPE <> 1); 
# 25- Work-related (meetings, sales calls, deliveries); 
# 26- Service private vehicle (gas, oil, lubes, repairs), 
# 27- Routine shopping (groceries, clothing, convenience store, household maintenance, etc.); 
# 28- Shopping for major purchases or specialty items (appliance, electronics, new vehicles, major household repairs, etc.); 
# 29- Household errands (bank, dry cleaning, etc.); 
# 30- Personal business (visit government office, attorney, accountant, etc.); 
# 31- Eat meal at restaurant/diner; 
# 32- Health care (doctor, dentist, eye care, chiropractor, veterinarian, etc.); 
# 33- Civic/ religious activities; 
# 34- Outdoor exercise (outdoor sports, jogging, bicycling, walking the dog, etc.); 
# 35- Indoor exercise (gym, yoga, etc.); 
# 36- Entertainment (movies, sporting events, etc.); 
# 37- Social/visiting friends and relatives; 
# 38- Other (specify), 
# 39- Loop trip (for interviewer only- not listed on diary), 
# 99- DK/RF

In [13]:
activities.purpose.value_counts()

1     203178
2      63538
27     34344
9      30530
8      27406
22     25265
21     24073
37     21980
7      19989
31     18326
39     13778
17     11270
34     10339
25      9940
29      9204
33      7376
36      7331
30      7086
5       6099
32      6016
11      5099
26      5067
35      4911
23      4723
38      4635
4       4211
6       4157
3       3461
28      2846
18      2339
24      1275
20      1236
16      1077
19       790
15       575
13       343
14       330
10       268
99       194
12       106
Name: purpose, dtype: int64

In [18]:
# Each place visit can have more than one activity, so generate some dummy
# variables indicating which visit IDs are associated with various activity types.

# Keeping only the household + person + trip IDs and then dropping duplicate rows
# gives us a list of the unique place visits for each activity category.

# Work trips
activities_filter = activities.purpose.isin([9])
work = activities.loc[activities_filter, ['sampno','perno','plano']].drop_duplicates()
work['work'] = 1

# Non-home, non-work ("secondary activity") trips
activities_filter = activities.purpose.isin(range(23, 38+1))
secondary = activities.loc[activities_filter, ['sampno','perno','plano']].drop_duplicates()
secondary['secondary'] = 1

# Shopping trips
activities_filter = activities.purpose.isin([23,24,26,27,28,29,31])
shopping = activities.loc[activities_filter, ['sampno','perno','plano']].drop_duplicates()
shopping['shopping'] = 1

# Outdoor recreation
activities_filter = activities.purpose.isin([34])
outdoor = activities.loc[activities_filter, ['sampno','perno','plano']].drop_duplicates()
outdoor['outdoors'] = 1

In [22]:
# Build a trips table my merging these activity dummies into the place visits

trips = pd.merge(places, work, on=['sampno','perno','plano'], how='left')
trips = pd.merge(trips, secondary, on=['sampno','perno','plano'], how='left')
trips = pd.merge(trips, shopping, on=['sampno','perno','plano'], how='left')
trips = pd.merge(trips, outdoor, on=['sampno','perno','plano'], how='left')

# Also include the sampling weights

trips = pd.merge(trips, persons, on=['sampno','perno'], how='left')

In [23]:
trips.head()

Unnamed: 0,sampno,perno,plano,mode,trip_distance_miles,prev_trip_duration_min,county_id,city,state,work,secondary,shopping,outdoors,perwgt
0,1031985,1,1,,,,95.0,VALLEJO,CA,,,,,0.052086
1,1031985,1,2,6.0,13.428271,22.0,95.0,BENICIA,CA,,1.0,,,0.052086
2,1031985,1,3,6.0,12.975526,20.0,95.0,VALLEJO,CA,,,,,0.052086
3,1031985,2,1,,,,95.0,VALLEJO,CA,,,,,0.052086
4,1031985,2,2,5.0,5.12596,10.0,95.0,VALLEJO,CA,,1.0,,,0.052086


## Save cleaned data to disk

In [24]:
# 'index=False' drops the index column, which we're not using

trips.to_csv('trips.csv', index=False)