# Data Merging Basics
## Your first inner join
You have been tasked with figuring out what the most popular types of fuel used in Chicago taxis are. To complete the analysis, you need to merge the taxi_owners and taxi_veh tables together on the vid column. You can then use the merged table along with the .value_counts() method to find the most common fuel_type.

Since you'll be working with pandas throughout the course, the package will be preloaded for you as pd in each exercise in this course. Also the taxi_owners and taxi_veh DataFrames are loaded for you.

In [1]:
import pandas as pd
taxi_owners = pd.read_pickle('dataset/taxi_owners.p')
taxi_veh = pd.read_pickle('dataset/taxi_vehicles.p')
display(taxi_owners.head())
display(taxi_owners.head())

# Merge the taxi_owners and taxi_veh tables setting a suffix
taxi_own_veh = taxi_owners.merge(taxi_veh, on='vid', suffixes=('own', 'veh'))

# Print few rows of taxi_own_veh
display(taxi_own_veh.head())

# Print the value_counts to find the most popular fuel_type
print(taxi_own_veh['fuel_type'].value_counts())

Unnamed: 0,rid,vid,owner,address,zip
0,T6285,6285,AGEAN TAXI LLC,4536 N. ELSTON AVE.,60630
1,T4862,4862,MANGIB CORP.,5717 N. WASHTENAW AVE.,60659
2,T1495,1495,"FUNRIDE, INC.",3351 W. ADDISON ST.,60618
3,T4231,4231,ALQUSH CORP.,6611 N. CAMPBELL AVE.,60645
4,T5971,5971,EUNIFFORD INC.,3351 W. ADDISON ST.,60618


Unnamed: 0,rid,vid,owner,address,zip
0,T6285,6285,AGEAN TAXI LLC,4536 N. ELSTON AVE.,60630
1,T4862,4862,MANGIB CORP.,5717 N. WASHTENAW AVE.,60659
2,T1495,1495,"FUNRIDE, INC.",3351 W. ADDISON ST.,60618
3,T4231,4231,ALQUSH CORP.,6611 N. CAMPBELL AVE.,60645
4,T5971,5971,EUNIFFORD INC.,3351 W. ADDISON ST.,60618


Unnamed: 0,rid,vid,ownerown,address,zip,make,model,year,fuel_type,ownerveh
0,T6285,6285,AGEAN TAXI LLC,4536 N. ELSTON AVE.,60630,NISSAN,ALTIMA,2011,HYBRID,AGEAN TAXI LLC
1,T4862,4862,MANGIB CORP.,5717 N. WASHTENAW AVE.,60659,HONDA,CRV,2014,GASOLINE,MANGIB CORP.
2,T1495,1495,"FUNRIDE, INC.",3351 W. ADDISON ST.,60618,TOYOTA,SIENNA,2015,GASOLINE,"FUNRIDE, INC."
3,T4231,4231,ALQUSH CORP.,6611 N. CAMPBELL AVE.,60645,TOYOTA,CAMRY,2014,HYBRID,ALQUSH CORP.
4,T5971,5971,EUNIFFORD INC.,3351 W. ADDISON ST.,60618,TOYOTA,SIENNA,2015,GASOLINE,EUNIFFORD INC.


HYBRID                    2792
GASOLINE                   611
FLEX FUEL                   89
COMPRESSED NATURAL GAS      27
Name: fuel_type, dtype: int64


## Inner joins and number of rows returned
All of the merges you have studied to this point are called inner joins. It is necessary to understand that inner joins only return the rows with matching values in both tables. You will explore this further by reviewing the merge between the wards and census tables, then comparing it to merges of copies of these tables that are slightly altered, named wards_altered, and census_altered. The first row of the wards column has been changed in the altered tables. You will examine how this affects the merge between them. The tables have been loaded for you.

For this exercise, it is important to know that the wards and census tables start with 50 rows.

In [11]:
# Print the first few rows of the census_altered table to view the change 
print(census_altered[['ward']].head())

# Merge the wards and census_altered tables on the ward column
# wards_census_altered = wards.____

# # Print the shape of wards_census_altered
# print('wards_census_altered table shape:', wards_census_altered.shape)

NameError: name 'census_altered' is not defined

## One-to-many merge
A business may have one or multiple owners. In this exercise, you will continue to gain experience with one-to-many merges by merging a table of business owners, called biz_owners, to the licenses table. Recall from the video lesson, with a one-to-many relationship, a row in the left table may be repeated if it is related to multiple rows in the right table. In this lesson, you will explore this further by finding out what is the most common business owner title. (i.e., secretary, CEO, or vice president)

The licenses and biz_owners DataFrames are loaded for you.


In [14]:
owners = pd.read_pickle('dataset/business_owners.p')
licenses = pd.read_pickle('dataset/licenses.p')

display(owners.head(), licenses.head())

Unnamed: 0,account,first_name,last_name,title
0,10,PEARL,SHERMAN,PRESIDENT
1,10,PEARL,SHERMAN,SECRETARY
2,10002,WALTER,MROZEK,PARTNER
3,10002,CELINA,BYRDAK,PARTNER
4,10005,IRENE,ROSENFELD,PRESIDENT


Unnamed: 0,account,ward,aid,business,address,zip
0,307071,3,743.0,REGGIE'S BAR & GRILL,2105 S STATE ST,60616
1,10,10,829.0,HONEYBEERS,13200 S HOUSTON AVE,60633
2,10002,14,775.0,CELINA DELI,5089 S ARCHER AVE,60632
3,10005,12,,KRAFT FOODS NORTH AMERICA,2005 W 43RD ST,60609
4,10044,44,638.0,NEYBOUR'S TAVERN & GRILLE,3651 N SOUTHPORT AVE,60613


In [27]:
new_data = pd.merge(owners, licenses)
display( owners.head())
display(licenses.head())
display(new_data.head())

Unnamed: 0,account,first_name,last_name,title
0,10,PEARL,SHERMAN,PRESIDENT
1,10,PEARL,SHERMAN,SECRETARY
2,10002,WALTER,MROZEK,PARTNER
3,10002,CELINA,BYRDAK,PARTNER
4,10005,IRENE,ROSENFELD,PRESIDENT


Unnamed: 0,account,ward,aid,business,address,zip
0,307071,3,743.0,REGGIE'S BAR & GRILL,2105 S STATE ST,60616
1,10,10,829.0,HONEYBEERS,13200 S HOUSTON AVE,60633
2,10002,14,775.0,CELINA DELI,5089 S ARCHER AVE,60632
3,10005,12,,KRAFT FOODS NORTH AMERICA,2005 W 43RD ST,60609
4,10044,44,638.0,NEYBOUR'S TAVERN & GRILLE,3651 N SOUTHPORT AVE,60613


Unnamed: 0,account,first_name,last_name,title,ward,aid,business,address,zip
0,10,PEARL,SHERMAN,PRESIDENT,10,829.0,HONEYBEERS,13200 S HOUSTON AVE,60633
1,10,PEARL,SHERMAN,SECRETARY,10,829.0,HONEYBEERS,13200 S HOUSTON AVE,60633
2,10002,WALTER,MROZEK,PARTNER,14,775.0,CELINA DELI,5089 S ARCHER AVE,60632
3,10002,CELINA,BYRDAK,PARTNER,14,775.0,CELINA DELI,5089 S ARCHER AVE,60632
4,10005,IRENE,ROSENFELD,PRESIDENT,12,,KRAFT FOODS NORTH AMERICA,2005 W 43RD ST,60609


In [30]:
display(owners[ owners['first_name'] == 'PEARL'])
display(licenses[ licenses['account'] == '10'])
display(new_data[ new_data['first_name'] == 'PEARL'])

Unnamed: 0,account,first_name,last_name,title
0,10,PEARL,SHERMAN,PRESIDENT
1,10,PEARL,SHERMAN,SECRETARY


Unnamed: 0,account,ward,aid,business,address,zip
1,10,10,829,HONEYBEERS,13200 S HOUSTON AVE,60633


Unnamed: 0,account,first_name,last_name,title,ward,aid,business,address,zip
0,10,PEARL,SHERMAN,PRESIDENT,10,829,HONEYBEERS,13200 S HOUSTON AVE,60633
1,10,PEARL,SHERMAN,SECRETARY,10,829,HONEYBEERS,13200 S HOUSTON AVE,60633


## Total riders in a month
Total riders in a month
Your goal is to find the total number of rides provided to passengers passing through the Wilson station (station_name == 'Wilson') when riding Chicago's public transportation system on weekdays (day_type == 'Weekday') in July (month == 7). Luckily, Chicago provides this detailed data, but it is in three different tables. You will work on merging these tables together to answer the question. This data is different from the business related data you have seen so far, but all the information you need to answer the question is below.

The cal, ridership, and stations DataFrames have been loaded for you. The relationship between the tables can be seen in the diagram below.

Table diagram. The cal table relates to ridership via year, month, and day. The ridership table relates to the stations table via station_id.

In [7]:
import pandas as pd

stations = pd.read_pickle('dataset/cta_stations.p')
ridership = pd.read_pickle('dataset/cta_ridership.p')
calender = pd.read_pickle('dataset/cta_calendar.p')
display(stations.head(), calender.head(), ridership.head())

# Merge the ridership, cal, and stations tables
ridership_cal_stations = ridership.merge(calender, on=['year', 'month', 'day']) \
                            .merge(stations, on= 'station_id')

# Create a filter to filter ridership_cal_stations
filter_criteria = ( (ridership_cal_stations['month'] == 7)
                   & (ridership_cal_stations['day_type'] == 'Weekday')
                   & (ridership_cal_stations['station_name'] == 'Wilson')
)

# Use .loc and the filter to select for rides
print(ridership_cal_stations.loc[filter_criteria, 'rides'].sum())

display(ridership_cal_stations ,filter_criteria)

Unnamed: 0,station_id,station_name,location
0,40010,Austin-Forest Park,"(41.870851, -87.776812)"
1,40020,Harlem-Lake,"(41.886848, -87.803176)"
2,40030,Pulaski-Lake,"(41.885412, -87.725404)"
3,40040,Quincy/Wells,"(41.878723, -87.63374)"
4,40050,Davis,"(42.04771, -87.683543)"


Unnamed: 0,year,month,day,day_type
0,2019,1,1,Sunday/Holiday
1,2019,1,2,Weekday
2,2019,1,3,Weekday
3,2019,1,4,Weekday
4,2019,1,5,Saturday


Unnamed: 0,station_id,year,month,day,rides
0,40010,2019,1,1,576
1,40010,2019,1,2,1457
2,40010,2019,1,3,1543
3,40010,2019,1,4,1621
4,40010,2019,1,5,719


140005


Unnamed: 0,station_id,year,month,day,rides,day_type,station_name,location
0,40010,2019,1,1,576,Sunday/Holiday,Austin-Forest Park,"(41.870851, -87.776812)"
1,40010,2019,1,2,1457,Weekday,Austin-Forest Park,"(41.870851, -87.776812)"
2,40010,2019,1,3,1543,Weekday,Austin-Forest Park,"(41.870851, -87.776812)"
3,40010,2019,1,4,1621,Weekday,Austin-Forest Park,"(41.870851, -87.776812)"
4,40010,2019,1,5,719,Saturday,Austin-Forest Park,"(41.870851, -87.776812)"
...,...,...,...,...,...,...,...,...
3280,41660,2019,12,27,13898,Weekday,Lake/State,"(41.884809, -87.627813)"
3281,41660,2019,12,28,9485,Saturday,Lake/State,"(41.884809, -87.627813)"
3282,41660,2019,12,29,7581,Sunday/Holiday,Lake/State,"(41.884809, -87.627813)"
3283,41660,2019,12,30,15332,Weekday,Lake/State,"(41.884809, -87.627813)"


0       False
1       False
2       False
3       False
4       False
        ...  
3280    False
3281    False
3282    False
3283    False
3284    False
Length: 3285, dtype: bool