# Inner join

Pandas allows combining data from different tables using **merging or joining**. This is done with the `merge()` function.

When two tables share a common column, merging aligns rows based on that column. By default, this creates an **inner join**, meaning only rows present in both tables are kept.

If both tables have columns with the same name, pandas adds suffixes (`_x`, `_y`) to distinguish them. Custom suffixes can also be set for clarity.

## Prepare Data

In [2]:
# Import pandas library
import pandas as pd

# Read the file
taxi_owners = pd.read_pickle("datasets/taxi_owners.p")
taxi_vehicles = pd.read_pickle("datasets/taxi_vehicles.p")
wards = pd.read_pickle("datasets/ward.p")
census = pd.read_pickle("datasets/census.p")

wards_altered = pd.read_csv("datasets/wards_altered.csv")
census_altered = pd.read_csv("datasets/census_altered.csv")

## Exercise: Your first inner join

You need to identify the most commonly used fuel type among Chicago taxis. To do this, combine the `taxi_owners` and `taxi_veh` tables using the `vid` column as the key. Once merged, examine the `fuel_type` column to determine the most frequent values.

### Instructions

1. Merge the `taxi_owners` and `taxi_veh` DataFrames on the column `vid` and store the result in `combined_taxi_data`.
2. Add suffixes `_own` and `_veh` to overlapping column names so they can be distinguished.
3. Use `.value_counts()` on the `fuel_type` column to see which fuel types are most common.

In [3]:
# Merge the DataFrames on vid with custom suffixes
combined_taxi_data = pd.merge(taxi_owners, taxi_vehicles, on="vid", suffixes=("_own", "_veh"))

# Check the available columns
print(combined_taxi_data.columns)

# Find the most common fuel type
fuel_counts = combined_taxi_data["fuel_type"].value_counts()
print(fuel_counts)

Index(['rid', 'vid', 'owner_own', 'address', 'zip', 'make', 'model', 'year',
       'fuel_type', 'owner_veh'],
      dtype='object')
fuel_type
HYBRID                    2792
GASOLINE                   611
FLEX FUEL                   89
COMPRESSED NATURAL GAS      27
Name: count, dtype: int64


## Exercise: Inner joins and number of rows returned

So far, all the merges you’ve worked with are **inner joins**. An inner join returns only the rows where the key values exist in both tables. In this exercise, you’ll see how altering the data affects the number of rows returned.

You’ll first merge the original `wards` and `census` tables. Then, you’ll compare the results when merging with slightly modified versions of these tables (`wards_altered` and `census_altered`), where the first row in the `ward` column has been changed.

Note: Both `wards` and `census` originally contain 50 rows.

### Instructions

1. Merge the original `wards` and `census` tables on the `ward` column and check the size of the resulting DataFrame.
2. Merge `wards_altered` with `census` on the `ward` column and observe the change in number of rows.
3. Merge `wards` with `census_altered` on the `ward` column and compare the result.

In [4]:
# Merge original wards and census tables

wards_census = pd.merge(wards, census, on="ward")
print("Shape of wards_census:", wards_census.shape)

Shape of wards_census: (50, 9)


In [5]:
# Preview the change in wards_altered
print("\nFirst few ward values in wards_altered:")
print(wards_altered["ward"].head())


First few ward values in wards_altered:
0    61
1     2
2     3
3     4
4     5
Name: ward, dtype: int64


In [6]:
# Merge wards_altered with census after aligning 'ward' column types
wards_altered["ward"] = wards_altered["ward"].astype(str)
census["ward"] = census["ward"].astype(str)
wards_altered_census = pd.merge(wards_altered, census, on="ward")
print("Shape of wards_altered_census:", wards_altered_census.shape)

Shape of wards_altered_census: (49, 9)


In [7]:
# Preview the change in census_altered
print("\nFirst few ward values in census_altered:")
print(census_altered["ward"].head())


First few ward values in census_altered:
0    NaN
1    2.0
2    3.0
3    4.0
4    5.0
Name: ward, dtype: float64


In [8]:
# Merge wards with census_altered
census_altered["ward"] = census_altered["ward"].astype(str)
wards_census_altered = pd.merge(wards, census_altered, on="ward")
print("Shape of wards_census_altered:", wards_census_altered.shape)

Shape of wards_census_altered: (0, 9)
