## Exercises: Explore the dataset

In [1]:
import pandas as pd
import seaborn as sns
taxis = sns.load_dataset("taxis")

**Explore the "taxis" dataset to answer the following questions:**

**Q1:** How many rows and column are in the dataset?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;<b>Rows:</b> 6433
&nbsp;&nbsp;&nbsp;<b>Columns:</b> 14
</details>

In [127]:
rows, cols = taxis.shape
print(f"Rows: {rows}")
print(f"Columns: {cols}")

Rows: 6433
Columns: 14


**Q2:** What datatype is the most common in the set?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;object (6 columns)
</details>

In [21]:
column_dtypes = dict()
for column in taxis.columns:
    dtype = str(taxis[column].dtype)
    if dtype in column_dtypes:
        column_dtypes[dtype] += 1
    else:
        column_dtypes[dtype] = 1

most_common_dtype = max(column_dtypes)
occurences = column_dtypes[most_common_dtype]
print(f"{most_common_dtype} ({occurences} columns)")

object (6 columns)


**Q3:** What is the average number of passengers in a taxi?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;1.54
</details>

In [130]:
average_passengers = taxis["passengers"].mean()
print(f"Average number of passengers: {average_passengers:.2f}")

Average number of passengers: 1.54


**Q4:** What is the most common number of passengers in a taxi?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;1
</details>

In [48]:
most_common_passengers = taxis["passengers"].value_counts().idxmax()
print(f"The most common number of passengers is {most_common_passengers}.")

The most common number of passengers is 1.


or:

In [137]:
most_common_passengers = taxis["passengers"].mode()[0]
print(f"The most common number of passengers is {most_common_passengers}.")

The most common number of passengers is 1.


**Q5:** What is the most common payment method?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;credit card
</details>

In [140]:
most_common_payment = taxis["payment"].value_counts().idxmax()
print(f"The most common payment method is {most_common_payment}.")

The most common payment method is credit card.


or:

In [141]:
most_common_payment = taxis["payment"].mode()[0]
print(f"The most common payment method is {most_common_payment}.")

The most common payment method is credit card.


**Q6:** Which of the categorical features has the most categories?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;dropoff_zone (203 categories)
</details>

In [142]:
taxis_categorized = taxis.dropna().astype({
    "color": "category",
    "payment": "category",
    "pickup_zone": "category",
    "dropoff_zone": "category",
    "pickup_borough": "category",
    "dropoff_borough": "category",
})

cols_unique_values = taxis_categorized.describe(include="category").loc["unique"]
cols_unique_values.sort_values(ascending=False, inplace=True)

col_most_values = cols_unique_values.iloc[:1].index[0]
unique_values = taxis_categorized[col_most_values].unique()

print(f"{col_most_values} ({unique_values.size} categories)")

dropoff_zone (203 categories)


**Q7:** What percentage of cars in the set are yellow?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;84.7%
</details>

In [113]:
yellow_cars = taxis[taxis["color"] == "yellow"]
yellow_cars_portion = len(yellow_cars) / len(taxis)

print(f"{yellow_cars_portion:.1%} of the cars are yellow.")

84.7% of the cars are yellow.


**Q8:** Which dropoff borough is most common? Which one is least common?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;<b>Most common:</b> Manhattan (5206)<br>
&nbsp;&nbsp;&nbsp;<b>Least common:</b> Staten Island (2)<br>
</details>

In [148]:
taxis

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
0,2019-03-23 20:21:09,2019-03-23 20:27:24,1,1.60,7.0,2.15,0.0,12.95,yellow,credit card,Lenox Hill West,UN/Turtle Bay South,Manhattan,Manhattan
1,2019-03-04 16:11:55,2019-03-04 16:19:00,1,0.79,5.0,0.00,0.0,9.30,yellow,cash,Upper West Side South,Upper West Side South,Manhattan,Manhattan
2,2019-03-27 17:53:01,2019-03-27 18:00:25,1,1.37,7.5,2.36,0.0,14.16,yellow,credit card,Alphabet City,West Village,Manhattan,Manhattan
3,2019-03-10 01:23:59,2019-03-10 01:49:51,1,7.70,27.0,6.15,0.0,36.95,yellow,credit card,Hudson Sq,Yorkville West,Manhattan,Manhattan
4,2019-03-30 13:27:42,2019-03-30 13:37:14,3,2.16,9.0,1.10,0.0,13.40,yellow,credit card,Midtown East,Yorkville West,Manhattan,Manhattan
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6428,2019-03-31 09:51:53,2019-03-31 09:55:27,1,0.75,4.5,1.06,0.0,6.36,green,credit card,East Harlem North,Central Harlem North,Manhattan,Manhattan
6429,2019-03-31 17:38:00,2019-03-31 18:34:23,1,18.74,58.0,0.00,0.0,58.80,green,credit card,Jamaica,East Concourse/Concourse Village,Queens,Bronx
6430,2019-03-23 22:55:18,2019-03-23 23:14:25,1,4.14,16.0,0.00,0.0,17.30,green,cash,Crown Heights North,Bushwick North,Brooklyn,Brooklyn
6431,2019-03-04 10:09:25,2019-03-04 10:14:29,1,1.12,6.0,0.00,0.0,6.80,green,credit card,East New York,East Flatbush/Remsen Village,Brooklyn,Brooklyn


In [118]:
taxis["dropoff_borough"].value_counts()

dropoff_borough
Manhattan        5206
Queens            542
Brooklyn          501
Bronx             137
Staten Island       2
Name: count, dtype: int64

In [125]:
most_common_borough = taxis["dropoff_borough"].value_counts().idxmax()
most_common_count = len(taxis[taxis["dropoff_borough"] == most_common_borough])
print(f"Most common: {most_common_dr_bo} ({most_common_count})")

least_common_borough = taxis["dropoff_borough"].value_counts().idxmin()
least_common_count = len(taxis[taxis["dropoff_borough"] == least_common_borough])
print(f"Least common: {least_common_borough} ({least_common_count})")

Most common: Manhattan (5206)
Least common: Staten Island (2)


**Q9:** Which column has the most missing values? How many?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;<i>dropoff_zone</i> and <i>dropoff_borough</i> both have 45 missing values.
</details>

In [199]:
missing_counts = taxis.isna().sum()

max_missing = missing_counts.max()

# get the columns with the most missing values
top_cols = missing_counts[missing_counts == max_missing]

if len(top_cols) == 1:
    print(f"{top_cols.index[0]} has {top_cols.iloc[0]} missing values.")
elif len(top_cols) == 2:
    print(
        f"{top_cols.index[0]} and {top_cols.index[1]} "
        f"both have {top_cols.iloc[0]} missing values."
        )
elif len(top_cols > 2):
    for i in range(len(top_cols)):
        if i == len(top_cols) - 1:
            print(f"{top_cols.index[i]}", end="")
        elif i == len(top_cols) - 2:
            print(f"{top_cols.index[i]}", end=" and ")
        else:
            print(f"{top_cols.index[i]}", end=", ")
    print(f" all have {top_cols.iloc[0]} missing values.")

dropoff_zone and dropoff_borough both have 45 missing values.


### Memory usage
``` taxis.info(memory_usage="deep") ``` gives you the total memory usage of the dataframe.

``` taxis.memory_usage(deep=True) ``` give you the total memory usage for each column.

**Answer the following questions:**

**Q10:** What is the total memory usage of the dataframe?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;2.9 MB
</details>

In [229]:
mem_usage_bytes = taxis.memory_usage(deep=True).sum()
mem_usage_megabytes = mem_usage_bytes / (1024**2)
print(f"{mem_usage_megabytes:.1f} MB")

2.9 MB


**Q11:** Which column takes up the most memory? How many kilobytes?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;pickup_zone (470 KB)
</details>

In [219]:
max_usage_col = taxis.memory_usage(deep=True).idxmax()
col_usage_bytes = taxis.memory_usage(deep=True).max()
col_usage_kilobytes = col_usage_bytes / 1000

print(f"{max_usage_col} ({col_usage_kilobytes:.0f} KB)")

pickup_zone (470 KB)


**Q12:** Why does the numeric columns all take up exactly 51464 bytes?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;They all use 64 bit datatypes. 64 bits = 8 bytes. 6433 entries * 8 bytes = 51464 bytes.
</details>

In [225]:
51464 / len(taxis)

8.0

**Q13:** What is the total memory usage after converting all *object* columns to *category*?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;494.0 KB
</details>

In [233]:
taxis_categorized = taxis.astype({
    "color": "category",
    "payment": "category",
    "pickup_zone": "category",
    "dropoff_zone": "category",
    "pickup_borough": "category",
    "dropoff_borough": "category",
})

mem_usage_bytes = taxis_categorized.memory_usage(deep=True).sum()
mem_usage_kilobytes = mem_usage_bytes / (1024)
print(f"{mem_usage_kilobytes:.1f} KB")

494.0 KB


**Q14:** ... and after also converting *float64* to *float32*?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;368.4 KB
</details>

In [235]:
taxis_float32 = taxis_categorized.astype({
    "distance": "float32",
    "fare": "float32",
    "tip": "float32",
    "tolls": "float32",
    "total": "float32",
})

mem_usage_bytes = taxis_float32.memory_usage(deep=True).sum()
mem_usage_kilobytes = mem_usage_bytes / (1024)
print(f"{mem_usage_kilobytes:.1f} KB")

368.4 KB


**Q15:** What is the smallest datatype we can convert passengers to? What is the total memory usage after converting passengers to the new type?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;The maximum number of passengers in the dataset are 6,<br> 
&nbsp;&nbsp;&nbsp;and therefore the values easily fit into the <i>int8</i> type (8 bit integer).<br>
<br>
&nbsp;&nbsp;&nbsp;New size: 324.4 KB
</details>

In [238]:
taxis_int8 = taxis_float32.astype({
    "passengers": "int8",
})

mem_usage_bytes = taxis_int8.memory_usage(deep=True).sum()
mem_usage_kilobytes = mem_usage_bytes / (1024)
print(f"{mem_usage_kilobytes:.1f} KB")

324.4 KB


**Q16:** How many percent of the orignal datasize is the new dataset after converting all the types as above?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;11.0 %
</details>

In [241]:
original_size = taxis.memory_usage(deep=True).sum()
new_size = taxis_int8.memory_usage(deep=True).sum()
portion = new_size / original_size
print(f"The new dataset is {portion:.1%} the size of the original dataset after converting all the types.")

The new dataset is 11.0% the size of the original dataset after converting all the types.


### Final note:
Just to be clear, if we want to limit our memory usage by specifying datatypes with a smaller memory footprint, it makes more sense to do so when loading the dataset in to pandas, than changing the type afterwards (as in the example above).

Most common ways to load data into pandas (like pd.from_csv, pd.from_json etc) provides optional parameters for setting the datatype as the files are read into pandas dataframes.

Also, note that this is really only a concern when working with huge sets of data. For smaller datasets, like the one in the example above, it doesn't really matter, and might be only unneccessary work to optimize. The above exercises just serve as examples to better understand data types and their memory footprints.