![](https://images.unsplash.com/photo-1521180104672-66e895511a57?ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&ixlib=rb-1.2.1&auto=format&fit=crop&w=1189&q=80)

In this notebook we'll see in which community areas of **Chicago** taxi drivers get **the highest tip**. But let's start by importing necessary tools from **biquery** and **bq_helper** and looking at the description of columns, I'll comment my work in both English and German.

*In diesem Notebook werden wir sehen, in welchen Stadtbezirke von **Chicago** Taxifahrer die höchsten **Trinkgelder** kriegen. Aber wir starten mit dem Importieren von Werkzeuge aus **bigquery** und **bq_helper**. Unterwegs mache ich Kommentare auf Deutsch und Englisch*

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
from google.cloud import bigquery

# Create a "Client" object
client = bigquery.Client()

# Construct a reference to the "chicago_taxi_trips" dataset
dataset_ref = client.dataset("chicago_taxi_trips", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

# Construct a reference to the "taxi_trips" table
table_ref = dataset_ref.table("taxi_trips")

# API request - fetch the table
table = client.get_table(table_ref)

# Preview the first five lines of the table
client.list_rows(table, max_results=5).to_dataframe()

In [None]:
import bq_helper
from bq_helper import BigQueryHelper

In [None]:
bq_assistant = BigQueryHelper("bigquery-public-data", "chicago_taxi_trips")

In [None]:
bq_assistant.table_schema("taxi_trips")

In [None]:
def show_amount_of_data_scanned(query):
    # dry_run lets us see how much data the query uses without running it
    dry_run_config = bigquery.QueryJobConfig(dry_run=True)
    query_job = client.query(query, job_config=dry_run_config)
    print('Data processed: {} GB'.format(round(query_job.total_bytes_processed / 10**9, 3)))

I have 2 questions for this dataset:
1) What are the maximum, minimum and average fares for rides lasting 10 minutes or more?

2) Which drop-off areas have the highest average tip?

I can answer these questions directly with **SQL queries** and I do it based on queries by Paul Mooney (https://www.kaggle.com/paultimothymooney/how-to-query-the-chicago-taxi-dataset) in **Part 1** of this notebook.

**In Part 2** I create a relatively big DataFrame and answer these questions with **Pandas functions**.

*Für diese Datenbank habe ich 2 Fragen:*

*1) Max. - Min. - und Durchschnittskosten per Fahrt länger als 10 Minuten.*

*2) In welchen Zielstadtteilen wird das höchste Durchschnittstrinkgeld bezahlt?*

*Ich kann diese Fragen direkt per **SQL-Abfragen** beantworten und das mache ich basierend auf Beispiele von Paul Mooney (https://www.kaggle.com/paultimothymooney/how-to-query-the-chicago-taxi-dataset) im **ersten Teil** dieser Arbeit.*

*Im **zweiten Teil** erstelle ich ein relativ großes DataFrame und beantworte diese Fragen mit **Pandas Funktionen**.*

# Part 1 / Teil 1

**1) What are the maximum, minimum and average fares for rides lasting 10 minutes or more?** I take data only for the year 2021!

***1) Max. - Min. - und Durchschnittskosten per Fahrt länger als 10 Minuten.*** *Ich benutze nur Daten für das Jahr 2021.*

In [None]:
query1 = """SELECT
  MAX(fare) AS maximum_fare,
  MIN(fare) AS minimum_fare,
  FORMAT('%3.2f', AVG(fare)) AS avg_fare
FROM
  `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE
  trip_seconds >= 600 AND EXTRACT(YEAR FROM trip_end_timestamp) = 2021
        """

show_amount_of_data_scanned(query1)

In [None]:
query1 = """SELECT
  MAX(fare) AS maximum_fare,
  MIN(fare) AS minimum_fare,
  FORMAT('%3.2f', AVG(fare)) AS avg_fare
FROM
  `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE
  trip_seconds >= 600 AND EXTRACT(YEAR FROM trip_end_timestamp) = 2021
        """

fare = client.query(query1).result().to_dataframe()
fare

**2) Which drop-off areas have the highest average tip in 2021?**

***2) In welchen Zielstadtteilen wird das höchste Trinkgeld bezahlt?***

In [None]:
query2 = """SELECT
            dropoff_community_area,
  FORMAT('%3.2f', AVG(tips)) AS average_tip,
  FORMAT('%3.2f', MAX(tips)) AS max_tip
FROM
  `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE dropoff_community_area IS NOT NULL AND EXTRACT(YEAR FROM trip_end_timestamp) = 2021
GROUP BY
  dropoff_community_area
ORDER BY
  average_tip DESC
        """

show_amount_of_data_scanned(query2)

In [None]:
query2 = """SELECT
            dropoff_community_area,
  FORMAT('%3.2f', AVG(tips)) AS average_tip,
  FORMAT('%3.2f', MAX(tips)) AS max_tip
FROM
  `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE dropoff_community_area IS NOT NULL AND EXTRACT(YEAR FROM trip_end_timestamp) = 2021
GROUP BY
  dropoff_community_area
ORDER BY
  average_tip DESC
        """

tip = client.query(query2).result().to_dataframe()
tip

We can read about [Community areas in Chicago](https://en.wikipedia.org/wiki/Community_areas_in_Chicago) in Wikipedia. In the community area number 76 (O'Hare) there is an **airport**, that's why tips are higher here. In general the average tips mirror the **socioeconomic situation** in these communities.

*Man kann hier über [Community areas in Chicago](https://en.wikipedia.org/wiki/Community_areas_in_Chicago) lesen. Im Community Nummer 76 (O'Hare) gibt es einen **Flughafen**, deswegen kriegen hier die Taxifahrer das höchste Trinkgeld. Sonst spiegeln die Trinkgeldwerte die wirtschaftliche und **soziale Situation** der Bezirke wider. *

Next I read my **csv-file** with **community area names** and **merge** it with the existing dataset.

*Als Nächstes lese ich meine **CSV-Datei** *mit **Bezirksnamen** und **füge** sie meinem DataFrame hinzu.*

In [None]:
import pandas as pd

areas_names = pd.read_csv("../input/chicagocsv/chicago_area.csv", sep=';')

areas_names.columns

In [None]:
tip_merged = pd.merge(tip, areas_names, on='dropoff_community_area')

In [None]:
columnsTitles = ['dropoff_community_area', 'community_name', 'average_tip', 'max_tip']

tip_merged = tip_merged.reindex(columns=columnsTitles)

tip_merged

We can see the average tips in the **scatterplot**.

*Man kann im **Scatterplot** die Durchschnittswerte von Trinkgelder sehen.*

In [None]:
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [None]:
plt.figure(figsize=(10,12))
sns.scatterplot(x=tip['dropoff_community_area'], y=tip['average_tip'])

# Part 2 / Teil 2

Create a DataFrame and delete all NaN values in columns "fare" and "tips".

*Ein DataFrame erstellen und **Zeilen mit NaN-Werten in Spalten "fare" und "tips" entfernen**.*

In [None]:
full_query1 = """SELECT
            dropoff_community_area, trip_end_timestamp, fare, tips 
FROM
  `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE
  trip_seconds >= 600 AND EXTRACT(YEAR FROM trip_end_timestamp) = 2021
        """

show_amount_of_data_scanned(full_query1)

In [None]:
full_query1 = """SELECT
            dropoff_community_area, trip_end_timestamp, fare, tips 
FROM
  `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE
  trip_seconds >= 600 AND EXTRACT(YEAR FROM trip_end_timestamp) = 2021
        """

full = client.query(full_query1).result().to_dataframe()
full.head()

In [None]:
len(full)

In [None]:
full.describe()

In [None]:
missing_values_count = full.isnull().sum()
missing_values_count

In [None]:
full = full.dropna(subset=['fare', 'tips'])

In [None]:
missing_values_count = full.isnull().sum()
missing_values_count

Now we can **calculate** fares for rides lasting 10 minutes or more **again**.

*Jetzt können wir Durchschnittskosten per Fahrt länger als 10 Minuten **neu berechnen**.*

In [None]:
full

In [None]:
full.fare.mean()

In [None]:
full.tips.mean()

In [None]:
mean_tips = full.groupby("dropoff_community_area").tips.mean()
mean_tips

I create **a new DataFrame** to answer the question about drop-off areas with the highest average tip.

*Ich erstelle **ein neues DataFrame**, um die Frage über die Durchnittswerte von Trinkgelder in Bezirken zu beantworten.*

In [None]:
mean_tips = mean_tips.to_frame()
mean_tips

Make a **new column**, a **new index** and sort values **descending**.

*Mache eine **neue Spalte**, ein **neues Index** und **sortiere** die Werte **absteigend**.*

In [None]:
mean_tips = mean_tips.rename(columns={"tips": "avg_tips"})

mean_tips

In [None]:
mean_tips.reset_index(level=0, inplace=True)

In [None]:
mean_tips.sort_values(by='avg_tips', ascending=False)

Change the **format of values** to make them **easier to read**.

*Verändere die **Formatierung** der Werte, damit sie **lesbarer** sind.*

In [None]:
mean_tips = mean_tips.astype({'dropoff_community_area': 'int32'})
mean_tips = mean_tips.round({'avg_tips': 2})

In [None]:
mean_tips.sort_values(by='avg_tips', ascending=False)

**Merge** two datasets to get **community area names**.

*Zusammenfüge **2 Datasets**, um die Bezirksnamen zu sehen.*

In [None]:
mean_tips_join = pd.merge(mean_tips, areas_names, on='dropoff_community_area')

In [None]:
columnsTitles = ['dropoff_community_area', 'community_name', 'avg_tips']

mean_tips_join = mean_tips_join.reindex(columns=columnsTitles)

In [None]:
mean_tips_join.sort_values(by='avg_tips', ascending=False)

**Resume**: We have seen that the average fare was correctly calculated with an SQL query in the first part of this notebook, because the number of NaN values was small and they didn't influence the result. But **average tips for community areas** have changed as we deleted rows with NaN values. Now **our results are more precise** and it means that our work with Pandas dataframe was not in vain.

*Fazit: Wir haben gesehen, dass die Durcschnittskosten per Fahrt wurden schon per SQL-Abfrage im ersten Teil richtig berechnet, weil die Zahl der NaN-Werten gering war und sie haben keinen Einfluss auf das Endergebnis. Aber **die Durchschnittswerte von Trinkgelder für die einzelnen Zielstadtteilen** haben doch ein bischen **geändert**, als wir die Zeilen mit NaN-Werten in der "tips"-Spalte gelöscht haben. **Jetzt haben wir die genaueren Ergebnisse** und die Mühe hat sich doch gelohnt.*

Create a **bar plot** with **Chicago community area names**.

*Wir können jetzt eine **Grafik mit Namen** unserer Bezirke erstellen.*

In [None]:
mean_tips_join.plot.bar(x='community_name', y='avg_tips', rot=90)

plt.gcf().set_size_inches(20, 10)

plt.title("Average Taxi Tips in Chicago, by Community Areas")
plt.ylabel("Average tips (in dollars)")
plt.xlabel("Chicago Community Areas")


# Part 3 / Teil 3

In Part 3 we'll see if there is any correlation between income, population groups and average taxi trips. For this purpose we'll use a dataset on basic demographics in Chicago community areas. Sadly it was last updated 2013 and we cannot rely on it too much, but still it can give us a general picture. 

*Im Teil 3 werden wir sehen, ob es eine Korrelation zwichen Trinkgelder, Einkommen und Einwohnergruppen gibt. Dafür werden wir eine Tabelle über demographische Struktur von Chicagos Stadtbezirken benutzen. Leider ist die Tabelle alt und wurde 2013 erstellt. Es bedeutet, wir können sie nicht zu viel vertrauen, aber wir können schon mit ihrer Hilfe nach Korrelationen suchen.*

In [None]:
areas_population = pd.read_csv("../input/chicago-community-areas-demographics/chicago_population.CSV", sep=';')

areas_population

We transform the table to make it fit for the join with our existing table.

*Wir transformieren die Tabelle, damit wir sie dann mit unserer Trinkgelder-Tabelle zusammenfügen können.*

In [None]:
population = areas_population.transpose()

population

In [None]:
population.dtypes

In [None]:
population = population.rename(columns=population.iloc[0])

In [None]:
population = population.iloc[1:]

In [None]:
population

In [None]:
population.index.name = 'dropoff_community_area'

population

In [None]:
population.reset_index(level=0, inplace=True)

population

Make dtypes usable for the plot.

*Mache dtypes verwendbar für die Grafik.*

In [None]:
population = population.astype({"dropoff_community_area": int, "population": int, "income": float, "latinos": float, "blacks": float, "white": float, "asian": float, "other": float,})

In [None]:
sns.regplot(x=population['income'], y=population['white'])

Check if there is a correlation between income and population groups.

*Überprüfe, ob es eine Korrelation zwischen Einkommen und Einwohnergruppen gibt.*

In [None]:
round(population['income'].corr(population['white']), 2)

In [None]:
round(population['income'].corr(population['blacks']), 2)

In [None]:
round(population['income'].corr(population['latinos']), 2)

In [None]:
round(population['income'].corr(population['asian']), 2)

In [None]:
population_join = pd.merge(mean_tips_join, population, on='dropoff_community_area')

In [None]:
population_join

In [None]:
population_join.drop('name', axis=1, inplace=True)

In [None]:
population_join

In [None]:
print("Correlation between income and average taxi tips in community areas of Chicago: {}".format(round(population_join['income'].corr(population_join['avg_tips']), 2)))

In [None]:
print("Correlation between percentage of white population and average taxi tips in community areas of Chicago: {}".format(round(population_join['white'].corr(population_join['avg_tips']), 2)))

In [None]:
print("Correlation between percentage of black population and average taxi tips in community areas of Chicago: {}".format(round(population_join['blacks'].corr(population_join['avg_tips']), 2)))

In [None]:
print("Correlation between percentage of Latino population and average taxi tips in community areas of Chicago: {}".format(round(population_join['latinos'].corr(population_join['avg_tips']), 2)))

In [None]:
print("Correlation between percentage of Asian population and average taxi tips in community areas of Chicago: {}".format(round(population_join['asian'].corr(population_join['avg_tips']), 2)))

In [None]:
sns.regplot(x=population_join['white'], y=population_join['avg_tips'])

We see that surprisingly average taxi trips correlate a bit more with population groups in the community area (0.85 in case of whites) than with the income (0.72). The correlation between taxi tips and population groups is even 1% higher than the correlation between income and population groups (0.84).

*Wir sehen überraschend, dass Trinkgelder mehr mit Einwohnergruppen als mit dem Einkommen korrelieren.*

# Part 4 / Teil 4

In this part I'll use Chicago Crime dataset to check is there is any negative correlation between the number of arrests and the level of average tips.

*In diesem Teil werde ich Chicago Crime Datenbank benutzen um zu sehen, ob es eine negative Korrelation zwischen die Zahl der Festnahmen und die Höhe der Trinkgelder gibt.*

In [None]:
# Construct a reference to the "chicago_crime" dataset
dataset_ref2 = client.dataset("chicago_crime", project="bigquery-public-data")

# API request - fetch the dataset
dataset2 = client.get_dataset(dataset_ref2)

# Construct a reference to the "crime" table
table_ref2 = dataset_ref.table("crime")

In [None]:
community_query = """SELECT
  community_area,
  COUNT(*) AS arrests
FROM `bigquery-public-data.chicago_crime.crime`
  WHERE arrest = TRUE
    AND year = 2021
  GROUP BY
    community_area
ORDER BY
  arrests DESC
        """

community = client.query(community_query).result().to_dataframe()

community.head()

In [None]:
community.rename(columns={'community_area': 'dropoff_community_area'}, inplace=True)

In [None]:
community.head()

I join two tables and adjust the order of columns.

*Ich mache ein Join von zwei Tabellen und ordne die Spalten neu.*

In [None]:
community_join = pd.merge(community, population_join, on='dropoff_community_area')

community_join

In [None]:
community_join = community_join[['dropoff_community_area', 'community_name','arrests','avg_tips','population', 'income', 'latinos', 'blacks', 'white', 'asian', 'other']]

community_join

The biggest correlation is between the number of arrests and the number of citizens in communities.

*Die größte Korrelation ist zwischen die Zahl der Festnahmen und die Zahl der Bewohner.*

In [None]:
round(community_join['arrests'].corr(community_join['population']), 2)

In [None]:
round(community_join['arrests'].corr(community_join['income']), 2)

In [None]:
round(community_join['arrests'].corr(community_join['white']), 2)

In [None]:
round(community_join['arrests'].corr(community_join['blacks']), 2)

In [None]:
round(community_join['arrests'].corr(community_join['avg_tips']), 2)

If we calculate the percentage of arrests per community citizens, we'll see that other correlations get more obvious.

*Wenn wir aber den Prozentwert der Festnahmen pro Einwohner berechnen, sehen wir, dass andere Korrelationen deutlicher werden.*

In [None]:
community_join['arrests_pct'] = community_join['arrests']/community_join['population']*100

community_join

In [None]:
round(community_join['arrests_pct'].corr(community_join['avg_tips']), 2)

In [None]:
round(community_join['arrests_pct'].corr(community_join['income']), 2)

In [None]:
round(community_join['arrests_pct'].corr(community_join['white']), 2)

In [None]:
round(community_join['arrests_pct'].corr(community_join['blacks']), 2)