# Basic Data Science Statistics — Lecture notebook

Objectives:
- Load a dataset (or generate a small sample) into a pandas DataFrame
- Inspect the data (head, info, describe)
- Compute min, max, mean, median, std
- Filter the DataFrame into sub-dataframes
- Aggregate with groupby
- Create and use an automatic summary function

In [2]:
import os
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

## 1) Load a dataset

In [22]:
df = pd.read_csv('work/introduction_to_data_science/data/formula-1-race-data/results.csv')
df.shape

(23777, 18)

In [23]:
print('\nDataFrame info:')
print(df.info())


DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23777 entries, 0 to 23776
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   resultId         23777 non-null  int64  
 1   raceId           23777 non-null  int64  
 2   driverId         23777 non-null  int64  
 3   constructorId    23777 non-null  int64  
 4   number           23771 non-null  float64
 5   grid             23777 non-null  int64  
 6   position         13227 non-null  float64
 7   positionText     23777 non-null  object 
 8   positionOrder    23777 non-null  int64  
 9   points           23777 non-null  float64
 10  laps             23777 non-null  int64  
 11  time             6004 non-null   object 
 12  milliseconds     6003 non-null   float64
 13  fastestLap       5383 non-null   float64
 14  rank             5531 non-null   float64
 15  fastestLapTime   5383 non-null   object 
 16  fastestLapSpeed  5383 non-null   object 


In [24]:
print('\nDescriptive statistics:')
display(df.describe(include='all'))


Descriptive statistics:


Unnamed: 0,resultId,raceId,driverId,constructorId,number,grid,position,positionText,positionOrder,points,laps,time,milliseconds,fastestLap,rank,fastestLapTime,fastestLapSpeed,statusId
count,23777.0,23777.0,23777.0,23777.0,23771.0,23777.0,13227.0,23777,23777.0,23777.0,23777.0,6004,6003.0,5383.0,5531.0,5383,5383.0,23777.0
unique,,,,,,,,39,,,,5758,,,,551,5144.0,
top,,,,,,,,R,,,,+8:22.19,,,,01:17.2,220.611,
freq,,,,,,,,8517,,,,5,,,,28,3.0,
mean,11889.481053,487.203937,226.515961,46.281785,16.965462,11.270303,7.782264,,13.081591,1.601403,45.270598,,6303313.0,41.061676,10.598807,,,18.242293
std,6864.691322,269.904857,231.386102,56.174091,13.644798,7.346436,4.745105,,7.824711,3.665154,30.525404,,1721748.0,17.156435,6.272457,,,26.380824
min,1.0,1.0,1.0,1.0,0.0,0.0,1.0,,1.0,0.0,0.0,,1474899.0,2.0,0.0,,,1.0
25%,5945.0,273.0,55.0,6.0,7.0,5.0,4.0,,7.0,0.0,20.0,,5442948.0,29.0,5.0,,,1.0
50%,11889.0,478.0,154.0,25.0,15.0,11.0,7.0,,13.0,0.0,52.0,,5859428.0,44.0,11.0,,,11.0
75%,17833.0,718.0,314.0,57.0,23.0,17.0,11.0,,19.0,1.0,66.0,,6495440.0,53.0,16.0,,,16.0


## 2) Inspect the data

Show a few rows, basic info and descriptive statistics

In [25]:
#printing the min and max of the 'position' column
print('Position min:', df['position'].min())
print('Position max:', df['position'].max())

Position min: 1.0
Position max: 33.0


In [26]:
max_position = df['position'].max()
max_position_data = df[df['position'] == max_position]
display(max_position_data[['position', 'raceId', 'driverId', 'constructorId']])

Unnamed: 0,position,raceId,driverId,constructorId
18144,33.0,748,539,113


In [34]:
constructors_df = pd.read_csv('work/introduction_to_data_science/data/formula-1-race-data/constructors.csv')
constructors_df.shape

(208, 6)

In [35]:
constructor_113 = constructors_df[constructors_df['constructorId'] == 113]
display(constructor_113[['constructorId', 'name', 'nationality']])

Unnamed: 0,constructorId,name,nationality
111,113,Kurtis Kraft,American


Finde heraus wie viele Results es gibt bei denen Kurtis Kraft also constructorId 113 teilgenomen hat was die durchschnittliche Platzierung war und welche Standardabweichung es von der platzierung gab.

In [36]:
# Filter results for Kurtis Kraft (constructorId 113)
kurtis_kraft_results = df[df['constructorId'] == 113]

# Anzahl der Ergebnisse
num_results = kurtis_kraft_results.shape[0]

# Durchschnittliche Platzierung (nur gültige Platzierungen)
mean_position = kurtis_kraft_results['position'].mean()

# Standardabweichung der Platzierung
std_position = kurtis_kraft_results['position'].std()

print(f"Anzahl der Results: {num_results}")
print(f"Durchschnittliche Platzierung: {mean_position:.2f}")
print(f"Standardabweichung der Platzierung: {std_position:.2f}")

Anzahl der Results: 226
Durchschnittliche Platzierung: 10.30
Standardabweichung der Platzierung: 6.39
