# Exploratory Data Analysis

**Goal** 

Explore and visualize the demographic, social, and economic structure and dynamics of Cologne over years.

## Setup & Imports

In [5]:
from pathlib import Path
DATA_PROCESSED = Path('../data/processed/cologne_data_clean_v1.csv')

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
plt.style.use("seaborn-v0_8")

import seaborn as sns


## Load data

In [6]:
df = pd.read_csv(DATA_PROCESSED, low_memory=False)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8892 entries, 0 to 8891
Data columns (total 52 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   year                             8892 non-null   int64  
 1   area_code                        8892 non-null   int64  
 2   area                             8892 non-null   object 
 3   area_level_code                  8892 non-null   int64  
 4   area_level                       8892 non-null   object 
 5   avg_age_total                    8879 non-null   float64
 6   avg_age_male                     8879 non-null   float64
 7   avg_age_female                   8879 non-null   float64
 8   avg_age_german                   8879 non-null   float64
 9   avg_age_non_german               8879 non-null   float64
 10  population_total                 8879 non-null   float64
 11  non_german_total                 8879 non-null   float64
 12  non_german_share    

## Filtered views

In [None]:
CITY = 0
DISTRICTS = 1
NEIGHBORHOODS = 2
STATISTICAL_BLOCKS = 3
SOCIAL_SPACES = 4

df_city = df[df.area_level_code == CITY]
df_districts = df[df.area_level_code == DISTRICTS]
df_neighborhoods = df[df.area_level_code == NEIGHBORHOODS]
df_statistical_blocks = df[df.area_level_code == STATISTICAL_BLOCKS]
df_social_spaces = df[df.area_level_code == SOCIAL_SPACES]

## Quick checks

In [8]:
latest_year = df['year'].max()
earliest_year = df['year'].min()

print(earliest_year, latest_year)

2012 2024


### Missing values (by area level)

In [19]:
df.groupby('area_level_code')[df.select_dtypes(include=["float64"]).columns].apply(lambda x: x.isna().mean()).T

area_level_code,0,1,2,3,4
avg_age_total,0.0,0.0,0.0,0.0,0.055556
avg_age_male,0.0,0.0,0.0,0.0,0.055556
avg_age_female,0.0,0.0,0.0,0.0,0.055556
avg_age_german,0.0,0.0,0.0,0.0,0.055556
avg_age_non_german,0.0,0.0,0.0,0.0,0.055556
population_total,0.0,0.0,0.0,0.0,0.055556
non_german_total,0.0,0.0,0.0,0.0,0.055556
non_german_share,0.0,0.0,0.0,0.0,0.055556
migration_background_total,0.0,0.0,0.0,0.0,0.055556
migration_background_share,0.0,0.0,0.0,0.0,0.055556


Most indicators are fully available at city, district, and neighborhood level, while statistical blocks and social spaces show substantial structural missingness. In this project I focus on city (trends) and districts (comparisons).

In [9]:
df_districts.groupby('year')[['new_registered_electric_cars', 'registered_electric_cars']].apply(lambda x: x.isna().mean())

Unnamed: 0_level_0,new_registered_electric_cars,registered_electric_cars
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2012,0.222222,0.222222
2013,0.333333,0.222222
2014,0.444444,0.111111
2015,0.555556,0.0
2016,0.222222,0.0
2017,0.0,0.0
2018,0.0,0.0
2019,0.0,0.0
2020,0.0,0.0
2021,0.0,0.0


#### Co-missingness

In [13]:
# df.isna().corr()

### Duplicates

In [14]:
df.duplicated(subset=['year','area_code']).sum()

np.int64(0)