# Project 1 - Data Engineering
## 2 Data Exploration and Analysis
This notebook is for the implementation of task "2 Data Exploration and Analysis", as listed in the **Project Instructions**.

<div class="alert alert-success">
<b>Overview:</b><br>
Data Exploration and Analysis consists of the following parts worth 70 points:
<ul>
    <li><b>E1 - Obtain and Scrub</b> (15 points)</li>
    <li><b>E2 - Exploratory data analysis (EDA) </b>(20 points)</li>
    <li><b>E3 - Formulate hypotheses </b>(25 points)</li>
    <li><b>Follow the guidelines for Data Exploration and Analysis below </b>(10 points)</li>
</ul></div>

<div class="alert alert-success">
<b>Guidelines for Data Exploration and Analysis:</b><br>
    <ol>
        <li>Use a single Jupyer notebook for your project.</li>
        <li>Use only Python-code for your project.</li>
        <li>The use of automatic and semi-automatic data analyis tools is not allowed (e. g., PandasGUI, D-Tale, Mito, etc.). Only use packages we used in the coded lectures.</li>
        <li>Export your environment for submission as 'prj01-environment.txt'.</li>
        <li>Upload your resulting work as a zip file containing only a single jupyter notebook and required files to run the notebook. All cell outputs and <b>figures must display in jupyter lab</b>. (Test this, in particular when you use another environment like VS Code.)</li>
        <li>All code cells in your notebook must be runnable without errors or warnings (e. g., deprecated functions). Each error/warning subtracts -2 points (up to the full 10 points for following the  guidelines).</li>
        <li>Use only relative paths in your project.</li>
        <li>Avoid (excessive) code duplication.</li>
        <li>Avoid loops iterating over pandas objects (Series, DataFrames). Explicitly justify each exception via a comment. </li>
        <li>All coded steps in your analysis must be commented.</li>
        <li>Keep your code as well as outputs short, precise and readable. Each long or unnecessary output subtracts -2 points (up to the full 10 points for following the project guidelines).</li>
    </ol>
    <b>Late submissions are not accepted and earn you 0 points on the python project. </b>
</div>

Explicitly list which notebook toolset was used (jupyter lab/jupyter notebook/VS Code/etc.)

**here**: MY_TOOL, MY_BROWSER

Explicitly and clearly state the chosen dataset number and title:
### Bevölkerung ODÖ Hunde
#### Hundebestand seit 2002 - Bezirke Wien --> contains info about dog density per districts over years
#### Hunde pro Bezirk Wien --> contains info about dog breeds count per district in 2024
#### Hunderassen Wien --> contains info about dog breeds count per district 2012 - 2017


## E1 - Obtain and Scrub

### Obtain
Download the dataset and understand both:
- format: wide vs. long, separaters, decimal points, encoding, etc., and
- content: what variables are in the columns, what is their meaning?
To this end, identify and download metadata such as headers, category listings, explanatory reports, etc.
### Scrub
The aim of scrub is to create a clean version of the data for further analysis.
- Load the dataset and take care of dtypes (dates, numbers, categories, etc.). Justify why you don't load/use specific columns.
- Check for footnotes or any other notifications on special cell content, such as time series breaks. Follow up that information, and document your decision how to deal with it. Remember: A homework contained such info in the cell "76.1 b". The metadata defines what that "b" stands for.
![image.png](attachment:8eab5647-0d31-4875-a3ac-990349e90b76.png)
- Choose an appropriate (Multi-)Index.
- Identify:
    1. missing values and get row and column overviews. Use graphical and/or numeric approaches. Once identified, handle missing values according to column type, time series property and data set size.
    2. duplicates (justify the used column subset). Remove duplicates - if any - and inspect what you removed.
- Transform to shape (tidy vs. wide) best suited for further analysis.
- Export the clean data to a file for inspection with an external data browser (e.g., MS Excel).
- Provide an overview of the clean dataset:
    1. show the dtypes
    2. quantitative column descriptions:
        1. categorical columns: number of unique values, counts
        2. numeric columns: range and median

In [1]:
# Importing necessary packages:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.io as pio
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from matplotlib import colormaps
import geopandas as gpd

## 1) Obtain the datasets and scrub data


#### 1) Dataset 1
Number of dogs (absolute and per 1,000 inhabitants) since 2002 - districts of Vienna
 	* NUTS | NUTS2-Region (Bundesland)
  * DISTRICT_CODE | Gemeindebezirkskennzahl (Schema: 9BBZZ, BB=Nummer Bezirk, ZZ=00)
  * SUB_DISTRICT_CODE | Zählbezirkskennzahl gemäß Stadt Wien (Schema: 9BBZZ, 9=Kennzeichnung Wien, BB=Nummer Bezirk, ZZ=Nummer Zählbezirk, ZZ=99 bei fehlender Zählbezirkskennzeichnung)
  * REF_YEAR | Referenzjahr
  * REF_DATE | Referenzdatum
  * DOG_VALUE | Anzahl der Hunde (absolut)
  * DOG_DENSITY | Anzahl der Hunde pro 1.000 EinwohnerInnen

   TODO: add more description?

##### Obtain

In [2]:
# Import the first data set
# import and drop all cols with NaN --> lots because of sep ';' but the sep splits the first col into its values
dogs_2002 = pd.read_csv("vie-bez-biz-spo-dog-2002f.csv", sep=';', skiprows=1).dropna(axis=1, how='any')

print("Dataframe shape: ", dogs_2002.shape)
dogs_2002.head(5)

Dataframe shape:  (528, 7)


Unnamed: 0,NUTS,DISTRICT_CODE,SUB_DISTRICT_CODE,REF_YEAR,REF_DATE,DOG_VALUE,DOG_DENSITY
0,AT13,90000,90000,2002,20020101,46.933,2987
1,AT13,90100,90100,2002,20020101,542.0,3074
2,AT13,90200,90200,2002,20020101,2.251,2529
3,AT13,90300,90300,2002,20020101,1.904,2316
4,AT13,90400,90400,2002,20020101,615.0,2123


* The separator is ';'
  * even though it creates many NaN columns --> due to the delimiter specified as ',', but importing with separator ';' splits the columns correctly and dropping NaN columns is better to handle.
* The encoding is 'utf-8'.
* The shape is (528, 7).
* Format is unsure:  
  * Contains entries for years by district of overall dog density sorted by years.
  * each row represents a single observation of district by year and dog breeds
  * but has multiple entries regarding time line and district
* The first row contains the csv title. It is dropped and the second row is used for column titles.



##### Scrub

In [3]:
print(dogs_2002.dtypes)

NUTS                  object
DISTRICT_CODE          int64
SUB_DISTRICT_CODE      int64
REF_YEAR               int64
REF_DATE               int64
DOG_VALUE            float64
DOG_DENSITY           object
dtype: object


Change DOG_DENSITY to use . instead of , and the format to float to be able to calculate with it, if needed.

Change REF_DATE to format datetime since its currently int64.

In [4]:
dogs_2002['DOG_DENSITY'] = dogs_2002['DOG_DENSITY'].str.replace(',', '.').astype(float)
dogs_2002['REF_DATE'] = pd.to_datetime(dogs_2002['REF_DATE'].astype(str), format='%Y%m%d')

In [5]:
print(dogs_2002.dtypes)
dogs_2002.head(2)

NUTS                         object
DISTRICT_CODE                 int64
SUB_DISTRICT_CODE             int64
REF_YEAR                      int64
REF_DATE             datetime64[ns]
DOG_VALUE                   float64
DOG_DENSITY                 float64
dtype: object


Unnamed: 0,NUTS,DISTRICT_CODE,SUB_DISTRICT_CODE,REF_YEAR,REF_DATE,DOG_VALUE,DOG_DENSITY
0,AT13,90000,90000,2002,2002-01-01,46.933,29.87
1,AT13,90100,90100,2002,2002-01-01,542.0,30.74


District Code and Sub District Code are the same --> removing Sub District Code.

In [6]:
# only execute if not already done (avoids errors)
if 'SUB_DISTRICT_CODE' in dogs_2002.columns:
    print(np.unique([dogs_2002['DISTRICT_CODE'] == dogs_2002['SUB_DISTRICT_CODE']], return_counts=True))
    dogs_2002.drop(columns=['SUB_DISTRICT_CODE'], inplace=True)

(array([ True]), array([528], dtype=int64))


Check for duplicates --> looks good.

In [7]:
key=['DOG_VALUE', 'REF_YEAR', 'DISTRICT_CODE']
dogs_2002.groupby(key)['REF_YEAR'].count().sort_values(ascending=False).head(5)

DOG_VALUE  REF_YEAR  DISTRICT_CODE
1.022      2015      90500            1
8.163      2014      92200            1
50.282     2006      90000            1
49.856     2005      90000            1
48.093     2004      90000            1
Name: REF_YEAR, dtype: int64

Check for null values --> looks good.

In [8]:
dogs_2002.isnull().values.any()

False

#### Dataset 2
  * NUTS1 NUTS2 NUTS3
  * DISTRICT_CODE
  * SUB_DISTRICT_CODE
  * Postal_CODE
  * Dog Breed
  * Anzahl
  * Ref_Date

##### Obtain

In [9]:
# Import dataset 2
dog_types_2024 = pd.read_csv("hunde-wien.csv", sep=';', encoding='latin-1', skiprows=1)

print("Dataframe shape: ", dog_types_2024.shape)
dog_types_2024.head(5)

Dataframe shape:  (11117, 9)


Unnamed: 0,NUTS1,NUTS2,NUTS3,DISTRICT_CODE,SUB_DISTRICT_CODE,Postal_CODE,Dog Breed,Anzahl,Ref_Date
0,AT1,AT13,AT113,90100,.,1010,Afghanischer Windhund / Mischling,1.0,20240603
1,AT1,AT13,AT113,90100,.,1010,Akita / Belgischer Schäferhund,1.0,20240603
2,AT1,AT13,AT113,90100,.,1010,Alaskan Malamute,1.0,20240603
3,AT1,AT13,AT113,90100,.,1010,American Cocker Spaniel,2.0,20240603
4,AT1,AT13,AT113,90100,.,1010,American Cocker Spaniel / Kleinpudel Schwarz,1.0,20240603


* The separator is ';'
* The encoding is 'latin-1'.
* The shape is (11117, 9).
* Format is long (?) :
  * contains a single count row for each dog breed (if we consider entries like 'Hovawart' and 'Hovawart / Golden Retriever' to be different)
  * but contains multiple entries on districts

##### Scrub

In [10]:
print(dog_types_2024.dtypes)

NUTS1                 object
NUTS2                 object
NUTS3                 object
DISTRICT_CODE          int64
SUB_DISTRICT_CODE     object
Postal_CODE            int64
Dog Breed             object
Anzahl               float64
Ref_Date               int64
dtype: object


Looks like columns could have only one value in some colums --> check for that. Also, no district is missing, which is good.

In [11]:
def check_unique_values(df):
    for col in df.columns:
        print(col, df[col].unique())

check_unique_values(dog_types_2024)

NUTS1 ['AT1']
NUTS2 ['AT13']
NUTS3 ['AT113']
DISTRICT_CODE [90100 90200 90300 90400 90500 90600 90700 90800 90900 91000 91100 91200
 91300 91400 91500 91600 91700 91800 91900 92000 92100 92200 92300]
SUB_DISTRICT_CODE ['.']
Postal_CODE [1010 1020 1030 1040 1050 1060 1070 1080 1090 1100 1110 1120 1130 1140
 1150 1160 1170 1180 1190 1200 1210 1220 1230]
Dog Breed ['Afghanischer Windhund / Mischling' 'Akita / Belgischer Schäferhund'
 'Alaskan Malamute' ... 'Zwergschnauzer schwarz / Deutsch Drahthaar'
 'Zwergspitz (Pomeranian) / Border Collie'
 'Zwergspitz (Pomeranian) / Zwergspitz (Pomeranian)']
Anzahl [  1.      2.      3.      4.      6.     17.      5.     10.      7.
   8.     13.     12.     25.     20.     19.     48.      9.     47.
  23.     29.     21.     11.    108.     49.     67.     52.     31.
  37.    103.     56.     27.     86.     14.     30.     84.    288.
  18.     83.     15.     45.     51.     28.     91.     40.     32.
  24.     63.     53.     22.     97.     3

Drop NUTS1, NUTS, NUTS3 and SUB_DISTRICT CODE. Don't want to drop Ref_Date yet, maybe could be used for some insights with the above data frame.

In [12]:
def adapt_df(df: pd.DataFrame):
    df = df.drop(['NUTS1', 'NUTS2', 'NUTS3', 'SUB_DISTRICT_CODE'], axis=1)
    df['Ref_Date'] = pd.to_datetime(df['Ref_Date'].astype(str), format='%Y%m%d')
    return df

dog_types_2024 = adapt_df(dog_types_2024)

In [13]:
print(dog_types_2024.dtypes)
dog_types_2024.head(2)

DISTRICT_CODE             int64
Postal_CODE               int64
Dog Breed                object
Anzahl                  float64
Ref_Date         datetime64[ns]
dtype: object


Unnamed: 0,DISTRICT_CODE,Postal_CODE,Dog Breed,Anzahl,Ref_Date
0,90100,1010,Afghanischer Windhund / Mischling,1.0,2024-06-03
1,90100,1010,Akita / Belgischer Schäferhund,1.0,2024-06-03


Check for null values --> looks good.

In [14]:
dog_types_2024.isnull().values.any()

False

Check if some Dog Breeds are duplicated --> looks good.

In [15]:
def check_for_duplicates(df):
    key=['Dog Breed', 'Postal_CODE']
    print(dog_types_2024.groupby(key)['Postal_CODE'].count().sort_values(ascending=False).head(5))

check_for_duplicates(dog_types_2024)

Dog Breed             Postal_CODE
Affenpinscher         1020           1
Malteser / Pudel      1020           1
Malteser / Pekingese  1140           1
                      1160           1
                      1180           1
Name: Postal_CODE, dtype: int64


Check if there are unknown Dog Breeds:
  * There appear to be 250 entries with at least partly unknown Dog Breeds.
  * They are kept for the moment.

In [16]:
def check_for_unknown_dog_breeds(df: pd.DataFrame):
    print(df[df['Dog Breed'].str.contains('Unbekannt')].size)
    print(df[df['Dog Breed'].str.contains('Unbekannt')]['Dog Breed'].unique())

check_for_unknown_dog_breeds(dog_types_2024)

250
['Unbekannt' 'Unbekannt / Mischling' 'Unbekannt / Dackel'
 'Unbekannt / Kleiner Münsterländer' 'Unbekannt / Mudi'
 'Unbekannt / Pit Bull Terrier']


In [17]:
dog_types_2024['Anzahl'] = dog_types_2024['Anzahl'].astype(int)

In [18]:
dog_types_2024.head()

Unnamed: 0,DISTRICT_CODE,Postal_CODE,Dog Breed,Anzahl,Ref_Date
0,90100,1010,Afghanischer Windhund / Mischling,1,2024-06-03
1,90100,1010,Akita / Belgischer Schäferhund,1,2024-06-03
2,90100,1010,Alaskan Malamute,1,2024-06-03
3,90100,1010,American Cocker Spaniel,2,2024-06-03
4,90100,1010,American Cocker Spaniel / Kleinpudel Schwarz,1,2024-06-03


Could combine colors to one col? If needed.

#### Dataset 3
  * NUTS1: AT1
  * NUTS2: AT13
  * NUTS3: AT113
  * DISTRICT_CODE: Bezirke, Format 9BB00
  * SUB_DISTRICT_CODE: Zählbezirk, leer
  * Postal_CODE: Postleitzahl, Format 1BB0
  * Dog Breed: Hunderasse
  * Anzahl: Anzahl der jeweiligen Hunderasse
  * Ref_Date: Jahr

In [19]:
# Import dataset 3
dog_types_2012 = pd.read_csv("hunde-vie.csv", sep=';', encoding='latin-1', skiprows=1)

print("Dataframe shape: ", dog_types_2012.shape)
dog_types_2012.head(5)

Dataframe shape:  (33793, 9)


Unnamed: 0,NUTS1,NUTS2,NUTS3,DISTRICT_CODE,SUB_DISTRICT_CODE,Postal_CODE,Dog Breed,Anzahl,Ref_Date
0,AT1,AT13,AT113,90100,.,1010,Afghanischer Windhund,1,20123112
1,AT1,AT13,AT113,90100,.,1010,Amerikanischer Cockerspaniel,1,20123112
2,AT1,AT13,AT113,90100,.,1010,Amerikanischer Staffordshire-Terrier,2,20123112
3,AT1,AT13,AT113,90100,.,1010,Australian Shepherd Dog,2,20123112
4,AT1,AT13,AT113,90100,.,1010,Australian Terrier,1,20123112


* The separator is ';'
* The encoding is 'latin-1'.
* The shape is (33793, 9).
* Format is long (?):
  * contains a single count row for each dog breed (if we consider entries like 'Hovawart' and 'Hovawart / Golden Retriever' to be different)
  * but again multiple entries


Repeat steps as with dataframe 2, since they have the same format.

In [20]:
check_unique_values(dog_types_2012)

NUTS1 ['AT1']
NUTS2 ['AT13']
NUTS3 ['AT113']
DISTRICT_CODE [90100 90200 90300 90400 90500 90600 90700 90800 90900 91000 91100 91200
 91300 91400 91500 91600 91700 91800 91900 92000 92100 92200 92300]
SUB_DISTRICT_CODE ['.']
Postal_CODE [1010 1020 1030 1040 1050 1060 1070 1080 1090 1100 1110 1120 1130 1140
 1150 1160 1170 1180 1190 1200 1210 1220 1230]
Dog Breed ['Afghanischer Windhund' 'Amerikanischer Cockerspaniel'
 'Amerikanischer Staffordshire-Terrier' ... 'Shikoku'
 'Olde English Bulldogge / Podenco Ibicenco' 'Whippet / Border-Collie']
Anzahl [   1    2    9    4    3    8    5    7    6   14  154   42   20   12
   31   15   17   86   13   19   33   41   47   32   30   72   22   81
   60   29   11  856   25   76   10   59   62   23   75   50   18  777
   44  206   24   21   26  330  229  196  143  263   40  127   43   16
   27  205   35   37  107  104  120   69   56  197   83   64   45  139
  175   67   65 1750  169   91   39   70  153   88   80   58   49  117
  109 1182  101   51 

Change date format to yyyymmdd to be the same as in dataframe above.

In [21]:
dog_types_2012['Ref_Date'] = pd.to_datetime(dog_types_2012['Ref_Date'], format='%Y%d%m').dt.strftime('%Y%m%d')
dog_types_2012 = adapt_df(dog_types_2012)
print(dog_types_2012.isnull().values.any())

False


Check for duplicates and unkown dog breeds --> has values with the breed not known completely as well as duplicates.
* Unkown can stay for now.
* Drop duplicates.



In [22]:
check_for_duplicates(dog_types_2012)
check_for_unknown_dog_breeds(dog_types_2012)

Dog Breed             Postal_CODE
Affenpinscher         1020           1
Malteser / Pudel      1020           1
Malteser / Pekingese  1140           1
                      1160           1
                      1180           1
Name: Postal_CODE, dtype: int64
690
['Unbekannt']


In [23]:
# drop duplicates
key=['Dog Breed', 'Postal_CODE']
dog_types_2012=dog_types_2012.drop_duplicates(subset=key)

### Merging Dataset 2 and 3 as they contain the same info for different years

In [24]:
print(dog_types_2024.dtypes)
print(dog_types_2012.dtypes)
dog_types_2012.head(5)

DISTRICT_CODE             int64
Postal_CODE               int64
Dog Breed                object
Anzahl                    int32
Ref_Date         datetime64[ns]
dtype: object
DISTRICT_CODE             int64
Postal_CODE               int64
Dog Breed                object
Anzahl                    int64
Ref_Date         datetime64[ns]
dtype: object


Unnamed: 0,DISTRICT_CODE,Postal_CODE,Dog Breed,Anzahl,Ref_Date
0,90100,1010,Afghanischer Windhund,1,2012-12-31
1,90100,1010,Amerikanischer Cockerspaniel,1,2012-12-31
2,90100,1010,Amerikanischer Staffordshire-Terrier,2,2012-12-31
3,90100,1010,Australian Shepherd Dog,2,2012-12-31
4,90100,1010,Australian Terrier,1,2012-12-31


In [25]:
combined_dog_types = pd.concat([dog_types_2012, dog_types_2024])

In [26]:
print(combined_dog_types.dtypes, combined_dog_types.shape)
combined_dog_types.head(5)

DISTRICT_CODE             int64
Postal_CODE               int64
Dog Breed                object
Anzahl                    int64
Ref_Date         datetime64[ns]
dtype: object (18178, 5)


Unnamed: 0,DISTRICT_CODE,Postal_CODE,Dog Breed,Anzahl,Ref_Date
0,90100,1010,Afghanischer Windhund,1,2012-12-31
1,90100,1010,Amerikanischer Cockerspaniel,1,2012-12-31
2,90100,1010,Amerikanischer Staffordshire-Terrier,2,2012-12-31
3,90100,1010,Australian Shepherd Dog,2,2012-12-31
4,90100,1010,Australian Terrier,1,2012-12-31


In [27]:
combined_dog_types_by_year_and_district = combined_dog_types.pivot_table(index=['Ref_Date', 'Postal_CODE'], columns='Dog Breed', values='Anzahl', fill_value=0).sort_values(by='Ref_Date', ascending=False)
combined_dog_types_by_year_and_district.head(2)

Unnamed: 0_level_0,Dog Breed,Affenpinscher,Affenpinscher / Afghanischer Windhund,Affenpinscher / Beagle,Affenpinscher / Border Terrier,Affenpinscher / Cairn Terrier,Affenpinscher / Griffon belge,Affenpinscher / Mischling,Affenpinscher / Scottish Terrier,Affenpinscher / Shih Tzu,Affenpinscher / Zwergschnauzer pfeffer-salz,...,Österreichischer Pinscher / Prager Rattler,Österreichischer Pinscher / Rauhhaar Dachshund Normal,Österreichischer Pinscher / Shar Pei,Österreichischer Pinscher / Spitz,Österreichischer Pinscher / Tibetan Spaniel,Österreichischer Pinscher / Weimaraner Kurzhaar,Österreichischer Pinscher / Whippet,Österreichischer Pinscher / Yorkshire Terrier,Österreichischer Pinscher / Zwergpudel Rot,Österreichischer Pinscher / Zwergspitz (Pomeranian)
Ref_Date,Postal_CODE,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
2024-06-03,1230,0,0,0,0,0,0,2,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2024-06-03,1120,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Sanity Check: 

Values are the same (apart from 0) --> looks good. 

In [35]:
unique_values = []
unique_values_single = []
for col in combined_dog_types_by_year_and_district.columns:
    unique_values.extend(combined_dog_types_by_year_and_district[col].unique())

unique_values_single.extend(dog_types_2012['Anzahl'].unique())
unique_values_single.extend(dog_types_2024['Anzahl'].unique())
unique_values_single = set(unique_values_single)
unique_values = set(unique_values)
differences = unique_values.symmetric_difference(unique_values_single)
print(differences)

{0}
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 529, 103, 104, 105, 107, 108, 109, 112, 113, 114, 115, 628, 117, 118, 119, 120, 121, 122, 123, 125, 127, 129, 130, 132, 133, 136, 139, 141, 142, 143, 144, 145, 146, 148, 150, 153, 154, 156, 2717, 1182, 161, 673, 169, 171, 172, 173, 686, 175, 179, 186, 189, 191, 194, 195, 196, 197, 204, 205, 206, 207, 208, 213, 1750, 217, 222, 224, 225, 228, 229, 234, 749, 241, 242, 253, 258, 260, 772, 263, 777, 269, 279, 281, 284, 288, 2354, 310, 521, 317, 330, 856, 352, 362, 1397, 385, 400, 922, 411, 415, 430, 943, 431, 477}
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 

## E2 - Exploratory data analysis (EDA)
Use the clean dataset and understand and explore the relationships in the data (numerical, visual, statistical). This includes at least but is not limited to:
- A comprehensive textual description of meaning for relevant fields in the dataset
- Statistical/numerical descriptions and visualization techniques we learned in the course inluding correlations, distributions and groupings of variing degrees.
- Checks for data quality, e. g., completeness, plausibility, outliers
- Handling any identified problems with the data
- If necessary, use additional data wrangling in line with your EDA and only keep what's necessary for the following steps of your analysis with appropriate data granularity and form ("tidy data")

Don't:
- test/prove hypotheses here. EDA should only motivate hypotheses.

Advanced/bonus:
- Depending on your hypothesis you may want to join external data (e.g., merge external highest education level to existing vaccination data) for additional insights.

Explain all steps taken and your thinking why you deem them necessary.

In [28]:
# E2:








## E3 - Formulate hypotheses

*Note: Read this section entirely and understand it - every group member.*

A hypothesis is an idea or explanation for something that is based on known facts but has not yet been proved. A hypothesis is a compact, concise statement, such as: "Individuals with higher income have (on average) more offspring.", that will be answered based on facts (the data). https://gradcoach.com/what-is-a-research-hypothesis-or-scientific-hypothesis/

Formulate *N* non-trivial hypotheses, 1 per group member, and regard the following criteria:
- State the hypothesis explicitly in concise language.
- The hypothesis must be **motivated** by either **EDA results** or **literature** (citation in the report needed).
- The hypothesis must refer to **endpoints** that are **testable**. Specifically, the endpoint must be derived from the data.
- Think of real-life use cases/consequences of your results (textual description).
- For each hypothesis explain all executed steps.
- In case of extreme or implausible results check the validity of your data.
- For each hypothesis export the artifacts (figures, tables, etc.) required for the report.
- If you decide to use a statistical test, use it properly. In particular, check the validity and comparability of the samples.

Do not:
- State nebulous, vague hypotheses. These don't contain endpoints and are unclear to test (i.e., answer).
- Use post-hoc hypotheses. Portraying an empirically inspired **post hoc hypothesis as a priori** violates the **falsification principle** crucial for hypothesis-driven (that is, confirmatory) empirical research. Falsification is severe scientific fraud.
- State trivial hypotheses (e.g., hypthesis 2: "Not Hypothesis 1").
- Answer based on "common knowledge".
- Try to **produce positively tested hypotheses**. If a well motivated hypothesis is negative, this is an important finding (see Simpson's Paradox). The value of a tested hypothesis lies in the information or learning it provides.

Example: The homework with Simpson's Paradox. The pooled overall comparison between the genders would be the EDA motivating the hypothesis: "At UC Berkeley the by-department admissions rate for females is lower than for males." It should be tested using samples of department admission rates for the 2 **groups** male and female. No steps of the test should be done in EDA (or a priory to stating the hypothesis). The groups should be compared graphically, e.g., via a stripplot overlayed with a boxplot. The figure should be labelled properly and exported for the report. A (paired) t-test **may** be used (it's optional) to test this hypothesis statistically. For different data (e.g., time series) different approaches may be required. You don't have to use statistical tests, in particular if you don't know what they are doing.


### E3-H1: "The prevalence of specific dog breeds in Vienna is strongly influenced by real estate prices."
Rationale: Higher real estate prices may attract certain dog breeds, indicating a potential link between the types of dogs found in high-income areas and local housing costs.
Data Needed: Dog breed data ('hunde-vie.csv'/hunde-wien.csv), Real estate prices dataset.

Author: Carlos Eduardo Tichy


### E3-H2: "Dog ownership trends are significantly different between high-cost and low-cost real estate areas of Vienna."
Rationale: Real estate prices could affect the type of dog breeds owned, changes in real estate prices could correlate with changes in dog breeds owned. Data Needed: Dog breed data ('hunde-vie.csv'/hunde-wien.csv), Real estate prices dataset.


Author: Theresa Spiel



### E3-H3: "xxx"
Author: Group member 3


### E3-H4: "xxx"
Author: Group member 4


### E3-H5: "xxx"
Author: Group member 5
