<span style='color:red'> NOTE: You can only pass the lab, when you provide both code and markdown </span>

Use Code for your analysis
Use Markdown to document and elaborate on your findings, conclusions, assertions, etc.

# DS_ML_I_P1: Dataset creation from raw data 
Provided is a list of Excel-Files that stem from a radar measurement using an array of 15 Antennas and a frequency sweep. Another Excel sheet provides information on the type of object that should be detected and its orientation.

The overall task is to load the data into **a single dataframe**, add the **proper information on object**, **orientation** and **the name of the image that shows the object** (which is not provided here)


## 1. Load the data and check proper loading
Load all the data into a single dataframe so that
* The name of the file is a separat column
* Only the first five columns and all rows per Sheet tab should be integrated (15 Tabs in total, one per Antenna)
* Sheet tab name should be the major index in a multiindex column dataframe
* Tab column names should be the minor index
* After this dataframe has been created the object information, orientation and image name should be added as separate columns by integrating the information from the specific excel sheet.

In [78]:
import pandas as pd
import glob
import os
import numpy as np


In [79]:
pd.options.display.max_columns=1000
pd.options.display.max_rows=20


### 1.1 Load Measurement Data 
Measurement data from multiple excel files are loaded, each containing measurement data from 15 antennas. 

The final dataframe is structured with antennas name as the major/1stlevel index column and the five measurement values as the 2nd level index column. 

A separate column `filename` is added for each row for getting the additional information later.


In [80]:
%%time
file_paths = glob.glob("P1b/Measurements_8_April_2023_IMP-SIMO/*.xls")

dfs =[]

for file in file_paths:
    file_data = []
    sheets = pd.read_excel(file, sheet_name=None, usecols=[0,1,2,3,4])

    for sheet_name, df_temp in sheets.items():
        multi_columns = [
            np.repeat(sheet_name,len(df_temp.columns)),
            df_temp.columns.to_list()
        ]
        # df_temp.columns = pd.MultiIndex.from_arrays(multi_columns, names=['major','minor'])
        df_temp.columns = pd.MultiIndex.from_arrays(multi_columns)

        file_data.append(df_temp)

    df_temp["filename"] = os.path.splitext(os.path.basename(file))[0]
    dfs.append(pd.concat(file_data, axis=1))

concatenated_data = pd.concat(dfs, ignore_index=True)

CPU times: total: 7.03 s
Wall time: 7.25 s


In [81]:
df_m = concatenated_data

### 1.2 Load Protocol Data
The measurement protocol excel files are loaded into one dataframe and each column name is translated accordingly to english

In [82]:
df_p = pd.read_excel("P1b/Messprotokoll_18_04_2023_open_V1.xlsx", skiprows=6, usecols="C:H").rename(columns={
    "Messung": "measurement", 
    "Gegenstand": "object", 
    "Postion": "position", 
    "Dateienname ": "filename", 
    "Bild ": "image",
    "Anmerkungen": "comments"
})

Set filename as dataframe's index for the later merge operation

In [83]:
df_p['filename'] = df_p['filename'].astype(str).str.strip()
df_p = df_p.set_index('filename')

### 1.3 Add Additional Information to Measurement Dataframe
The measurement dataframe is enriched with the additional information like the object information, position and image name as separate columns, by mapping the column filename to the relevant information in the protocol dataframe.

In [84]:
df_m['filename'] = df_m['filename'].astype(str).str.strip()

df_m['object'] = df_m['filename'].map(df_p['object'])
df_m['position'] = df_m['filename'].map(df_p['position'])
df_m['image'] = df_m['filename'].map(df_p['image'])

In [85]:
df = df_m

## 2. Print some statistics and analyze

### 2.1 Statistic Analysis Result
* A total of 3.000 rows across 79 columns were loaded. There are 46 columns with the data type `float64`, 30 columns with `int64`, and 3 columns with `object` type.

* The `image` column was read as `float64` instead of `object`. A transformation is likely needed to correctly represent the data later.
* The column object, position, and image containts missing values


In [86]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 79 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   (ANT 1, DAC Value)       3000 non-null   float64
 1   (ANT 1, DAC Value RAW)   3000 non-null   int64  
 2   (ANT 1, Magnitute)       3000 non-null   float64
 3   (ANT 1, Phase)           3000 non-null   float64
 4   (ANT 1, Frequency)       3000 non-null   int64  
 5   (ANT 2, DAC Value)       3000 non-null   float64
 6   (ANT 2, DAC Value RAW)   3000 non-null   int64  
 7   (ANT 2, Magnitute)       3000 non-null   float64
 8   (ANT 2, Phase)           3000 non-null   float64
 9   (ANT 2, Frequency)       3000 non-null   int64  
 10  (ANT 3, DAC Value)       3000 non-null   float64
 11  (ANT 3, DAC Value RAW)   3000 non-null   int64  
 12  (ANT 3, Magnitute)       3000 non-null   float64
 13  (ANT 3, Phase)           3000 non-null   float64
 14  (ANT 3, Frequency)      

There are in total 420 rows with missing values

In [87]:
rows_with_missing_values = df[df.isna().any(axis=1)]
print(f"{len(rows_with_missing_values)} has missing values")

420 has missing values


### What does those missing values mean? What does it says about the data quality?

The following show the summary statistic of the final dataframe.

It was found from the description that the `Frequency` of all antennas are all equals in every rows and every columns with the frequency 2450000000.

In [88]:
df.describe()

Unnamed: 0_level_0,ANT 1,ANT 1,ANT 1,ANT 1,ANT 1,ANT 2,ANT 2,ANT 2,ANT 2,ANT 2,ANT 3,ANT 3,ANT 3,ANT 3,ANT 3,ANT 4,ANT 4,ANT 4,ANT 4,ANT 4,ANT 5,ANT 5,ANT 5,ANT 5,ANT 5,ANT 6,ANT 6,ANT 6,ANT 6,ANT 6,ANT 7,ANT 7,ANT 7,ANT 7,ANT 7,ANT 8,ANT 8,ANT 8,ANT 8,ANT 8,ANT 9,ANT 9,ANT 9,ANT 9,ANT 9,ANT 10,ANT 10,ANT 10,ANT 10,ANT 10,ANT 11,ANT 11,ANT 11,ANT 11,ANT 11,ANT 12,ANT 12,ANT 12,ANT 12,ANT 12,ANT 13,ANT 13,ANT 13,ANT 13,ANT 13,ANT 14,ANT 14,ANT 14,ANT 14,ANT 14,ANT 15,ANT 15,ANT 15,ANT 15,ANT 15,image
Unnamed: 0_level_1,DAC Value,DAC Value RAW,Magnitute,Phase,Frequency,DAC Value,DAC Value RAW,Magnitute,Phase,Frequency,DAC Value,DAC Value RAW,Magnitute,Phase,Frequency,DAC Value,DAC Value RAW,Magnitute,Phase,Frequency,DAC Value,DAC Value RAW,Magnitute,Phase,Frequency,DAC Value,DAC Value RAW,Magnitute,Phase,Frequency,DAC Value,DAC Value RAW,Magnitute,Phase,Frequency,DAC Value,DAC Value RAW,Magnitute,Phase,Frequency,DAC Value,DAC Value RAW,Magnitute,Phase,Frequency,DAC Value,DAC Value RAW,Magnitute,Phase,Frequency,DAC Value,DAC Value RAW,Magnitute,Phase,Frequency,DAC Value,DAC Value RAW,Magnitute,Phase,Frequency,DAC Value,DAC Value RAW,Magnitute,Phase,Frequency,DAC Value,DAC Value RAW,Magnitute,Phase,Frequency,DAC Value,DAC Value RAW,Magnitute,Phase,Frequency,Unnamed: 76_level_1
count,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,2580.0
mean,-0.3371,1390.0,-29.440857,-11.037441,2450000000.0,-0.3371,1390.0,-28.584241,-8.454822,2450000000.0,-0.3371,1390.0,-28.496339,6.162484,2450000000.0,-0.3371,1390.0,-27.625137,-5.295042,2450000000.0,-0.3371,1390.0,-28.421452,-14.991958,2450000000.0,-0.3371,1390.0,-24.960215,31.746377,2450000000.0,-0.3371,1390.0,-24.468644,6.721207,2450000000.0,-0.3371,1390.0,-24.302023,11.708171,2450000000.0,-0.3371,1390.0,-26.139537,1.314662,2450000000.0,-0.3371,1390.0,-25.865322,-3.758702,2450000000.0,-0.3371,1390.0,-27.790367,-10.607068,2450000000.0,-0.3371,1390.0,-25.176945,1.131744,2450000000.0,-0.3371,1390.0,-25.627766,7.391315,2450000000.0,-0.3371,1390.0,-26.44304,-2.713947,2450000000.0,-0.3371,1390.0,-27.908959,-17.62175,2450000000.0,22.0
std,0.727482,346.419787,21.308478,100.516179,0.0,0.727482,346.419787,20.660701,101.662437,0.0,0.727482,346.419787,19.896935,105.45424,0.0,0.727482,346.419787,20.533735,102.950631,0.0,0.727482,346.419787,20.865768,101.811997,0.0,0.727482,346.419787,19.237833,97.933551,0.0,0.727482,346.419787,19.978353,112.893332,0.0,0.727482,346.419787,20.381027,113.738913,0.0,0.727482,346.419787,20.061793,108.204689,0.0,0.727482,346.419787,20.366402,102.632036,0.0,0.727482,346.419787,21.118536,100.650432,0.0,0.727482,346.419787,20.238759,102.738083,0.0,0.727482,346.419787,19.945961,106.206655,0.0,0.727482,346.419787,20.653936,102.324115,0.0,0.727482,346.419787,22.365833,100.495661,0.0,12.412079
min,-1.5761,800.0,-76.1324,-179.746,2450000000.0,-1.5761,800.0,-78.798,-179.855,2450000000.0,-1.5761,800.0,-65.8506,-179.841,2450000000.0,-1.5761,800.0,-66.2139,-179.83,2450000000.0,-1.5761,800.0,-72.2257,-179.865,2450000000.0,-1.5761,800.0,-68.5037,-179.901,2450000000.0,-1.5761,800.0,-67.1878,-179.738,2450000000.0,-1.5761,800.0,-69.0334,-179.686,2450000000.0,-1.5761,800.0,-83.5241,-179.956,2450000000.0,-1.5761,800.0,-75.3889,-179.916,2450000000.0,-1.5761,800.0,-75.9411,-179.522,2450000000.0,-1.5761,800.0,-72.5596,-179.967,2450000000.0,-1.5761,800.0,-64.0716,-179.974,2450000000.0,-1.5761,800.0,-77.3509,-179.978,2450000000.0,-1.5761,800.0,-85.9119,-178.594,2450000000.0,1.0
25%,-0.9566,1095.0,-49.023,-98.42445,2450000000.0,-0.9566,1095.0,-47.094225,-104.9275,2450000000.0,-0.9566,1095.0,-46.240275,-93.691275,2450000000.0,-0.9566,1095.0,-47.068525,-108.0485,2450000000.0,-0.9566,1095.0,-48.134,-103.4285,2450000000.0,-0.9566,1095.0,-41.839825,-34.75435,2450000000.0,-0.9566,1095.0,-43.308825,-99.83635,2450000000.0,-0.9566,1095.0,-43.7201,-96.735925,2450000000.0,-0.9566,1095.0,-44.589625,-99.1971,2450000000.0,-0.9566,1095.0,-46.133775,-106.337,2450000000.0,-0.9566,1095.0,-48.34465,-98.241525,2450000000.0,-0.9566,1095.0,-45.028425,-101.07925,2450000000.0,-0.9566,1095.0,-44.521475,-95.738675,2450000000.0,-0.9566,1095.0,-45.712275,-104.57575,2450000000.0,-0.9566,1095.0,-48.573975,-102.61225,2450000000.0,11.0
50%,-0.3371,1390.0,-34.07985,-24.7267,2450000000.0,-0.3371,1390.0,-33.7176,-0.022706,2450000000.0,-0.3371,1390.0,-33.76835,20.15135,2450000000.0,-0.3371,1390.0,-32.08675,0.879465,2450000000.0,-0.3371,1390.0,-31.8168,-35.51635,2450000000.0,-0.3371,1390.0,-30.2288,43.5981,2450000000.0,-0.3371,1390.0,-27.1522,2.42822,2450000000.0,-0.3371,1390.0,-26.8414,20.315,2450000000.0,-0.3371,1390.0,-30.4976,14.80255,2450000000.0,-0.3371,1390.0,-28.64425,8.16926,2450000000.0,-0.3371,1390.0,-30.8484,-16.3945,2450000000.0,-0.3371,1390.0,-27.81345,18.13955,2450000000.0,-0.3371,1390.0,-29.024,28.89735,2450000000.0,-0.3371,1390.0,-30.5332,9.48673,2450000000.0,-0.3371,1390.0,-29.67785,-43.49395,2450000000.0,22.0
75%,0.2824,1685.0,-5.19755,71.956325,2450000000.0,0.2824,1685.0,-5.459223,80.736,2450000000.0,0.2824,1685.0,-6.18516,95.47015,2450000000.0,0.2824,1685.0,-4.311635,87.00505,2450000000.0,0.2824,1685.0,-4.999288,70.3394,2450000000.0,0.2824,1685.0,-2.598315,119.5485,2450000000.0,0.2824,1685.0,-1.677032,121.62525,2450000000.0,0.2824,1685.0,-0.551188,126.35625,2450000000.0,0.2824,1685.0,-3.197342,99.07375,2450000000.0,0.2824,1685.0,-3.236295,84.562275,2450000000.0,0.2824,1685.0,-3.834718,73.8607,2450000000.0,0.2824,1685.0,-1.81759,82.5745,2450000000.0,0.2824,1685.0,-2.617195,91.6349,2450000000.0,0.2824,1685.0,-2.84283,84.947625,2450000000.0,0.2824,1685.0,-3.38751,62.334175,2450000000.0,33.0
max,0.9019,1980.0,2.96704,179.945,2450000000.0,0.9019,1980.0,2.60007,179.882,2450000000.0,0.9019,1980.0,1.95198,179.968,2450000000.0,0.9019,1980.0,3.01007,179.962,2450000000.0,0.9019,1980.0,2.31273,179.475,2450000000.0,0.9019,1980.0,3.01017,179.927,2450000000.0,0.9019,1980.0,3.01017,179.841,2450000000.0,0.9019,1980.0,3.01017,179.695,2450000000.0,0.9019,1980.0,3.01017,179.963,2450000000.0,0.9019,1980.0,3.01017,179.995,2450000000.0,0.9019,1980.0,1.62582,179.972,2450000000.0,0.9019,1980.0,3.01017,179.856,2450000000.0,0.9019,1980.0,3.01017,179.92,2450000000.0,0.9019,1980.0,3.01017,179.94,2450000000.0,0.9019,1980.0,3.01017,179.79,2450000000.0,43.0


This is then later checked and confirmed below, that every value in every `Frequency` column is constant, 2450000000. More details will be shown in the visualization

In [89]:
freq_columns = df.xs('Frequency', axis=1, level=1)
(freq_columns == 2450000000).all(axis=None)

np.True_

In [90]:
print(df['image'].dtype)
print(df['image'].unique())
print(df['image'].min())
print(df['image'].max())

float64
[nan  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16. 17.
 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35.
 36. 37. 38. 39. 40. 41. 42. 43.]
1.0
43.0


## 3. Visualize the data
* Scatter Plot
* Box Plot
* Histogram

### 3.1 Flattening the dataframe
To make data visualization easier, the DataFrame is flattened. Instead of using multilevel columns, there is now a separate column specifying which antenna the value was taken from.

In [91]:
antenna_df = df.loc[:,df.columns.get_level_values(0).str.startswith('ANT')]
antenna_df = antenna_df.stack(level=0, future_stack=True).reset_index()
antenna_df.rename(columns={'level_0':'Index', 'level_1':'Antenna'}, inplace=True)
antenna_df

Unnamed: 0,Index,Antenna,DAC Value,DAC Value RAW,Magnitute,Phase,Frequency
0,0,ANT 1,-1.5761,800,-51.029200,-32.22020,2450000000
1,0,ANT 2,-1.5761,800,-49.357300,-36.46740,2450000000
2,0,ANT 3,-1.5761,800,-48.255700,-29.19570,2450000000
3,0,ANT 4,-1.5761,800,-49.068000,-34.92130,2450000000
4,0,ANT 5,-1.5761,800,-49.537400,-32.57220,2450000000
...,...,...,...,...,...,...,...
44995,2999,ANT 11,0.9019,1980,0.001814,1.21299,2450000000
44996,2999,ANT 12,0.9019,1980,1.910730,53.37160,2450000000
44997,2999,ANT 13,0.9019,1980,0.064425,83.02310,2450000000
44998,2999,ANT 14,0.9019,1980,1.615870,52.62040,2450000000


In [92]:
object_details_df = df.loc[:,~df.columns.get_level_values(0).str.startswith('ANT')]
object_details_df.columns = object_details_df.columns.droplevel(level=1)
object_details_df

Unnamed: 0,filename,object,position,image
0,1804202300,Ohne Kamera,,
1,1804202300,Ohne Kamera,,
2,1804202300,Ohne Kamera,,
3,1804202300,Ohne Kamera,,
4,1804202300,Ohne Kamera,,
...,...,...,...,...
2995,1804202349,,,
2996,1804202349,,,
2997,1804202349,,,
2998,1804202349,,,


In [93]:
final_df = antenna_df.merge(object_details_df, how='left', left_on='Index', right_index=True)
final_df

Unnamed: 0,Index,Antenna,DAC Value,DAC Value RAW,Magnitute,Phase,Frequency,filename,object,position,image
0,0,ANT 1,-1.5761,800,-51.029200,-32.22020,2450000000,1804202300,Ohne Kamera,,
1,0,ANT 2,-1.5761,800,-49.357300,-36.46740,2450000000,1804202300,Ohne Kamera,,
2,0,ANT 3,-1.5761,800,-48.255700,-29.19570,2450000000,1804202300,Ohne Kamera,,
3,0,ANT 4,-1.5761,800,-49.068000,-34.92130,2450000000,1804202300,Ohne Kamera,,
4,0,ANT 5,-1.5761,800,-49.537400,-32.57220,2450000000,1804202300,Ohne Kamera,,
...,...,...,...,...,...,...,...,...,...,...,...
44995,2999,ANT 11,0.9019,1980,0.001814,1.21299,2450000000,1804202349,,,
44996,2999,ANT 12,0.9019,1980,1.910730,53.37160,2450000000,1804202349,,,
44997,2999,ANT 13,0.9019,1980,0.064425,83.02310,2450000000,1804202349,,,
44998,2999,ANT 14,0.9019,1980,1.615870,52.62040,2450000000,1804202349,,,


## 4. Conclusion


### 📂 General Dataset Overview
 Identify and describe data sources (e.g., databases, CSVs, APIs).

 Document the number of records (rows) and features (columns).

 Determine the data types (numeric, categorical, datetime, etc.).

 Check units of measurement (if applicable).

 Record metadata (e.g., column names, descriptions, units).

### 📊 Feature-Level Summary
 Provide summary statistics:

Numeric: count, mean, median, min, max, standard deviation, percentiles.

Categorical: count of unique values, frequency of each category.

 Identify primary keys or identifiers (if present).

 Highlight target variable(s) for prediction or analysis.

 Note data ranges and value distributions.

 Detect constant or quasi-constant features (little to no variation).

### 🧹 Data Quality Overview
 Report missing values (count and percentage per column).

 Detect duplicates (rows or IDs).

 Check for data type mismatches (e.g., numbers stored as strings).

 Flag outliers or unexpected values.

### 📈 Visualization (Optional but Recommended)
 Histograms or boxplots for numerical data.

 Bar charts for categorical data.

 Heatmaps or pairplots to see relationships (if already relevant).

### 🗒️ Documentation
 Maintain a data dictionary (column descriptions, formats, value ranges).

 Note any initial assumptions or findings relevant to data quality or usage.

