# Problem Solving Phase

In [None]:
# Run this cell to import necessary packages and dataset. Do not use any additional libraries.
import pandas as pd; import numpy as np
import matplotlib; import matplotlib.pyplot as plt
from pandas.plotting import parallel_coordinates
from IPython.core.interactiveshell import InteractiveShell

%matplotlib inline
matplotlib.style.use('ggplot')
InteractiveShell.ast_node_interactivity = "all"

files = {
    "CAR COMPANY 1": ("AQ_phase1_dataset1.csv", "Dataset 1 (Company 1)"),
    "CAR COMPANY 2": ("AQ_phase1_dataset2.csv", "Dataset 2 (Company 2)"),
    "CAR COMPANY 3": ("AQ_phase1_dataset3.csv", "Dataset 3 (Company 3)"),
    "CAR COMPANY 4": ("AQ_phase1_dataset4.csv", "Dataset 4 (Company 4)")
}

label_col = "Input_Dataset_name"
dfs, dfs_plot = {}, {}
for company, (file, label) in files.items():
    df = pd.read_csv(file)
    dfs[company] = df
    df_plot = df.copy()
    df_plot[label_col] = label
    dfs_plot[company] = df_plot

## Dataset description
Each of the 4 dataframes shows **units sold** (in 100’s) and **employee satisfaction** (1–100 scale) from 182 sites worldwide for car companies 1–4.

Running the cell below will return **descriptive statistics** and **a parallel coordinates plot**for comparison:

1. **Median**: middle value that splits the data into two equal halves.

2. **Interquartile range (IQR)**: spread of the middle 50% of the data (75th percentile − 25th percentile).

3. **Spearman's correlation**: strength and direction of monotonic relationships (both variables rise/fall together, or one rises as the other falls).

4. **Parallel coordinates**: visualization of multivariate data, where each record is a line crossing vertical axes (one per variable).

In [None]:
median_df = pd.DataFrame({name: df.median().round(2) for name, df in dfs.items()})
iqr_df = pd.DataFrame({name: (df.quantile(0.75) - df.quantile(0.25)).round(2) for name, df in dfs.items()})

print("--- Median ---"); 
display(median_df);
print (" ")

print("--- Interquartile Range ---"); 
display(iqr_df); 
print (" ")

print("--- Spearman Correlations ---");
corr_blocks = {name: df.corr(method="spearman").round(2) for name, df in dfs.items()}
corr_side_by_side = pd.concat(corr_blocks, axis=1); display(corr_side_by_side); 
print (" ")

print("--- Parallel Coordinates ---");
fig, axes = plt.subplots(1, 4, figsize=(22, 5), constrained_layout=True)
for ax, (name, dfc) in zip(axes, dfs_plot.items()):
    parallel_coordinates(dfc, label_col, ax=ax); ax.set_title(name, fontsize=10);
    
plt.show();

## Task description

Design **multiple measures** to rank the four car companies from **most** to **least successful**, using **all data points** in the datasets.

For **each measure**:
1. Print the resulting **company order** (e.g., 1234, 2134)
2. Print your **reasoning** — why you chose this approach or calculation
3. Show how you **used the descriptive statistics** from above in your reasoning
4. State your **confidence** in the measure

**Important:** Always print all four options for each measure.

### How to create a new measure

Use the template cell provided below. To add a measure:
1. Insert a new cell below the template
2. Copy the template contents into it
3. Modify the copied cell with your new measure
4. Repeat for as many measures as you can within the allotted time

In [None]:
# Template for designing a measure (Note: Your intuitive ideas are more valuable than minor syntax details)
# ----------------------------------------------------------------------------------------------------------

# You code (if any)

ranking_list = ["1234", "1243", "1324", "1342", "1423", "1432", "2134", "2143", "2314", "2341", "2413", "2431", "3124", "3142", "3214", "3241", "3412", "3421", "4123", "4132", "4213", "4231", "4312", "4321"]
carcompany_order = 'None' # Pick one string from the list as the car company order
print(carcompany_order)

carcompany_order_reasoning = 'None' # Write your reasoning here
print(carcompany_order_reasoning)

option_a = "I found the descriptive statistics HELPFUL in designing the measure (my measure is BASED ON one or more of them)"
option_b = "I found the descriptive statistics HELPFUL in designing the measure (my measure is NOT BASED ON them, but I still found them helpful in reasoning about what measures might or might not work)"
option_c = "I found the descriptive statistics NOT HELPFUL in designing the measure (my measure is NOT BASED ON them and I did not find information from them to be convincing enough to answer the task)"
option_d = "I found the descriptive statistics NOT HELPFUL in designing the measure (my measure is BASED ON them, but I did not find information from them to be convincing enough to answer the task)"
used_descriptive_statistics_in_reasoning = 'None' # Choose one of the four options above
print(used_descriptive_statistics_in_reasoning)

confidence_measure = 0 # Choose one confidence level (1-2-3-4-5)
print(confidence_measure)

# Instruction Phase

Please **complete the Problem Solving phase** before moving to this phase.

### Click the link to watch the first video (4 mins)
[Instruction Phase - Part 1](https://www.youtube.com/embed/zxh4JnVCwss?si=c7PN1ZqxDUHd6xzz)

### Click the link to watch the second video (11 mins)
[Instruction Phase - Part 2](https://www.youtube.com/embed/VzNpSVnSbhg?si=dsxKEvcHWxDvFIp-)

### Click the link to watch the third video (4 mins)
[Instruction Phase - Part 3](https://www.youtube.com/embed/bryiXWN3owI?si=B0wIgj7tyQ1yShY3)